Improve math tutorial README documentation#1766
Conversation
Address documentation gaps including: platform requirements (vLLM x86_64 Linux only), GPU VRAM requirements and compatible GPUs, AWS credential setup, HuggingFace token guidance with error notes, all 12 available prompts documented with selection guidance, quality score interpretation table, LSH parameter documentation for deduplication, cache clearing warning, compact horizontal pipeline diagram, and a troubleshooting section for common errors. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR significantly improves the math tutorial README with AWS credential guidance, a full listing of all 12 LLM prompts with selection guidance, quality score and LSH parameter tables, a collapsible pipeline diagram, and a troubleshooting section. Most previously flagged issues are addressed: the wrong Confidence Score: 5/5Safe to merge; all remaining findings are P2 style suggestions that do not block merge. All P0/P1 issues from prior rounds are resolved or addressed (wrong input dir fixed, bands_per_iteration added, AWS framing corrected). The two remaining findings are P2 documentation polish (missing platform details and missing MIND example command) that can be addressed in a follow-up. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart LR
HF([HuggingFace]) --> D[0_download]
D --> RAW[(Raw Data)]
RAW --> CC[1_cc_index_lookup*]
CC --> PRE[2_text_preprocess]
S3_CC([Common Crawl S3]) --> PRE
PRE --> LLM[3_llm_cleanup]
LLM --> QC[4_quality_classifier]
QC --> DEDUP[5_deduplication]
DEDUP --> FINAL[(Final Data)]
note1["* Only for datasets without WARC metadata"] -.-> CC
style HF fill:#ff9,stroke:#888
style FINAL fill:#9f9,stroke:#888
style S3_CC fill:#ff9,stroke:#888
style note1 fill:#eee,stroke:#aaa,color:#555
Reviews (8): Last reviewed commit: "Merge branch 'main' into lbliii/math-tut..." | Re-trigger Greptile |
| @@ -523,3 +596,87 @@ python tutorials/math/3_llm_cleanup.py \ | |||
| --max_model_len 16384 \ | |||
| --input_filetype jsonl | |||
There was a problem hiding this comment.
Wrong input directory in Content Cleaning example
This example uses --input $MATH_DATA_DIR/deduplicated, but deduplicated is the final output of Step 5. LLM cleanup (Step 3) consumes preprocessed text from Step 2 and must run before Steps 4 and 5 — the pipeline table on line 315 and the main usage example on line 483 both confirm this ordering. Feeding deduplicated output back through LLM cleanup is logically backwards and will confuse users following this example.
| python tutorials/math/3_llm_cleanup.py \ | |
| --input $MATH_DATA_DIR/preprocessed \ | |
| --output $MATH_DATA_DIR/cleaned_code \ | |
| --model microsoft/phi-4 \ | |
| --prompt HTML_TO_TEXT_PROMPT_CODE \ | |
| --chunk_data \ | |
| --chunk_length 5000 \ | |
| --max_model_len 16384 \ | |
| --input_filetype jsonl |
| **LSH parameters** (tunable via command-line flags): | ||
|
|
||
| The LLM cleanup step supports various specialized prompts for different mathematical content processing needs: | ||
| | Flag | Default | Description | | ||
| |------|---------|-------------| | ||
| | `--char_ngrams` | 24 | Character n-gram size for MinHash. Values below 20 may produce ~5% false positives. | | ||
| | `--num_bands` | 20 | Number of LSH bands. More bands = higher recall but slower. | | ||
| | `--minhashes_per_band` | 13 | Hashes per band. More hashes = higher precision but lower recall. | | ||
| | `--use_64_bit_hash` | False | Use 64-bit hash for fewer collisions on very large datasets. | | ||
| | `--seed` | 42 | Seed for MinHash permutations (for reproducibility). | | ||
|
|
||
| The similarity threshold is implicitly controlled by `num_bands` and `minhashes_per_band`. The approximate threshold is `(1/num_bands)^(1/minhashes_per_band)`. With the defaults (20 bands, 13 hashes/band), this is approximately 0.72. To detect more similar pairs (stricter dedup), increase `num_bands`; to be more lenient, decrease it. |
There was a problem hiding this comment.
--bands_per_iteration missing from LSH parameters table
The troubleshooting section (line 682) references --bands_per_iteration (default: 5) as a tuning knob for OOM errors, and the flag exists in 5_deduplication.py (line 75–78). Adding it to the parameters table here would make the documentation consistent and help users find it before they hit an OOM error.
| **LSH parameters** (tunable via command-line flags): | |
| The LLM cleanup step supports various specialized prompts for different mathematical content processing needs: | |
| | Flag | Default | Description | | |
| |------|---------|-------------| | |
| | `--char_ngrams` | 24 | Character n-gram size for MinHash. Values below 20 may produce ~5% false positives. | | |
| | `--num_bands` | 20 | Number of LSH bands. More bands = higher recall but slower. | | |
| | `--minhashes_per_band` | 13 | Hashes per band. More hashes = higher precision but lower recall. | | |
| | `--use_64_bit_hash` | False | Use 64-bit hash for fewer collisions on very large datasets. | | |
| | `--seed` | 42 | Seed for MinHash permutations (for reproducibility). | | |
| The similarity threshold is implicitly controlled by `num_bands` and `minhashes_per_band`. The approximate threshold is `(1/num_bands)^(1/minhashes_per_band)`. With the defaults (20 bands, 13 hashes/band), this is approximately 0.72. To detect more similar pairs (stricter dedup), increase `num_bands`; to be more lenient, decrease it. | |
| | Flag | Default | Description | | |
| |------|---------|-------------| | |
| | `--char_ngrams` | 24 | Character n-gram size for MinHash. Values below 20 may produce ~5% false positives. | | |
| | `--num_bands` | 20 | Number of LSH bands. More bands = higher recall but slower. | | |
| | `--minhashes_per_band` | 13 | Hashes per band. More hashes = higher precision but lower recall. | | |
| | `--bands_per_iteration` | 5 | Number of bands to shuffle concurrently. Reduce if you hit OOM errors. | | |
| | `--use_64_bit_hash` | False | Use 64-bit hash for fewer collisions on very large datasets. | | |
| | `--seed` | 42 | Seed for MinHash permutations (for reproducibility). | |
- Fix wrong input directory in Content Cleaning example (deduplicated → preprocessed) to match pipeline order - Add --bands_per_iteration to LSH parameters table - Correct AWS description: Common Crawl is part of AWS Open Data program, not requester-pays Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sarahyurick
left a comment
There was a problem hiding this comment.
Thanks! Mostly looks good to me, left some minor comments. Let's see if we can get @ronjer30 and/or @raosukrit67 to review as well.
- Remove Platform Requirements section (Curator is Linux-only) - Revert GPU requirements to original wording - Remove CUDA verification from troubleshooting - Remove redundant "Repository Not Found" troubleshooting section - Fix LSH threshold value from 0.72 to 0.79 Signed-off-by: Lawrence Lane <llane@nvidia.com>
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### No GPUs Detected |
There was a problem hiding this comment.
I think this subsection can be removed. Otherwise LGTM, thanks!
|
Thanks @lbliii, looks good to me. |
Addresses reviewer feedback: the pynvml reinstall note already appears in the Install section, so duplicating it under Troubleshooting is unnecessary. Signed-off-by: Lawrence Lane <llane@nvidia.com>
thanks for taking a look! do you have the ability to hit approve to unblock merge? if not i can find another team member |
|
/ok to test 6cd8a0d |
* Improve math tutorial README documentation Address documentation gaps including: platform requirements (vLLM x86_64 Linux only), GPU VRAM requirements and compatible GPUs, AWS credential setup, HuggingFace token guidance with error notes, all 12 available prompts documented with selection guidance, quality score interpretation table, LSH parameter documentation for deduplication, cache clearing warning, compact horizontal pipeline diagram, and a troubleshooting section for common errors. * Address PR review comments on math tutorial README - Fix wrong input directory in Content Cleaning example (deduplicated → preprocessed) to match pipeline order - Add --bands_per_iteration to LSH parameters table - Correct AWS description: Common Crawl is part of AWS Open Data program, not requester-pays * Address reviewer feedback on math tutorial README - Remove Platform Requirements section (Curator is Linux-only) - Revert GPU requirements to original wording - Remove CUDA verification from troubleshooting - Remove redundant "Repository Not Found" troubleshooting section - Fix LSH threshold value from 0.72 to 0.79 * Remove redundant "No GPUs Detected" troubleshooting subsection Addresses reviewer feedback: the pynvml reinstall note already appears in the Install section, so duplicating it under Troubleshooting is unnecessary. --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Description
Addresses documentation gaps in the math data curation pipeline tutorial README, based on feedback from internal review. Changes include:
Checklist