Skip to content

Improve math tutorial README documentation#1766

Merged
sarahyurick merged 8 commits intomainfrom
lbliii/math-tutorial-docs-fix
Apr 22, 2026
Merged

Improve math tutorial README documentation#1766
sarahyurick merged 8 commits intomainfrom
lbliii/math-tutorial-docs-fix

Conversation

@lbliii
Copy link
Copy Markdown
Contributor

@lbliii lbliii commented Apr 8, 2026

Description

Addresses documentation gaps in the math data curation pipeline tutorial README, based on feedback from internal review. Changes include:

  • Added platform requirements (vLLM x86_64 Linux only), GPU VRAM needs (≥20 GB), and compatible GPU list (A100, A40, L40, L40S, H100)
  • Added AWS credential setup section and improved HuggingFace token guidance with common error notes
  • Documented all 12 available prompts (was only 2) with selection guidance, organized into cleaning, classification, and MIND dialogue categories
  • Added quality score interpretation table (0–5) with threshold recommendations for Step 4
  • Added LSH parameter table with tuning guidance and similarity threshold formula for Step 5
  • Added compact horizontal pipeline overview and wrapped the large mermaid diagram in a collapsible section
  • Added troubleshooting section covering common errors (no GPUs, 0 CC matches, auth errors, empty LLM output, all-zero scores, S3 failures, dedup issues)

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Address documentation gaps including: platform requirements (vLLM
x86_64 Linux only), GPU VRAM requirements and compatible GPUs, AWS
credential setup, HuggingFace token guidance with error notes, all 12
available prompts documented with selection guidance, quality score
interpretation table, LSH parameter documentation for deduplication,
cache clearing warning, compact horizontal pipeline diagram, and a
troubleshooting section for common errors.

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lbliii lbliii requested a review from a team as a code owner April 8, 2026 15:54
@lbliii lbliii requested review from suiyoubi and removed request for a team April 8, 2026 15:54
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lbliii lbliii self-assigned this Apr 8, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 8, 2026

Greptile Summary

This PR significantly improves the math tutorial README with AWS credential guidance, a full listing of all 12 LLM prompts with selection guidance, quality score and LSH parameter tables, a collapsible pipeline diagram, and a troubleshooting section. Most previously flagged issues are addressed: the wrong --input deduplicated path in the Content Cleaning example is fixed, --bands_per_iteration appears in the LSH table, and AWS access is correctly described as free-but-auth-required.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style suggestions that do not block merge.

All P0/P1 issues from prior rounds are resolved or addressed (wrong input dir fixed, bands_per_iteration added, AWS framing corrected). The two remaining findings are P2 documentation polish (missing platform details and missing MIND example command) that can be addressed in a follow-up.

No files require special attention.

Important Files Changed

Filename Overview
tutorials/math/README.md Comprehensive documentation improvements including AWS credentials guidance, all 12 prompts documented, LSH parameter table, quality score interpretation table, collapsible pipeline diagram, and troubleshooting section; minor gap between PR description and implemented content

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart LR
    HF([HuggingFace]) --> D[0_download]
    D --> RAW[(Raw Data)]
    RAW --> CC[1_cc_index_lookup*]
    CC --> PRE[2_text_preprocess]
    S3_CC([Common Crawl S3]) --> PRE
    PRE --> LLM[3_llm_cleanup]
    LLM --> QC[4_quality_classifier]
    QC --> DEDUP[5_deduplication]
    DEDUP --> FINAL[(Final Data)]

    note1["* Only for datasets without WARC metadata"] -.-> CC

    style HF fill:#ff9,stroke:#888
    style FINAL fill:#9f9,stroke:#888
    style S3_CC fill:#ff9,stroke:#888
    style note1 fill:#eee,stroke:#aaa,color:#555
Loading

Reviews (8): Last reviewed commit: "Merge branch 'main' into lbliii/math-tut..." | Re-trigger Greptile

Comment thread tutorials/math/README.md
Comment on lines 589 to 597
@@ -523,3 +596,87 @@ python tutorials/math/3_llm_cleanup.py \
--max_model_len 16384 \
--input_filetype jsonl
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Wrong input directory in Content Cleaning example

This example uses --input $MATH_DATA_DIR/deduplicated, but deduplicated is the final output of Step 5. LLM cleanup (Step 3) consumes preprocessed text from Step 2 and must run before Steps 4 and 5 — the pipeline table on line 315 and the main usage example on line 483 both confirm this ordering. Feeding deduplicated output back through LLM cleanup is logically backwards and will confuse users following this example.

Suggested change
python tutorials/math/3_llm_cleanup.py \
--input $MATH_DATA_DIR/preprocessed \
--output $MATH_DATA_DIR/cleaned_code \
--model microsoft/phi-4 \
--prompt HTML_TO_TEXT_PROMPT_CODE \
--chunk_data \
--chunk_length 5000 \
--max_model_len 16384 \
--input_filetype jsonl

Comment thread tutorials/math/README.md Outdated
Comment on lines +562 to +572
**LSH parameters** (tunable via command-line flags):

The LLM cleanup step supports various specialized prompts for different mathematical content processing needs:
| Flag | Default | Description |
|------|---------|-------------|
| `--char_ngrams` | 24 | Character n-gram size for MinHash. Values below 20 may produce ~5% false positives. |
| `--num_bands` | 20 | Number of LSH bands. More bands = higher recall but slower. |
| `--minhashes_per_band` | 13 | Hashes per band. More hashes = higher precision but lower recall. |
| `--use_64_bit_hash` | False | Use 64-bit hash for fewer collisions on very large datasets. |
| `--seed` | 42 | Seed for MinHash permutations (for reproducibility). |

The similarity threshold is implicitly controlled by `num_bands` and `minhashes_per_band`. The approximate threshold is `(1/num_bands)^(1/minhashes_per_band)`. With the defaults (20 bands, 13 hashes/band), this is approximately 0.72. To detect more similar pairs (stricter dedup), increase `num_bands`; to be more lenient, decrease it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 --bands_per_iteration missing from LSH parameters table

The troubleshooting section (line 682) references --bands_per_iteration (default: 5) as a tuning knob for OOM errors, and the flag exists in 5_deduplication.py (line 75–78). Adding it to the parameters table here would make the documentation consistent and help users find it before they hit an OOM error.

Suggested change
**LSH parameters** (tunable via command-line flags):
The LLM cleanup step supports various specialized prompts for different mathematical content processing needs:
| Flag | Default | Description |
|------|---------|-------------|
| `--char_ngrams` | 24 | Character n-gram size for MinHash. Values below 20 may produce ~5% false positives. |
| `--num_bands` | 20 | Number of LSH bands. More bands = higher recall but slower. |
| `--minhashes_per_band` | 13 | Hashes per band. More hashes = higher precision but lower recall. |
| `--use_64_bit_hash` | False | Use 64-bit hash for fewer collisions on very large datasets. |
| `--seed` | 42 | Seed for MinHash permutations (for reproducibility). |
The similarity threshold is implicitly controlled by `num_bands` and `minhashes_per_band`. The approximate threshold is `(1/num_bands)^(1/minhashes_per_band)`. With the defaults (20 bands, 13 hashes/band), this is approximately 0.72. To detect more similar pairs (stricter dedup), increase `num_bands`; to be more lenient, decrease it.
| Flag | Default | Description |
|------|---------|-------------|
| `--char_ngrams` | 24 | Character n-gram size for MinHash. Values below 20 may produce ~5% false positives. |
| `--num_bands` | 20 | Number of LSH bands. More bands = higher recall but slower. |
| `--minhashes_per_band` | 13 | Hashes per band. More hashes = higher precision but lower recall. |
| `--bands_per_iteration` | 5 | Number of bands to shuffle concurrently. Reduce if you hit OOM errors. |
| `--use_64_bit_hash` | False | Use 64-bit hash for fewer collisions on very large datasets. |
| `--seed` | 42 | Seed for MinHash permutations (for reproducibility). |

Comment thread tutorials/math/README.md Outdated
- Fix wrong input directory in Content Cleaning example
  (deduplicated → preprocessed) to match pipeline order
- Add --bands_per_iteration to LSH parameters table
- Correct AWS description: Common Crawl is part of AWS Open Data
  program, not requester-pays

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lbliii lbliii requested a review from sarahyurick April 14, 2026 14:59
Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Mostly looks good to me, left some minor comments. Let's see if we can get @ronjer30 and/or @raosukrit67 to review as well.

Comment thread tutorials/math/README.md Outdated
Comment thread tutorials/math/README.md Outdated
Comment thread tutorials/math/README.md Outdated
Comment thread tutorials/math/README.md Outdated
- Remove Platform Requirements section (Curator is Linux-only)
- Revert GPU requirements to original wording
- Remove CUDA verification from troubleshooting
- Remove redundant "Repository Not Found" troubleshooting section
- Fix LSH threshold value from 0.72 to 0.79

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Comment thread tutorials/math/README.md Outdated

## Troubleshooting

### No GPUs Detected
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this subsection can be removed. Otherwise LGTM, thanks!

@ronjer30
Copy link
Copy Markdown
Contributor

Thanks @lbliii, looks good to me.

@lbliii lbliii requested a review from sarahyurick April 21, 2026 14:23
Addresses reviewer feedback: the pynvml reinstall note already appears
in the Install section, so duplicating it under Troubleshooting is
unnecessary.

Signed-off-by: Lawrence Lane <llane@nvidia.com>
@sarahyurick sarahyurick added the r1.2.0 Pick this label for auto cherry-picking into r1.2.0 label Apr 21, 2026
@lbliii
Copy link
Copy Markdown
Contributor Author

lbliii commented Apr 21, 2026

Thanks @lbliii, looks good to me.

thanks for taking a look! do you have the ability to hit approve to unblock merge? if not i can find another team member

@sarahyurick
Copy link
Copy Markdown
Contributor

/ok to test 6cd8a0d

@sarahyurick sarahyurick enabled auto-merge (squash) April 22, 2026 18:16
@sarahyurick sarahyurick merged commit 4c3af97 into main Apr 22, 2026
45 checks passed
sarahyurick added a commit that referenced this pull request Apr 22, 2026
* Improve math tutorial README documentation

Address documentation gaps including: platform requirements (vLLM
x86_64 Linux only), GPU VRAM requirements and compatible GPUs, AWS
credential setup, HuggingFace token guidance with error notes, all 12
available prompts documented with selection guidance, quality score
interpretation table, LSH parameter documentation for deduplication,
cache clearing warning, compact horizontal pipeline diagram, and a
troubleshooting section for common errors.




* Address PR review comments on math tutorial README

- Fix wrong input directory in Content Cleaning example
  (deduplicated → preprocessed) to match pipeline order
- Add --bands_per_iteration to LSH parameters table
- Correct AWS description: Common Crawl is part of AWS Open Data
  program, not requester-pays




* Address reviewer feedback on math tutorial README

- Remove Platform Requirements section (Curator is Linux-only)
- Revert GPU requirements to original wording
- Remove CUDA verification from troubleshooting
- Remove redundant "Repository Not Found" troubleshooting section
- Fix LSH threshold value from 0.72 to 0.79



* Remove redundant "No GPUs Detected" troubleshooting subsection

Addresses reviewer feedback: the pynvml reinstall note already appears
in the Install section, so duplicating it under Troubleshooting is
unnecessary.



---------

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r1.2.0 Pick this label for auto cherry-picking into r1.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants