Improve math tutorial README documentation by lbliii · Pull Request #1766 · NVIDIA-NeMo/Curator

lbliii · 2026-04-08T15:54:53Z

Description

Addresses documentation gaps in the math data curation pipeline tutorial README, based on feedback from internal review. Changes include:

Added platform requirements (vLLM x86_64 Linux only), GPU VRAM needs (≥20 GB), and compatible GPU list (A100, A40, L40, L40S, H100)
Added AWS credential setup section and improved HuggingFace token guidance with common error notes
Documented all 12 available prompts (was only 2) with selection guidance, organized into cleaning, classification, and MIND dialogue categories
Added quality score interpretation table (0–5) with threshold recommendations for Step 4
Added LSH parameter table with tuning guidance and similarity threshold formula for Step 5
Added compact horizontal pipeline overview and wrapped the large mermaid diagram in a collapsible section
Added troubleshooting section covering common errors (no GPUs, 0 CC matches, auth errors, empty LLM output, all-zero scores, S3 failures, dedup issues)

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Address documentation gaps including: platform requirements (vLLM x86_64 Linux only), GPU VRAM requirements and compatible GPUs, AWS credential setup, HuggingFace token guidance with error notes, all 12 available prompts documented with selection guidance, quality score interpretation table, LSH parameter documentation for deduplication, cache clearing warning, compact horizontal pipeline diagram, and a troubleshooting section for common errors. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

copy-pr-bot · 2026-04-08T15:55:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-04-08T15:57:43Z

Greptile Summary

This PR significantly improves the math tutorial README with AWS credential guidance, a full listing of all 12 LLM prompts with selection guidance, quality score and LSH parameter tables, a collapsible pipeline diagram, and a troubleshooting section. Most previously flagged issues are addressed: the wrong --input deduplicated path in the Content Cleaning example is fixed, --bands_per_iteration appears in the LSH table, and AWS access is correctly described as free-but-auth-required.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style suggestions that do not block merge.

All P0/P1 issues from prior rounds are resolved or addressed (wrong input dir fixed, bands_per_iteration added, AWS framing corrected). The two remaining findings are P2 documentation polish (missing platform details and missing MIND example command) that can be addressed in a follow-up.

No files require special attention.

Important Files Changed

Filename	Overview
tutorials/math/README.md	Comprehensive documentation improvements including AWS credentials guidance, all 12 prompts documented, LSH parameter table, quality score interpretation table, collapsible pipeline diagram, and troubleshooting section; minor gap between PR description and implemented content

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart LR
    HF([HuggingFace]) --> D[0_download]
    D --> RAW[(Raw Data)]
    RAW --> CC[1_cc_index_lookup*]
    CC --> PRE[2_text_preprocess]
    S3_CC([Common Crawl S3]) --> PRE
    PRE --> LLM[3_llm_cleanup]
    LLM --> QC[4_quality_classifier]
    QC --> DEDUP[5_deduplication]
    DEDUP --> FINAL[(Final Data)]

    note1["* Only for datasets without WARC metadata"] -.-> CC

    style HF fill:#ff9,stroke:#888
    style FINAL fill:#9f9,stroke:#888
    style S3_CC fill:#ff9,stroke:#888
    style note1 fill:#eee,stroke:#aaa,color:#555

_{Reviews (8): Last reviewed commit: "Merge branch 'main' into lbliii/math-tut..." | Re-trigger Greptile}

greptile-apps · 2026-04-08T15:57:46Z

@@ -523,3 +596,87 @@ python tutorials/math/3_llm_cleanup.py \
  --max_model_len 16384 \
  --input_filetype jsonl


Wrong input directory in Content Cleaning example

This example uses --input $MATH_DATA_DIR/deduplicated, but deduplicated is the final output of Step 5. LLM cleanup (Step 3) consumes preprocessed text from Step 2 and must run before Steps 4 and 5 — the pipeline table on line 315 and the main usage example on line 483 both confirm this ordering. Feeding deduplicated output back through LLM cleanup is logically backwards and will confuse users following this example.

Suggested change

python tutorials/math/3_llm_cleanup.py \

--input $MATH_DATA_DIR/preprocessed \

--output $MATH_DATA_DIR/cleaned_code \

--model microsoft/phi-4 \

--prompt HTML_TO_TEXT_PROMPT_CODE \

--chunk_data \

--chunk_length 5000 \

--max_model_len 16384 \

--input_filetype jsonl

greptile-apps · 2026-04-08T15:57:47Z

+**LSH parameters** (tunable via command-line flags):

-The LLM cleanup step supports various specialized prompts for different mathematical content processing needs:
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--char_ngrams` | 24 | Character n-gram size for MinHash. Values below 20 may produce ~5% false positives. |
+| `--num_bands` | 20 | Number of LSH bands. More bands = higher recall but slower. |
+| `--minhashes_per_band` | 13 | Hashes per band. More hashes = higher precision but lower recall. |
+| `--use_64_bit_hash` | False | Use 64-bit hash for fewer collisions on very large datasets. |
+| `--seed` | 42 | Seed for MinHash permutations (for reproducibility). |
+
+The similarity threshold is implicitly controlled by `num_bands` and `minhashes_per_band`. The approximate threshold is `(1/num_bands)^(1/minhashes_per_band)`. With the defaults (20 bands, 13 hashes/band), this is approximately 0.72. To detect more similar pairs (stricter dedup), increase `num_bands`; to be more lenient, decrease it.


--bands_per_iteration missing from LSH parameters table

The troubleshooting section (line 682) references --bands_per_iteration (default: 5) as a tuning knob for OOM errors, and the flag exists in 5_deduplication.py (line 75–78). Adding it to the parameters table here would make the documentation consistent and help users find it before they hit an OOM error.

Suggested change

**LSH parameters** (tunable via command-line flags):

The LLM cleanup step supports various specialized prompts for different mathematical content processing needs:

| Flag | Default | Description |

|------|---------|-------------|

| `--char_ngrams` | 24 | Character n-gram size for MinHash. Values below 20 may produce ~5% false positives. |

| `--num_bands` | 20 | Number of LSH bands. More bands = higher recall but slower. |

| `--minhashes_per_band` | 13 | Hashes per band. More hashes = higher precision but lower recall. |

| `--use_64_bit_hash` | False | Use 64-bit hash for fewer collisions on very large datasets. |

| `--seed` | 42 | Seed for MinHash permutations (for reproducibility). |

The similarity threshold is implicitly controlled by `num_bands` and `minhashes_per_band`. The approximate threshold is `(1/num_bands)^(1/minhashes_per_band)`. With the defaults (20 bands, 13 hashes/band), this is approximately 0.72. To detect more similar pairs (stricter dedup), increase `num_bands`; to be more lenient, decrease it.

| Flag | Default | Description |

|------|---------|-------------|

| `--char_ngrams` | 24 | Character n-gram size for MinHash. Values below 20 may produce ~5% false positives. |

| `--num_bands` | 20 | Number of LSH bands. More bands = higher recall but slower. |

| `--minhashes_per_band` | 13 | Hashes per band. More hashes = higher precision but lower recall. |

| `--bands_per_iteration` | 5 | Number of bands to shuffle concurrently. Reduce if you hit OOM errors. |

| `--use_64_bit_hash` | False | Use 64-bit hash for fewer collisions on very large datasets. |

| `--seed` | 42 | Seed for MinHash permutations (for reproducibility). |

- Fix wrong input directory in Content Cleaning example (deduplicated → preprocessed) to match pipeline order - Add --bands_per_iteration to LSH parameters table - Correct AWS description: Common Crawl is part of AWS Open Data program, not requester-pays Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sarahyurick

Thanks! Mostly looks good to me, left some minor comments. Let's see if we can get @ronjer30 and/or @raosukrit67 to review as well.

- Remove Platform Requirements section (Curator is Linux-only) - Revert GPU requirements to original wording - Remove CUDA verification from troubleshooting - Remove redundant "Repository Not Found" troubleshooting section - Fix LSH threshold value from 0.72 to 0.79 Signed-off-by: Lawrence Lane <llane@nvidia.com>

sarahyurick · 2026-04-15T15:36:42Z

+
+## Troubleshooting
+
+### No GPUs Detected


I think this subsection can be removed. Otherwise LGTM, thanks!

ronjer30 · 2026-04-21T00:22:38Z

Thanks @lbliii, looks good to me.

Addresses reviewer feedback: the pynvml reinstall note already appears in the Install section, so duplicating it under Troubleshooting is unnecessary. Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii · 2026-04-21T18:31:43Z

Thanks @lbliii, looks good to me.

thanks for taking a look! do you have the ability to hit approve to unblock merge? if not i can find another team member

sarahyurick · 2026-04-22T18:13:03Z

/ok to test 6cd8a0d

* Improve math tutorial README documentation Address documentation gaps including: platform requirements (vLLM x86_64 Linux only), GPU VRAM requirements and compatible GPUs, AWS credential setup, HuggingFace token guidance with error notes, all 12 available prompts documented with selection guidance, quality score interpretation table, LSH parameter documentation for deduplication, cache clearing warning, compact horizontal pipeline diagram, and a troubleshooting section for common errors. * Address PR review comments on math tutorial README - Fix wrong input directory in Content Cleaning example (deduplicated → preprocessed) to match pipeline order - Add --bands_per_iteration to LSH parameters table - Correct AWS description: Common Crawl is part of AWS Open Data program, not requester-pays * Address reviewer feedback on math tutorial README - Remove Platform Requirements section (Curator is Linux-only) - Revert GPU requirements to original wording - Remove CUDA verification from troubleshooting - Remove redundant "Repository Not Found" troubleshooting section - Fix LSH threshold value from 0.72 to 0.79 * Remove redundant "No GPUs Detected" troubleshooting subsection Addresses reviewer feedback: the pynvml reinstall note already appears in the Install section, so duplicating it under Troubleshooting is unnecessary. --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

lbliii requested a review from a team as a code owner April 8, 2026 15:54

lbliii requested review from suiyoubi and removed request for a team April 8, 2026 15:54

lbliii self-assigned this Apr 8, 2026

greptile-apps Bot reviewed Apr 8, 2026

View reviewed changes

lbliii requested a review from sarahyurick April 14, 2026 14:59

sarahyurick reviewed Apr 14, 2026

View reviewed changes

Comment thread tutorials/math/README.md Outdated

Comment thread tutorials/math/README.md Outdated

Comment thread tutorials/math/README.md Outdated

Comment thread tutorials/math/README.md Outdated

sarahyurick reviewed Apr 15, 2026

View reviewed changes

Merge branch 'main' into lbliii/math-tutorial-docs-fix

5dae4ba

lbliii requested a review from sarahyurick April 21, 2026 14:23

Remove redundant "No GPUs Detected" troubleshooting subsection

0abb613

Addresses reviewer feedback: the pynvml reinstall note already appears in the Install section, so duplicating it under Troubleshooting is unnecessary. Signed-off-by: Lawrence Lane <llane@nvidia.com>

sarahyurick added the r1.2.0 Pick this label for auto cherry-picking into r1.2.0 label Apr 21, 2026

Merge branch 'main' into lbliii/math-tutorial-docs-fix

b5d60e3

Merge branch 'main' into lbliii/math-tutorial-docs-fix

4bb37a1

sarahyurick approved these changes Apr 22, 2026

View reviewed changes

Merge branch 'main' into lbliii/math-tutorial-docs-fix

6cd8a0d

copy-pr-bot Bot temporarily deployed to test April 22, 2026 18:13 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 22, 2026 18:13 Inactive

sarahyurick enabled auto-merge (squash) April 22, 2026 18:16

copy-pr-bot Bot temporarily deployed to nemo-ci April 22, 2026 18:26 Inactive

sarahyurick merged commit 4c3af97 into main Apr 22, 2026
45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve math tutorial README documentation#1766

Improve math tutorial README documentation#1766
sarahyurick merged 8 commits intomainfrom
lbliii/math-tutorial-docs-fix

lbliii commented Apr 8, 2026

Uh oh!

copy-pr-bot Bot commented Apr 8, 2026

Uh oh!

greptile-apps Bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Apr 8, 2026

Uh oh!

greptile-apps Bot Apr 8, 2026

Uh oh!

Uh oh!

sarahyurick left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarahyurick Apr 15, 2026

Uh oh!

ronjer30 commented Apr 21, 2026

Uh oh!

lbliii commented Apr 21, 2026

Uh oh!

sarahyurick commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -523,3 +596,87 @@ python tutorials/math/3_llm_cleanup.py \
		--max_model_len 16384 \
		--input_filetype jsonl

+python tutorials/math/3_llm_cleanup.py \
+  --input $MATH_DATA_DIR/preprocessed \
+  --output $MATH_DATA_DIR/cleaned_code \
+  --model microsoft/phi-4 \
+  --prompt HTML_TO_TEXT_PROMPT_CODE \
+  --chunk_data \
+  --chunk_length 5000 \
+  --max_model_len 16384 \
+  --input_filetype jsonl

Conversation

lbliii commented Apr 8, 2026

Description

Checklist

Uh oh!

copy-pr-bot Bot commented Apr 8, 2026

Uh oh!

greptile-apps Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarahyurick Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

ronjer30 commented Apr 21, 2026

Uh oh!

lbliii commented Apr 21, 2026

Uh oh!

sarahyurick commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps Bot commented Apr 8, 2026 •

edited

Loading