You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improve math tutorial README documentation (#1766)
* Improve math tutorial README documentation
Address documentation gaps including: platform requirements (vLLM
x86_64 Linux only), GPU VRAM requirements and compatible GPUs, AWS
credential setup, HuggingFace token guidance with error notes, all 12
available prompts documented with selection guidance, quality score
interpretation table, LSH parameter documentation for deduplication,
cache clearing warning, compact horizontal pipeline diagram, and a
troubleshooting section for common errors.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Address PR review comments on math tutorial README
- Fix wrong input directory in Content Cleaning example
(deduplicated → preprocessed) to match pipeline order
- Add --bands_per_iteration to LSH parameters table
- Correct AWS description: Common Crawl is part of AWS Open Data
program, not requester-pays
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Address reviewer feedback on math tutorial README
- Remove Platform Requirements section (Curator is Linux-only)
- Revert GPU requirements to original wording
- Remove CUDA verification from troubleshooting
- Remove redundant "Repository Not Found" troubleshooting section
- Fix LSH threshold value from 0.72 to 0.79
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Remove redundant "No GPUs Detected" troubleshooting subsection
Addresses reviewer feedback: the pynvml reinstall note already appears
in the Install section, so duplicating it under Troubleshooting is
unnecessary.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Downloading the CC Index (Option 1) and fetching WARC content via S3 both require AWS credentials. Common Crawl data is part of the [AWS Open Data Sponsorship Program](https://aws.amazon.com/opendata/open-data-sponsorship-program/) and is free to download — but you still need an AWS account and credentials to authenticate with the `aws s3 cp` command:
31
+
32
+
```bash
33
+
# Option 1: Environment variables
34
+
export AWS_ACCESS_KEY_ID=your_access_key
35
+
export AWS_SECRET_ACCESS_KEY=your_secret_key
36
+
37
+
# Option 2: AWS CLI configuration
38
+
aws configure
39
+
```
40
+
41
+
See the [AWS CLI documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) for more details. If you don't have AWS credentials, you can use the CDX Index via HTTPS as a fallback (see [CC Index Requirements](#common-crawl-cc-index-requirements)).
42
+
28
43
### Common Crawl (CC) Index Requirements
29
44
30
45
The index lookup script (`1_cc_index_lookup.py`) uses **cuDF** against a local CC Index in **parquet format**.
@@ -168,6 +183,19 @@ For datasets with WARC metadata already included, set `"needs_cc_lookup": false`
168
183
169
184
## Complete Pipeline Flow
170
185
186
+
**Quick overview:**
187
+
188
+
```
189
+
HuggingFace ──► 0_download ──► Raw Data ──► 1_cc_index_lookup* ──► 2_text_preprocess ──► 3_llm_cleanup ──► 4_quality_classifier ──► 5_deduplication ──► Final Data
190
+
▲ ▲
191
+
CC Index (S3) Common Crawl (S3)
192
+
193
+
* Step 1 only needed for datasets without WARC metadata (OpenWebMath, MegaMath, etc.)
194
+
```
195
+
196
+
<details>
197
+
<summary>Detailed pipeline diagram (click to expand)</summary>
The `0_download.py` script downloads math datasets from HuggingFace Hub. It reads dataset configurations from `datasets.json` and downloads parquet files to `$MATH_DATA_DIR/raw/<dataset_name>/`.
303
333
304
-
### Authentication (Optional)
334
+
### HuggingFace Authentication
335
+
336
+
Several steps in this pipeline require a HuggingFace token:
337
+
-**Step 0**: Downloading gated datasets (e.g., some FineMath splits)
338
+
-**Step 4**: The FineMath quality classifier model (`HuggingFaceTB/finemath-classifier`)
305
339
306
-
For gated datasets or higher download rate limits, authenticate with HuggingFace:
340
+
**Set up your token before starting the pipeline:**
307
341
308
342
```bash
309
-
# Option 1: Environment variable
343
+
# Option 1: Environment variable (recommended)
310
344
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
311
345
312
346
# Option 2: CLI login (saves token to ~/.cache/huggingface/token)
@@ -315,6 +349,8 @@ huggingface-cli login
315
349
316
350
Get your token at: https://huggingface.co/settings/tokens
317
351
352
+
> **Note:** A missing or invalid token typically produces a "repository not found" error rather than an explicit authentication error. If you see this error, verify your `HF_TOKEN` is set and valid.
**Output**: JSONL files with `cleaned_text` (LLM-processed text). When chunking is enabled, chunks are automatically merged back into one row per document.
458
494
459
495
**Key flags**:
460
-
-`--chunk_data` / `--chunk_length`: Enable token-based chunking before LLM processing
496
+
-`--chunk_data` / `--chunk_length`: Enable token-based chunking before LLM processing. **These flags must be used together** — `--chunk_length` is required when `--chunk_data` is set, and `--max_model_len` is also required for chunking.
497
+
-`--prompt`: Name of the prompt to use. See [Available Prompts](#available-prompts) for the full list and selection guidance.
461
498
-`--groupby`: Columns to group by for chunk merging (default: `url`)
462
499
-`--max_text_length`: Maximum merged text length in chars (default: 900,000)
463
500
-`--classification`: Output classification labels instead of cleaned text
**Input**: JSONL files from Step 3. The classifier reads from the `text` field by default; use `--text-field cleaned_text` if your data was processed by the LLM cleanup step.
477
514
478
515
**Output**: JSONL files with additional columns:
479
-
-`finemath_scores`: float scores (0..5)
480
-
-`finemath_int_scores`: integer scores (0..5)
516
+
-`finemath_scores`: float scores (0.0–5.0)
517
+
-`finemath_int_scores`: integer scores (0–5)
518
+
519
+
**Interpreting quality scores:**
520
+
521
+
| Score | Interpretation |
522
+
|-------|----------------|
523
+
| 0 | No mathematical content |
524
+
| 1 | Minimal or tangential math content |
525
+
| 2 | Some math content but low quality (e.g., poorly formatted, incomplete) |
526
+
| 3 | Moderate quality math content — usable for general training |
527
+
| 4 | High quality math content — well-structured and educational |
As a guideline, the pre-configured FineMath datasets use score thresholds of ≥3 (`FINEMATH_3PLUS`) and ≥4 (`FINEMATH_4PLUS`). A threshold of ≥3 is a reasonable starting point; use ≥4 for higher-quality, smaller datasets.
1. First stage: Duplicate IDs are identified and saved to `duplicate_ids_dir`
501
551
2. Second stage: Duplicates are removed from the dataset
502
552
503
-
**Note**: The `cache_dir` must be empty between runs.
553
+
> **Important:** The `cache_dir`**must be manually cleared between runs**. Stale cache files from a previous run will cause incorrect results. Delete or empty the cache directory before re-running:
554
+
> ```bash
555
+
> rm -rf $MATH_DATA_DIR/dedup_cache/*
556
+
>```
557
+
558
+
**LSH parameters** (tunable via command-line flags):
559
+
560
+
| Flag | Default | Description |
561
+
|------|---------|-------------|
562
+
|`--char_ngrams`| 24 | Character n-gram size for MinHash. Values below 20 may produce ~5% false positives. |
563
+
|`--num_bands`| 20 | Number of LSH bands. More bands = higher recall but slower. |
564
+
|`--minhashes_per_band`| 13 | Hashes per band. More hashes = higher precision but lower recall. |
565
+
|`--bands_per_iteration`| 5 | Number of bands to shuffle concurrently. Reduce if you hit OOM errors. |
566
+
|`--use_64_bit_hash`| False | Use 64-bit hashfor fewer collisions on very large datasets. |
567
+
|`--seed`| 42 | Seed for MinHash permutations (for reproducibility). |
504
568
505
-
## Alternative Prompts and Use Cases
569
+
The similarity threshold is implicitly controlled by `num_bands`and `minhashes_per_band`. The approximate threshold is `(1/num_bands)^(1/minhashes_per_band)`. With the defaults (20 bands, 13 hashes/band), this is approximately 0.79. To detect more similar pairs (stricter dedup), increase `num_bands`; to be more lenient, decrease it.
506
570
507
-
The LLM cleanup step supports various specialized prompts for different mathematical content processing needs:
571
+
## Available Prompts
572
+
573
+
The LLM cleanup step (`3_llm_cleanup.py`) supports 12 prompts via the `--prompt` flag. The prompt is loaded by name from `nemo_curator.utils.prompts`.
508
574
509
575
### Content Cleaning Prompts
510
576
511
-
**`HTML_TO_TEXT_PROMPT`** (default): Extract main content, preserve math, standardize equations to LaTeX `$...$`, remove boilerplate
577
+
Use these for extracting and cleaning text from raw HTML/web content:
512
578
513
-
**`HTML_TO_TEXT_PROMPT_CODE`**: For pages mixing math and significant code (e.g., computational math tutorials)
579
+
| Prompt | Use Case | When to Choose |
580
+
|--------|----------|----------------|
581
+
|**`HTML_TO_TEXT_PROMPT`** (default) | Extract main content, preserve math, standardize equations to LaTeX `$...$`, remove boilerplate | General-purpose math content extraction |
582
+
|**`HTML_TO_TEXT_PROMPT_CODE`**| Same as above but also preserves code blocks | Pages mixing math and significant code (e.g., computational math tutorials, Jupyter-style content) |
Use these with `--classification` to label content rather than clean it:
600
+
601
+
| Prompt | Use Case |
602
+
|--------|----------|
603
+
|**`MATH_TOPIC_CLASSIFICATION_PROMPT`**| Classifies content into topics: Math, CS, Physics, Statistics, Chemistry, Economics, or Other |
604
+
|**`CODE_QUALITY_PROMPT`**| Rates code quality on a 0–5 scale with detailed criteria |
605
+
|**`CODE_QUALITY_PROMPT_SIMPLIFIED`**| Rates code quality on a 0–2 scale (simpler/faster) |
606
+
607
+
```bash
608
+
# Example: classify math topics
609
+
python tutorials/math/3_llm_cleanup.py \
610
+
--input $MATH_DATA_DIR/preprocessed \
611
+
--output $MATH_DATA_DIR/classified_topics \
612
+
--model microsoft/phi-4 \
613
+
--prompt MATH_TOPIC_CLASSIFICATION_PROMPT \
614
+
--classification \
615
+
--max_model_len 16384 \
616
+
--input_filetype jsonl
617
+
```
618
+
619
+
### Synthetic Dialogue Prompts (MIND Dataset)
620
+
621
+
These prompts convert source text into multi-turn dialogue formats, based on the [MIND paper](https://arxiv.org/pdf/2410.12881). Useful for generating conversational training data from math content:
622
+
623
+
| Prompt | Format |
624
+
|--------|--------|
625
+
|**`mind_two_profs`**| Discussion between two professors |
|**`mind_layman_knowall`**| Expert explaining to a layperson |
631
+
|**`mind_debate`**| Debate-style discussion |
632
+
633
+
## Troubleshooting
634
+
635
+
### 0 URLs Matched in CC Index Lookup (Step 1)
636
+
637
+
- Verify your CC Index parquet files follow the required hive-partitioned directory structure: `<base_path>/crawl=CC-MAIN-YYYY-WW/subset=warc/*.parquet`
638
+
- Check that the crawl ID(s) you downloaded actually contain your URLs. Not all URLs appear in every crawl — try multiple crawls.
639
+
- Confirm the `url_col`in`datasets.json` matches the actual column name in your dataset.
640
+
641
+
### Empty LLM Output (Step 3)
642
+
643
+
- Check that `--max_model_len` is large enough for your input chunks. The script filters out chunks that exceed 80% of `max_model_len`.
644
+
- Verify the model downloaded successfully and you have sufficient VRAM (~20 GB for Phi-4).
645
+
- Try a smaller `--chunk_length` value if documents are being dropped.
646
+
647
+
### All Classifier Scores Are 0 (Step 4)
648
+
649
+
- Ensure you're using the correct `--text-field`. After LLM cleanup, the text is in `cleaned_text`, not `text`:
650
+
```bash
651
+
python tutorials/math/4_quality_classifier.py \
652
+
--input "$MATH_DATA_DIR/cleaned/*.jsonl" \
653
+
--output $MATH_DATA_DIR/classified \
654
+
--text-field cleaned_text
655
+
```
656
+
- Verify the input files are not empty and contain the expected text field.
657
+
658
+
### Common Crawl S3 Fetch Failures (Step 2)
659
+
660
+
- If using S3 (`CC_USE_S3=1`), ensure your AWS credentials are configured — see [AWS Credentials](#aws-credentials-for-common-crawl-s3-access).
661
+
- HTTPS fetching (the default) does not require AWS credentials but may be slower.
662
+
- Malformed WARC records are skipped silently. Use `--report-stats` to see extraction statistics and identify how many records failed.
663
+
664
+
### Deduplication Errors (Step 5)
665
+
666
+
- If results seem incorrect, ensure you cleared `cache_dir` from any previous run: `rm -rf $MATH_DATA_DIR/dedup_cache/*`
667
+
- Out-of-memory errors: try reducing `--bands_per_iteration` (default: 5) to process fewer bands concurrently.
0 commit comments