Skip to content

Commit 4c3af97

Browse files
lbliiiclaudeayushdgsarahyurick
authored
Improve math tutorial README documentation (#1766)
* Improve math tutorial README documentation Address documentation gaps including: platform requirements (vLLM x86_64 Linux only), GPU VRAM requirements and compatible GPUs, AWS credential setup, HuggingFace token guidance with error notes, all 12 available prompts documented with selection guidance, quality score interpretation table, LSH parameter documentation for deduplication, cache clearing warning, compact horizontal pipeline diagram, and a troubleshooting section for common errors. Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address PR review comments on math tutorial README - Fix wrong input directory in Content Cleaning example (deduplicated → preprocessed) to match pipeline order - Add --bands_per_iteration to LSH parameters table - Correct AWS description: Common Crawl is part of AWS Open Data program, not requester-pays Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address reviewer feedback on math tutorial README - Remove Platform Requirements section (Curator is Linux-only) - Revert GPU requirements to original wording - Remove CUDA verification from troubleshooting - Remove redundant "Repository Not Found" troubleshooting section - Fix LSH threshold value from 0.72 to 0.79 Signed-off-by: Lawrence Lane <llane@nvidia.com> * Remove redundant "No GPUs Detected" troubleshooting subsection Addresses reviewer feedback: the pynvml reinstall note already appears in the Install section, so duplicating it under Troubleshooting is unnecessary. Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
1 parent 72b20ba commit 4c3af97

1 file changed

Lines changed: 155 additions & 13 deletions

File tree

tutorials/math/README.md

Lines changed: 155 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,21 @@ uv pip install --force-reinstall pynvml
2525
- RHEL/Fedora: `sudo dnf install -y lynx` (or `sudo yum install -y lynx`)
2626
- Conda: `conda install -c conda-forge lynx`
2727

28+
### AWS Credentials (for Common Crawl S3 Access)
29+
30+
Downloading the CC Index (Option 1) and fetching WARC content via S3 both require AWS credentials. Common Crawl data is part of the [AWS Open Data Sponsorship Program](https://aws.amazon.com/opendata/open-data-sponsorship-program/) and is free to download — but you still need an AWS account and credentials to authenticate with the `aws s3 cp` command:
31+
32+
```bash
33+
# Option 1: Environment variables
34+
export AWS_ACCESS_KEY_ID=your_access_key
35+
export AWS_SECRET_ACCESS_KEY=your_secret_key
36+
37+
# Option 2: AWS CLI configuration
38+
aws configure
39+
```
40+
41+
See the [AWS CLI documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) for more details. If you don't have AWS credentials, you can use the CDX Index via HTTPS as a fallback (see [CC Index Requirements](#common-crawl-cc-index-requirements)).
42+
2843
### Common Crawl (CC) Index Requirements
2944

3045
The index lookup script (`1_cc_index_lookup.py`) uses **cuDF** against a local CC Index in **parquet format**.
@@ -168,6 +183,19 @@ For datasets with WARC metadata already included, set `"needs_cc_lookup": false`
168183

169184
## Complete Pipeline Flow
170185

186+
**Quick overview:**
187+
188+
```
189+
HuggingFace ──► 0_download ──► Raw Data ──► 1_cc_index_lookup* ──► 2_text_preprocess ──► 3_llm_cleanup ──► 4_quality_classifier ──► 5_deduplication ──► Final Data
190+
▲ ▲
191+
CC Index (S3) Common Crawl (S3)
192+
193+
* Step 1 only needed for datasets without WARC metadata (OpenWebMath, MegaMath, etc.)
194+
```
195+
196+
<details>
197+
<summary>Detailed pipeline diagram (click to expand)</summary>
198+
171199
```mermaid
172200
flowchart TD
173201
subgraph download["Download (Optional)"]
@@ -276,6 +304,8 @@ flowchart TD
276304
linkStyle default stroke:#76b900,stroke-width:2px
277305
```
278306

307+
</details>
308+
279309
### Pipeline Summary
280310

281311
| Step | Script | Input | Output | Required For |
@@ -301,12 +331,16 @@ mkdir -p $MATH_DATA_DIR/{raw,enriched,preprocessed,cleaned,classified,dedup_cach
301331

302332
The `0_download.py` script downloads math datasets from HuggingFace Hub. It reads dataset configurations from `datasets.json` and downloads parquet files to `$MATH_DATA_DIR/raw/<dataset_name>/`.
303333

304-
### Authentication (Optional)
334+
### HuggingFace Authentication
335+
336+
Several steps in this pipeline require a HuggingFace token:
337+
- **Step 0**: Downloading gated datasets (e.g., some FineMath splits)
338+
- **Step 4**: The FineMath quality classifier model (`HuggingFaceTB/finemath-classifier`)
305339

306-
For gated datasets or higher download rate limits, authenticate with HuggingFace:
340+
**Set up your token before starting the pipeline:**
307341

308342
```bash
309-
# Option 1: Environment variable
343+
# Option 1: Environment variable (recommended)
310344
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
311345

312346
# Option 2: CLI login (saves token to ~/.cache/huggingface/token)
@@ -315,6 +349,8 @@ huggingface-cli login
315349

316350
Get your token at: https://huggingface.co/settings/tokens
317351

352+
> **Note:** A missing or invalid token typically produces a "repository not found" error rather than an explicit authentication error. If you see this error, verify your `HF_TOKEN` is set and valid.
353+
318354
### Download Commands
319355

320356
```bash
@@ -457,7 +493,8 @@ python tutorials/math/3_llm_cleanup.py \
457493
**Output**: JSONL files with `cleaned_text` (LLM-processed text). When chunking is enabled, chunks are automatically merged back into one row per document.
458494

459495
**Key flags**:
460-
- `--chunk_data` / `--chunk_length`: Enable token-based chunking before LLM processing
496+
- `--chunk_data` / `--chunk_length`: Enable token-based chunking before LLM processing. **These flags must be used together**`--chunk_length` is required when `--chunk_data` is set, and `--max_model_len` is also required for chunking.
497+
- `--prompt`: Name of the prompt to use. See [Available Prompts](#available-prompts) for the full list and selection guidance.
461498
- `--groupby`: Columns to group by for chunk merging (default: `url`)
462499
- `--max_text_length`: Maximum merged text length in chars (default: 900,000)
463500
- `--classification`: Output classification labels instead of cleaned text
@@ -473,11 +510,24 @@ python tutorials/math/4_quality_classifier.py \
473510
--output $MATH_DATA_DIR/classified
474511
```
475512

476-
**Input**: JSONL files from Step 3
513+
**Input**: JSONL files from Step 3. The classifier reads from the `text` field by default; use `--text-field cleaned_text` if your data was processed by the LLM cleanup step.
477514

478515
**Output**: JSONL files with additional columns:
479-
- `finemath_scores`: float scores (0..5)
480-
- `finemath_int_scores`: integer scores (0..5)
516+
- `finemath_scores`: float scores (0.0–5.0)
517+
- `finemath_int_scores`: integer scores (0–5)
518+
519+
**Interpreting quality scores:**
520+
521+
| Score | Interpretation |
522+
|-------|----------------|
523+
| 0 | No mathematical content |
524+
| 1 | Minimal or tangential math content |
525+
| 2 | Some math content but low quality (e.g., poorly formatted, incomplete) |
526+
| 3 | Moderate quality math content — usable for general training |
527+
| 4 | High quality math content — well-structured and educational |
528+
| 5 | Excellent math content — textbook-quality, clear explanations |
529+
530+
As a guideline, the pre-configured FineMath datasets use score thresholds of ≥3 (`FINEMATH_3PLUS`) and ≥4 (`FINEMATH_4PLUS`). A threshold of ≥3 is a reasonable starting point; use ≥4 for higher-quality, smaller datasets.
481531

482532
## Step 5: Deduplication
483533

@@ -500,21 +550,41 @@ python tutorials/math/5_deduplication.py \
500550
1. First stage: Duplicate IDs are identified and saved to `duplicate_ids_dir`
501551
2. Second stage: Duplicates are removed from the dataset
502552

503-
**Note**: The `cache_dir` must be empty between runs.
553+
> **Important:** The `cache_dir` **must be manually cleared between runs**. Stale cache files from a previous run will cause incorrect results. Delete or empty the cache directory before re-running:
554+
> ```bash
555+
> rm -rf $MATH_DATA_DIR/dedup_cache/*
556+
> ```
557+
558+
**LSH parameters** (tunable via command-line flags):
559+
560+
| Flag | Default | Description |
561+
|------|---------|-------------|
562+
| `--char_ngrams` | 24 | Character n-gram size for MinHash. Values below 20 may produce ~5% false positives. |
563+
| `--num_bands` | 20 | Number of LSH bands. More bands = higher recall but slower. |
564+
| `--minhashes_per_band` | 13 | Hashes per band. More hashes = higher precision but lower recall. |
565+
| `--bands_per_iteration` | 5 | Number of bands to shuffle concurrently. Reduce if you hit OOM errors. |
566+
| `--use_64_bit_hash` | False | Use 64-bit hash for fewer collisions on very large datasets. |
567+
| `--seed` | 42 | Seed for MinHash permutations (for reproducibility). |
504568
505-
## Alternative Prompts and Use Cases
569+
The similarity threshold is implicitly controlled by `num_bands` and `minhashes_per_band`. The approximate threshold is `(1/num_bands)^(1/minhashes_per_band)`. With the defaults (20 bands, 13 hashes/band), this is approximately 0.79. To detect more similar pairs (stricter dedup), increase `num_bands`; to be more lenient, decrease it.
506570
507-
The LLM cleanup step supports various specialized prompts for different mathematical content processing needs:
571+
## Available Prompts
572+
573+
The LLM cleanup step (`3_llm_cleanup.py`) supports 12 prompts via the `--prompt` flag. The prompt is loaded by name from `nemo_curator.utils.prompts`.
508574
509575
### Content Cleaning Prompts
510576
511-
**`HTML_TO_TEXT_PROMPT`** (default): Extract main content, preserve math, standardize equations to LaTeX `$...$`, remove boilerplate
577+
Use these for extracting and cleaning text from raw HTML/web content:
512578
513-
**`HTML_TO_TEXT_PROMPT_CODE`**: For pages mixing math and significant code (e.g., computational math tutorials)
579+
| Prompt | Use Case | When to Choose |
580+
|--------|----------|----------------|
581+
| **`HTML_TO_TEXT_PROMPT`** (default) | Extract main content, preserve math, standardize equations to LaTeX `$...$`, remove boilerplate | General-purpose math content extraction |
582+
| **`HTML_TO_TEXT_PROMPT_CODE`** | Same as above but also preserves code blocks | Pages mixing math and significant code (e.g., computational math tutorials, Jupyter-style content) |
514583
515584
```bash
585+
# Example: cleaning code-heavy math content
516586
python tutorials/math/3_llm_cleanup.py \
517-
--input $MATH_DATA_DIR/deduplicated \
587+
--input $MATH_DATA_DIR/preprocessed \
518588
--output $MATH_DATA_DIR/cleaned_code \
519589
--model microsoft/phi-4 \
520590
--prompt HTML_TO_TEXT_PROMPT_CODE \
@@ -523,3 +593,75 @@ python tutorials/math/3_llm_cleanup.py \
523593
--max_model_len 16384 \
524594
--input_filetype jsonl
525595
```
596+
597+
### Classification Prompts
598+
599+
Use these with `--classification` to label content rather than clean it:
600+
601+
| Prompt | Use Case |
602+
|--------|----------|
603+
| **`MATH_TOPIC_CLASSIFICATION_PROMPT`** | Classifies content into topics: Math, CS, Physics, Statistics, Chemistry, Economics, or Other |
604+
| **`CODE_QUALITY_PROMPT`** | Rates code quality on a 0–5 scale with detailed criteria |
605+
| **`CODE_QUALITY_PROMPT_SIMPLIFIED`** | Rates code quality on a 0–2 scale (simpler/faster) |
606+
607+
```bash
608+
# Example: classify math topics
609+
python tutorials/math/3_llm_cleanup.py \
610+
--input $MATH_DATA_DIR/preprocessed \
611+
--output $MATH_DATA_DIR/classified_topics \
612+
--model microsoft/phi-4 \
613+
--prompt MATH_TOPIC_CLASSIFICATION_PROMPT \
614+
--classification \
615+
--max_model_len 16384 \
616+
--input_filetype jsonl
617+
```
618+
619+
### Synthetic Dialogue Prompts (MIND Dataset)
620+
621+
These prompts convert source text into multi-turn dialogue formats, based on the [MIND paper](https://arxiv.org/pdf/2410.12881). Useful for generating conversational training data from math content:
622+
623+
| Prompt | Format |
624+
|--------|--------|
625+
| **`mind_two_profs`** | Discussion between two professors |
626+
| **`mind_teacher_student`** | Teacher-student Q&A |
627+
| **`mind_two_students`** | Two students discussing an assignment |
628+
| **`mind_interview`** | Interview with an expert |
629+
| **`mind_problem_solving`** | Problem-solving conversation |
630+
| **`mind_layman_knowall`** | Expert explaining to a layperson |
631+
| **`mind_debate`** | Debate-style discussion |
632+
633+
## Troubleshooting
634+
635+
### 0 URLs Matched in CC Index Lookup (Step 1)
636+
637+
- Verify your CC Index parquet files follow the required hive-partitioned directory structure: `<base_path>/crawl=CC-MAIN-YYYY-WW/subset=warc/*.parquet`
638+
- Check that the crawl ID(s) you downloaded actually contain your URLs. Not all URLs appear in every crawl — try multiple crawls.
639+
- Confirm the `url_col` in `datasets.json` matches the actual column name in your dataset.
640+
641+
### Empty LLM Output (Step 3)
642+
643+
- Check that `--max_model_len` is large enough for your input chunks. The script filters out chunks that exceed 80% of `max_model_len`.
644+
- Verify the model downloaded successfully and you have sufficient VRAM (~20 GB for Phi-4).
645+
- Try a smaller `--chunk_length` value if documents are being dropped.
646+
647+
### All Classifier Scores Are 0 (Step 4)
648+
649+
- Ensure you're using the correct `--text-field`. After LLM cleanup, the text is in `cleaned_text`, not `text`:
650+
```bash
651+
python tutorials/math/4_quality_classifier.py \
652+
--input "$MATH_DATA_DIR/cleaned/*.jsonl" \
653+
--output $MATH_DATA_DIR/classified \
654+
--text-field cleaned_text
655+
```
656+
- Verify the input files are not empty and contain the expected text field.
657+
658+
### Common Crawl S3 Fetch Failures (Step 2)
659+
660+
- If using S3 (`CC_USE_S3=1`), ensure your AWS credentials are configured — see [AWS Credentials](#aws-credentials-for-common-crawl-s3-access).
661+
- HTTPS fetching (the default) does not require AWS credentials but may be slower.
662+
- Malformed WARC records are skipped silently. Use `--report-stats` to see extraction statistics and identify how many records failed.
663+
664+
### Deduplication Errors (Step 5)
665+
666+
- If results seem incorrect, ensure you cleared `cache_dir` from any previous run: `rm -rf $MATH_DATA_DIR/dedup_cache/*`
667+
- Out-of-memory errors: try reducing `--bands_per_iteration` (default: 5) to process fewer bands concurrently.

0 commit comments

Comments
 (0)