Skip to content

Adds difficulty sampling curriculum dataloader and dataset builder#1661

Open
undfined wants to merge 45 commits into
mainfrom
undfined/difficulty-sampling
Open

Adds difficulty sampling curriculum dataloader and dataset builder#1661
undfined wants to merge 45 commits into
mainfrom
undfined/difficulty-sampling

Conversation

@undfined
Copy link
Copy Markdown

@undfined undfined commented May 6, 2026

Summary

Adds a difficulty-map generation pipeline and a difficulty-aware prompt sampling path in the dataloader.

  • Adds scripts/data/difficulty_sampling/create_difficulty_map.py to build per-instance difficulty metadata from Hugging Face datasets with pass-count / attempt-count aggregates.
  • The difficulty builder computes beta-binomial posterior statistics and writes difficulty.value, difficulty.posterior_mean, difficulty.posterior_lower_bound, difficulty.expected_quantile, difficulty.bucket_index, and difficulty.bucket_count.
  • The difficulty builder writes JSONL outputs plus .schema.json and .metadata.json sidecars, and supports optional push-to-hub for a single task/model output group.
  • Adds open_instruct/difficulty_curriculum.py, which defines the difficulty curriculum config, metadata parsing, bucket schedule, within-bucket weighting, optional adaptive bucket reweighting, and quantile-based filtering over difficulty metadata.
  • Extends open_instruct/data_loader.py with DifficultyCurriculumHFDataLoader and prompt-loader integration so training can sample prompts through the difficulty curriculum instead of uniform reshuffling.
  • Threads the training step into the sampler, records observed rewards / advantages back into the sampler for adaptive updates, and logs curriculum sampling metrics during training.
  • Adds reference curriculum metadata/schema artifacts under configs/curriculum/Qwen_Qwen3-4B-Base/.
  • Adds user-facing documentation in scripts/data/difficulty_sampling/README.md.

Tests

  • Adds tests/test_create_difficulty_map.py for difficulty-map generation.
  • Adds open_instruct/test_difficulty_curriculum.py for curriculum sampling, parser wiring, filtering, and loader integration.

gemini-code-assist[bot]

This comment was marked as low quality.

@undfined undfined changed the title WIP: Adds difficulty sampling curriculum dataloader and dataset builder Adds difficulty sampling curriculum dataloader and dataset builder May 11, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an end-to-end difficulty-aware curriculum pipeline: a dataset-side difficulty metadata builder (Beta-Binomial posterior + bucketing) and a training-side sampler/dataloader path that can shift prompt sampling over training steps (optionally adaptive based on observed rewards/advantages).

Changes:

  • Add create_difficulty_map.py + docs/tests to generate per-row difficulty.* metadata from HF pass-rate aggregates and write JSONL + schema/metadata sidecars (optionally push to Hub).
  • Add open_instruct/difficulty_curriculum.py sampler/config + integrate into open_instruct/data_loader.py and grpo_fast.py CLI/config wiring.
  • Add launch scripts + reference curriculum artifacts under configs/curriculum/..., and update changelog.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_create_difficulty_map.py Unit tests for difficulty-map generation logic (with stubs for external deps).
scripts/train/qwen/difficult-curriculum/qwen3_4b_dapo_math_dc.sh Example launch script enabling difficulty curriculum flags.
scripts/train/qwen/difficult-curriculum/qwen3_4b_dapo_math_dc_longer_easy_bootstrap.sh Variant launcher adjusting bootstrap/warmup steps.
scripts/train/qwen/difficult-curriculum/qwen3_4b_dapo_math_dc_hardest50.sh Variant launcher filtering to hardest quantiles.
scripts/train/qwen/difficult-curriculum/qwen3_4b_dapo_math_dc_adaptive_strong.sh Variant launcher enabling stronger adaptive sampling.
scripts/train/qwen/difficult-curriculum/qwen3_4b_dapo_math_dc_adaptive_strong_hardest50.sh Adaptive+hardest50 launcher variant.
scripts/train/qwen/difficult-curriculum/qwen3_4b_dapo_math_dc_adaptive_light.sh Variant launcher enabling lighter adaptive sampling.
scripts/train/qwen/difficult-curriculum/qwen3_4b_dapo_math_dc_adaptive_light_hardest50.sh Adaptive-light + hardest50 launcher variant.
scripts/data/difficulty_sampling/README.md User-facing documentation for difficulty metadata + curriculum usage.
scripts/data/difficulty_sampling/create_difficulty_map.py Difficulty-map builder script (load HF dataset, estimate posterior, bucket, write/push outputs).
open_instruct/test_difficulty_curriculum.py Tests for sampler behavior, quantile filtering, adaptive stats, and loader integration.
open_instruct/test_data_loader.py Extends tests to ensure truncation masking keeps arrays/batch aligned.
open_instruct/grpo_fast.py Wires curriculum args/config into GRPO entrypoint and DataPreparationActor creation.
open_instruct/difficulty_curriculum.py New curriculum sampler implementation (schedule, weighting, adaptive reweighting, metrics, state).
open_instruct/data_loader.py Adds curriculum-backed prompt dataloader + adaptive observation recording + curriculum metrics.
configs/curriculum/Qwen_Qwen3-4B-Base/math__Qwen_Qwen3-4B-Base__bbq-eb-q10-k5.schema.json Reference schema artifact for difficulty-annotated dataset.
configs/curriculum/Qwen_Qwen3-4B-Base/math__Qwen_Qwen3-4B-Base__bbq-eb-q10-k5.metadata.json Reference metadata artifact describing generation configuration.
CHANGELOG.md Notes the new difficulty curriculum + builder feature.
Comments suppressed due to low confidence (1)

open_instruct/data_loader.py:1542

  • advantages (and scores_per_prompt / mean_grouped_rewards) are computed before maybe_mask_truncated_completions filters out non-stop rollouts. When masking is enabled, truncated rollouts still influence the per-prompt mean/std used for advantage normalization, which biases the remaining trainable samples (and can make the grouping logic inconsistent if some rollouts are removed). Consider applying the truncation mask before computing per-prompt statistics, or recomputing per-prompt means/stds from the retained rollouts (grouping by prompt_id / index) so only trainable samples contribute to the advantage calculation.
            scores = np.array(batch.scores)
            scores_per_prompt = scores.reshape(-1, self.config.num_samples_per_prompt_rollout)
            mean_grouped_rewards = scores_per_prompt.mean(axis=-1)
            mean_grouped_rewards = np.repeat(mean_grouped_rewards, self.config.num_samples_per_prompt_rollout, axis=0)
            std_grouped_rewards = scores_per_prompt.std(axis=-1)
            std_grouped_rewards = np.repeat(std_grouped_rewards, self.config.num_samples_per_prompt_rollout, axis=0)

            if self.config.advantage_normalization_type == "standard":
                advantages = (scores - mean_grouped_rewards) / (std_grouped_rewards + 1e-8)
            elif self.config.advantage_normalization_type == "centered":
                advantages = scores - mean_grouped_rewards
            else:
                raise ValueError(f"Invalid advantage normalization type: {self.config.advantage_normalization_type}")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1524 to 1529
prompt_dataset_indices: list[int] = []
if self.curriculum_dataloader is not None and batch.indices is not None:
prompt_dataset_indices = [
int(index) for index in batch.indices[:: self.config.num_samples_per_prompt_rollout]
]

BUDGET="${BUDGET:-ai2/oe-omai}"

# Difficulty-annotated variant of hamishivi/DAPO-Math-17k-Processed_filtered
DATASET_WITH_DIFFICULTY="undfined/dapo-math-17k-processed-filtered-qwen3-4b-base-32samples-ds"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants