Skip to content

Commit d00faa7

Browse files
committed
sdg_pipeline: multi-model profiling stage replaces single-model difficulty_estimation
Introduce a new `profiling` stage that runs per-model generate->judge->aggregate chains in parallel for N models and merges results into a single `profiling` array per problem: "profiling": [ {"model": "ModelA", "pass_rate": 0.5, "pass_at_n": "2/4"}, {"model": "ModelB", "pass_rate": 0.8, "pass_at_n": "4/5"}, ] Changes: - `run_pipeline.profiling()` orchestrates: shared prepare -> per-model chains (generate, judge, aggregate) in parallel -> final merge. Judge kwargs are copied per-iteration so args don't leak across models; `num_random_seeds` is inherited from generation if not explicitly set. - `aggregate_profiling_model.py` (new): per-model aggregator over per-seed `output-rs*.jsonl` files. Streams inputs — keeps only the BASE_FIELDS projection + a small counters dict per `(id, problem)` key — so aggregation fits in memory at 1M+ problem scale. Falls back to `(_lineno, line_number)` when neither `id` nor `problem` is present on a record. - `merge_profiling.py` (new): merges per-model result files. Asserts every per-model file contains the same `(id, problem)` key set so row-alignment mismatches fail loudly instead of silently dropping problems. After a successful merge, removes the per-model `result.jsonl` intermediates (folders — generation/, judgement/, logs/ — are retained for debugging). - `filter_solutions.py`: replaces the scalar `difficulty_model_pass_rate` bounds with a per-model dict `profiling_pass_rate_ranges: {model_name: [min, max]}` (min exclusive, max inclusive). - `validate_pipeline.py` and `scripts/utils/constants.py`: update stage-name and field-set checks (`PROFILING_FIELDS`, required `profiling` key, row-count equality for the new stage). - Base + settings YAMLs: renamed stage (`difficulty_estimation` -> `profiling`) and directory (`step-3-difficulty-estimation` -> `step-3-profiling`); new `profiling.models: [...]` list with per-model `generation_kwargs` and an optional per-model `judge_kwargs` override. - SLURM test: references `stages.profiling.models.0.generation_kwargs...` instead of the old top-level path. - README: updates the stage list + filter-parameter description. - `aggregate_difficulty.py` is removed (replaced by the new aggregator + merger). Signed-off-by: Tatevik Ter-Hovhannisyan <tterhovhanni@nvidia.com>
1 parent 410b337 commit d00faa7

14 files changed

Lines changed: 355 additions & 181 deletions

File tree

recipes/opensciencereasoning/sdg_pipeline/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ This folder provides templates, prompts, and scripts for the automated pipeline
55
- Deduplicate and clean incoming problems via [`filter_problems`](scripts/filter_problems.py).
66
- Run contamination checks in [`decontaminate`](scripts/decontaminate.py).
77
- Launch [`generate_solutions`](run_pipeline.py) to obtain model answers when no GT is supplied, then run majority voting to recover a GT answer. Will only be applied with the `without_gt` setting.
8-
- Score questions with [`difficulty_estimation`](run_pipeline.py) and enrich metadata with [`topics_labeling`](run_pipeline.py).
8+
- Profile problem difficulty with multiple models via [`profiling`](run_pipeline.py) and enrich metadata with [`topics_labeling`](run_pipeline.py).
99
- Finish with [`aggregate`](scripts/aggregate_metadata.py) and [`filter_solutions`](scripts/filter_solutions.py) to produce deliverables.
1010

1111
## SFT Data Flow
@@ -27,10 +27,10 @@ This folder provides templates, prompts, and scripts for the automated pipeline
2727
- `generation/output*.jsonl`: raw generations.
2828
- `with_predictions/output*.jsonl`: adds `predicted_answer`, and when the majority answer is applied, also adds `expected_answer`, `majority_voting_agreement_rate`, and `majority_voting_agreement_at_n`.
2929
- Optional `judgement/output*.jsonl`: contains `judgement` strings when `make_judgement` is enabled. The aggregated stage output also adds `is_correct`, `generation_model_pass_rate`, `generation_model_pass_at_n`, and `generation_model` to each sample.
30-
- [`difficulty_estimation`](run_pipeline.py): Requires GT answers. Uses [`remove_redundant_fields.py`](scripts/remove_redundant_fields.py) to keep baseline keys, generates boxed-format solutions (`generation_kwargs`), judges them (`judge_kwargs`), and writes `final_result.jsonl` with `difficulty_model`, `difficulty_model_pass_rate`, and `difficulty_model_pass_at_n` fields (see [`aggregate_difficulty.py`](scripts/aggregate_difficulty.py)).
30+
- [`profiling`](run_pipeline.py): Requires GT answers. Runs multiple models in parallel, each through a generate-judge-aggregate chain. Uses [`remove_redundant_fields.py`](scripts/remove_redundant_fields.py) to keep baseline keys, generates boxed-format solutions per model, judges them, and writes `final_result.jsonl` with a `profiling` array field containing per-model `{model, pass_rate, pass_at_n}` entries (see [`aggregate_profiling_model.py`](scripts/aggregate_profiling_model.py) and [`merge_profiling.py`](scripts/merge_profiling.py)).
3131
- [`aggregate`](scripts/aggregate_metadata.py): Merges metadata (`metadata_files`) and optional solution glob (`solutions_path`) into `final_result.jsonl`. The resulting records combine base fields with appended metadata and solution statistics.
3232
- [`prepare_for_sft`](run_pipeline.py): Calls `nemo_skills.training.prepare_data` via the configured `prepare_data_kwargs` (tokenizer, prompt config, formatting toggles). Outputs an instruction-tuning JSONL file.
33-
- [`filter_solutions`](scripts/filter_solutions.py): Applies correctness/pass-rate/metadata filters. Parameters: `only_correct_solutions`, `generation_model_pass_rate_range`, `difficulty_model_pass_rate_range`, `metadata_values`, `only_samples_with_ground_truth_answer`. The filtered output preserves the same schema as the input `final_result.jsonl`.
33+
- [`filter_solutions`](scripts/filter_solutions.py): Applies correctness/pass-rate/metadata filters. Parameters: `only_correct_solutions`, `generation_model_pass_rate_range`, `profiling_pass_rate_range` (JSON dict `{model_name: [min, max]}`), `metadata_values`, `only_samples_with_ground_truth_answer`. The filtered output preserves the same schema as the input `final_result.jsonl`.
3434
- [`process_messages_and_bucket`](run_pipeline.py): Uses [`scripts/process_messages_and_bucket.py`](scripts/process_messages_and_bucket.py) to transform prepared rows into input/output message text, compute `input_token_length` and `output_token_length`, and optionally split into token-length buckets based on `bucket_field` and `bucket_sizes`.
3535
- [`validate`](scripts/validate_pipeline.py): Reuses the automated checker to verify artifacts exist, counts add up, and required metadata fields are present, so failures point directly to the problematic stage. See [What the Validation Stage Covers](#what-the-validation-stage-covers) for details and caveats.
3636

recipes/opensciencereasoning/sdg_pipeline/configs/pipelines/base.yaml

Lines changed: 23 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ pipeline_stages:
2121
- decontaminate # Decontaminate problems
2222
- topics_labeling # Label topics and subtopics
2323
- generate_solutions # Generate solutions
24-
- difficulty_estimation # Estimate difficulty of problems
24+
- profiling # Profile problem difficulty with multiple models
2525
- aggregate # Aggregate all the metadata into a single file
2626
- filter_solutions # Filter solutions
2727
- prepare_for_sft # Prepare for SFT
@@ -33,7 +33,7 @@ directories:
3333
step-0-filter-problems: ${base_output_dir}/solution-sdg/step-0-filter-problems
3434
step-1-decontaminate: ${base_output_dir}/solution-sdg/step-1-decontaminate
3535
step-2-topics-labeling: ${base_output_dir}/solution-sdg/step-2-topics-labeling
36-
step-3-difficulty-estimation: ${base_output_dir}/solution-sdg/step-3-difficulty-estimation
36+
step-3-profiling: ${base_output_dir}/solution-sdg/step-3-profiling
3737
step-4-generate-solutions: ${base_output_dir}/solution-sdg/step-4-generate-solutions
3838
step-5-aggregate: ${base_output_dir}/solution-sdg/step-5-aggregate
3939
step-6-filter-solutions: ${base_output_dir}/solution-sdg/step-6-filter-solutions
@@ -134,35 +134,37 @@ stages:
134134
dependencies:
135135
- decontaminate
136136

137-
difficulty_estimation:
138-
output_dir: ${directories.step-3-difficulty-estimation}
137+
profiling:
138+
output_dir: ${directories.step-3-profiling}
139139
input_file: ${directories.step-1-decontaminate}/final_result.jsonl # Should have expected answers
140140

141-
generation_kwargs:
142-
args:
143-
model: /hf_models/Qwen3-30B-A3B
144-
server_type: vllm
145-
server_gpus: 8
146-
server_nodes: 1
147-
dependent_jobs: 1
148-
num_random_seeds: 5
149-
num_chunks: 20
150-
ctx_args: >-
151-
++prompt_config=generic/general-boxed
152-
++inference.tokens_to_generate=16000
153-
141+
# Shared judge config -- used for all models unless overridden per-model
154142
judge_kwargs:
155143
args:
156144
model: /hf_models/gpt-oss-20b
157145
server_type: vllm
158146
server_gpus: 8
159147
server_nodes: 1
160-
num_random_seeds: ${stages.difficulty_estimation.generation_kwargs.args.num_random_seeds}
161148
dependent_jobs: 1
162149
num_chunks: 5
163150
ctx_args: >-
164151
++prompt_config=judge/general-judge
165152
153+
models:
154+
- name: Qwen3-30B-A3B
155+
generation_kwargs:
156+
args:
157+
model: /hf_models/Qwen3-30B-A3B
158+
server_type: vllm
159+
server_gpus: 8
160+
server_nodes: 1
161+
dependent_jobs: 1
162+
num_random_seeds: 5
163+
num_chunks: 20
164+
ctx_args: >-
165+
++prompt_config=generic/general-boxed
166+
++inference.tokens_to_generate=16000
167+
166168
dependencies:
167169
- decontaminate
168170

@@ -232,18 +234,18 @@ stages:
232234
solutions_path: ${directories.step-4-generate-solutions}/final_result.jsonl
233235
metadata_files:
234236
- ${directories.step-2-topics-labeling}/final_result.jsonl
235-
- ${directories.step-3-difficulty-estimation}/final_result.jsonl
237+
- ${directories.step-3-profiling}/final_result.jsonl
236238
dependencies:
237239
- topics_labeling
238-
- difficulty_estimation
240+
- profiling
239241
- generate_solutions
240242

241243
filter_solutions:
242244
output_dir: ${directories.step-6-filter-solutions}
243245
input_file: ${directories.step-5-aggregate}/final_result.jsonl
244246
only_correct_solutions: True
245247
generation_model_pass_rate_range: [-1.0, 1.0] # minimum exclusive, maximum inclusive
246-
difficulty_model_pass_rate_range: [-1.0, 1.0] # minimum exclusive, maximum inclusive
248+
profiling_pass_rate_range: null # Optional: JSON dict {"ModelName": [min, max]} for per-model filtering
247249
only_samples_with_ground_truth_answer: True
248250
metadata_values:
249251
topic: ["Biology", "Chemistry", "Physics", "Mathematics", "Other", "undefined"]

recipes/opensciencereasoning/sdg_pipeline/configs/settings/multiple_prompts.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ stages:
1717
enabled: False
1818
topics_labeling:
1919
enabled: False
20-
difficulty_estimation:
20+
profiling:
2121
enabled: False
2222
aggregate:
2323
enabled: False

recipes/opensciencereasoning/sdg_pipeline/configs/settings/seed_data.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ stages:
1818
solutions_path: null
1919
dependencies:
2020
- topics_labeling
21-
- difficulty_estimation
21+
- profiling
2222
filter_solutions:
2323
input_file: ${directories.step-4-aggregate}/final_result.jsonl
2424
output_dir: ${directories.step-5-filter-solutions}

recipes/opensciencereasoning/sdg_pipeline/configs/settings/seed_data_postprocess.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ stages:
77
enabled: False
88
topics_labeling:
99
enabled: False
10-
difficulty_estimation:
10+
profiling:
1111
enabled: False
1212
generate_solutions:
1313
generation_kwargs:
@@ -21,5 +21,5 @@ stages:
2121
filter_solutions:
2222
only_correct_solutions: True
2323
is_ground_truth_answer_present: True
24-
difficulty_model_pass_rate_range: null
24+
profiling_pass_rate_range: null
2525
metadata_values: null

recipes/opensciencereasoning/sdg_pipeline/configs/settings/without_gt.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
directories:
44
step-3-generate-solutions: ${base_output_dir}/solution-sdg/step-3-generate-solutions
5-
step-4-difficulty-estimation: ${base_output_dir}/solution-sdg/step-4-difficulty-estimation
5+
step-4-profiling: ${base_output_dir}/solution-sdg/step-4-profiling
66
step-5-aggregate: ${base_output_dir}/solution-sdg/step-5-aggregate
77
step-6-filter-solutions: ${base_output_dir}/solution-sdg/step-6-filter-solutions
88

@@ -13,17 +13,17 @@ stages:
1313
make_majority_voting: True
1414
dependencies:
1515
- decontaminate
16-
difficulty_estimation:
16+
profiling:
1717
input_file: ${directories.step-3-generate-solutions}/final_result.jsonl
18-
output_dir: ${directories.step-4-difficulty-estimation}
18+
output_dir: ${directories.step-4-profiling}
1919
dependencies:
2020
- generate_solutions
2121
aggregate:
2222
output_dir: ${directories.step-5-aggregate}
2323
solutions_path: ${directories.step-3-generate-solutions}/final_result.jsonl
2424
metadata_files:
2525
- ${directories.step-2-topics-labeling}/final_result.jsonl
26-
- ${directories.step-4-difficulty-estimation}/final_result.jsonl
26+
- ${directories.step-4-profiling}/final_result.jsonl
2727
filter_solutions:
2828
input_file: ${directories.step-5-aggregate}/final_result.jsonl
2929
output_dir: ${directories.step-6-filter-solutions}

recipes/opensciencereasoning/sdg_pipeline/run_pipeline.py

Lines changed: 90 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -410,32 +410,29 @@ def generate_solutions(cluster, expname, run_after, stage_config, **kwargs):
410410
)
411411

412412

413-
def difficulty_estimation(cluster, expname, run_after, stage_config, **kwargs):
414-
"""Run difficulty estimation generation, judge correctness, and postprocess metrics.
413+
def profiling(cluster, expname, run_after, stage_config, **kwargs):
414+
"""Run multi-model profiling: generate, judge, and aggregate per model in parallel, then merge.
415415
416-
This stage:
417-
- Generates multiple solutions per problem using the provided model/prompt.
418-
- Runs LLM-based judging (math_judge) over those generations to get Yes/No per sample.
419-
- Postprocesses the judgements to append three keys to the final results file:
420-
- difficulty_model: the model used for generation
421-
- difficulty_model_pass_rate: decimal fraction of correct judgements (e.g., 0.5)
422-
- difficulty_model_pass_at_n: formatted fraction "correct/total" (e.g., 2/4)
416+
This stage runs N models in parallel, each through a generate -> judge -> aggregate chain.
417+
All models share a common input preparation step. A final merge step combines all per-model
418+
results into a single profiling array per problem.
419+
420+
Output format per record:
421+
"profiling": [
422+
{"model": "ModelA", "pass_rate": 0.5, "pass_at_n": "2/4"},
423+
{"model": "ModelB", "pass_rate": 0.8, "pass_at_n": "4/5"},
424+
]
423425
424426
Note: The judging step extracts predicted answers using the \\boxed{...} convention.
425427
It will only work out-of-the-box when generations include a final answer in boxed format.
426428
"""
427429
output_dir = stage_config["output_dir"]
428430
input_file = stage_config["input_file"]
431+
models = stage_config.get("models", [])
432+
shared_judge_kwargs = stage_config.get("judge_kwargs", {})
429433

430-
generation_kwargs = stage_config.get("generation_kwargs", {})
431-
judge_kwargs = stage_config.get("judge_kwargs", {})
432-
433-
generation_args = generation_kwargs.get("args", {})
434-
generation_ctx_args = generation_kwargs.get("ctx_args", "")
435-
436-
judge_args = judge_kwargs.get("args", {})
437-
judge_ctx_args = judge_kwargs.get("ctx_args", "")
438-
434+
# Step 1: Shared prepare job
435+
prepare_expname = f"{expname}_prepare_profiling"
439436
run_cmd(
440437
ctx=wrap_arguments(
441438
f"python /nemo_run/code/recipes/opensciencereasoning/sdg_pipeline/scripts/remove_redundant_fields.py "
@@ -446,41 +443,85 @@ def difficulty_estimation(cluster, expname, run_after, stage_config, **kwargs):
446443
),
447444
cluster=cluster,
448445
log_dir=f"{output_dir}/tmp/logs",
449-
expname=f"{expname}_prepare_difficulty_estimation",
446+
expname=prepare_expname,
450447
run_after=run_after,
451448
)
452449

453-
generate(
454-
ctx=wrap_arguments(generation_ctx_args),
455-
cluster=cluster,
456-
input_file=f"{output_dir}/tmp/prepared.jsonl",
457-
output_dir=f"{output_dir}/generation",
458-
expname=f"{expname}-generation",
459-
run_after=f"{expname}_prepare_difficulty_estimation",
460-
**generation_args,
461-
)
450+
# Step 2: Per-model generate -> judge -> aggregate chains (all depend on prepare, run in parallel)
451+
per_model_aggregate_expnames = []
452+
per_model_result_files = []
462453

463-
generate(
464-
ctx=wrap_arguments(judge_ctx_args),
465-
generation_type="math_judge",
466-
cluster=cluster,
467-
input_dir=f"{output_dir}/generation",
468-
output_dir=f"{output_dir}/judgement",
469-
expname=f"{expname}-judgement",
470-
run_after=f"{expname}-generation",
471-
**judge_args,
472-
)
454+
for model_cfg in models:
455+
model_name = model_cfg["name"]
456+
safe_name = model_name.replace("/", "-").lower()
457+
model_dir = f"{output_dir}/{safe_name}"
458+
459+
generation_kwargs = model_cfg.get("generation_kwargs", {})
460+
generation_args = generation_kwargs.get("args", {})
461+
generation_ctx_args = generation_kwargs.get("ctx_args", "")
462+
463+
# Per-model judge_kwargs: use model-level override if present, else shared
464+
judge_kwargs = model_cfg.get("judge_kwargs", shared_judge_kwargs)
465+
judge_args = judge_kwargs.get("args", {})
466+
judge_ctx_args = judge_kwargs.get("ctx_args", "")
467+
468+
# Ensure judge num_random_seeds matches generation if not explicitly set
469+
if "num_random_seeds" not in judge_args and "num_random_seeds" in generation_args:
470+
judge_args = dict(judge_args)
471+
judge_args["num_random_seeds"] = generation_args["num_random_seeds"]
472+
473+
gen_expname = f"{expname}-{safe_name}-generation"
474+
judge_expname = f"{expname}-{safe_name}-judgement"
475+
agg_expname = f"{expname}-{safe_name}-aggregate"
476+
477+
generate(
478+
ctx=wrap_arguments(generation_ctx_args),
479+
cluster=cluster,
480+
input_file=f"{output_dir}/tmp/prepared.jsonl",
481+
output_dir=f"{model_dir}/generation",
482+
expname=gen_expname,
483+
run_after=prepare_expname,
484+
**generation_args,
485+
)
486+
487+
generate(
488+
ctx=wrap_arguments(judge_ctx_args),
489+
generation_type="math_judge",
490+
cluster=cluster,
491+
input_dir=f"{model_dir}/generation",
492+
output_dir=f"{model_dir}/judgement",
493+
expname=judge_expname,
494+
run_after=gen_expname,
495+
**judge_args,
496+
)
497+
498+
model_result_file = f"{model_dir}/result.jsonl"
499+
run_cmd(
500+
ctx=wrap_arguments(
501+
f"python /nemo_run/code/recipes/opensciencereasoning/sdg_pipeline/scripts/aggregate_profiling_model.py "
502+
f" --judgement_dir '{model_dir}/judgement' "
503+
f" --output_file '{model_result_file}' "
504+
f" --model_name '{model_name}' "
505+
),
506+
cluster=cluster,
507+
log_dir=f"{model_dir}/logs",
508+
run_after=judge_expname,
509+
expname=agg_expname,
510+
)
511+
512+
per_model_aggregate_expnames.append(agg_expname)
513+
per_model_result_files.append(model_result_file)
473514

515+
# Step 3: Merge all per-model results into a single profiling array
474516
run_cmd(
475517
ctx=wrap_arguments(
476-
f"python /nemo_run/code/recipes/opensciencereasoning/sdg_pipeline/scripts/aggregate_difficulty.py "
477-
f" --judgement_dir '{output_dir}/judgement' "
518+
f"python /nemo_run/code/recipes/opensciencereasoning/sdg_pipeline/scripts/merge_profiling.py "
519+
f" --model_result_files {shlex.quote(json.dumps(per_model_result_files, ensure_ascii=False))} "
478520
f" --output_file '{output_dir}/{OUTPUT_FILE}' "
479-
f" --difficulty_model '{generation_args['model'].split('/')[-1]}' "
480521
),
481522
cluster=cluster,
482523
log_dir=f"{output_dir}/logs",
483-
run_after=f"{expname}-judgement",
524+
run_after=per_model_aggregate_expnames,
484525
expname=expname,
485526
)
486527

@@ -519,7 +560,7 @@ def filter_solutions(cluster, expname, run_after, stage_config, **kwargs):
519560
Supported filters (see `filter_solutions.py`):
520561
- `only_correct_solutions`: keep only samples marked `is_correct`.
521562
- `generation_model_pass_rate_range`: JSON `[min, max]` range (min exclusive, max inclusive).
522-
- `difficulty_model_pass_rate_range`: JSON `[min, max]` range over difficulty pass rates.
563+
- `profiling_pass_rate_range`: JSON dict `{model_name: [min, max]}` for per-model profiling filtering.
523564
- `metadata_values`: dict of field -> allowed values.
524565
525566
Replace `filter_solutions.py` with your own implementation if custom filtering logic is required.
@@ -528,7 +569,7 @@ def filter_solutions(cluster, expname, run_after, stage_config, **kwargs):
528569
input_file = stage_config["input_file"]
529570
only_correct_solutions = stage_config.get("only_correct_solutions", False)
530571
generation_model_pass_rate_range = stage_config.get("generation_model_pass_rate_range", None)
531-
difficulty_model_pass_rate_range = stage_config.get("difficulty_model_pass_rate_range", None)
572+
profiling_pass_rate_range = stage_config.get("profiling_pass_rate_range", None)
532573
metadata_values = stage_config.get("metadata_values", None)
533574
only_samples_with_ground_truth_answer = stage_config.get("only_samples_with_ground_truth_answer", False)
534575

@@ -537,9 +578,9 @@ def filter_solutions(cluster, expname, run_after, stage_config, **kwargs):
537578
if generation_model_pass_rate_range
538579
else ""
539580
)
540-
difficulty_model_pass_rate_range_arg = (
541-
f" --difficulty_model_pass_rate_range {shlex.quote(json.dumps(difficulty_model_pass_rate_range, ensure_ascii=False))} "
542-
if difficulty_model_pass_rate_range
581+
profiling_pass_rate_range_arg = (
582+
f" --profiling_pass_rate_range {shlex.quote(json.dumps(profiling_pass_rate_range, ensure_ascii=False))} "
583+
if profiling_pass_rate_range
543584
else ""
544585
)
545586
metadata_values_arg = (
@@ -558,7 +599,7 @@ def filter_solutions(cluster, expname, run_after, stage_config, **kwargs):
558599
f" --output_file '{output_dir}/{OUTPUT_FILE}' "
559600
f"{only_correct_arg}"
560601
f"{generation_model_pass_rate_range_arg} "
561-
f"{difficulty_model_pass_rate_range_arg} "
602+
f"{profiling_pass_rate_range_arg} "
562603
f"{metadata_values_arg} "
563604
f"{only_samples_with_ground_truth_answer_arg} "
564605
),
@@ -752,7 +793,7 @@ def derive_variant_name():
752793
"decontaminate": decontaminate,
753794
"topics_labeling": topics_labeling,
754795
"generate_solutions": generate_solutions,
755-
"difficulty_estimation": difficulty_estimation,
796+
"profiling": profiling,
756797
"aggregate": aggregate,
757798
"filter_solutions": filter_solutions,
758799
"prepare_for_sft": prepare_for_sft,

0 commit comments

Comments
 (0)