This directory contains the benchmark-side evaluation scripts for SourceBench.
The file names are historical, but the intended workflow is simple:
validate_official_submission.pyofficial_submission_backend.pyofficial_run.pycompute_metrics.py
If you are only running the public benchmark locally, you usually need just:
compute_metrics.py
If you are running official leaderboard evaluation server-side, the full flow is:
- validate submission schema
- intake the submission into an internal queue
- execute the official run on the hidden holdout split
- compute final leaderboard metrics
Purpose:
- convert one or more scored source files into leaderboard-ready metrics
- aggregate overall metrics and per-query-type metrics
- export JSON and CSV artifacts
Use this when:
- you already have judged source files from
scoring.py - you want final metrics for local comparison or official reporting
Main input:
- one or more scored JSON files, usually
*.enriched.json
Main output:
leaderboard_data.jsonleaderboard_overall.csvleaderboard_by_query_type.csv- query- and source-level CSVs
Typical command:
python src/evaluation/compute_metrics.py \
--run MODEL_NAME=path/to/scored.enriched.json \
--query-metadata data/queries/sourcebench_public_queries_v1.csv \
--out-dir output/metricsPurpose:
- validate the schema of an official submission package
- support both
endpointandanswer_url_bundlesubmissions
Use this when:
- a participant has prepared a submission JSON
- you want a strict yes/no schema check before any execution
Main input:
- one submission JSON file
Main output:
- validation report JSON
- exit code
0for valid submissions,1for invalid submissions
Typical command:
python src/evaluation/validate_official_submission.py \
--input leaderboard/examples/endpoint_submission.example.json \
--output output/validation_report.jsonPurpose:
- intake a validated submission into the internal official evaluation queue
- assign a submission id
- write submission status, evaluation request, and redacted copies
This script is best understood as the official submission intake step.
Use this when:
- you want to register a participant submission for server-side evaluation
- you want a reproducible internal directory for that submission
Main input:
- one participant submission JSON
Main output:
- a submission directory under
leaderboard/.official_submissions/ validation_report.jsonsubmission_status.jsonevaluation_request.json- redacted or raw submission copies
Typical command:
python src/evaluation/official_submission_backend.py \
--input submission.jsonPurpose:
- execute the official SourceBench pipeline for a queued submission
- run hidden queries for endpoint submissions, or continue from answer+URL bundles
- run scraping, judging, and final metric computation
This script is best understood as the official evaluation runner.
Use this when:
- a submission has already been validated and accepted into the queue
- you want to produce official artifacts for the holdout evaluation
Main input:
- a submission directory produced by
official_submission_backend.py - an internal holdout manifest CSV
Main output:
- stage outputs for source collection, scraping, scoring, and metrics
- final official metrics artifacts
- updated submission status and run manifest
Typical command:
python src/evaluation/official_run.py \
--submission-dir leaderboard/.official_submissions/SUBMISSION_ID \
--holdout-manifest path/to/internal_holdout.csvThink about these files as four layers:
validate_official_submission.py: schema validationofficial_submission_backend.py: intake and queueingofficial_run.py: executioncompute_metrics.py: metric aggregation
If this repository is refactored further, the clearer long-form names would be:
validate_submission.pyintake_official_submission.pyrun_official_evaluation.pycompute_metrics.py
The current file names are kept for compatibility with the existing docs and examples.