|
| 1 | +# Scorecard Reference |
| 2 | + |
| 3 | +After each evaluation run, the contest validation harness produces a `scorecard.json` containing the per-run and per-benchmark results that determined a team's score. Scorecards are distributed to teams alongside their results to detail how their results were interpreted and score calculated. |
| 4 | + |
| 5 | +This page documents every field that may appear on a scorecard, organized into three groups: |
| 6 | + |
| 7 | +1. [Run-Level Fields](#run-level-fields) — metadata describing the validation run as a whole. |
| 8 | +2. [Benchmark-Level Fields](#benchmark-level-fields) — per-benchmark measurements and outcomes. |
| 9 | +3. [Validation Check Fields](#validation-check-fields) — the individual pass/fail checks that gate scoring. |
| 10 | + |
| 11 | +A short summary of the [scoring formula](#scoring-formula) is included at the end for convenience; full scoring rules live on the [Scoring Criteria](score.html) page. |
| 12 | + |
| 13 | +> ℹ️ **NOTE** |
| 14 | +> Some fields are diagnostic and may be `null` when not applicable (for example, `failure_reason` on a successful run). Cost and runtime fields are reported with the raw measured values; any caps or saturations are indicated by companion boolean fields such as `gamma_capped`. |
| 15 | +
|
| 16 | +--- |
| 17 | + |
| 18 | +## Run-Level Fields |
| 19 | + |
| 20 | +These fields describe the validation run as a whole and are emitted once per scorecard. |
| 21 | + |
| 22 | +| Field | Description | |
| 23 | +|---|---| |
| 24 | +| `validation_id` | Unique ID for this validation attempt. | |
| 25 | +| `round` | Contest round being validated, for example `alpha`. | |
| 26 | +| `validator_git_sha` | Git commit of the trusted official validation tools bundle used for functional validation. | |
| 27 | +| `validator_s3_key` | Internal S3 location of the trusted validation tools bundle used for this run. | |
| 28 | +| `validator_validate_dcps_sha256` | SHA256 checksum of the trusted `validate_dcps.py` script used for functional validation. | |
| 29 | +| `total_score` | Sum of all benchmark scores. Failed benchmarks contribute `0`. | |
| 30 | +| `failure_reason` | Run-level failure reason if validation failed before completing benchmark evaluation. Usually `null` for completed runs. | |
| 31 | +| `status` | Overall validation status, such as `completed` or `failed`. | |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## Benchmark-Level Fields |
| 36 | + |
| 37 | +Each entry in the `benchmarks` array describes the result for one benchmark. |
| 38 | + |
| 39 | +### Identity & Status |
| 40 | + |
| 41 | +| Field | Description | |
| 42 | +|---|---| |
| 43 | +| `name` | Benchmark name. | |
| 44 | +| `checksum_sha256` | SHA256 checksum of the input benchmark DCP used for validation. | |
| 45 | +| `status` | Benchmark outcome. `scored` means the benchmark completed validation and was scored; `failed` means a required step failed. | |
| 46 | +| `produced_output_dcp` | `true` if the validation harness found an output DCP for this benchmark. | |
| 47 | +| `failure_reason` | Benchmark-level failure reason if the benchmark did not receive a positive score. May also be `no_improvement` when validation passed but Fmax did not improve. | |
| 48 | + |
| 49 | +### Timing Measurements |
| 50 | + |
| 51 | +| Field | Description | |
| 52 | +|---|---| |
| 53 | +| `fmax_input_mhz` | Baseline Fmax of the input DCP, in MHz. | |
| 54 | +| `fmax_output_mhz` | Measured Fmax of the submitted output DCP, in MHz. This is `0.0` if no valid Fmax measurement was produced. | |
| 55 | +| `wns_ns` | Worst negative slack from timing analysis, in nanoseconds. | |
| 56 | +| `whs_ns` | Worst hold slack, in nanoseconds. | |
| 57 | + |
| 58 | +### Scoring Inputs |
| 59 | + |
| 60 | +| Field | Description | |
| 61 | +|---|---| |
| 62 | +| `alpha_fmax_improvement_mhz` | Fmax improvement in MHz. Computed as `fmax_output_mhz - fmax_input_mhz` when a valid Fmax measurement exists. | |
| 63 | +| `beta_openrouter_cost_usd` | OpenRouter API spend for this benchmark, in USD. Used as a score penalty. | |
| 64 | +| `gamma_runtime_hours` | Runtime of `make run_optimizer` for this benchmark, in hours. Capped at `1.0` for scoring. | |
| 65 | +| `score` | Final score for this benchmark after applying improvement, API-cost penalty, and runtime penalty. | |
| 66 | + |
| 67 | +### Diagnostics |
| 68 | + |
| 69 | +| Field | Description | |
| 70 | +|---|---| |
| 71 | +| `validation` | Object containing detailed pass/fail checks for routing, DRC, timing, and functional validation. See [Validation Check Fields](#validation-check-fields). | |
| 72 | +| `wall_time_seconds` | Actual wall-clock runtime of `make run_optimizer`, in seconds. | |
| 73 | +| `gamma_capped` | `true` if runtime reached the per-benchmark cap and the runtime penalty was saturated. | |
| 74 | +| `openrouter_key_hash` | Internal hash identifier for the temporary OpenRouter key used for this benchmark. | |
| 75 | +| `openrouter_metering_failed` | `true` if OpenRouter usage could not be measured reliably. | |
| 76 | + |
| 77 | +--- |
| 78 | + |
| 79 | +## Validation Check Fields |
| 80 | + |
| 81 | +The `validation` object on each benchmark contains the gating pass/fail checks. A benchmark must pass all required checks to be eligible for a positive score. |
| 82 | + |
| 83 | +| Field | Description | |
| 84 | +|---|---| |
| 85 | +| `par_routed` | `true` if the output DCP is fully routed according to Vivado `report_route_status`. | |
| 86 | +| `par_drc_clean` | `true` if Vivado DRC checks passed. | |
| 87 | +| `hold_passed` | `true` if hold timing checks passed. | |
| 88 | +| `pulse_width_passed` | `true` if pulse-width checks passed. | |
| 89 | +| `sim_passed` | `true` if trusted functional validation passed using the official `validate_dcps.py` flow. | |
| 90 | + |
| 91 | +--- |
| 92 | + |
| 93 | +## Scoring Formula |
| 94 | + |
| 95 | +For a benchmark with `status: "scored"`, the score is computed as: |
| 96 | + |
| 97 | +``` |
| 98 | +score = max(0, alpha - 0.1 * alpha * beta - 0.1 * alpha * gamma) |
| 99 | +``` |
| 100 | + |
| 101 | +where: |
| 102 | + |
| 103 | +| Symbol | Source field | Meaning | |
| 104 | +|---|---|---| |
| 105 | +| `alpha` | `alpha_fmax_improvement_mhz` | `fmax_output_mhz - fmax_input_mhz` | |
| 106 | +| `beta` | `beta_openrouter_cost_usd` | OpenRouter spend, in USD | |
| 107 | +| `gamma` | `gamma_runtime_hours` | Optimizer runtime, in hours (capped at `1.0`) | |
| 108 | + |
| 109 | +A benchmark receives a score of `0` if any of the following are true: |
| 110 | + |
| 111 | +- It fails a required validation check (see [Validation Check Fields](#validation-check-fields)). |
| 112 | +- It does not produce an output DCP. |
| 113 | +- It does not improve Fmax over the input. |
| 114 | + |
| 115 | +For full scoring rules, ranking methodology, and tie-breaking, see the [Scoring Criteria](score.html) page. |
0 commit comments