Skip to content

Commit 7cf9f04

Browse files
committed
Adding scorecard reference
Signed-off-by: Chris Lavin <chris.lavin@amd.com>
1 parent c52bb7b commit 7cf9f04

2 files changed

Lines changed: 119 additions & 0 deletions

File tree

docs/score.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,7 @@ Since testing and validation will occur on AWS instances, we will limit runtime
3535

3636
## OpenRouter Cost Limit
3737
For each team's submission, a new API key provisioned with at least $1.00 USD per benchmark will be allocated for the entire evaluation. Teams cannot provide their own API keys to enable additional spend beyond this limit.
38+
39+
## Scorecard
40+
After each evaluation run, teams receive a detailed scorecard with all of the per-run and per-benchmark fields used to compute their score. See the [Scorecard Reference](scorecard.html) for a full description of every field reported.
41+

docs/scorecard.md

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# Scorecard Reference
2+
3+
After each evaluation run, the contest validation harness produces a `scorecard.json` containing the per-run and per-benchmark results that determined a team's score. Scorecards are distributed to teams alongside their results to detail how their results were interpreted and score calculated.
4+
5+
This page documents every field that may appear on a scorecard, organized into three groups:
6+
7+
1. [Run-Level Fields](#run-level-fields) &mdash; metadata describing the validation run as a whole.
8+
2. [Benchmark-Level Fields](#benchmark-level-fields) &mdash; per-benchmark measurements and outcomes.
9+
3. [Validation Check Fields](#validation-check-fields) &mdash; the individual pass/fail checks that gate scoring.
10+
11+
A short summary of the [scoring formula](#scoring-formula) is included at the end for convenience; full scoring rules live on the [Scoring Criteria](score.html) page.
12+
13+
> ℹ️ **NOTE**
14+
> Some fields are diagnostic and may be `null` when not applicable (for example, `failure_reason` on a successful run). Cost and runtime fields are reported with the raw measured values; any caps or saturations are indicated by companion boolean fields such as `gamma_capped`.
15+
16+
---
17+
18+
## Run-Level Fields
19+
20+
These fields describe the validation run as a whole and are emitted once per scorecard.
21+
22+
| Field | Description |
23+
|---|---|
24+
| `validation_id` | Unique ID for this validation attempt. |
25+
| `round` | Contest round being validated, for example `alpha`. |
26+
| `validator_git_sha` | Git commit of the trusted official validation tools bundle used for functional validation. |
27+
| `validator_s3_key` | Internal S3 location of the trusted validation tools bundle used for this run. |
28+
| `validator_validate_dcps_sha256` | SHA256 checksum of the trusted `validate_dcps.py` script used for functional validation. |
29+
| `total_score` | Sum of all benchmark scores. Failed benchmarks contribute `0`. |
30+
| `failure_reason` | Run-level failure reason if validation failed before completing benchmark evaluation. Usually `null` for completed runs. |
31+
| `status` | Overall validation status, such as `completed` or `failed`. |
32+
33+
---
34+
35+
## Benchmark-Level Fields
36+
37+
Each entry in the `benchmarks` array describes the result for one benchmark.
38+
39+
### Identity & Status
40+
41+
| Field | Description |
42+
|---|---|
43+
| `name` | Benchmark name. |
44+
| `checksum_sha256` | SHA256 checksum of the input benchmark DCP used for validation. |
45+
| `status` | Benchmark outcome. `scored` means the benchmark completed validation and was scored; `failed` means a required step failed. |
46+
| `produced_output_dcp` | `true` if the validation harness found an output DCP for this benchmark. |
47+
| `failure_reason` | Benchmark-level failure reason if the benchmark did not receive a positive score. May also be `no_improvement` when validation passed but Fmax did not improve. |
48+
49+
### Timing Measurements
50+
51+
| Field | Description |
52+
|---|---|
53+
| `fmax_input_mhz` | Baseline Fmax of the input DCP, in MHz. |
54+
| `fmax_output_mhz` | Measured Fmax of the submitted output DCP, in MHz. This is `0.0` if no valid Fmax measurement was produced. |
55+
| `wns_ns` | Worst negative slack from timing analysis, in nanoseconds. |
56+
| `whs_ns` | Worst hold slack, in nanoseconds. |
57+
58+
### Scoring Inputs
59+
60+
| Field | Description |
61+
|---|---|
62+
| `alpha_fmax_improvement_mhz` | Fmax improvement in MHz. Computed as `fmax_output_mhz - fmax_input_mhz` when a valid Fmax measurement exists. |
63+
| `beta_openrouter_cost_usd` | OpenRouter API spend for this benchmark, in USD. Used as a score penalty. |
64+
| `gamma_runtime_hours` | Runtime of `make run_optimizer` for this benchmark, in hours. Capped at `1.0` for scoring. |
65+
| `score` | Final score for this benchmark after applying improvement, API-cost penalty, and runtime penalty. |
66+
67+
### Diagnostics
68+
69+
| Field | Description |
70+
|---|---|
71+
| `validation` | Object containing detailed pass/fail checks for routing, DRC, timing, and functional validation. See [Validation Check Fields](#validation-check-fields). |
72+
| `wall_time_seconds` | Actual wall-clock runtime of `make run_optimizer`, in seconds. |
73+
| `gamma_capped` | `true` if runtime reached the per-benchmark cap and the runtime penalty was saturated. |
74+
| `openrouter_key_hash` | Internal hash identifier for the temporary OpenRouter key used for this benchmark. |
75+
| `openrouter_metering_failed` | `true` if OpenRouter usage could not be measured reliably. |
76+
77+
---
78+
79+
## Validation Check Fields
80+
81+
The `validation` object on each benchmark contains the gating pass/fail checks. A benchmark must pass all required checks to be eligible for a positive score.
82+
83+
| Field | Description |
84+
|---|---|
85+
| `par_routed` | `true` if the output DCP is fully routed according to Vivado `report_route_status`. |
86+
| `par_drc_clean` | `true` if Vivado DRC checks passed. |
87+
| `hold_passed` | `true` if hold timing checks passed. |
88+
| `pulse_width_passed` | `true` if pulse-width checks passed. |
89+
| `sim_passed` | `true` if trusted functional validation passed using the official `validate_dcps.py` flow. |
90+
91+
---
92+
93+
## Scoring Formula
94+
95+
For a benchmark with `status: "scored"`, the score is computed as:
96+
97+
```
98+
score = max(0, alpha - 0.1 * alpha * beta - 0.1 * alpha * gamma)
99+
```
100+
101+
where:
102+
103+
| Symbol | Source field | Meaning |
104+
|---|---|---|
105+
| `alpha` | `alpha_fmax_improvement_mhz` | `fmax_output_mhz - fmax_input_mhz` |
106+
| `beta` | `beta_openrouter_cost_usd` | OpenRouter spend, in USD |
107+
| `gamma` | `gamma_runtime_hours` | Optimizer runtime, in hours (capped at `1.0`) |
108+
109+
A benchmark receives a score of `0` if any of the following are true:
110+
111+
- It fails a required validation check (see [Validation Check Fields](#validation-check-fields)).
112+
- It does not produce an output DCP.
113+
- It does not improve Fmax over the input.
114+
115+
For full scoring rules, ranking methodology, and tie-breaking, see the [Scoring Criteria](score.html) page.

0 commit comments

Comments
 (0)