This repository is organized as a benchmark product: generate scenarios, run cross-model evaluations, compute bias metrics, and expose auditable results for dashboards and release gates.
flowchart TD
SG[Scenario Generator] --> PIT[Point-in-Time Controller]
PIT --> API[FastAPI Service]
API --> ORCH[BiasEvaluationOrchestrator]
ORCH --> LLM[NVIDIA / OpenAI / Other Providers]
ORCH --> DET[BiasDetector]
DET --> DB[(Postgres + TimescaleDB)]
DB --> AGG[Run-Level Aggregation]
AGG --> DASH[Dash Reporting UI]
src/scenarios/: deterministic scenario templates and paired anchoring scenarios.src/utils/pit_controller.py: enforces no future context in scenario metadata.src/agents/: unified model client abstraction.src/core/evaluator.py: benchmark orchestration, concurrency, and persistence.src/detectors/: bias extraction and scoring logic.src/api/: benchmark execution and result endpoints.src/dashboard/: visual reporting layer.
- Generate or load scenarios with
as_oftimestamps. - Select agents and scenarios for a benchmark run.
- Execute model calls and parse actions/confidence.
- Score each response by bias type and store evaluations.
- Aggregate per-run/per-model metrics for reporting and governance.
- Use run IDs (
run_id) as immutable audit handles. - Keep scenario versions stable across benchmark comparisons.
- Enforce thresholds in CI/CD before promoting new models.