WorldModel Gym is an end-to-end benchmark platform for long-horizon planning agents. It combines reproducible benchmark environments, planner and world-model baselines, a FastAPI submission service, and a polished Next.js leaderboard into one deployable monorepo.
- Reproducible benchmark tasks designed around sparse rewards, partial observability, and procedural generalization
- Modular research stack spanning environments, agents, planners, and world models
- Production-minded backend with Alembic migrations, scoped API keys, rate limiting, readiness checks, structured logging, and Prometheus metrics
- Modern frontend with a custom editorial product UI, same-origin proxying, SEO metadata, and Playwright smoke coverage
- Full-stack delivery workflow with GitHub Actions, Render deployment support, and Vercel deployment support
core/: benchmark environments, traces, and evaluation harnessagents/: baseline agents and agent registryplanners/: planning algorithms such as MCTS and MPC-CEMworldmodels/: deterministic, stochastic, and ensemble world model baselinesserver/: FastAPI API, auth, migrations, storage, CLI, and demo-data seedingweb/: Next.js App Router dashboard and API proxymobile/: Expo-based mobile viewerpaper/: manuscript sources and generated PDF artifacts
flowchart LR
subgraph Research["Research Layer"]
ENV["Benchmark Environments"]
AGENT["Agents"]
PLAN["Planners"]
MODEL["World Models"]
HARNESS["Evaluation Harness"]
end
subgraph Platform["Platform Layer"]
API["FastAPI API"]
DB[("Postgres / SQLite")]
STORE[("Local / S3 Artifact Store")]
WEB["Next.js Dashboard"]
PROXY["Next.js API Proxy"]
MOBILE["Expo Viewer"]
end
AGENT --> ENV
AGENT --> PLAN
AGENT --> MODEL
PLAN --> MODEL
HARNESS --> AGENT
HARNESS --> ENV
HARNESS --> API
API --> DB
API --> STORE
WEB --> PROXY
PROXY --> API
MOBILE --> API
flowchart LR
GH["GitHub"]
CI["GitHub Actions"]
V["Vercel Web"]
R["Render API"]
PG[("Managed Postgres")]
S3[("S3-Compatible Storage")]
GH --> CI
GH --> V
GH --> R
R --> PG
R --> S3
V --> R
The default production shape is:
- FastAPI API on Render
- Next.js dashboard on Vercel
- managed Postgres for run metadata
- local or S3-compatible storage for trace and metrics artifacts
- browser requests routed through the Next.js proxy instead of direct cross-origin API calls
- create benchmark runs through the API
- upload metrics, traces, and config artifacts
- inspect public leaderboard data by track
- browse tasks and benchmark context from the web dashboard
- verify public deployment health with readiness, liveness, and smoke checks
- seed demo leaderboard data for first-run and demo environments
make setup
make demomake demo will:
- start the API and web stack with Docker when available
- fall back to a local API process if Docker is unavailable
- create a benchmark run
- upload artifacts through the API
- populate the leaderboard flow end to end
Open:
Local development uses built-in defaults. If you need overrides, export environment variables in your shell or configure them in your deployment provider. Do not commit env files to the repository.
make lint
make test
make demo
make seed-demo
make create-api-key NAME=local-writer SCOPE=runs:write
make verify-deployment
make deploy
make stop
make deploy-public
make stop-public
make deploy-vercel- Alembic migrations replace implicit schema creation
- scoped API keys support
runs:writeand admin-style access control - legacy upload-token support exists only as a compatibility path and can be disabled
- authenticated writes and public reads are rate limited separately
- structured logs include request IDs, durations, and startup/readiness events
/healthz,/readyz, and/metricsexpose runtime health and monitoring hooks- the frontend uses a same-origin proxy route for safer browser-to-API access
- demo data seeding and demo-run upload tooling are built into the repo
Create a scoped API key:
.venv/bin/python -m worldmodel_server.cli create-api-key \
--name production-writer \
--scope runs:writeSeed demo data:
.venv/bin/python -m worldmodel_server.cli seed-demo-data --forceUpload a demo run against a local or hosted API:
.venv/bin/python scripts/demo_run.py \
--api-base http://localhost:8000Verify the full public deployment:
.venv/bin/python scripts/verify_deployment.py \
--api-base https://worldmodel-gym-api.onrender.com \
--web-base https://world-model-gym.vercel.appUseful runtime endpoints:
- API liveness:
/healthz - API readiness:
/readyz - API metrics:
/metrics - web smoke path:
/api/proxy/api/leaderboard?track=test
- Deploy the API from render.yaml
- Deploy the web app from the
web/root directory in Vercel - Store production secrets in Render and Vercel, not in repo files
- Switch artifact storage to S3-compatible storage for durable production uploads
- Remove
WMG_BOOTSTRAP_API_KEYafter the first durable writer key is created
Full deployment and operations references:
- Ruff lint and formatting checks
- Pytest coverage for backend behavior
- Next.js production build verification
- Playwright smoke tests for the web flow
- scheduled production smoke checks against public deployment surfaces
- Built and shipped a full-stack benchmark product spanning environments, planners, model baselines, backend APIs, and frontend dashboards
- Hardened the backend with migrations, auth, rate limiting, structured logging, and cloud deployment support
- Added deployment verification, browser E2E coverage, and production smoke automation on top of standard CI
- Designed a custom frontend UI system rather than relying on a boilerplate template