worldmodel-gym

WorldModel Gym is an end-to-end benchmark platform for long-horizon planning agents. It combines reproducible benchmark environments, planner and world-model baselines, a FastAPI submission service, and a polished Next.js leaderboard into one deployable monorepo.

Why This Repo Stands Out

Reproducible benchmark tasks designed around sparse rewards, partial observability, and procedural generalization
Modular research stack spanning environments, agents, planners, and world models
Production-minded backend with Alembic migrations, scoped API keys, rate limiting, readiness checks, structured logging, and Prometheus metrics
Modern frontend with a custom editorial product UI, same-origin proxying, SEO metadata, and Playwright smoke coverage
Full-stack delivery workflow with GitHub Actions, Render deployment support, and Vercel deployment support

Stack

core/: benchmark environments, traces, and evaluation harness
agents/: baseline agents and agent registry
planners/: planning algorithms such as MCTS and MPC-CEM
worldmodels/: deterministic, stochastic, and ensemble world model baselines
server/: FastAPI API, auth, migrations, storage, CLI, and demo-data seeding
web/: Next.js App Router dashboard and API proxy
mobile/: Expo-based mobile viewer
paper/: manuscript sources and generated PDF artifacts

Architecture

flowchart LR
    subgraph Research["Research Layer"]
        ENV["Benchmark Environments"]
        AGENT["Agents"]
        PLAN["Planners"]
        MODEL["World Models"]
        HARNESS["Evaluation Harness"]
    end

    subgraph Platform["Platform Layer"]
        API["FastAPI API"]
        DB[("Postgres / SQLite")]
        STORE[("Local / S3 Artifact Store")]
        WEB["Next.js Dashboard"]
        PROXY["Next.js API Proxy"]
        MOBILE["Expo Viewer"]
    end

    AGENT --> ENV
    AGENT --> PLAN
    AGENT --> MODEL
    PLAN --> MODEL
    HARNESS --> AGENT
    HARNESS --> ENV
    HARNESS --> API
    API --> DB
    API --> STORE
    WEB --> PROXY
    PROXY --> API
    MOBILE --> API

Deployment Topology

flowchart LR
    GH["GitHub"]
    CI["GitHub Actions"]
    V["Vercel Web"]
    R["Render API"]
    PG[("Managed Postgres")]
    S3[("S3-Compatible Storage")]

    GH --> CI
    GH --> V
    GH --> R
    R --> PG
    R --> S3
    V --> R

The default production shape is:

FastAPI API on Render
Next.js dashboard on Vercel
managed Postgres for run metadata
local or S3-compatible storage for trace and metrics artifacts
browser requests routed through the Next.js proxy instead of direct cross-origin API calls

Core Product Capabilities

create benchmark runs through the API
upload metrics, traces, and config artifacts
inspect public leaderboard data by track
browse tasks and benchmark context from the web dashboard
verify public deployment health with readiness, liveness, and smoke checks
seed demo leaderboard data for first-run and demo environments

Local Quickstart

make setup
make demo

make demo will:

start the API and web stack with Docker when available
fall back to a local API process if Docker is unavailable
create a benchmark run
upload artifacts through the API
populate the leaderboard flow end to end

Open:

Local development uses built-in defaults. If you need overrides, export environment variables in your shell or configure them in your deployment provider. Do not commit env files to the repository.

Developer Commands

make lint
make test
make demo
make seed-demo
make create-api-key NAME=local-writer SCOPE=runs:write
make verify-deployment
make deploy
make stop
make deploy-public
make stop-public
make deploy-vercel

Production Features

Alembic migrations replace implicit schema creation
scoped API keys support runs:write and admin-style access control
legacy upload-token support exists only as a compatibility path and can be disabled
authenticated writes and public reads are rate limited separately
structured logs include request IDs, durations, and startup/readiness events
/healthz, /readyz, and /metrics expose runtime health and monitoring hooks
the frontend uses a same-origin proxy route for safer browser-to-API access
demo data seeding and demo-run upload tooling are built into the repo

Auth, Data, and Operations

Create a scoped API key:

.venv/bin/python -m worldmodel_server.cli create-api-key \
  --name production-writer \
  --scope runs:write

Seed demo data:

.venv/bin/python -m worldmodel_server.cli seed-demo-data --force

Upload a demo run against a local or hosted API:

.venv/bin/python scripts/demo_run.py \
  --api-base http://localhost:8000

Verify the full public deployment:

.venv/bin/python scripts/verify_deployment.py \
  --api-base https://worldmodel-gym-api.onrender.com \
  --web-base https://world-model-gym.vercel.app

Useful runtime endpoints:

API liveness: /healthz
API readiness: /readyz
API metrics: /metrics
web smoke path: /api/proxy/api/leaderboard?track=test

Deployment Notes

Deploy the API from render.yaml
Deploy the web app from the web/ root directory in Vercel
Store production secrets in Render and Vercel, not in repo files
Switch artifact storage to S3-compatible storage for durable production uploads
Remove WMG_BOOTSTRAP_API_KEY after the first durable writer key is created

Full deployment and operations references:

Quality Gates

Ruff lint and formatting checks
Pytest coverage for backend behavior
Next.js production build verification
Playwright smoke tests for the web flow
scheduled production smoke checks against public deployment surfaces

Resume-Friendly Highlights

Built and shipped a full-stack benchmark product spanning environments, planners, model baselines, backend APIs, and frontend dashboards
Hardened the backend with migrations, auth, rate limiting, structured logging, and cloud deployment support
Added deployment verification, browser E2E coverage, and production smoke automation on top of standard CI
Designed a custom frontend UI system rather than relying on a boilerplate template

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github		.github
agents		agents
core		core
docker		docker
docs		docs
mobile		mobile
paper		paper
planners		planners
runs		runs
scripts		scripts
server		server
tests		tests
web		web
worldmodels		worldmodels
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
render.yaml		render.yaml
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

worldmodel-gym

Why This Repo Stands Out

Stack

Architecture

Deployment Topology

Core Product Capabilities

Local Quickstart

Developer Commands

Production Features

Auth, Data, and Operations

Deployment Notes

Quality Gates

Resume-Friendly Highlights

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

worldmodel-gym

Why This Repo Stands Out

Stack

Architecture

Deployment Topology

Core Product Capabilities

Local Quickstart

Developer Commands

Production Features

Auth, Data, and Operations

Deployment Notes

Quality Gates

Resume-Friendly Highlights

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages