PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

Figure: Confidence-Weighted Return Updated by Empirical Market Resolutions

Welcome to the official repository for PolyBench, the first large-scale, contamination-proof benchmark that evaluates Large Language Models (LLMs) as autonomous trading agents on live decentralized prediction markets.

This repository contains the full data collection, alignment, AI assessment, and trading execution pipeline introduced in our paper:
PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

Download dataset from OneDrive Link, or use the repo to create your own!

📖 Overview

Predicting real-world events from live market signals demands systems that fuse qualitative news with quantitative order-book dynamics under strict temporal discipline. Existing benchmarks often reduce forecasting to static text question answering.

Figure: PolyBench Construction Pipeline

We present PolyBench, a multimodal evaluation framework built on Polymarket data that synchronously couples:

Event Resolution Criteria: Strict conditional definitions and settlement rules.
Central Limit Order Book (CLOB) States: Real-time liquidity, bid-ask spreads, and midpoint pricing.
Exogenous News Streams: Pre-fetched Google News context aligned temporally with the market snapshot.

Using PolyBench, we evaluated seven state-of-the-art LLMs (e.g., Gemini-3-Flash, MiMo-V2-Flash, DeepSeek-V3.2, GPT-OSS-120B) across 38,666 binary prediction markets spanning 4,997 events. The framework measures their practical financial viability via our proposed Confidence-Weighted Return (CWR), Annualized Percentage Yield (APY), and Sharpe ratio.

Table: All LLMs' Results on Metrics

🛠 Project Architecture

This is a self-contained, standalone CLI intended for continuous market evaluation and batch processing:

poly-analysis/
├── config.py              # Central Configuration (API keys, constraints)
├── .env                   # Environment variables for secure credential storage
├── main.py                # Main CLI entry point
│
├── core/                  # Core Business Logic
│   ├── analysis.py        # LLM inference engine & heuristic prompt building
│   ├── market_data.py     # Polymarket Gamma API and CLOB interactions 
│   ├── news_fetcher.py    # Exogenous context scraping pipeline
│   ├── trading_engine.py  # Polygon blockchain transaction builders
│
├── batch/                 # Automated & Sequential Processing
│   ├── unified_batch.py   # Consolidated pipeline for Data Prefetch, AI healing, Error Recovery, and Backtesting
│
├── database/              # Storage Layer
│   ├── peewee_models.py   # ORM Schema (Markets, Snapshots, Predictions)
│   ├── polymarket.db      # Live interactive execution DB (Need to be downloaded)
│
├── scripts/               # Utility & Operational Scripts
    ├── evaluate_mimo.py   # Compute CWR, APY and Sharpe ratios from historical predictions
    ├── plot_portfolio.py  # Generate dynamic returns / capital allocation visualizations
    ├── db_cli.py          # Interactive terminal database inspector

🚀 Quickstart

1. Installation

Clone the repository and install the required dependencies inside a virtual environment (Python 3.10+ recommended):

conda create -n polybench python=3.10
conda activate polybench
pip install -r requirements.txt

2. Configuration (`.env`)

Create an .env file in the root directory mirroring the necessary keys for fetching news, loading order books, and pinging AI models.

PRIMARY_PROVIDER='OPENROUTER'
OPENROUTER_API_KEY='your_api_key_here'
OPENROUTER_MODEL='google/gemini-3-flash-preview'
OPENROUTER_PROVIDER_ORDER='google-vertex'
# OPENROUTER_MODEL='xiaomi/mimo-v2-flash'
# OPENROUTER_PROVIDER_ORDER='xiaomi/fp8'

VERBOSE_MODE=false
DEBUG_MODE=false

3. Execution

Launch the interactive terminal dashboard:

python main.py

📊 Evaluation Metrics

PolyBench enforces a strict Bayesian baseline locked to specific timestamps, discarding any predictions that fall below a $c < 0.6$ confidence limit or fail to identify positive Expected Value (EV+) opportunities.

Confidence-Weighted Return (CWR): Evaluates pure capital allocation efficiency, linearly scaling simulated investments against the active Order Book based on stated model conviction.
Annualized Percentage Yield (APY): Time-normalized CWR discounting the opportunity cost of locked capital.
Sharpe Ratio ($\mu_r / \sigma_r$): Pure risk-adjusted portfolio performance penalizing erratic draw-downs.

For rigorous execution instructions, please see the WALKTHROUGH.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

📖 Overview

🛠 Project Architecture

🚀 Quickstart

1. Installation

2. Configuration (`.env`)

3. Execution

📊 Evaluation Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
batch		batch
core		core
database		database
evaluation		evaluation
scripts		scripts
README.md		README.md
WALKTHROUGH.md		WALKTHROUGH.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

📖 Overview

🛠 Project Architecture

🚀 Quickstart

1. Installation

2. Configuration (.env)

3. Execution

📊 Evaluation Metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Configuration (`.env`)

Packages