diff --git a/.cursor/rules/api-key-controls.mdc b/.cursor/rules/api-key-controls.mdc index 7ab5b02..dd3d8ec 100644 --- a/.cursor/rules/api-key-controls.mdc +++ b/.cursor/rules/api-key-controls.mdc @@ -1,5 +1,5 @@ --- -alwaysApply: false +alwaysApply: true --- # API Key Controls & Workflow Management System @@ -11,12 +11,16 @@ Enables OpenRouter integration with automatic LM Studio fallback (default mode) **Key Features:** - **Per-Role OpenRouter Selection**: Each role independently uses LM Studio or OpenRouter (default mode); all roles use OpenRouter in generic mode - **Global OpenRouter API Key**: Single key for all per-role OpenRouter selections within one running backend instance. Boost can reuse it when no explicit boost-only override key is provided. -- **OpenRouter Auto-Fill**: OpenRouter selectors auto-fill context window from model `context_length` and auto-fill max output tokens as `min(20% of model context_length, smallest available host max_completion_tokens)` +- **OpenRouter Auto-Fill**: OpenRouter selectors fetch provider endpoint metadata and compute host-aware context/output settings from a capable endpoint set. Auto mode must ignore known weak hosts (currently Venice) and low/missing-cap outliers before computing context/max-output; manual host selection uses that exact host. +- **OpenRouter Reasoning Effort**: Every OpenRouter role exposes a visible reasoning-effort selector. Default `auto` sends maximum OpenRouter reasoning effort (`xhigh`) through the normalized `reasoning.effort` request object; users may lower it or set `none`. - **LM Studio Fallback** (default mode only): Optional fallback per role on credit exhaustion - **Free Model Cooldown Handling**: SERIAL BOTTLENECK pause, free model looping, and auto-selector backup (see below) -- **Boost Mode**: Selective task acceleration via two modes, using either an explicit boost override key or the active global OpenRouter key: +- **Boost Mode**: Selective task acceleration via next-count, category, always-prefer, and per-task routing controls, using either an explicit boost override key or the active global OpenRouter key: - **Boost Next X Calls**: Counter-based, next X API calls regardless of task ID - - **Category Boost**: Role-based, boosts all calls for specific role categories (Aggregator and Compiler only; Autonomous agents inherit from their parent roles automatically) + - **Category Boost**: Role-based, boosts all calls for specific role categories across Aggregator, Compiler, Autonomous/proof, and LeanOJ roles + - **Always Prefer Boost**: Attempts boost for every API call, falling back to the primary route on boost failure + - **Per-Task Toggle**: Legacy task-ID boost controls for individual workflow tasks +- **Supercharge**: Per-role setting that wraps one role answer as 4 parallel diversified full answer attempts plus a 5th same-model deterministic synthesis answer. If Boost applies, all 5 calls use the Boost route/model/provider/settings. - **System works without LM Studio**: Defaults to OpenRouter when LM Studio unavailable; generic mode never attempts LM Studio ## Mode-Specific Behavior @@ -39,7 +43,20 @@ Enables OpenRouter integration with automatic LM Studio fallback (default mode) **Boost is a ROUTING decision, NOT a CONCURRENCY decision.** - Boost affects which API endpoint is used, NOT whether submitters run in parallel or serial - Aggregation submitters ALWAYS run in parallel regardless of boost status (unless single-model mode) -- Single-model mode: triggered when all submitters AND validator use the SAME configured model ID. Boost routing does NOT trigger single-model mode. +- LeanOJ topic and brainstorm submitters ALWAYS run in parallel regardless of boost/provider routing; validation batches up to 3 topics/submissions. +- Single-model mode: triggered when all submitters AND validator use the SAME configured model ID, except when LM Studio has multiple loaded same-base numeric `:#` instances (for example `model:1`, `model:2`) for that model; in that case submitters may run in parallel and the LM Studio client routes independent calls to idle sibling instances. Boost routing does NOT trigger single-model mode. + +### Supercharge + +**Supercharge is a per-role answer-quality wrapper, NOT a routing mode.** +- `ModelConfig.supercharge_enabled` and related request fields control it per role. +- Frontend Supercharge controls are developer-mode-only; start/generate request payloads must force Supercharge off unless developer mode is active. +- `api_client_manager.generate_completion()` is the only implementation choke point: checked roles run 4 parallel full answer attempts, then a 5th synthesis call that receives the original messages plus the prior 4 outputs and returns the single role answer. +- Calls 1-4 must be full answer attempts in the original role format/schema, not short notes. +- Calls 1-4 intentionally violate the default deterministic temperature rule with the fixed ladder `[0.0, 0.2, 0.4, 0.8]` so concurrent candidates diversify; the synthesis call stays `temperature=0.0`. +- The synthesis call must produce the final answer in the exact original required format/schema and must not mention Supercharge or candidate attempts. +- Tool-call requests (`tools` or `tool_choice`) bypass Supercharge because assistant/tool turn pairing must remain exact. +- If Boost applies to the original task, all 5 Supercharge calls force the same Boost mode and Boost config first; Boost failures are strict for Supercharge and must not silently mix in the primary route. `boost_next_count` is consumed once for the successful boosted overall Supercharge answer, not once per internal attempt. ### Backend Core @@ -48,11 +65,19 @@ Enables OpenRouter integration with automatic LM Studio fallback (default mode) - App Attribution Headers: `HTTP-Referer: https://intrafere.com/moto-autonomous-home-ai/`, `X-Title: MOTO Autonomous ASI` - Credit exhaustion detection: HTTP 402 OR error messages containing "credit", "insufficient", "balance", "quota", "key limit", "limit exceeded" - Raises `CreditExhaustionError` on exhaustion (no retries). Retries transient errors (max 3). -- Temperature=0.0 default. No stop sequences (removed — caused premature truncation with certain models). +- Temperature=0.0 default except Supercharge candidate attempts and parallel brainstorm submitter lanes. No stop sequences (removed — caused premature truncation with certain models). - Exposes both model-level metadata (`/models`) and provider endpoint metadata (`/models/{author}/{slug}/endpoints`) so the UI can compute safe host-aware OpenRouter auto-fill values. +- Auto-routed calls include a provider `ignore` list for known weak hosts so OpenRouter can still fall back across capable providers. Explicit user-selected providers use `provider.order=[provider]` with `allow_fallbacks=false` so requests cannot silently fall back to a host whose limits were not used for settings. #### APIClientManager (`backend/shared/api_client_manager.py`) -- Central router for all API calls: boost check → role's OpenRouter (with resettable fallback) → LM Studio (default mode) or OpenRouter-only (generic mode) +- Central router for all API calls: optional Supercharge wrapper → boost check → role's OpenRouter (with resettable fallback) → LM Studio (default mode) or OpenRouter-only (generic mode) +- Temperature policy exceptions live here: Supercharge attempts use `[0.0, 0.2, 0.4, 0.8]`; parallel brainstorm submitter lanes use `[0.0, 0.1, ..., 0.9]`. Validators, compiler roles, proof/final roles, JSON retries, and single-model sequential submitters stay `0.0`. +- LM Studio instance sharing lives below this router in `lm_studio_client.generate_completion()`: only default-mode LM Studio calls can share same-base loaded numeric `:#` siblings, response metadata must preserve both the configured model and effective instance, and state-sensitive workflow ordering must not change. +- Raw provider/model transport output must never be replayed into MOTO retry prompts, feedback memory, accepted memory, RAG, or durable context. Conversational retries are required, but failed-output context must first pass `sanitize_model_output_for_retry_context()` so only reusable visible answer text remains. The sanitizer strips known private thought/channel/control tokens only as transport scaffolding outside visible JSON/string content, not ordinary visible Lean/math/operator syntax such as `<|` or literal visible marker text such as `<|channel>final` / `` inside content. +- Parser exception strings that are inserted into retry prompts must not contain raw response excerpts; raw excerpts are allowed only in logs/observability surfaces that are never reused as model context. +- Observability surfaces must default to metadata/previews with secret redaction. Provider keys, URL query keys, Wolfram query/result text, and full prompt/response bodies must not be persisted or broadcast unless an explicit trusted debug path opts in. Legacy full-payload log fields are scrubbed from persisted API logs on logger startup. +- Tool-call assistant/tool protocol turns are the only exception where exact assistant content/structure may need preservation; ordinary JSON retry assistant turns are not tool protocol turns and must use sanitized retry context. +- Generic mode must normalize or reject LM Studio role configs and must never fall through to `lm_studio_client.generate_completion()`, even if a direct API caller submits legacy `provider="lm_studio"` or an LM fallback value. - Generic mode: `get_embeddings()` early-returns to `FastEmbedProvider` before the LM Studio → OpenRouter fallback chain - Tracks fallback state per role: `_role_fallback_state: Dict[str, str]` - `reset_openrouter_fallbacks()`: Resets all roles originally configured for OpenRouter back from LM Studio fallback. Called automatically on API key set, or manually via reset endpoint. @@ -60,12 +85,15 @@ Enables OpenRouter integration with automatic LM Studio fallback (default mode) **CRITICAL REQUIREMENT - Role Configuration:** - **EVERY role calling `api_client_manager.generate_completion()` MUST be configured via `api_client_manager.configure_role()`** -- This includes: aggregator submitters/validator, compiler submitters/validator/critique, autonomous agents, Tier 3 final answer agents +- This includes: aggregator submitters/validator, compiler submitters/validator/critique, autonomous agents, Tier 3 final answer agents, and LeanOJ roles/topic/brainstorm submitters +- Role configs must preserve `supercharge_enabled` when copied into proof snapshots, manual proof helpers, child Aggregator/Compiler coordinators, and LeanOJ grouped roles. - **Proof agents (Part 3, optional)** do NOT have standalone role configs. `ProofVerificationStage` reuses the stored `ProofRuntimeConfigSnapshot` (brainstorm submitter, high-context submitter, validator) captured by `autonomous_coordinator._build_proof_runtime_config_snapshot()` and persisted via `research_metadata.set_proof_runtime_config()`. Manual `POST /api/proofs/check` requires `lean4_enabled=True` AND a seeded snapshot — start autonomous research once to seed it. **Boost Mode Priority** (`should_use_boost(task_id)`): 1. Boost Next X: `boost_next_count > 0` → True -2. Category Boost: `_extract_role_prefix(task_id) in boosted_categories` → True +2. Always Prefer Boost: `always_prefer_boost=True` → True +3. Category Boost: `_extract_role_prefix(task_id) in boosted_categories` → True +4. Per-task toggle: exact task ID is enabled → True **Counter Decrement:** `boost_next_count` decrements ONLY on successful boost API calls. Failed/exhausted calls do NOT decrement. @@ -78,16 +106,18 @@ Enables OpenRouter integration with automatic LM Studio fallback (default mode) - `compiler_high_param` → "Compiler High-Param" - `compiler_validator` → "Compiler Validator" - `autonomous_*` → "Autonomous" +- `proof_*` / `autonomous_proof_*` → proof-specific categories +- `leanoj_*` → LeanOJ topic, brainstorm, subproof, final-solver, and validator categories; LeanOJ path-decision tasks are absorbed into Final Solver boost routing #### BoostManager (`backend/shared/boost_manager.py`) -- Singleton. Key methods: `set_boost_config`, `clear_boost`, `set_boost_next_count`, `toggle_category_boost`, `should_use_boost` (main check for coordinators), `consume_boost_count` (only after successful boost call) -- Boost can use an **explicit override** OpenRouter API key, or it falls back to the active global OpenRouter key. A temporary `OpenRouterClient` is created per boosted task and closed immediately after. +- Singleton. Key methods: `set_boost_config`, `clear_boost`, `set_boost_next_count`, `toggle_category_boost`, `set_always_prefer`, `toggle_task_boost`, `should_use_boost` (main check for coordinators), `consume_boost_count` (only after successful boost call) +- Boost can use an **explicit override** OpenRouter API key in process memory only, or it falls back to the active global OpenRouter key. Boost state persistence must never write provider key material; legacy persisted boost keys are scrubbed on load. A temporary `OpenRouterClient` is created per boosted task and closed immediately after. - **Autonomous agent task ID inheritance**: All autonomous orchestration agents use parent role task ID prefixes — Topic Selector/Completion Reviewer/Reference Selector/Paper Title Selector/Tier 3 agents use `agg_sub1_*`; Topic Validator/Redundancy Checker use `agg_val_*`. Boosting a parent role automatically covers all autonomous agents that run on that model. **Proof agents are the exception**: they use their own prefixes (`proof_id_*`, `proof_lemma_*`, `proof_form_*`, `proof_novelty_*`, `proof_framing_gate_*`) because they run under the `autonomous_proof_*` role IDs with distinct runtime-snapshot configs; Aggregator/Validator category boosts do NOT cover proof agents. #### BoostLogger (`backend/shared/boost_logger.py`) - Singleton. Log file resolves under the active instance data root (default desktop path: `backend/data/boost_api_log.txt`) - Methods: `log_api_call`, `get_logs(limit)`, `clear_logs`, `get_stats` -- Boost logs are merged into the main API call log view; boost endpoints remain available for boost-only debugging. +- Boost logs are merged into the main API call log view; persisted/default route output must avoid provider keys and raw full prompt/response bodies. #### Workflow Task Generation (Internal Backend Tracking) Coordinators track task IDs internally for boost routing. The frontend does NOT display predicted task lists. @@ -144,7 +174,7 @@ Predictions refresh: after initialization, each task completion, mode switches, - `POST /api/openrouter/reset-exhaustion` — Reset all credit exhaustion flags + role fallback states mid-session - `DELETE /api/openrouter/api-key` — Clear key - `GET /api/openrouter/api-key-status` — `{ has_key, enabled }` -- `GET /api/openrouter/models` — Available models (also caches free models for rotation) +- `GET /api/openrouter/models` — Available models (also caches free models for rotation); temporary keys must use `Authorization: Bearer`, never URL query parameters - `GET /api/openrouter/providers/{model_id}` — Providers + endpoint metadata for model - `GET /api/openrouter/free-model-settings` — `{ looping_enabled, auto_selector_enabled, ... }` - `POST /api/openrouter/free-model-settings` — Update free model settings (body: `FreeModelSettings`) @@ -176,7 +206,7 @@ Predictions refresh: after initialization, each task completion, mode switches, - `looping_enabled` — rotate to next available free model on rate limit (highest context first) - `auto_selector_enabled` — fall back to `openrouter/free` (131072 context) when all free models exhausted -**Rotation chain** (in `api_client_manager._try_free_model_rotation()` called from RateLimitError handler): +**Rotation chain** (in `api_client_manager._try_free_model_rotation()` called from RateLimitError handler; keep optional `tools` / `tool_choice` passed through when that helper is used): 1. If `looping_enabled`: **iterate through ALL** non-rate-limited free models (highest context first) using `tried_models` set to avoid re-trying. On each `RateLimitError`, refresh rate-limited dict and continue to next model. On `CreditExhaustionError`, stop looping. 2. If all looping candidates exhausted and `auto_selector_enabled`: try `openrouter/free` 3. If still failed: check LM Studio fallback (default mode only; generic mode skips this) @@ -214,6 +244,6 @@ Predictions refresh: after initialization, each task completion, mode switches, **Hosted generic mode (no keyring):** Provider keys are env-injected at sandbox launch and/or set via proxied MOTO API routes. `secret_store` persistence is bypassed; keys live in sandbox memory only. Re-injection required after sandbox recreation. `OPENROUTER_API_KEY` env var auto-loaded during lifespan if present. -**localStorage:** `workflow_panel_collapsed`, `aggregatorConfig`, `compiler_settings`, `autonomousConfig` (includes `freeModelLooping`, `freeModelAutoSelector`). When `MOTO_FRONTEND_STORAGE_PREFIX` / `VITE_MOTO_STORAGE_PREFIX` is active, these keys are automatically namespaced per instance. +**localStorage:** `workflow_panel_collapsed`, `aggregatorConfig`, `compiler_settings`, `autonomousConfig` (includes `freeModelLooping`, `freeModelAutoSelector`, per-role Supercharge settings). When `MOTO_FRONTEND_STORAGE_PREFIX` / `VITE_MOTO_STORAGE_PREFIX` is active, these keys are automatically namespaced per instance. -**Session (in-memory):** fallback state per role, boosted task IDs, boost next count, boosted categories, completed task IDs, free model manager state. Boost logs and boost state persist under the active instance data root (`boost_api_log.txt`, `boost_state.json`) and are merged into the main API call log view. +**Session (in-memory):** fallback state per role, boosted task IDs, boost next count, boosted categories, completed task IDs, free model manager state, and any explicit Boost override key. Boost override keys must never be persisted to `boost_state.json`; legacy plaintext keys are ignored/scrubbed on load. Boost logs and non-secret boost routing state persist under the active instance data root (`boost_api_log.txt`, `boost_state.json`) and are merged into the main API call log view. API call logs store previews/metadata by default; full prompt/response payload persistence is debug opt-in only, and provider/model error logs must report shape/status metadata instead of raw response bodies. diff --git a/.cursor/rules/hosted-web-contract.mdc b/.cursor/rules/hosted-web-contract.mdc new file mode 100644 index 0000000..6715d90 --- /dev/null +++ b/.cursor/rules/hosted-web-contract.mdc @@ -0,0 +1,193 @@ +--- +description: Hybrid deployment contract — generic_mode, hosted sandbox, proxy auth, web-team boundary, updater policy +alwaysApply: true +--- + +# Hybrid Deployment Contract (intrafere.ai / Hosted Web Product) + +MOTO is ONE codebase serving TWO deployment targets. A single `generic_mode` boolean switches between them. All features, fixes, and improvements ship to both targets simultaneously. + +## Two Deployment Targets + +- **Default mode (`generic_mode=False`)**: GitHub open-source release. Desktop app with `.bat`/`.ps1` launcher. LM Studio + OpenRouter. User runs locally. +- **Generic mode (`generic_mode=True`)**: Hosted web backend. API-only sandbox on Blaxel, fronted by the Intrafere website/control plane on AWS. FastEmbed embeddings, OpenRouter-only LLM inference, no LM Studio dependency. + +## Two-Team Boundary (Strict) + +| Team | Repo | Owns | +|------|------|------| +| **Upgrade Team** | MOTO repo | `generic_mode` conditional branches, FastEmbed provider, hosted runtime contract, proxy auth plumbing, `/api/features`, `/api/health`, build metadata, desktop launchers + updater contract | +| **Web Team** | Separate repo | Website frontend, Clerk auth, Stripe billing, AWS control plane, Blaxel sandbox lifecycle, same-domain proxy, dashboard update UX, CI/CD image pipeline | + +The Web Team consumes MOTO as a pre-built image. They never commit into the MOTO repo. Their website wraps private MOTO sandboxes — handles auth, billing, proxying, and instance lifecycle while MOTO handles all research orchestration. + +## Hosted Container Artifacts + +- The MOTO repo provides the canonical hosted container files: `Dockerfile`, `.dockerignore`, and `docker/entrypoint.sh`. +- Those files define the API-only `python:3.12-slim` generic-mode runtime contract. Image publish, rollout, and redeploy automation still belong to the separate Web Team repo. + +## The `generic_mode` Flag + +`SystemConfig` field in `backend/shared/config.py`: +```python +generic_mode: bool = False +``` + +Toggled via `MOTO_GENERIC_MODE=true` env var (read explicitly in `main.py` lifespan, not via Pydantic auto-mapping, to avoid adding an env_prefix to SystemConfig). + +When `False`: program behaves as the existing open-source desktop release. When `True`: activates conditional code paths. No existing default-mode behavior is modified. + +## Decision Points + +1. **`api_client_manager.get_embeddings()`** — generic mode early-returns to in-process `FastEmbedProvider` before the LM Studio → OpenRouter fallback chain +2. **`rag_manager.py`** — generic mode skips global RAG lock for embedding calls (FastEmbed is in-process/thread-safe); ChromaDB write locking remains in both modes; synchronous ChromaDB calls and CPU-heavy RAG scoring must run off the FastAPI event loop +3. **`main.py` lifespan** — generic mode skips LM Studio connection test; auto-loads `OPENROUTER_API_KEY` from env if present +4. **`openrouter.py` LM Studio availability** — generic mode returns `{available: false, generic_mode: true}` without pinging LM Studio +5. **`download.py` PDF** — generic mode returns `501` (Playwright/Chromium not installed in hosted image) +6. **Frontend** — calls `GET /api/features` on mount; when `generic_mode=True`, hides all LM Studio UI, defaults everything to OpenRouter +7. **`middleware.py` + `websocket.py`** — generic mode validates internal proxy auth (`X-Moto-*` signed headers) on all non-allowlisted routes +8. **Long-running workflow isolation** — research/proof/RAG/Lean jobs may run in background tasks, but must not block the FastAPI event loop that serves GUI/status/health/API-key routes + +## Instance-Scoped Runtime Contract (Both Modes) + +One process pair = one MOTO instance (local or sandbox). Env inputs: +- `MOTO_INSTANCE_ID`, `MOTO_BACKEND_HOST`/`HOST`, `MOTO_BACKEND_PORT`/`PORT` +- `MOTO_DATA_ROOT`, optional `MOTO_LOG_ROOT`, optional `MOTO_SECRET_NAMESPACE` +- optional `MOTO_FRONTEND_STORAGE_PREFIX`, optional `MOTO_CORS_ORIGINS`, optional `MOTO_LM_STUDIO_BASE_URL` +- Default desktop launches bind backend and bundled Vite frontend to loopback and require `MOTO_DESKTOP_API_TOKEN` / `VITE_MOTO_DESKTOP_API_TOKEN` on protected HTTP routes. Desktop WebSockets use one-time tickets minted by authenticated `POST /api/ws-ticket`; hosted generic mode continues to use proxy HMAC auth instead. + +Hosted sandboxes reuse this exact contract (`MOTO_DATA_ROOT=/app/backend/data`). No separate hosted-only env model. + +## Proxy Auth Contract (Generic Mode Only) + +Browser reaches sandboxes only through the authenticated control-plane proxy, never via direct sandbox URLs. + +- **Proxy path**: `https://app.intrafere.com/instances/{instance_id}/moto/api/...` and `wss://.../moto/ws` +- Proxy strips `/instances/{instance_id}/moto` prefix before forwarding +- Control plane injects signed headers: `X-Moto-Instance-Id`, `X-Moto-Proxy-Timestamp`, `X-Moto-Proxy-Signature`, and `X-Moto-Body-SHA256` for body-capable protected requests +- Signature payload binds `{instance_id}`, `{timestamp}`, uppercase method, stripped path, raw query string, and the `X-Moto-Body-SHA256` value (empty hash for WebSockets/bodyless requests) +- Sandbox validates instance ID match, timestamp skew ≤60s, HMAC digest, query string, that the signed body hash matches the actual received request body, and rejects replayed signatures inside the accepted skew window +- Protected hosted HTTP requests with `Content-Length` above `MOTO_GENERIC_MAX_REQUEST_BYTES` / `GENERIC_MAX_REQUEST_BYTES` (default 16 MiB) are rejected before route handling; the control-plane proxy should enforce the same or stricter body-size cap before forwarding +- If `generic_mode=True` and `MOTO_INSTANCE_ID` or `MOTO_INTERNAL_PROXY_SECRET` is missing: fail closed at startup +- Allowlisted without proxy auth: `GET /health`, `GET /api/health`, `GET /api/features`, `OPTIONS` preflight +- `Authorization` header is NOT reused for sandbox auth (existing MOTO routes use it for OpenRouter key passthrough) + +Implementation: centralized in `middleware.py` (HTTP) and `websocket.py` (before `accept()`). No per-route auth changes. + +## `/api/features` Endpoint (Both Modes) + +Build 0 lands the public identity subset first. Returns: +```python +{ + "version": str, + "build_commit": str, # authoritative update key + "update_channel": "main", + "api_contract_version": "build5-v12", + "generic_mode": bool, + "lm_studio_enabled": bool, + "pdf_download_available": bool, +} +``` + +The current Build 5 runtime preserves the four identity fields while exposing the stable capability flags above. Build 5 v12 replaces compiler critique rewrite WebSocket events with `self_review_appended` and changes post-body critique output to a validated appended self-review section. Later hosted work may extend `/api/features` with additional capability flags such as `max_submitters` and `tier3_available`, but the existing fields above remain stable and `api_contract_version` must bump when that happens. + +Must remain capability-only. Must NOT expose per-user or per-instance state (e.g. whether an OpenRouter key is set). + +## `/api/health` Endpoint + +Richer readiness alias of `/health`. Available in both modes. Hosted sandboxes use it for liveness/readiness probes. + +## FastAPI Responsiveness Contract + +GUI loads, hosted control-plane probes, and desktop status polling share the same FastAPI app as long-running MOTO workflows. Any code reachable from background research/proof tasks must preserve event-loop responsiveness: + +- Do not run synchronous ChromaDB operations, large in-memory RAG scoring, Lean temp-file writes/deletes, workspace repair deletes, subprocess waits, or `time.sleep()` on the event loop. +- Use async subprocess APIs for external tools and `asyncio.to_thread()` for unavoidable synchronous filesystem, ChromaDB, or CPU-heavy scoring work. +- Status/health/capability/key-status endpoints must be fast-lane routes: return cached/in-memory state only and must not trigger Lean, ChromaDB scans, OpenRouter model-list fetches, or large session-directory walks. +- Do not paper over event-loop starvation with multiple Uvicorn workers unless coordinator state, WebSockets, and runtime memory have first been externalized; current singleton coordinators assume one backend process per instance. + +## Embedding Strategy (Generic Mode) + +FastEmbed by Qdrant — in-process ONNX Runtime, `nomic-embed-text-v1.5` INT8, ~200 MB RAM, no PyTorch. + +- Dependency in `requirements-generic.txt` (additive, not in main `requirements.txt`) +- `fastembed_provider.py` (~30 lines) wraps the library; lazy-imported so default installs are unaffected +- If `generic_mode=True` and `fastembed` is missing: fail fast with clear error +- Batch query variant optimization: `_vector_search()` batches all query embeddings into one `get_embeddings()` call (benefits both modes) + +## Dependency Handling + +`requirements-generic.txt`: +``` +-r requirements.txt +fastembed>=0.3.6 +onnxruntime>=1.18.0,<2.0 +``` +Hosted image installs both files but does NOT run `playwright install chromium`. + +## Frontend Serving (Generic Mode) + +Sandbox is API-only. The MOTO React frontend is NOT served from the hosted sandbox. The Web Team builds their own frontend (website + dashboard + embedded MOTO UI). In default mode, the bundled frontend is served by Vite / static build as today. + +## PDF Download + +- Default mode: `POST /api/download/pdf` works via Playwright, but submitted HTML is untrusted; server-side rendering must sanitize/allowlist content, enforce PDF-specific size caps, disable JavaScript, keep Chromium sandboxing enabled, and block external browser network requests +- Generic mode: returns `501` ("PDF generation unavailable in web mode. Use raw text download.") +- Web Team may implement client-side PDF in their frontend independently + +## Secret Handling (Generic Mode) + +- Desktop default: `secret_store.py` uses OS keyring, restored on startup +- Hosted generic mode: provider keys are env-injected at sandbox launch and/or set via proxied MOTO API routes. `secret_store` persistence is bypassed; keys live in sandbox memory only. Re-injection required after sandbox recreation. +- Generic-mode OpenRouter and Wolfram routes update runtime memory only; they do not write to or clear the desktop keyring. +- Control plane NEVER stores, logs, or persists user provider keys in its own database + +## Data Persistence + +- `backend/data/` is the default desktop working set +- Hosted: `MOTO_DATA_ROOT=/app/backend/data` so Blaxel storage mounts to one unambiguous path +- ChromaDB SQLite files stay on Blaxel sandbox storage (local file semantics required) +- Sandbox recall/resume returns the same filesystem state; redeploy/recreate advances to the newest image +- Uploads: server-side enforcement of `.txt` only, 5 MB max, filename sanitization, path traversal rejection + +## Updater Policy + +- **Authoritative update source**: GitHub `main` branch (not GitHub Releases) +- **Desktop**: launcher compares local build metadata against GitHub `main`. Auto-apply is only for clean `origin/main` git checkouts or ZIP/extracted installs with no launcher-managed instances still running. ZIP updates preserve active data/log roots, instance storage, launcher state, env files, and keyring-related namespaces. +- **Hosted**: sandboxes do NOT self-mutate. Redeploy/recreate uses the latest approved `main`-derived image. Recall/resume keeps the existing image. Hosted `POST /api/update/pull` must return unavailable instead of attempting in-place update. +- **Build metadata**: `version`, `build_commit`, `update_channel`, and `api_contract_version` exposed via `/api/features`; the committed `main`-branch manifest lives at `moto-update-manifest.json` + +## Canonical Runtime Baselines + +- **Desktop release**: Windows (release-blocking); Ubuntu 24.04 LTS (validation target, separate launcher effort) +- **Hosted sandbox**: Debian/glibc via `python:3.12-slim` +- **Unsupported**: Alpine/musl (Python + ONNX + Chroma stack needs glibc) + +## What Stays the Same (Both Modes) + +All RAG pipeline logic, coordinator logic, prompt engineering, WebSocket routing, paper/brainstorm/outline memory management, ChromaDB usage, and REST route surface remain shared. Generic mode adds proxy auth and hides LM Studio options; proof execution routes may exist but must report disabled/unavailable capability unless the required runtime flags and toolchains are present. + +## Integration Contract Rule + +Any REST shape, auth contract, or WebSocket event change that affects the website must update **code, this rule, the live `/openapi.json` schema, and `api_contract_version` in `/api/features`** in the same approved `main` merge. The live backend's `GET /openapi.json` is the machine-readable REST schema contract. + +## Proof Integration Contract (Builds 1-5, optional, gated off by default) + +All Lean 4 and SMT behavior is gated on three runtime flags (`lean4_enabled`, `lean4_lsp_enabled`, `smt_enabled`). All three default false and stay silent when disabled. Hosted sandboxes ship with them disabled. + +- **Hosted image stays Lean-free and Z3-free.** No Lean toolchain, no `z3` binary, and no Python wheel for either is permitted in `Dockerfile`, `docker/entrypoint.sh`, or `requirements-generic.txt`. Proof features are desktop-opt-in only for the current contract. +- **Lean 4 remains authoritative** for every stored proof. The `Lean4Result` contract is unchanged by SMT; SMT (when enabled) produces tactic hints consumed by the formalization agent, never a standalone proof artifact. +- **Subprocess fallback must keep working** when `lean4_lsp_enabled=False`. LSP is a latency optimization, not a replacement. +- **Proof routes under `/api/proofs/*`** are additive to the hosted REST contract: `GET /api/proofs` (list), `GET /api/proofs/novel`, `GET /api/proofs/status`, `POST /api/proofs/settings`, `POST /api/proofs/check` (manual check), `GET /api/proofs/{id}`, `GET /api/proofs/{id}/certificate[.lean]`, `GET /api/proofs/{id}/dependencies`, `GET /api/proofs/graph`, `GET /api/proofs/mathlib/{lemma_name}/dependents` (Build 5). +- **LeanOJ routes** are additive to the hosted REST contract in `build5-v6`: start/resume, stop, status, clear, skip-brainstorm, force-brainstorm, master-proof draft/edit summaries, current-run proofs, and cross-session proof library endpoints live under `/api/leanoj/*`. +- **Pruned Stage 2 paper routes** are additive in `build5-v6`: pruned papers are removed from model context/RAG but remain downloadable under `/api/auto-research/paper-history/pruned*`; hard deletion is limited to explicit delete-all-pruned endpoints. +- **LeanOJ live-activity WebSocket events** include model-call failure/retry progress, initial topic generation/validation, recursive brainstorm progress, brainstorm submitter/queue/batch-validation events, sufficiency/phase-limit events, master-proof edit validation/applied/rejected events, final semantic-review rejection, and final-attempt-cycle exhaustion. +- **Compiler critique WebSocket events** include validated critique progress and `self_review_appended`; partial/total rewrite events are no longer emitted by the active critique flow. +- **Proof WebSocket events** are part of the web-surface contract: `proof_framing_decided`, `proof_check_started`, `proof_check_complete`, `proof_check_no_candidates`, `proof_check_candidates_found`, `mathlib_lemmas_suggested`, `proof_attempt_started`, `proof_verified`, `proof_attempt_failed`, `proof_attempts_exhausted`, `proof_retry_started`, `proof_retry_scheduled`, `novel_proof_discovered`, `known_proof_verified`, `proof_dependency_added`, `smt_check_started`, `smt_check_complete`. `proof_verified` is emitted only after proof registration/reuse and includes `proof_id`. +- **Proof certificate exports stay text-based** (`.lean` source + JSON metadata). No binary-only proof artifacts. +- **Proof runtime config snapshot** (`ProofRuntimeConfigSnapshot`) is persisted via `research_metadata` so manual `POST /api/proofs/check` can run without an active autonomous session; required state is `lean4_enabled=True` AND a seeded snapshot. +- **`api_contract_version` bumps** apply the same way to proof additions as to the base contract: any new proof route or event added after Build 5 must bump the contract version in the same merge. + +## Hosting Ownership + +Intrafere operates the service providing back-end with Blaxel. diff --git a/.cursor/rules/json-prompt-design.mdc b/.cursor/rules/json-prompt-design.mdc index 5715730..3748820 100644 --- a/.cursor/rules/json-prompt-design.mdc +++ b/.cursor/rules/json-prompt-design.mdc @@ -164,16 +164,30 @@ CORRECT RESPONSE: - Improve validator rigor (currently lacks evaluation depth) - Maintain existing prompt assembly order: System → JSON Schema → User Prompt → Context → RAG → Final Instruction - **MATH VARIANT**: Citation requirements REMOVED. Focus on mathematical rigor, logical correctness, and established mathematical principles. Models with web search capabilities are encouraged to use them for verification. Validation is purely AI-driven. +- **Proof Prompt Relevance Boundary**: Every automated proof JSON prompt must treat the USER RESEARCH PROMPT as the primary filter. Candidate identification returns every prompt-relevant, non-trivial theorem worth attempting, ordered by usefulness to the user prompt first and novelty/formalization value second. Never impose an artificial theorem-count cap unless explicitly requested. - **Compiler Outline Injection**: The compiler outline is always fully injected (never truncated, never RAGed) for all modes because it provides the structural framework for document construction and validation. -- **TEMPERATURE POLICY**: All prompts use temperature=0.0 (deterministic generation). The system's evolving context provides sufficient diversity. This applies to ALL agents. +- **TEMPERATURE POLICY**: Default all prompts to `temperature=0.0`. Only two exceptions are allowed: Supercharge candidate attempts and parallel brainstorm submitter lanes. Validators, compiler roles, proof/final roles, and JSON retries must stay deterministic. +- **Supercharge Schema Preservation**: Per-role Supercharge calls generate 4 full answer attempts plus a 5th synthesis answer. Candidate attempts must be sanitized to reusable visible answer text before the 5th call; private thought/channel/control transcript text must never be fed into synthesis, retries, feedback memory, accepted memory, or RAG. The synthesis prompt must place the final instruction after the candidate block, treat candidates as optional working material, and preserve the original task's exact output contract; if the original role expects JSON, the 5th answer must output only valid JSON in that same schema and must not mention Supercharge or candidate attempts. - **NATURAL COMPLETION POLICY**: Models stop naturally when JSON response is complete. No stop sequences enforced. `sanitize_json_response()` handles trailing whitespace. **CRITICAL**: Truncated JSON (unclosed braces/brackets) raises ValueError - no repair attempted. - **JSON Response Preprocessing**: All LLM responses preprocessed by `sanitize_json_response()` in `backend/shared/json_parser.py`. See implementation for complete sanitization pipeline: strips reasoning tokens/markdown/control tokens, handles LaTeX escapes (pre-escapes dangerous commands), escapes control chars in strings, rejects truncated JSON, detects pure reasoning text. Enhanced error logging with diagnostics. Array responses auto-extract `data[0]`. +- **Retry Transcript Hygiene**: Raw provider/model transport output must never enter retry prompts, feedback memory, accepted memory, RAG, synthesis prompts, or durable context. Keep conversational retries, but replay only `sanitize_model_output_for_retry_context()` output, which strips known leading private thought/channel/control transcript scaffolding while preserving useful visible malformed JSON/output excerpts and literal tags/operators inside visible content. Channel/control markers must be treated as transport scaffolding only when detected outside visible JSON/string content; sanitization must not treat ordinary Lean/math/operator text such as `<|`, or literal visible text such as ``, ``, `<|channel>final`, or ``, as a control token when it appears inside visible content. Exact assistant/tool protocol turns are the only exception. - **No Startup Compatibility Testing**: Models trusted to work. JSON sanitizer handles all quirks automatically. Model configs cached on first success. - **Reasoning Field Extraction**: Agent code checks BOTH `content` and `reasoning` fields for model compatibility. - **Centralized JSON Parsing**: All agents use `parse_json()` from `backend/shared/json_parser.py`. Exceptions: memory modules loading system-written files use direct `json.loads()`. +- **LeanOJ JSON Retry**: LeanOJ proof-solver roles also use centralized `parse_json()` and must retry malformed/non-object JSON before treating a role call as failed. During each configured final-attempt cycle, malformed model output is recorded as failed proof feedback and the loop continues until Lean verifies, the cycle is exhausted, or the operator stops; provider credit exhaustion/no-fallback configuration errors are non-retryable resumable pauses, not proof feedback. +- **LeanOJ Batch Validation JSON**: LeanOJ brainstorm validation may receive 2-3 submissions and must return `{"decisions": [...]}` with one ordered binary accept/reject decision per submission. Accepted brainstorm decisions should classify `context_role` as `active_plan`, `verified_hint`, `refuted_construction`, or `scratch`; topic validation may receive 2-3 topics and must return ordered `{"decisions": [...]}` entries keyed by `topic_number`. Initial topic validation accepts only broad locked foundation questions that cover `answer n`, lower construction, upper proof, exact LeanOJ semantics, and Lean formalization; reject narrow lemma/tactic/bound/repair topics. +- **LeanOJ Brainstorm Prune JSON**: LeanOJ prune-review prompts must ask whether any accepted brainstorm memory should be removed or updated because it is `outdated`, `redundant`, wrong, harmful, or superseded. Do not pressure the reviewer to remove content: keep the conservative `"none"` default, allow at most one operation, and preserve any idea with unique proof-solving value. Prune validation should accept deletes/edits only when the operation clearly improves the proof-solving database under those criteria. +- **LeanOJ Final Context Routing**: Final-solver direct proof context is limited to verified standalone subproofs plus accepted notes explicitly classified as `active_plan`. Ordinary accepted brainstorm notes default to `scratch`, and accepted idea artifact records must persist `context_role` metadata across resume/reload. Lean-accepted partial scaffolds with `sorry`/`admit` and failed final attempts cannot seed `master_proof.lean` unless explicitly marked high-value/master-seed eligible. The final solver may receive the most recent 5 final attempts only as compact execution feedback to avoid repeating failed edits; this feedback is not proof evidence. Failed/refuted constructions are not proof evidence: pass them only through the compact `refuted_construction_warnings` / “DO NOT USE” channel. +- **LeanOJ Master-Proof Editing JSON**: The final solver edits durable `master_proof.lean` with `{"action":"edit_proof","needs_more_time":true|false,"operation":"full_content|replace|insert_after|delete","old_string":"exact unique proof text","new_string":"Lean code","reasoning":"..."}`. `master_proof.lean` must contain the current chosen proof route only, not accumulated competing/refuted constructions. Final solver prompts must not expose path-transition choices, raw `need_more_brainstorming`, final-cycle failed-attempt counts, or any `stuck_needs_brainstorm` action. They may expose compact recent execution feedback such as Lean errors, stale `old_string` rejections, JSON truncation, and watchdog/no-progress notices. Required corrections from recent feedback must take priority over unrelated new additions, fresh routes, or speculative helpers; new additions are allowed only when they directly implement the required correction or helper code needed for that correction. Phase transitions are selected only by the discrete path-decision mode. Legacy `{"lean_code":...}` is compatibility only. +- **LeanOJ Master-Proof Lean Gate**: A master proof edit must never be persisted merely because the string edit applies. After structural edit application and any required shortening validation, the updated proof is checked in memory first. `needs_more_time=true` edits run Lean with placeholders allowed but still must parse/typecheck, preserve the original template/declarations, and pass forbidden-device integrity checks. `needs_more_time=false` edits run Lean with no placeholders, then final template integrity, answer adequacy, semantic review, and registration. Lean/template failure rejects the edit, preserves the prior master proof and shortening-backup metadata, and feeds the Lean diagnostics (`error_output`, diagnostic output, goal states, raw stderr when present) back to the final solver. +- **LeanOJ Master-Proof Shortening Validation JSON**: Material-shortening edits to `master_proof.lean` must be reviewed before the Lean gate by `leanoj_master_proof_edit_validator` using `{"decision":"accept","reasoning":"...","feedback_to_submitter":""}` or `{"decision":"reject","reasoning":"...","feedback_to_submitter":"precise correction"}`. Rejection preserves the prior proof and becomes direct final-solver feedback. Validator acceptance is not proof acceptance: shortening backup/redo state and `master_proof.lean` persistence happen only after the accepted edit also passes the Lean/template gate. The edit validator must reject changes that ignore required corrections in favor of unrelated new additions, and rejection feedback must instruct the submitter to fix the required corrections before new addition attempts. +- **LeanOJ Final Semantic Review JSON**: After Lean accepts final code and deterministic integrity checks pass, the Final Proof Solver must review the Lean-accepted code against the full LeanOJ problem prompt/template using `{"solved":true,"reasoning":"..."}` or `{"solved":false,"continuation_feedback":"...","reasoning":"..."}`. Rejection is continuation feedback, not verified success. +- **LeanOJ Formalization Semantics Guardrail**: LeanOJ planning, proof-editing, validation, and final-review prompts must state that the Lean template is the formal source of truth, template operations must not be silently reinterpreted to match informal olympiad intuition (e.g. `Nat` subtraction truncates), proposed formulas/constructions should be sanity-checked against the exact Lean predicate on small cases when feasible, and Lean acceptance alone must not be claimed as solving the informal problem unless the formal/informal correspondence is justified. +- **Shared Post-Lean Proof Integrity Gate**: Lean 4 is authoritative for proof checking, but proof outputs still pass `backend/shared/lean_proof_integrity.py` before storage/placement. This shared gate rejects newly introduced `axiom`/`constant`/`opaque` proof devices and uses statement-alignment validation so a Lean-accepted proof cannot be stored for an unrelated or user-prompt-irrelevant `ProofCandidate.statement`. +- **LeanOJ Proof Validation Boundary**: Lean 4 is authoritative formal checking for LeanOJ success, but LLM validators still gate planning decisions, Lean-accepted subproof relevance, and final semantic review. A compiled subproof must not be stored as verified run context unless it matches the requested subproof/role; a compiled final solution must not stop the run unless it preserves the template and the Final Proof Solver confirms it solves the actual prompt rather than a formal loophole. - **Specialized Retry for Pure Reasoning Text**: When "No JSON found" error, aggregator submitter uses specialized retry: (1) Don't think step-by-step, (2) Start with `{` immediately, (3) Raw JSON only. See `backend/aggregator/agents/submitter.py`. - **Standard LaTeX-Focused Retry**: Retry prompts explain HOW to escape LaTeX properly. **LaTeX IS allowed** - just escape backslashes once (`\mathbb` → `\\mathbb`). DO NOT double-escape. For `old_string`: copy EXACTLY from document, just escape backslashes. -- **Retry Context Overflow Prevention (CRITICAL)**: Truncate failed output to ~2000 chars before retry. Calculate if retry fits context window. Fall back to simple re-prompt if too large. Set `max_tokens` explicitly (never `None`). NEVER auto-increase beyond user limits. Applies to: `submitter.py`, `validator.py`, `high_param_submitter.py`, `compiler_validator.py`. +- **Retry Context Overflow Prevention (CRITICAL)**: Sanitize failed output, then truncate to ~2000 chars before retry. Parser exception messages that are inserted into retry prompts must report failure type/structure only and must not include raw output excerpts. Calculate if retry fits context window. Fall back to simple re-prompt if too large. Set `max_tokens` explicitly (never `None`). NEVER auto-increase beyond user limits. Applies to: `submitter.py`, `validator.py`, `high_context_submitter.py`, `high_param_submitter.py`, `compiler_validator.py`. ## Internal Content Warning (Required in All Prompts) @@ -227,6 +241,7 @@ WHEN IN DOUBT: Verify independently. Do not assume. Do not trust unverified inte - `backend/autonomous/prompts/paper_redundancy_prompts.py` - `backend/autonomous/prompts/paper_continuation_prompts.py` - `backend/autonomous/prompts/final_answer_prompts.py` +- `backend/autonomous/prompts/proof_prompts.py` **Note:** The prompt structure examples in the sections below show the core task-specific content. The INTERNAL CONTENT WARNING block is ALWAYS inserted between the role description and the "YOUR TASK:" section in the actual code. @@ -245,13 +260,20 @@ def get_validator_system_prompt() -> str: return """You are a validation agent in an AI cluster. Your role is to evaluate mathematical submissions and decide whether they should be added to the shared knowledge base. YOUR TASK: -Tell me if the addition of the new submission increases potential solution availability in a significant way and/or provides a valuable solution space-constraint that narrows where we need to search in a significant way. +Decide whether the submission provides the strongest rigorous progress currently justified toward solving the user's problem, with highest priority given to direct solutions, direct partial solutions, impossibility results, exact reductions, or sharp constraints. + +Essentially, you are evaluating whether the training database becomes more useful toward directly answering the user's mathematical prompt with this submission added than it was without it. -Essentially, you are evaluating whether the training database becomes more useful toward finding mathematical solutions with this submission added than it was without it. +Note: You are not generating solutions yourself. You are judging whether this submission directly solves, partially solves, refutes, or materially enables the user's problem better than the current knowledge base does. -Note: You are not generating solutions yourself - you are assessing if there are new solutions potentially available if we add this submission to the training database, or if the solution space becomes stronger in any way. +META-PHASE EXCEPTION: +If the USER PROMPT explicitly says TOPIC EXPLORATION PHASE or PAPER TITLE EXPLORATION PHASE, evaluate the submission as the requested candidate artifact, not as a direct solution: +- TOPIC EXPLORATION PHASE: accept a candidate brainstorm question if it is specific, distinct, relevant, grounded, and aimed at a strong direct-answer path +- PAPER TITLE EXPLORATION PHASE: accept a candidate title if it is accurate, specific, distinct, professional, and foregrounds direct answer-bearing content when justified +- Do NOT reject these meta-phase submissions merely because they are questions or titles rather than mathematical solutions EVALUATION CRITERIA - Consider: +- Does the submission directly answer, partially answer, refute, or sharply constrain the user's problem or a necessary subproblem? - Does the submission add genuinely new information or perspectives beyond what is already accepted? - Does the submission connect existing mathematical concepts in novel ways? - Does the submission provide concrete methods, theorems, proofs, or mathematical techniques? @@ -263,9 +285,9 @@ EVALUATION CRITERIA - Consider: VALIDATION DECISION RULES: A submission should be ACCEPTED if it: -1. Increases potential solution availability in a significant way, OR -2. Provides valuable solution space constraints that narrow where to search, OR -3. Offers novel mathematical insights not present in existing accepted submissions, OR +1. Directly solves, partially solves, or proves a meaningful impossibility/limitation result for the user's problem or a necessary subproblem, OR +2. Provides valuable solution space constraints that sharply narrow where a direct answer can lie, OR +3. Offers rigorous enabling insights not present in existing accepted submissions when a stronger direct step is not yet available, OR 4. Presents rigorous mathematical arguments based on established principles A submission should be REJECTED if it: @@ -451,9 +473,15 @@ YOUR TASK: Evaluate EACH submission INDEPENDENTLY to determine if it would make a valuable cumulative addition to the shared knowledge base. Independent Assessment: -For each submission, ask: "Does this submission increase potential solution availability or provide valuable constraints, considering only the existing database (not the other submission in this batch)?" +For each submission, ask: "Does this submission provide the strongest rigorous direct progress currently justified toward the user's problem, considering only the existing database (not the other submission in this batch)?" -Essentially, you are evaluating whether the training database becomes more useful toward finding mathematical solutions with each submission added than it was without it. +Essentially, you are evaluating whether the training database becomes more useful toward directly answering the user's mathematical prompt with each submission added than it was without it. + +META-PHASE EXCEPTION: +If the USER PROMPT explicitly says TOPIC EXPLORATION PHASE or PAPER TITLE EXPLORATION PHASE, evaluate each submission as the requested candidate artifact, not as a direct solution: +- TOPIC EXPLORATION PHASE: accept a candidate brainstorm question if it is specific, distinct, relevant, grounded, and aimed at a strong direct-answer path +- PAPER TITLE EXPLORATION PHASE: accept a candidate title if it is accurate, specific, distinct, professional, and foregrounds direct answer-bearing content when justified +- Do NOT reject these meta-phase submissions merely because they are questions or titles rather than mathematical solutions EVALUATION CRITERIA (Apply to EACH submission independently): - Does the submission add genuinely new information or perspectives beyond what is already accepted? @@ -466,9 +494,9 @@ EVALUATION CRITERIA (Apply to EACH submission independently): VALIDATION DECISION RULES (for each submission): A submission should be ACCEPTED if it: -1. Increases potential solution availability in a significant way, OR -2. Provides valuable solution space constraints that narrow where to search, OR -3. Offers novel mathematical insights not present in existing accepted submissions, OR +1. Directly solves, partially solves, or proves a meaningful impossibility/limitation result for the user's problem or a necessary subproblem, OR +2. Provides valuable solution space constraints that sharply narrow where a direct answer can lie, OR +3. Offers rigorous enabling insights not present in existing accepted submissions when a stronger direct step is not yet available, OR 4. Presents rigorous mathematical arguments based on established principles A submission should be REJECTED if it: @@ -599,9 +627,15 @@ YOUR TASK: Evaluate EACH submission INDEPENDENTLY to determine if it would make a valuable cumulative addition to the shared knowledge base. Independent Assessment: -For each of the three submissions, ask: "Does this submission increase potential solution availability or provide valuable constraints, considering only the existing database (not the other submissions in this batch)?" +For each of the three submissions, ask: "Does this submission provide the strongest rigorous direct progress currently justified toward the user's problem, considering only the existing database (not the other submissions in this batch)?" + +Essentially, you are evaluating whether the training database becomes more useful toward directly answering the user's mathematical prompt with each submission added than it was without it. -Essentially, you are evaluating whether the training database becomes more useful toward finding mathematical solutions with each submission added than it was without it. +META-PHASE EXCEPTION: +If the USER PROMPT explicitly says TOPIC EXPLORATION PHASE or PAPER TITLE EXPLORATION PHASE, evaluate each submission as the requested candidate artifact, not as a direct solution: +- TOPIC EXPLORATION PHASE: accept a candidate brainstorm question if it is specific, distinct, relevant, grounded, and aimed at a strong direct-answer path +- PAPER TITLE EXPLORATION PHASE: accept a candidate title if it is accurate, specific, distinct, professional, and foregrounds direct answer-bearing content when justified +- Do NOT reject these meta-phase submissions merely because they are questions or titles rather than mathematical solutions EVALUATION CRITERIA (Apply to EACH submission independently): - Does the submission add genuinely new information or perspectives beyond what is already accepted? @@ -614,9 +648,9 @@ EVALUATION CRITERIA (Apply to EACH submission independently): VALIDATION DECISION RULES (for each submission): A submission should be ACCEPTED if it: -1. Increases potential solution availability in a significant way, OR -2. Provides valuable solution space constraints that narrow where to search, OR -3. Offers novel mathematical insights not present in existing accepted submissions, OR +1. Directly solves, partially solves, or proves a meaningful impossibility/limitation result for the user's problem or a necessary subproblem, OR +2. Provides valuable solution space constraints that sharply narrow where a direct answer can lie, OR +3. Offers rigorous enabling insights not present in existing accepted submissions when a stronger direct step is not yet available, OR 4. Presents rigorous mathematical arguments based on established principles A submission should be REJECTED if it: @@ -1652,79 +1686,37 @@ Output your response ONLY as JSON in this exact format: --- -## 8. CRITIQUE & REWRITE PHASE (POST-BODY CONSTRUCTION) +## 8. CRITIQUE & SELF-REVIEW PHASE (POST-BODY CONSTRUCTION) **File:** `backend/compiler/prompts/critique_prompts.py` ### Overview -After the body section is complete (before conclusion), the system enters a **Critique Phase** that reuses the aggregator infrastructure to collect peer review feedback. This phase ensures the body section is mathematically sound and properly aligned before proceeding. +After the body section is complete (before conclusion), the system enters a **Critique Phase** that collects validator-approved self-review notes. Accepted critiques are appended to the paper transparently instead of rewriting paper content. ### Workflow -1. **Critique Aggregation** (5 total attempts required): +1. **Critique Aggregation** (3 total attempts required): - Single critique submitter generates peer review feedback on body section - - **Decline Mechanism**: Submitter can assess "no critique needed" when body is academically acceptable (counts toward 5 total attempts) + - **Decline Mechanism**: Submitter can assess "no critique needed" when body is academically acceptable (counts toward 3 total attempts) - Validator validates critiques/declines (accept/reject with feedback loop) - Pruning occurs every 7 acceptances (same as aggregator cleanup review) - - Target: 5 total attempts (accepted + rejected + declined attempts) + - Target: 3 total attempts (accepted + rejected + declined attempts) - Uses aggregator workflow with critique-specific prompts -2. **Rewrite Decision**: - - If at least 1 critique accepted: Critique submitter reviews all accepted critiques + accumulated history from previous failed versions - - If 0 critiques accepted: Skip rewrite, move to next section - - Decides: "rewrite" (major issues found) or "continue" (minor/incorrect critiques) - - Validator validates the decision (accept/reject with retry loop) - - Decision includes optional new title and new outline - -3. **Rewrite Execution** (if approved): - - Mark rewrite as pending (counter increments only after first successful acceptance) - - Three execution paths based on decision: - - **CONTINUE**: Proceed to conclusion (critiques minor/incorrect) - - **PARTIAL_REVISION**: **ITERATIVE** edits - proposes ONE edit at a time, validates, applies, then proposes next edit - - **TOTAL_REWRITE**: Clear body section completely and rebuild from scratch - - Update title if changed (increment version number) - - Update outline if changed - - **CONTEXT FOR BOTH PARTIAL_REVISION AND TOTAL_REWRITE**: - - Pre-critique paper (paper snapshot from START of critique phase - shows what failed) - - Current accepted critique feedback (ONLY accepted, not rejected critiques) - - ALL critiques from ALL previous failed versions (accumulated feedback history) - - Original aggregator database - - Reference papers (if applicable) - -4. **Version Loop**: - - If rewrite_count >= 1 (completed rewrites): Skip critique phase entirely, proceed to conclusion - - Rewrite counts as "completed" only after first successful body acceptance - - Single completed rewrite cycle is sufficient for convergence +2. **Self-Review Append**: + - If at least 1 critique is accepted: append accepted critiques as `AI Self-Review and Limitations` + - If 0 critiques are accepted: move to conclusion without adding the section + - The section is placed after compiler/appended proof material when present, otherwise after conclusion + - Critiques never trigger partial rewrites, total rewrites, body clearing, title changes, or outline updates ### Rationale -**Why partial revision is ITERATIVE (one edit at a time):** -- Allows the model to see the result of each edit before proposing the next -- Each edit is validated individually for correctness -- Prevents cascading failures from a batch of edits -- Model can see pre-critique paper AND current paper to understand what started vs where we are -- More precise control over the revision process - -**Why partial revision is preferred:** -- Most critiques identify specific, localized issues that can be fixed with targeted edits -- Preserves coherence in sections that are already correct -- Faster than full body rewrite -- Reduces risk of introducing new errors in previously sound sections -- More efficient use of model context and computation - -**Why total rewrite is last resort:** -- Total rewrites are difficult and can introduce errors in areas that were previously correct -- Even with feedback, rewriting from scratch can lose coherence -- Should only be used when issues are too pervasive for targeted edits -- Catastrophic flaws (fundamental math errors throughout, complete misalignment) justify total rewrite -- Now receives full context: pre-critique paper + accepted critiques, so rewrite is informed - -**Why maximum 1 rewrite:** -- Prevents infinite rewrite loops on difficult topics -- Forces convergence to best-effort result after single revision attempt -- Accumulated feedback ensures the revision benefits from all critique history -- Single rewrite cycle is sufficient with partial revision option available +**Why append rather than rewrite:** +- Preserves the validated paper content and proof placement. +- Keeps model-discovered limitations visible to readers. +- Avoids rewrite loops and accidental loss of correct content. +- Makes the AI self-review honest provenance rather than hidden revision pressure. ### Decline Mechanism (Academically Acceptable Body) @@ -1740,9 +1732,9 @@ After the body section is complete (before conclusion), the system enters a **Cr - Mathematical rigor meets academic standards **Behavior When Target Met**: -- If NO accepted critiques: Skip rewrite, transition directly to conclusion -- If accepted critiques exist: Run rewrite decision -- Rationale: With only 5 attempts, no early termination mechanism is needed +- If NO accepted critiques: transition directly to conclusion +- If accepted critiques exist: append `AI Self-Review and Limitations`, then transition to conclusion +- Rationale: With only 3 attempts, no early termination mechanism is needed ### Complete Prompt Structure - Critique Generation @@ -1819,84 +1811,6 @@ Output as JSON: """ ``` -### Complete Prompt Structure - Rewrite Decision - -**Function:** `get_rewrite_decision_system_prompt()` - -```python -def get_rewrite_decision_system_prompt() -> str: - return """You are reviewing aggregated peer review critiques to decide if body needs revision. - -[... INTERNAL CONTENT WARNING ...] - -YOUR TASK: -The peer review phase collected critiques through multiple attempts. ALL accepted critiques from the CURRENT version are provided below (typically 1-3 accepted out of 5 total attempts). - -**ACCUMULATED CRITIQUE HISTORY**: If this is not the first critique phase (rewrite_count > 0), you will also see critiques from ALL previous failed versions, labeled as "FAILED - REWRITTEN". Use this accumulated feedback to understand what went wrong in past attempts and avoid repeating mistakes. - -Review all critiques and decide: - -DECISION OPTIONS: -1. CONTINUE - Minor/incorrect critiques -2. PARTIAL_REVISION - Fixable issues, you will propose edits ONE AT A TIME in iterative loop -3. TOTAL_REWRITE - Catastrophic flaws, rebuild from scratch (last resort) - -CONTINUE if: -- Minor issues -- Incorrect critiques -- Small gaps addressable in review - -PARTIAL_REVISION if: -- Specific sections have fixable errors -- Missing content can be inserted at specific locations -- Most of body is sound, only targeted fixes needed -- NOTE: You will then propose edits ONE AT A TIME (not all at once) - -TOTAL_REWRITE if (ONLY AS LAST RESORT): -- Fundamental mathematical errors pervasive throughout -- Body fundamentally misaligned with paper title -- Structural problems require complete reorganization -- Issues too widespread for targeted edits -- NOTE: Rewrite will have full context (pre-critique paper + accepted critiques) - -FOR ANY REVISION: -- Can change title (if scope drift) -- Can update outline (if structure needs changes) -- For PARTIAL_REVISION: Edit operations are proposed iteratively (not in this decision) - -Output as JSON: -{ - "decision": "continue | partial_revision | total_rewrite", - "new_title": "New title or null", - "new_outline": "Updated outline or null", - "reasoning": "Detailed explanation" -} -""" -``` - -### Iterative Edit Prompt Structure (for PARTIAL_REVISION) - -**Function:** `get_iterative_edit_system_prompt()` - -When PARTIAL_REVISION is chosen, the system enters an iterative edit loop. Each iteration: -1. Shows pre-critique paper (original state before this revision cycle) -2. Shows current paper (after any edits applied so far) -3. Shows accepted critique feedback -4. Shows edits already applied -5. Requests ONE edit proposal - -```json -{ - "operation": "replace | insert_after | delete", - "old_string": "Exact text to find in CURRENT paper", - "new_string": "Replacement text", - "reasoning": "Which critique issue this addresses", - "more_edits_needed": true | false -} -``` - -The loop continues until `more_edits_needed=false` or max iterations (20) reached. - ### Assembly in `build_critique_prompt()` ```python @@ -1938,35 +1852,6 @@ The loop continues until `more_edits_needed=false` or max iterations (20) reache } ``` -**Rewrite Decision:** -```json -{ - "decision": "continue | partial_revision | total_rewrite", - "new_title": "New title or null", - "new_outline": "Updated outline or null", - "reasoning": "Detailed explanation" -} -``` - -**Iterative Edit (for partial_revision loop):** -```json -{ - "operation": "replace | insert_after | delete", - "old_string": "Exact text to find", - "new_string": "Replacement text", - "reasoning": "Which critique this addresses", - "more_edits_needed": true | false -} -``` - -**Rewrite Decision Validation:** -```json -{ - "decision": "accept or reject", - "reasoning": "Why decision is or isn't justified" -} -``` - --- ## 9. COMPILER RIGOR PROMPTS (LEAN 4 THEOREM FLOW) @@ -1979,30 +1864,35 @@ The loop continues until `more_edits_needed=false` or max iterations (20) reache ### Four-Stage Architecture -The rigor loop no longer edits paper text. Each rigor cycle runs four stages, with the coordinator owning the validator loop and the appendix fallback: +The rigor loop no longer edits paper text directly during discovery/formalization. Each rigor cycle runs four stages, with the coordinator owning inline validator attempts and appendix routing: **Stage 1: Theorem discovery (unvalidated)** — `build_rigor_theorem_discovery_prompt` - High-param submitter reads the full writing context (outline direct-injected, paper direct-injected when it fits, RAG for the rest per the offload priority excluding `compiler_outline.txt` + `compiler_paper.txt`). - Sees `EXISTING VERIFIED PROOFS` block (from `proof_database.get_all_proofs()`) so it does not re-propose already-verified theorems. - Sees `OPEN LEMMA TARGETS` block (from `proof_database.get_recent_failure_hints()`) as optional retry candidates. -- Decides whether a theorem is worth attempting. Decline ends the rigor cycle. +- Decides whether a user-prompt-relevant theorem is worth attempting. Decline ends the rigor cycle. +- Discovery is explicitly allowed to construct extension theorems from partial paper work, the current outline, supporting context, or the user prompt when helpful to paper construction and/or the user's goal. It is not limited to exact claims already present in the current paper. +- Discovery must classify `theorem_origin` as `existing_paper_claim`, `extension_from_partial_work`, or `extension_from_user_prompt`, and must set `placement_preference` to `inline` or `appendix_only`. Extension-derived theorems must use `appendix_only`. **Stage 2: Lean 4 formalization** — reuses `ProofFormalizationAgent.prove_candidate(max_attempts=5)` from autonomous mode - Up to 5 Lean 4 attempts with error-feedback chaining (failing tactic + goal states + raw Lean diagnostics fed back into each retry). - Broadcasts `proof_attempt_started` / `proof_verified` / `proof_attempt_failed` / `proof_check_complete` events with `source_type="compiler_rigor"` so the existing autonomous-mode proof UI lights up for free. - All-5-fail: candidate is recorded via `proof_database.record_failed_candidate` (becomes a future open lemma target) and the cycle ends as a decline. -**Stage 3: Novelty classification + persistence** — shared `assess_proof_novelty` helper from `backend/autonomous/core/proof_novelty.py` +**Stage 3: Post-Lean integrity + novelty classification + persistence** — shared `validate_full_lean_proof_integrity` helper from `backend/shared/lean_proof_integrity.py`, then shared `assess_proof_novelty` helper from `backend/autonomous/core/proof_novelty.py` +- Rejects Lean-accepted proofs that introduce new fake proof devices (`axiom`, `constant`, `opaque`) not present in the source context. +- Rejects Lean-accepted proofs that do not align with the intended theorem statement. - Classifies the verified proof as novel or known. - `proof_database.add_proof(record)` stores it with `source_type="paper"`, `source_id=f"compiler_rigor:{session}"`. - Novel proofs automatically enter the highest-priority direct-injection block on the next submitter instantiation (via `proof_database.inject_into_prompt`). - Non-novel proofs stay in the database, visible through `/api/proofs/*` for future reference-selection UI flows. -**Stage 4: Placement (2 attempts + appendix fallback)** — `build_rigor_placement_prompt` -- Submitter proposes an inline edit that introduces the theorem with an explicit "verified in Lean 4, see Appendix A, " marker. +- **Stage 4: Placement routing (inline attempts OR appendix-only)** — `build_rigor_placement_prompt` +- If `placement_preference="appendix_only"`, inline placement is skipped and the verified theorem is appended directly to the Theorems Appendix with `placement_outcome="appendix_requested"`. +- If `placement_preference="inline"`, submitter proposes an inline edit that introduces the theorem with an explicit "verified in Lean 4, see Appendix A, " marker. - Validator uses the new `rigor_lean_placement` mode: judges placement and narrative only; `rigor_check` is **forced to True** regardless of LLM output (Lean 4 is the source of mathematical truth). - Up to 2 placement attempts; attempt 2 receives the validator's rejection feedback via `validator_rejection_feedback` field. -- On double rejection (or when attempt 1 is not produced), the theorem is appended to the **Theorems Appendix** via `paper_memory.append_to_theorems_appendix(...)`. Counts as a `rigor_acceptance` because the math is preserved. +- On double rejection (or when attempt 1 is not produced), the theorem is appended to the **Theorems Appendix** via `paper_memory.append_to_theorems_appendix(...)` with `placement_outcome="appendix_fallback"`. Counts as a `rigor_acceptance` because the math is preserved. ### Stage 1 JSON Schema (discovery) @@ -2011,7 +1901,9 @@ The rigor loop no longer edits paper text. Each rigor cycle runs four stages, wi "needs_theorem_work": true, "theorem_statement": "precise statement with explicit hypotheses", "formal_sketch": "concrete Mathlib tactics / lemmas that look promising", - "source_excerpt": "2-6 sentences from the paper that motivate this theorem", + "source_excerpt": "2-6 sentences of motivating paper/outline/context/user-prompt basis", + "theorem_origin": "existing_paper_claim | extension_from_partial_work | extension_from_user_prompt", + "placement_preference": "inline | appendix_only", "retry_existing_failure_id": "theorem_id from OPEN LEMMA TARGETS if retrying, empty otherwise", "reasoning": "why this theorem is the best target right now OR why no theorem" } @@ -2024,11 +1916,15 @@ Decline form: "theorem_statement": "", "formal_sketch": "", "source_excerpt": "", + "theorem_origin": "", + "placement_preference": "", "retry_existing_failure_id": "", "reasoning": "why declining" } ``` +Placement preference rule: `extension_from_partial_work` and `extension_from_user_prompt` MUST resolve to `appendix_only` even if the model emits `inline`. Existing paper claims may use `inline` when the theorem strengthens local prose, or `appendix_only` when it is useful but would distract from the body. + ### Stage 4 JSON Schema (placement) ```json @@ -2060,7 +1956,7 @@ Each entry written by `format_theorem_appendix_entry(...)` (helper in `backend/c ``` Theorem (proof_XXX) [Novel | Known] - -Status: verified by Lean 4 () +Status: verified by Lean 4 () Statement: Lean 4 proof: @@ -2306,7 +2202,7 @@ Part 3 introduces autonomous topic selection, brainstorm-to-paper workflows, and **File:** `backend/autonomous/prompts/topic_exploration_prompts.py` -**Purpose:** Before topic selection, collect 5 validated candidate brainstorm questions using the full Part 1 aggregator infrastructure (parallel submitters, batch validation up to 3). Uses `build_exploration_user_prompt()` to frame the standard aggregator as a candidate question generator. +**Purpose:** Before topic selection, collect 5 validated candidate brainstorm questions using the full Part 1 aggregator infrastructure (parallel submitters, batch validation up to 3). Uses `build_exploration_user_prompt()` to frame the standard aggregator as a candidate question generator, with a preference for candidate questions that maximize the chance of a rigorous direct answer rather than merely broad exploration. **Architecture:** Reuses `AggregatorCoordinator` — no custom JSON schemas. Standard aggregator submitter/validator prompts handle generation and validation. The exploration user prompt provides the framing context (research goal, existing brainstorms/papers, diversity requirement). @@ -2757,7 +2653,7 @@ All proof prompts pass `temperature=0.0`. **Function:** `build_proof_framing_gate_prompt(user_prompt)` -**Purpose:** One-shot decision at autonomous start — decides whether the research program should activate the full proof pipeline. Errs on the side of `true` whenever there is meaningful mathematical substance. +**Purpose:** One-shot decision at autonomous start — decides whether the research program should activate the full proof pipeline. Errs on the side of `true` whenever formal proof can materially help the user's prompt. ```json { @@ -2776,7 +2672,7 @@ All proof prompts pass `temperature=0.0`. **Function:** `build_proof_identification_prompt(user_prompt, source_type, source_id, source_content)` -**Purpose:** Novelty-seeking gate that extracts the most promising non-trivial theorem candidates from a brainstorm or paper. Rejects trivial identities and textbook restatements. Returns at most 5 candidates ranked by novelty potential. +**Purpose:** User-prompt relevance gate that extracts every prompt-relevant, non-trivial theorem candidate from a brainstorm or paper. Rejects off-prompt curiosities, trivial identities, and textbook restatements. Orders candidates by direct usefulness to the user prompt first, then novelty/formalization value. No artificial theorem-count cap. ```json { @@ -2786,23 +2682,23 @@ All proof prompts pass `temperature=0.0`. "theorem_id": "thm_1", "statement": "natural-language theorem statement", "formal_sketch": "optional note about assumptions, notation, or likely Lean formalization strategy", - "novelty_rationale": "why this theorem is non-trivial and worth formalizing" + "novelty_rationale": "why this theorem helps the user prompt and is worth formalizing" } ] } ``` **Field requirements:** -- `has_provable_theorems`: Boolean. `true` when at least one non-trivial novel-potential theorem is present. -- `theorems`: Array of candidates, ranked by novelty potential. **Maximum 5 entries.** Empty array when `has_provable_theorems` is `false`. +- `has_provable_theorems`: Boolean. `true` when at least one prompt-relevant, non-trivial theorem is present. +- `theorems`: Array of every prompt-relevant candidate, ordered by direct usefulness to the user prompt first and novelty/formalization value second. Empty array when `has_provable_theorems` is `false`. - `theorem_id`: Stable string identifier such as `"thm_1"`, `"thm_2"`, etc. - `statement`: Natural-language theorem statement. Required. - `formal_sketch`: Optional Lean formalization hints, assumptions, or notation notes. -- `novelty_rationale`: Brief explanation of why this theorem is non-trivial and worth the cost of Lean verification. Required for each candidate. +- `novelty_rationale`: Brief explanation of why this theorem helps the USER RESEARCH PROMPT and is worth the cost of Lean verification. Required for each candidate. -**What to extract:** Novel theorems, bold conjectures that can be sharpened, non-obvious connections/bounds/structural results, ambitious claims (the formalization agent narrows if needed). +**What to extract:** Theorems, supporting lemmas, sharpened conjectures, non-obvious bounds, and structural results that materially help answer, support, or advance the USER RESEARCH PROMPT. -**What to reject:** Trivial identities (e.g. `n + 0 = n`), standard Mathlib restatements, results closable by a single tactic (`simp`, `omega`, `norm_num`, `decide`, `rfl`), tautologies, definitional equalities. +**What to reject:** Off-prompt mathematical curiosities, trivial identities (e.g. `n + 0 = n`), standard Mathlib restatements, results closable by a single tactic (`simp`, `omega`, `norm_num`, `decide`, `rfl`), tautologies, definitional equalities. --- @@ -2910,20 +2806,24 @@ All proof prompts pass `temperature=0.0`. **Function:** `build_proof_novelty_prompt(user_prompt, theorem_statement, lean_code, existing_novel_proofs)` -**Purpose:** Post-verification novelty gate — classifies a Lean-4-verified theorem as novel or known. Does NOT re-check validity. Errs on the side of recognizing novelty for results that required multi-step reasoning or non-trivial formalization work. +**Purpose:** Post-verification novelty gate — classifies a Lean-4-verified theorem into a novelty tier. Does NOT re-check validity. Errs on the side of recognizing novelty for results that required multi-step reasoning, non-trivial formalization work, or original proof strategy. ```json { - "is_novel": true, + "novelty_tier": "mathematical_discovery", "reasoning": "brief explanation" } ``` **Field requirements:** -- `is_novel`: Boolean. `true` → proof enters the highest-priority direct-injection block for all subsequent brainstorm/paper submitters via `proof_database.get_novel_proofs_for_injection()`. `false` → stored in the database but not injected. +- `novelty_tier`: One of `not_novel`, `novel_formulation`, `novel_variant`, `mathematical_discovery`, or `major_mathematical_discovery`. Any tier except `not_novel` enters the highest-priority direct-injection block for all subsequent brainstorm/paper submitters via `proof_database.get_novel_proofs_for_injection()`. `not_novel` proofs are stored in the database but not injected. - `reasoning`: Always required. -**Novel criteria (any one sufficient):** Result not in Mathlib or standard textbooks; new connection/bound/structural insight; formalizes a previously unverified conjecture; non-trivial composition of known results yielding something new; original relative to the existing stored proofs. +**Novelty tiers:** +- `novel_formulation`: The mathematical result is historically known, but this Lean 4 formalization or mechanized proof is novel for the research program. +- `novel_variant`: A non-trivial reformulation, restructuring, generalization, different proof strategy, weaker hypotheses, stronger conclusion, or original composition based on known material. +- `mathematical_discovery`: A new theorem, bound, connection, structural insight, formally verified conjecture, or independently publishable/citable mathematical contribution. +- `major_mathematical_discovery`: A possible field-level breakthrough that may be competitive for a major prize or medal in a related field if confirmed and accepted by domain experts. This sits above ordinary `mathematical_discovery`. **Not novel:** Direct Mathlib restatement; trivial identity or tautology; closable by a single standard tactic (`simp`, `omega`, `norm_num`, `decide`, `rfl`); duplicates an already-stored novel proof. @@ -2936,16 +2836,17 @@ These core requirements apply across all prompt types: 1. **Internal Content Warning**: All system prompts include the standardized skepticism warning block 2. **Concrete Format Examples**: Every prompt includes correct/wrong format examples with visual indicators 3. **Structured Rejection Feedback**: Validators use the standardized rejection format (Reason/Issue/What I Saw/Expected/Fix) -4. **Compiler Outline Injection**: The compiler outline is always fully injected (never RAGed) for structural framework -5. **Temperature Policy**: All prompts use temperature=0.0 where API calls allow (deterministic generation) - the context in the program from feedback, etc provide enough variance to avoid looping. -6. **JSON Preprocessing**: All LLM responses preprocessed by `sanitize_json_response()` -7. **Exact String Matching**: Document edits use exact verbatim matches with conservative consecutive fuzzy matching fallback for model escaping quirks (85% consecutive + tail anchor + uniqueness required) -8. **Phase-Based Construction**: Papers written in order: Body → Conclusion → Introduction → Abstract -9. **Required Sections**: +4. **Direct-Solution Preference**: Prompts should prefer the strongest rigorous direct progress toward the user's goal (direct solutions, direct partial solutions, impossibility results, exact reductions, or sharp constraints) and use indirect support only when no stronger direct step is currently justified. Meta-phases such as topic exploration and paper title exploration still output candidates, but those candidates are judged by direct-answer potential instead of being rejected for not being solutions themselves. +5. **Compiler Outline Injection**: The compiler outline is always fully injected (never RAGed) for structural framework +6. **Temperature Policy**: Default `temperature=0.0`; only Supercharge candidates and parallel brainstorm submitter lanes may use explicit diversity temperatures. Validators, compiler roles, proof/final roles, and JSON retries stay `0.0`. +7. **JSON Preprocessing**: All LLM responses preprocessed by `sanitize_json_response()` +8. **Exact String Matching**: Document edits use exact verbatim matches with conservative consecutive fuzzy matching fallback for model escaping quirks (85% consecutive + tail anchor + uniqueness required) +9. **Phase-Based Construction**: Papers written in order: Body → Conclusion → Introduction → Abstract +10. **Required Sections**: - **OUTLINE**: Must include Introduction, Body, Conclusion (Abstract is optional - can be "Abstract", "I. Abstract", or "0. Abstract") - **PAPER CONSTRUCTION**: Always writes Abstract → Introduction → Body → Conclusion (Abstract is always written during construction phase regardless of outline) -10. **No Placeholder Output**: Submissions must never contain placeholder markers -11. **Placeholder Resume Repair**: When resuming from existing paper, missing placeholders are automatically added via `paper_memory.ensure_placeholders_exist()` to prevent "old_string not found" failures -12. **Fake Placeholder Detection**: System distinguishes real section content from model-inserted fake placeholder text (FULL content >300 chars = real; <300 chars with keywords = fake) to prevent confusion during marker repair +11. **No Placeholder Output**: Submissions must never contain placeholder markers +12. **Placeholder Resume Repair**: When resuming from existing paper, missing placeholders are automatically added via `paper_memory.ensure_placeholders_exist()` to prevent "old_string not found" failures +13. **Fake Placeholder Detection**: System distinguishes real section content from model-inserted fake placeholder text (FULL content >300 chars = real; <300 chars with keywords = fake) to prevent confusion during marker repair --- diff --git a/.cursor/rules/main-rule-3-code-interaction-and-rule-interaction-rules.mdc b/.cursor/rules/main-rule-3-code-interaction-and-rule-interaction-rules.mdc index 312e9ca..242f663 100644 --- a/.cursor/rules/main-rule-3-code-interaction-and-rule-interaction-rules.mdc +++ b/.cursor/rules/main-rule-3-code-interaction-and-rule-interaction-rules.mdc @@ -4,7 +4,7 @@ alwaysApply: true # Code and Rule Interaction Rules -1.) Never introduce a new wait to hault the program unless specifically directed by the user. The program is designed to run until its goal completion or the operator presses stop. Infinite loops are probabalistically avoided due to the feedback mechanics. +1.) Never introduce a new hidden wait/halt, automatic stop, or loop-disabling cap unless specifically directed by the user or already defined by these rules as an explicit safety valve/user-configurable checkpoint. RALPH/MOTO is designed to run until goal completion or until the operator presses stop; infinite loops are intentional and are probabilistically steered by feedback mechanics, not disabled by agent edits. 2.) Always remove and cleanup old code, do not comment out code or leave broken/unused code in this program unless specifically directed by the user. @@ -18,6 +18,20 @@ alwaysApply: true 7.) Any REST shape, auth contract, WebSocket event, or `/api/features` capability change that affects the web wrapper must update **code, the relevant rule(s), and `api_contract_version` in `/api/features`** in the same approved merge. The live backend's `GET /openapi.json` is the machine-readable REST schema contract. -8.) Only ONE workflow mode may be active at a time (Aggregator, Compiler, or Autonomous Research). This constraint applies identically in both default mode and generic mode. +8.) Only ONE workflow mode may be active at a time (Aggregator, Compiler, Autonomous Research, or LeanOJ Proof Solver). This constraint applies identically in both default mode and generic mode. Start conflict checks must be serialized and include pending/background-task activity flags such as `autonomous_coordinator.is_active`, not only persisted `state.is_running` fields. -9.) Lean 4 and SMT features are always gated on `lean4_enabled`, `lean4_lsp_enabled`, and `smt_enabled` runtime flags. All three default false, must stay silent and side-effect-free when disabled, and must never ship Lean or Z3 toolchains or Python wheels into `requirements-generic.txt`, `Dockerfile`, or `docker/entrypoint.sh` (hosted image stays Lean-free and Z3-free). Lean 4 is authoritative for every stored proof; SMT contributes hints only. +9.) Lean 4 and SMT features are always gated on `lean4_enabled`, `lean4_lsp_enabled`, and `smt_enabled` runtime flags. All three default false, must stay silent and side-effect-free when disabled, and must never ship Lean or Z3 toolchains or Python wheels into `requirements-generic.txt`, `Dockerfile`, or `docker/entrypoint.sh` (hosted image stays Lean-free and Z3-free). Lean 4 is authoritative formal checking for every stored proof and is necessary for LeanOJ final solutions; SMT contributes hints only. Z3 executable paths are trusted startup/operator configuration only, must be rejected as runtime API input, and must resolve to a `z3`/`z3.exe` executable. Automated proof candidates must directly serve the user prompt, not merely be non-trivial or novel. + +10.) LeanOJ initial topic generation and brainstorm submitters always run in parallel and feed one validator that batch-validates up to 3 topics/submissions. Initial topic candidates/selection must be broad locked foundation questions covering the whole LeanOJ solution route, not narrow sublemma/tactic/local-repair topics. Recursive brainstorming has no separate recursive-topic prepass and must not re-inject the initial selected topic as active steering context; it uses the shared accepted proof-memory database plus the current proof/failure context. Accepted brainstorm memory must preserve occurrence-specific chronological metadata even for duplicate idea text. Never implement active LeanOJ topic or brainstorm phases as round-robin/serial submitter calls; one hung submitter must not halt the phase. + +11.) LeanOJ stop/crash/restart is resumable by default. `Clear Progress` / `/api/leanoj/clear?confirm=true` is the only intentional reset path. Start/restart should choose the best matching/resumable persisted session by proof context, not blindly create a new session or pick the latest file. + +12.) LeanOJ OpenRouter credit exhaustion or no-fallback provider configuration errors are non-retryable pauses, not proof-attempt failures. Do not let API credit/config failures inflate final proof attempt loops. + +13.) LeanOJ/RALPH final-proof loop checkpoints may only be user-configurable feedback checkpoints, not hidden loop shutdowns. The durable `master_proof.lean` is the authoritative working draft, and every accepted master-proof edit must pass an in-memory Lean gate before persistence: `needs_more_time=true` runs Lean with `sorry`/`admit` placeholders allowed but still requires parse/typecheck, template preservation, and no fake proof devices; `needs_more_time=false` runs Lean placeholder-free and then final semantic review against the user prompt/template before the run stops as verified. Final-proof mode is edit-only: it must not be offered, shown, or taught `stuck_needs_brainstorm`, raw `need_more_brainstorming`, failed-attempt counts, or any path transition. It may see the most recent 5 final attempts as compact execution feedback (Lean errors, stale edit rejections, JSON truncation, watchdog/no-progress notices) so it can avoid repeating failed edits. Lean/template rejection, semantic-review rejection, conservative no-progress/stale-edit watchdog feedback, and validator rejection of non-progressive shortening edits must preserve the master proof and persist structured continuation feedback; non-user-forced no-progress handoffs should gather recursive brainstorm context before re-entering final mode. + +14.) LeanOJ/RALPH final verification must remain placeholder-free, but Lean-accepted scaffolds containing `sorry`/`admit` should be saved as partial proofs for future context. Partial proofs are citeable incomplete references only; never count them as verified solutions and never accept fake `axiom`/`constant`/`opaque` proof devices. + +15.) Parent/user-selected phases have hierarchy precedence over child branches. When a parent phase starts (LeanOJ forced final loop, autonomous paper writing, Tier 3 final answer/final selection), lower-tier brainstorm/topic/path child tasks must stop or be ignored. LeanOJ `Skip Brainstorm` locks the run into the final loop until the configured final-attempt cycle is exhausted; model/path requests for more brainstorming cannot override that user action early. `Force Brainstorm` is a separate explicit user override that returns to recursive brainstorming while preserving proof progress. + +16.) LeanOJ prompt flows must guard formal/informal mismatches: treat the Lean template as the formal source of truth, do not silently reinterpret operations such as `Nat` subtraction, sanity-check exact-template formulas on small cases when feasible, and never claim Lean acceptance alone proves the informal problem unless that correspondence is justified. diff --git a/.cursor/rules/part-1-aggregator-tool-design-specifications.mdc b/.cursor/rules/part-1-aggregator-tool-design-specifications.mdc index f0b5171..702551f 100644 --- a/.cursor/rules/part-1-aggregator-tool-design-specifications.mdc +++ b/.cursor/rules/part-1-aggregator-tool-design-specifications.mdc @@ -46,7 +46,7 @@ Validator processes 1, 2, or 3 submissions simultaneously using batch-specific p **Submission context injection**: Direct inject if fits. If too large: RAG the submission as file, keep user prompt direct. If user prompt + RAG'd submission still too large: RAG all user-prompt files. If user prompt itself too large after all RAG: halt with error + diagnostic. -**Hosted upload enforcement (generic mode)**: Server-side validation of `.txt` only, 5 MB max, filename sanitization, path traversal rejection. Applied in both modes but critical for hosted sandboxes where the control plane proxies uploads. +**Upload/path enforcement**: Server-side validation of `.txt` only, 5 MB max, filename sanitization, path traversal rejection. Upload responses return logical filenames, not absolute host paths. Public Aggregator starts resolve `uploaded_files` only under `user_uploads`; internal autonomous reference-paper context may opt into trusted data-root file references via an explicit coordinator flag. ## Context Allocation @@ -66,16 +66,21 @@ No context carryover between prompts (only system-intended DB/submission transfe User selects model per role. Multiple roles can share a model. Models load with user-set context sizes. +Per-role Supercharge is optional. When enabled for a submitter or validator, `api_client_manager.generate_completion()` runs 4 parallel full answer attempts for that role call, then a 5th same-model synthesis call and returns only the synthesis result. Supercharge candidate attempts intentionally use temperatures `[0.0, 0.2, 0.4, 0.8]` to diversify parallel outputs; synthesis remains `0.0`. Candidate attempts are sanitized to reusable visible answer text before synthesis; private thought/channel/control transcript text must never be fed back as feedback, brainstorm memory, or synthesis context. The synthesis prompt frames candidates as optional working material: the model may use one, combine several, ignore all, or write a stronger new answer, while preserving the original role output contract. If Boost applies to that role/task, all internal Supercharge calls use the Boost config first. Tool-call requests bypass Supercharge. + +Parallel brainstorm submitter lanes intentionally use temperatures `[0.0, 0.1, ..., 0.9]` by submitter index so every parallel set includes a deterministic lane and increasing exploration lanes. This applies only to parallel submitter generation. Validators, compiler roles, JSON retries, and single-model sequential submitters remain `0.0`. + ## Single-Model Mode When ALL submitters AND validator use the same model → single-model mode: - Submitters run SEQUENTIALLY (S1 → S2 → ... → Sn) - Validator processes all queued submissions after each full submitter round - Prevents queue overflow from parallel tasks flooding when LLM completes +- Exception: if LM Studio reports multiple loaded same-base numeric `:#` instances for that model, submitters may still run in parallel while the LM Studio client routes independent calls to idle sibling instances. - Boost does NOT affect single-model detection (routing only, not model config) ## Multi-Submitter Configuration -Per-submitter: provider (LM Studio / OpenRouter in default mode; OpenRouter only in generic mode), model, OpenRouter host provider, LM Studio fallback (default mode only), context window, max output tokens. UI: "Number of Submitters" selector (1-10), "Copy Main to All" button. +Per-submitter: provider (LM Studio / OpenRouter in default mode; OpenRouter only in generic mode), model, OpenRouter host provider, LM Studio fallback (default mode only), context window, max output tokens, and Supercharge checkbox. UI: "Number of Submitters" selector (1-10), "Copy Main to All" button. OpenRouter auto-fill rule: selecting an OpenRouter model auto-fills from endpoint metadata only. Context window uses the smallest relevant host `context_length`; max output tokens use `min(20% of that host context, smallest relevant host max_completion_tokens)`. If `max_prompt_tokens` is available, shrink usable context to respect it. If endpoint caps are incomplete, preserve current values (no guessing). @@ -89,7 +94,7 @@ Accepted submissions database: never truncated. Live preview shows exact non-tru Every 7th acceptance (`total_acceptances % 7 == 0`, minimum 7 before first review): -**Phase 1**: Validator reviews ALL accepted submissions, identifies AT MOST ONE for removal (redundant, contradicted, superseded, or provides no unique value). +**Phase 1**: Validator reviews the accepted-submissions database and identifies AT MOST ONE for removal (redundant, contradicted, superseded, or provides no unique value). If the complete database fits, it is direct-injected in full. If it does not fit, cleanup must use the normal direct-first/RAG fallback path instead of skipping or truncating; the review is then evidence-bounded by retrieved context. **Phase 2** (only if removal proposed): Validator self-validates its removal proposal. Conservative default: if uncertain, reject removal. If validated: execute removal + full RAG rebuild (all shared-training sources are dropped and re-indexed from the post-removal file so deleted content is no longer retrievable). diff --git a/.cursor/rules/part-1-and-part-2-cointeraction-architecture.mdc b/.cursor/rules/part-1-and-part-2-cointeraction-architecture.mdc index b50a2d5..35a3455 100644 --- a/.cursor/rules/part-1-and-part-2-cointeraction-architecture.mdc +++ b/.cursor/rules/part-1-and-part-2-cointeraction-architecture.mdc @@ -6,12 +6,12 @@ alwaysApply: true This describes additional architecture for the synergy between the part 1 database aggregator tool and part 2 aggregator-compiler tool. Both modes operate identically in default and generic deployment — the only difference is provider availability (see `hosted-web-contract.mdc` for details on generic mode). - NOTE: This is a continuously-running program that does not stop itself, the user selects the aggregator to start, then starts the compiler when they desire, and then the user choses when to turn off each selective mode by turning the off switch. There is no "solution stop token" as in normal AI solution generation. + NOTE: This is a continuously-running program that does not stop itself; the user selects one top-level workflow mode to run and turns it off when desired. There is no "solution stop token" as in normal AI solution generation. ## Aggregator start-up workflow 1.) The aggregator runs initially with no compiler running. -2.) The compiler does not begin running until the user starts it manually. Aggregator can run on its own for a head-start for as long as the operator would like. If the operator desires the aggregator can also run by itself without any compilation. +2.) The compiler does not begin running until the user starts it manually, and current code enforces that Aggregator, Compiler, Autonomous Research, and LeanOJ Proof Solver are mutually exclusive top-level modes. Aggregator can run on its own without any compilation. ## GUI Design @@ -40,22 +40,34 @@ The live-constructing compiler-written paper should be viewable in one tab and a **Use Case**: User may have domain knowledge that the brainstorm has explored sufficient territory before the automatic 10-acceptance interval, saving time and allowing manual control over the autonomous workflow. -## Multi-Submitter Architecture (Aggregator Only) +## Hierarchy / Parent Action Precedence -**Distinction**: Multiple submitters are only available for the Aggregator (Part 1 and Part 3 brainstorm aggregation). The Compiler (Part 2) uses a fixed single-submitter sequential Markov chain workflow. +Parent workflow actions override child agents immediately. Manual paper writing, forced Tier 3, LeanOJ forced final solving, and final selection phases must stop or fence off any lower-tier brainstorm/topic/path workers before continuing. Never allow stale child outputs to change phase after a parent action has taken ownership. + +## Multi-Submitter Architecture (Aggregator and LeanOJ) + +**Distinction**: Multiple submitters are available for the Aggregator (Part 1 and Part 3 brainstorm aggregation) and LeanOJ topic/brainstorm phases. The Compiler (Part 2) uses a fixed single-submitter sequential Markov chain workflow. ### Aggregator Multi-Submitter (Part 1 & Part 3) - Configurable 1-10 parallel submitters (default: 3) - Each submitter can have its own model, context window, and max output tokens - Enables multi-model exploration of different solution basins simultaneously +- Parallel submitter generation uses the shared temperature ladder `[0.0, 0.1, ..., 0.9]` by submitter index; single-model sequential submitters and validators stay `0.0`. +- If all submitters and the validator are configured with the same LM Studio model ID, the Aggregator normally uses single-model sequential mode. Exception: when LM Studio reports multiple loaded same-base numeric `:#` instances for that model, submitters may run in parallel and `lm_studio_client` routes independent calls to idle sibling instances while the validator remains ordered. - Single validator maintains coherent Markov chain evolution for database alignment - UI labels: "Submitter 1 (Main Submitter)", "Submitter 2", "Submitter 3", etc. - "Copy Main to All" button for quick configuration +### LeanOJ Topic/Brainstorm Multi-Submitter +- Configurable 1-10 parallel submitters generate initial topics and brainstorm ideas +- One validator batch-validates up to 3 completed topics or submissions at a time; initial topics must be broad locked foundation questions for the whole LeanOJ solution route, not narrow lemma/tactic/repair targets +- Parallel topic/brainstorm submitters use the shared temperature ladder `[0.0, 0.1, ..., 0.9]` by submitter index; LeanOJ validators, final solver, semantic review, and retry/repair calls stay `0.0`. +- No round-robin/serial submitter awaiting; a hung submitter must not block other submitters or validation + ### Compiler Single-Submitter (Part 2) - Fixed 2-submitter architecture (NOT configurable): - **High-Context Submitter**: Handles outline_create, outline_update, construction, review modes. During construction, may invoke the Wolfram Alpha tool up to 20 times per submission when `system_config.wolfram_alpha_enabled=true`. - - **High-Parameter Submitter**: Handles rigor mode. Rigor is the **Lean-4-verified-theorem flow**: discovery → up to 5 Lean 4 formalization attempts (with error feedback) → novelty classification → placement (2 attempts, validator uses `rigor_lean_placement` mode forcing `rigor_check=True`) → Theorems Appendix fallback. The compiler writes verified proofs directly into the shared `proof_database` (same database used by autonomous mode); novel proofs automatically enter the highest-priority direct-injection block on the next submitter instantiation. + - **High-Parameter Submitter**: Handles rigor mode. Rigor is the **Lean-4-verified-theorem flow**: user-prompt-relevant discovery (including explicit extension theorems from partial paper work / outline / supporting context / user prompt when helpful) → up to 5 Lean 4 formalization attempts (with error feedback) → novelty classification → placement routing. Existing-paper-claim theorems may go through inline placement (2 attempts, validator uses `rigor_lean_placement` mode forcing `rigor_check=True`); extension-derived theorems are forced to `placement_preference="appendix_only"` and appended directly to the Theorems Appendix (`placement_outcome="appendix_requested"`). Inline failures still use Theorems Appendix fallback. The compiler writes verified proofs directly into the shared `proof_database` (same database used by autonomous mode); novel proofs automatically enter the highest-priority direct-injection block on the next submitter instantiation. - Sequential Markov chain workflow (only one submission at a time) - Each compiler submitter has its own model, context, and max token settings (separate from aggregator) - UI shows these as separate "High-Context Submitter" and "High-Parameter Submitter" sections @@ -65,6 +77,7 @@ The live-constructing compiler-written paper should be viewable in one tab and a ## Additional Traits Shared Between Aggregator-Submitters and Compiler-Submitters - The JSON of aggregator-subbmiters and compiler-submitters should include a "reasoning:" request below its "submission:" line. (This forces the submitter to explain the thoughts behind there reasoning and can also reveal deception for additional context for the validator.) +- MOTO conversational retries must preserve useful failed-output context, but never raw provider/model transcript text. Any assistant replay or reusable feedback/memory excerpt must be sanitized to remove known private thought/channel/control transport scaffolding first while preserving visible mathematical/Lean syntax such as `<|` and literal marker text inside JSON/string content. Parser error strings sent back to models must not include raw output excerpts. Exact tool-call assistant/tool protocol turns are the only exception. ## API Call Output Notes (User-Configurable) - **All `max_tokens` limits are user-configurable via GUI settings** (like context window sizes). Users can adjust these per model role based on their specific models' capabilities. diff --git a/.cursor/rules/part-2-compiler-tool-design-specification.mdc b/.cursor/rules/part-2-compiler-tool-design-specification.mdc index b397819..9406396 100644 --- a/.cursor/rules/part-2-compiler-tool-design-specification.mdc +++ b/.cursor/rules/part-2-compiler-tool-design-specification.mdc @@ -31,9 +31,11 @@ Before every `_pre_validate_exact_string_match()`, system calls `paper_memory.en **Provider Selection**: Each compiler role (validator, high-context, high-param, critique submitter) can independently use LM Studio or OpenRouter with optional host provider and LM Studio fallback (default mode). In generic mode, all roles use OpenRouter only; LM Studio options are hidden in the frontend. -**Export Behavior**: Raw text export available in both modes. PDF export (`POST /api/download/pdf`) is desktop-only — generic mode returns `501` (Playwright/Chromium not installed in hosted image). +**Supercharge**: Each compiler role has an optional Supercharge checkbox. Checked roles run 4 full answer attempts plus a 5th same-model synthesis answer through `api_client_manager.generate_completion()`. If Boost applies, every internal Supercharge call uses the Boost route/model/provider settings first. Tool-call requests bypass Supercharge; this is especially important for the Wolfram-enabled construction loop. -**Aggregator RAG refresh**: Every 10 accepted aggregator submissions (not immediate like aggregator). +**Export Behavior**: Raw text export available in both modes. PDF export (`POST /api/download/pdf`) is desktop-only — generic mode returns `501` (Playwright/Chromium not installed in hosted image). Server-side PDF rendering must treat submitted HTML as untrusted: sanitize/allowlist content and block external network requests from Playwright. + +**Aggregator RAG refresh**: Manual Part 2 refreshes every 10 accepted aggregator submissions (not immediate like aggregator). Autonomous/Tier 3 compiler runs do not start the manual aggregator monitor because the parent autonomous tier owns the active brainstorm/reference context. **Enhanced Rejection Feedback Format** (`compiler_rejection_log.py`): - Header: "🚫 REJECTED BECAUSE: [Failure Reason]" @@ -97,20 +99,19 @@ Body content is ALWAYS inserted BEFORE CONCLUSION_PLACEHOLDER. `_apply_edit()` a - 4× HC construction → validator - 1× HC outline update → validator *(skipped if body complete)* - 2× HC review → validator -- 1× HP rigor → validator *(skipped if body complete)* +- Then, if body is still active, run the HP Lean-4 theorem-search rigor loop until the first decline. Each successful rigor cycle lands one verified theorem inline or in the Theorems Appendix, then the rigor loop may continue; this is no longer exactly one HP pass. **Rigor Mode (Lean 4 verified theorems, 4-stage)**: The rigor loop no longer rewrites prose. Each rigor cycle: -- Stage 1 (HP, unvalidated): theorem discovery - using the full writing context, decide if a theorem worth formalizing exists that is not already verified; return `needs_theorem_work=false` to decline and end the rigor loop. +- Stage 1 (HP, unvalidated): theorem discovery - using the full writing context, decide if a user-prompt-relevant theorem worth formalizing exists that is not already verified; return `needs_theorem_work=false` to decline and end the rigor loop. Discovery is explicitly allowed to construct extension theorems from partial paper work, the outline, supporting context, or the user prompt when helpful to paper construction and/or the user's goal, not only exact claims already written in the current paper. +- Stage 1 output includes `theorem_origin` (`existing_paper_claim`, `extension_from_partial_work`, `extension_from_user_prompt`) and `placement_preference` (`inline`, `appendix_only`). Extension-derived theorems MUST be forced to `appendix_only`; existing-paper-claim theorems may be inline or appendix-only. - Stage 2: `ProofFormalizationAgent.prove_candidate(max_attempts=5)` - up to 5 Lean 4 attempts with error-feedback chaining. On 5 failures: record the candidate via `proof_database.record_failed_candidate` so future cycles see it as an open lemma target; end the rigor cycle as a decline. - Stage 3: novelty classification via the shared `assess_proof_novelty` helper; `proof_database.add_proof` persists the verified proof. Novel proofs automatically enter the highest-priority direct-injection block (`proof_database.inject_into_prompt`) on the next submitter instantiation. Non-novel proofs remain available through `/api/proofs` for future user-driven reference selection. -- Stage 4: placement - HP model proposes an inline edit that introduces the theorem with an explicit "verified in Lean 4" marker and an appendix cross-reference. Validator uses the new `rigor_lean_placement` mode which forces `rigor_check=True` (Lean 4 is the source of mathematical truth) and judges placement/narrative only. Up to 2 placement attempts (attempt 2 gets validator rejection feedback). -- Appendix fallback: if both placement attempts fail, the verified theorem is appended to the **Theorems Appendix** block (`THEOREMS_APPENDIX_START` / `THEOREMS_APPENDIX_END` bracket markers in `paper_memory.py`). Still counts as a `rigor_acceptance` because the math is preserved. +- Stage 4: placement - if `placement_preference="inline"`, HP model proposes an inline edit that introduces the theorem with an explicit "verified in Lean 4" marker and an appendix cross-reference. Validator uses `rigor_lean_placement` mode which forces `rigor_check=True` (Lean 4 is the source of mathematical truth) and judges placement/narrative only. Up to 2 placement attempts (attempt 2 gets validator rejection feedback). +- Appendix routing: if `placement_preference="appendix_only"`, skip inline placement and append directly to the **Theorems Appendix** with `placement_outcome="appendix_requested"`. If inline placement is attempted but both placement attempts fail, append with `placement_outcome="appendix_fallback"`. Both outcomes count as `rigor_acceptance` because the math is preserved. - Loop 2 ends on first **decline** (no theorem found OR 5 Lean attempts failed OR Lean 4 disabled). Every verified theorem lands somewhere so there is no "rejection" outcome at the loop level. - Config gate: `system_config.lean4_enabled=false` → every rigor cycle declines immediately. -**Decline Mechanisms:** -- `outline_update`: `needs_update: boolean` **Decline Mechanisms:** - `outline_update`: `needs_update: boolean` - `construction`: `needs_construction: boolean` @@ -133,24 +134,22 @@ Detection via `_is_body_complete()` in `compiler_coordinator.py`. ## Critique Phase (Post-Body, Pre-Conclusion) -**"5 total attempts"** = accepted + rejected + declined (not just accepted). - -**Max 1 completed rewrite**. Rewrite "completed" only after first successful body acceptance post-rewrite. After 1 completed rewrite, critique phase is skipped entirely. +**"3 total attempts"** = accepted + rejected + declined (not just accepted). **Workflow:** -1. If `rewrite_count >= 1` completed rewrites → skip critique, proceed to conclusion -2. Critique aggregation: target 5 total attempts -3. Pre-critique snapshot of paper body -4. If 5 attempts with ≥1 accepted → rewrite decision; if 0 accepted → skip rewrite -5. Decisions: **CONTINUE** (minor/incorrect critiques) | **PARTIAL_REVISION** (iterative one-edit-at-a-time loop until `more_edits_needed=false`) | **TOTAL_REWRITE** (catastrophic flaws only) +1. Critique aggregation: target 3 total attempts +2. Submitter may critique or decline; validator still validates every critique/decline +3. If accepted critiques exist, append them to the paper as `AI Self-Review and Limitations` +4. If 0 critiques are accepted, proceed without adding the section +5. Transition to conclusion; critique never rewrites paper content -Context for rewrites: pre-critique paper + accepted critiques only (rejected excluded) + accumulated history from prior failed versions. +The self-review section is inserted after the compiler Theorems Appendix when present, otherwise after the paper conclusion and before the paper anchor. Later autonomous proof appends must stay before this self-review section. -**Decline**: Submitter can assess "no critique needed" if body is academically acceptable (no errors, complete, meets rigor). If 0 accepted critiques at end of 5 attempts → skip rewrite. +**Decline**: Submitter can assess "no critique needed" if body is academically acceptable (no errors, complete, meets rigor). If no critiques are accepted after 3 attempts, no self-review section is appended. **Skip Critique (User Override)**: `POST /api/compiler/skip-critique` — available only during active critique phase (`in_critique_phase=True`). Immediately ends critique, transitions to conclusion, broadcasts `critique_phase_skipped` with `reason: "user_override"`. Irreversible. -**WebSocket Events:** `critique_phase_started`, `critique_progress`, `critique_accepted`, `critique_rejected`, `critique_decline_accepted`, `critique_decline_rejected`, `critique_removed`, `critique_phase_ended`, `critique_phase_skipped`, `rewrite_decision_rejected`, `body_rewrite_started`, `phase_transition`, `phase_completion_signal` +**WebSocket Events:** `critique_phase_started`, `critique_progress`, `critique_accepted`, `critique_rejected`, `critique_decline_accepted`, `critique_decline_rejected`, `critique_removed`, `self_review_appended`, `critique_phase_ended`, `critique_phase_skipped`, `phase_transition`, `phase_completion_signal` --- @@ -207,11 +206,11 @@ Prevents models' fake placeholder text (e.g., "XI. Conclusion\n*placeholder*") f Per-role context windows (all user-configurable, default 131072): - Validator, High-Context Submitter, High-Parameter Submitter: 131072 tokens each -- **Settings flow**: All compiler modules read from `system_config.compiler_*` at runtime. The caller that creates `CompilerCoordinator` MUST write settings to `system_config` before init (manual mode: `/api/compiler/start`; autonomous mode: `autonomous_coordinator.py` before `CompilerCoordinator()` creation). +- **Settings flow**: All compiler modules read from `system_config.compiler_*` at runtime. The caller that creates `CompilerCoordinator` MUST write settings to `system_config` before init (manual mode: `/api/compiler/start`; autonomous mode: `autonomous_coordinator.py` before `CompilerCoordinator()` creation). Per-role Supercharge flags must be passed through `ModelConfig`, not `system_config`. - **OpenRouter auto-fill**: Selecting an OpenRouter model auto-fills from endpoint metadata only. Context window uses the smallest relevant host `context_length`; max output tokens use `min(20% of that host context, smallest relevant host max_completion_tokens)`. If `max_prompt_tokens` is available, shrink usable context to respect it. If endpoint caps are incomplete, preserve current values (no guessing). - Rigor mode dynamically adjusts RAG budget if outline + system prompts exceed available context - Construction mode dynamically adjusts RAG budget when brainstorm content is present: `rag_budget = max(5000, max_allowed - outline_tokens - paper_tokens - brainstorm_tokens - 5000_overhead)`. Brainstorm always direct-injected at full fidelity; RAG evidence scales to fit remaining budget. -- **Wolfram Alpha as a construction tool**: During `HighContextSubmitter.submit_construction` (body / conclusion / introduction / abstract), when `system_config.wolfram_alpha_enabled=true`, the writer may invoke the `wolfram_alpha_query` OpenAI-compatible tool up to **20 times per submission** to verify factual / computational claims before writing them. On budget exhaustion, the loop forces finalization with tools disabled. Tool audit trail lives in `CompilerSubmission.metadata["wolfram_calls"]`. The validator is not re-invoking Wolfram; it just sees the audit trail. Wolfram tool is NOT available in `outline_create`, `outline_update`, `review`, or the rigor loop. +- **Wolfram Alpha as a construction tool**: During `HighContextSubmitter.submit_construction` (body / conclusion / introduction / abstract), when `system_config.wolfram_alpha_enabled=true`, the writer may invoke the `wolfram_alpha_query` OpenAI-compatible tool up to **20 times per submission** to verify factual / computational claims before writing them. On budget exhaustion, the loop forces finalization with tools disabled. Tool replies remain model-visible, but logs/WebSocket events expose only redacted metadata and lengths; paper credits store counts only. Wolfram tool is NOT available in `outline_create`, `outline_update`, `review`, or the rigor loop. **Context rules:** User prompt ALWAYS direct injected. Direct injection first; RAG only when doesn't fit. ~85% RAG retrieval, ~15% direct injections. Halt with error if user prompt exceeds context_window - minimum_RAG_allocation. diff --git a/.cursor/rules/part-3-autonomous-research-mode.mdc b/.cursor/rules/part-3-autonomous-research-mode.mdc index 091a96e..0b4599b 100644 --- a/.cursor/rules/part-3-autonomous-research-mode.mdc +++ b/.cursor/rules/part-3-autonomous-research-mode.mdc @@ -42,6 +42,8 @@ The autonomous coordinator USES actual Part 1 aggregator infrastructure for brai - Configures with topic-specific database path (`auto_brainstorms/brainstorm_{topic_id}.txt` under the active instance data root; default desktop path: `backend/data/auto_brainstorms/brainstorm_{topic_id}.txt`) - Runs configurable 1-10 submitters + 1 validator workflow (default 3 submitters) - Each submitter can have its own model, context window, and max output tokens for multi-model exploration +- Each role can independently enable Supercharge; child Aggregator coordinators must preserve `supercharge_enabled` from the autonomous role configs. +- Parallel brainstorm/topic/title exploration submitters inherit the Part 1 temperature ladder; autonomous validators and compiler/final-answer roles stay `0.0`. - SINGLE validator maintains coherent Markov chain evolution (same constraint as Part 1) - Monitors acceptance count for completion triggers (every 10 acceptances) - Handles pruning (every 7 acceptances) automatically via aggregator @@ -55,6 +57,7 @@ The autonomous coordinator USES actual Part 1 aggregator infrastructure for brai - Stops aggregator when completion review decides to write paper - **Phase enforcement**: Construction submitter must check current phase before declaring completion - **Premature decline rejection**: Coordinator rejects declines if required sections are missing based on current phase +- **Parent precedence**: Forced paper writing and forced Tier 3 must stop active child aggregators before the parent tier continues; local exploration/title aggregators must be tracked so they can be stopped. ### Part 2 Compiler Integration (Tier 2) The autonomous coordinator USES actual Part 2 compiler infrastructure for paper compilation: @@ -68,11 +71,13 @@ Compiler submitters may selectively use, synthesize beyond, or depart from brain **Critical Implementation Details**: - **system_config propagation (REQUIRED)**: Before creating `CompilerCoordinator`, autonomous mode MUST write all compiler context/token settings to `system_config` (e.g., `system_config.compiler_high_context_context_window = self._high_context_context`). Compiler modules read from `system_config` at init — the manual `/api/compiler/start` route does this, but autonomous mode bypasses that route and must do it explicitly. Applies to both `_compile_paper_from_brainstorm()` and `_compile_tier3_paper()`. +- **Supercharge propagation (REQUIRED)**: Autonomous mode must preserve per-role `supercharge_enabled` for brainstorm submitters, validator, high-context, high-param, critique submitter, proof runtime snapshots, and child Compiler/Aggregator coordinators. This setting lives in role configs / `ModelConfig`, not `system_config`. - Constrains section order: Body → Conclusion → Introduction → Abstract -- Paper is considered complete when abstract is detected in paper content -- Uses regex patterns to detect and extract abstract section +- Paper is considered complete when the abstract phase receives explicit `section_complete: true` +- Regex patterns may still extract abstract text for metadata, but do not drive phase completion - Reference papers are RAG'ed with brainstorm having higher direct injection priority - Outline is ALWAYS fully injected (never RAGed) for structural framework integrity +- Autonomous/Tier 3 compiler runs must not start the manual Part 1 aggregator monitor; the parent tier owns all brainstorm/reference context. --- @@ -282,12 +287,6 @@ JSON schemas defined in `json-prompt-design.mdc`. Two-step: submitter requests p - System intelligently handles large papers via RAG when needed - Maximum 3 papers enforced across the topic-cycle selection modes -### Context for Pre-Brainstorm Reference Selection -- User's high-level research prompt (direct injection) -- Current brainstorm topic prompt (direct injection) -- ALL Tier 2 paper titles + abstracts (direct injection if fits, RAG if too large) -- Instruction: "Select papers that would help inform and enhance exploration of this brainstorm topic" - ### Key Design Points - **Same references persist**: References selected here are used for BOTH brainstorming AND paper writing - **Additional selection later**: AI can select MORE references (up to 3 total) before paper writing @@ -318,13 +317,14 @@ The autonomous brainstorm aggregator inherits batch validation from Part 1 infra 1. **Topic-Specific Database**: Writes to `auto_brainstorms/brainstorm_{topic_id}.txt` under the active instance data root (default desktop path: `backend/data/auto_brainstorms/brainstorm_{topic_id}.txt`) instead of `rag_shared_training.txt` 2. **No User-Provided Topic Prompt**: Uses the AI-generated brainstorm topic prompt 3. **Completion Tracking**: Tracks acceptance count (including removals) for completion review trigger -4. **Hard Limit**: 30 accepted submissions (FORCE transition to paper writing, no completion review) +4. **Deletion Safety**: An active/current brainstorm must not be deleted while autonomous research or its aggregator is running; if its metadata or database disappears, aggregation must stop and clear stale coordinator pointers rather than recreate an invisible DB. +5. **Hard Limit**: 30 accepted submissions (FORCE transition to paper writing, no completion review) - Purpose: Prevents runaway brainstorms from accumulating indefinitely - Trigger: After each acceptance, check if count >= 30 - Behavior: Immediately transition to paper writing, skip completion review - WebSocket event: `brainstorm_hard_limit_reached` - **TOTAL across all rounds**: When `continue_existing` resumes an incomplete brainstorm, the 30-cap applies to the TOTAL acceptance count (prior + new). The aggregator loop tracks a `resume_acceptance_base` offset so `_acceptance_count` always reflects the true total. If a topic already has >= 30 acceptances on entry, aggregation is skipped entirely and paper writing is forced immediately. -5. **Rejection Hard Limit**: 10 consecutive rejections (with minimum 5 acceptances) FORCE transition to paper writing +6. **Rejection Hard Limit**: 10 consecutive rejections (with minimum 5 acceptances) FORCE transition to paper writing - Purpose: Prevents infinite rejection loops when brainstorm is exhausted - Trigger: After rejection, check if consecutive rejections >= 10 AND acceptances >= 5 - Behavior: Immediately transition to paper writing, skip completion review @@ -500,30 +500,6 @@ Same two-step browsing workflow as pre-brainstorm selection (expand request → **Prompts**: `paper_title_exploration_prompts.py` — `build_title_exploration_user_prompt()` frames the aggregation task for candidate title generation with context: user prompt, topic, brainstorm summary, existing papers, reference papers. -### Paper Title Exploration (Pre-Title Candidate Brainstorm) - -**Purpose**: Before committing to a paper title, the system collects 5 validated candidate titles using the Part 1 aggregator infrastructure. The final title selection then chooses from candidates, synthesizes them, or proposes a new title with justification. - -**Architecture**: Uses `AggregatorCoordinator` from Part 1 — same parallel submitters + batch validator, but with **cleanup/pruning disabled** (`enable_cleanup_review=False`) since target is only 5 candidates. - -**Applies to EVERY paper creation**: Tier 2 papers (1/2/3 from brainstorm), Tier 3 short-form, Tier 3 gap/intro/conclusion chapters. - -**Workflow**: -1. Aggregator starts with all configured submitters running in parallel -2. Submitters generate candidate paper titles as standard submissions -3. Validator checks quality, relevance, and DIVERSITY (rejects near-duplicates) -4. Accepted candidates accumulate in temp title DB -5. Coordinator stops at 5 acceptances (or 15 consecutive rejections safety valve) -6. Reads title DB, formats as candidate list for final title selection - -**Temp DB**: `title_candidates_{topic_id}.txt` in brainstorms dir (cleaned up after phase) - -**WebSocket Events**: `paper_title_exploration_started`, `paper_title_exploration_progress`, `paper_title_exploration_complete` - -**Crash Recovery**: On resume, exploration restarts fresh (short phase, no state to preserve). - -**Prompts**: `paper_title_exploration_prompts.py` — `build_title_exploration_user_prompt()` frames the aggregation task for candidate title generation with context: user prompt, topic, brainstorm summary, existing papers, reference papers. - ### Paper Title Selection **Context**: @@ -618,24 +594,17 @@ The validator will REJECT any outline missing these required sections or with in - Cannot skip to conclusion/introduction/abstract **Critique Phase (Post-Body, Pre-Conclusion)**: -- **Maximum Rewrites**: 1 completed rewrite allowed. Rewrite counts as "completed" only after first successful body acceptance. After 1 completed rewrite, critique phase is skipped. -- **Pre-Critique Snapshot**: Paper body snapshotted at critique phase start (for rewrite context) -- **Triggered**: Automatically when body construction completes (unless rewrite_count >= 1 completed rewrites) +- **Triggered**: Automatically when body construction completes - **Purpose**: Peer review body section before proceeding to conclusion -- **Target**: 5 total attempts (accepted + rejected + declined) +- **Target**: 3 total attempts (accepted + rejected + declined) - **Decline Mechanism**: Submitter can assess "no critique needed" if body is academically acceptable (no mathematical errors, all outline requirements met, proper rigor) -- **Skip Rewrite**: If 5 total attempts complete with 0 accepted critiques, skip rewrite phase and continue to conclusion -- **Rewrite Decision**: If 5 total attempts reached with ≥1 acceptance, submitter decides: continue / partial_revision / total_rewrite -- **Decision Options**: - - **CONTINUE**: Critiques minor/incorrect, proceed to conclusion - - **PARTIAL_REVISION**: **ITERATIVE** edits - proposes ONE edit at a time, validates, applies, sees result, then proposes next. Context includes pre-critique paper + current paper + accepted critiques. - - **TOTAL_REWRITE**: Clear entire body and rebuild from scratch (catastrophic flaws only). Receives pre-critique paper + accepted critiques for context. -- **Accumulated History**: All critiques from all previous failed versions are provided to rewrite decision -- **Context for Rewrites**: Pre-critique paper (shows what failed) + accepted critiques ONLY (rejected critiques NOT included) -- **JSON Schema**: `{"critique_needed": true/false, "submission": "...", "reasoning": "..."}` for critiques; `{"decision": "continue|partial_revision|total_rewrite", "new_title": null, "new_outline": null, "reasoning": "..."}` for rewrite decision (note: edit_operations removed, now iterative) +- **Self-Review Append**: If accepted critiques exist after 3 attempts, append them as `AI Self-Review and Limitations`; if 0 critiques are accepted, continue without the section +- **No Rewrites**: Critiques never trigger partial revision, total rewrite, body clearing, title changes, or outline updates +- **Placement**: Self-review is final reader-facing content after the compiler Theorems Appendix/proof section when present, otherwise after the conclusion; later proof appends must stay before self-review +- **JSON Schema**: `{"critique_needed": true/false, "submission": "...", "reasoning": "..."}` for critiques only **Skip Critique Phase (User Override)**: -- **Purpose**: Allow users to manually skip the critique/rewrite phase and proceed directly to conclusion +- **Purpose**: Allow users to manually skip the critique/self-review phase and proceed directly to conclusion - **API Endpoint**: `POST /api/auto-research/skip-critique` - **Availability**: Any time during Tier 2 paper writing - **Behavior**: @@ -765,8 +734,8 @@ When abstract is written and validated, the paper is considered COMPLETE. Additi - Uses validator model from current session configuration - Calculates average rating: `(novelty + correctness + impact) / 3` - Saves critique to paper's critique storage - - If average rating ≥ 7.0, emits `high_score_critique` WebSocket event - - Frontend displays popup notification (max 3, FIFO queue) + - If average rating ≥ 6.25, emits `high_score_critique` WebSocket event + - Frontend displays popup notification (max 3, FIFO queue) and recovers missed high-score popups from saved paper critique badges on reload/poll - Non-blocking: errors logged but don't affect paper completion - See "Auto-Critique Popup Notifications" section below for details @@ -798,9 +767,9 @@ JSON schema defined in `json-prompt-design.mdc`. Fields: `should_remove` (bool), - Maximum 1 removal per review cycle 5. **Execution**: - - If removal validated: Move paper to `auto_papers/archive/` under the active instance data root (default desktop path: `backend/data/auto_papers/archive/`) - - Update metadata to mark as "archived" - - Update statistics + - If removal validated: prune the paper into `auto_papers/pruned/` (or the session `papers/pruned/` directory) with `pruned_paper_{paper_id}*` filenames and a top-of-file `PRUNED PAPER - REMOVED FROM MODEL CONTEXT` banner + - Update metadata to mark `status="pruned"`, store prune reason/actor/time, and remove active RAG sources for that paper + - Pruned papers are excluded from all future model context/reference selection, but remain visible and downloadable for the user until they explicitly delete all pruned papers ### Return to Topic Selection / Brainstorm Multi-Paper Continuation @@ -1031,10 +1000,10 @@ Wolfram Alpha Verifications: 3 queries **Wolfram Alpha Verification Tracking**: - Wolfram Alpha API calls are tracked separately from LLM API calls -- Only ACCEPTED Wolfram verifications are counted (where result was added to paper via validated rigor submission) +- Only Wolfram calls attached to accepted construction submissions are counted in paper credits - Displayed in MODEL CREDITS section below LLM model list - Format: "Wolfram Alpha Verifications: N queries" -- Tracking happens in `compiler_coordinator._submit_and_validate_rigor()` after validator acceptance +- Tracking happens in `compiler_coordinator._track_submission_wolfram_calls()` after validator acceptance - If no Wolfram calls made, this line is omitted from credits - **Graceful edge case handling**: Credits show even if only Wolfram calls exist (no model tracking data), or if only model calls exist (no Wolfram calls) @@ -1049,7 +1018,7 @@ class PaperMetadata(BaseModel): **PaperModelTracker Class** (`backend/autonomous/memory/paper_model_tracker.py`): - `track_call(model_id)`: Record an API call for a model -- `track_wolfram_call(query)`: Record a Wolfram Alpha verification +- `track_wolfram_call(query)`: Increment the Wolfram verification count; query text is not persisted for credits - `get_wolfram_call_count()`: Get total Wolfram queries - `has_tracking_data()`: Returns True if any model calls OR Wolfram calls exist (handles edge cases gracefully) - `get_models_dict()`: Get Dict[str, int] for metadata storage @@ -1220,10 +1189,12 @@ Main component for displaying Tier 3 status and content: - Back button to return to list **API Endpoints**: -- `GET /auto-research/final-answer/{answer_id}/archive/papers` - List archived papers -- `GET /auto-research/final-answer/{answer_id}/archive/papers/{paper_id}` - Get paper details -- `GET /auto-research/final-answer/{answer_id}/archive/brainstorms` - List archived brainstorms -- `GET /auto-research/final-answer/{answer_id}/archive/brainstorms/{topic_id}` - Get brainstorm details +- `GET /api/auto-research/final-answer/{answer_id}/archive/papers` - List archived papers +- `GET /api/auto-research/final-answer/{answer_id}/archive/papers/{paper_id}` - Get paper details +- `GET /api/auto-research/final-answer/{answer_id}/archive/brainstorms` - List archived brainstorms +- `GET /api/auto-research/final-answer/{answer_id}/archive/brainstorms/{topic_id}` - Get brainstorm details + +Archive IDs are untrusted path components. Resolve `answer_id`, `paper_id`, and `topic_id` with `validate_single_path_component()` / `resolve_path_within_root()` before reading archived files. **Design Principles**: - Non-intrusive: Button is discrete, not prominently displayed @@ -1249,17 +1220,17 @@ Main component for displaying Tier 3 status and content: Runs automatically after every completed brainstorm (Tier 1) and every completed paper (Tier 2 / Tier 3 chapter), gated on `system_config.lean4_enabled`. Silent no-op when disabled. -**Proof Framing Gate (one-shot, at autonomous start)**: When `lean4_enabled`, the coordinator runs `_run_proof_framing_gate()` before research begins. A single LLM call on the user prompt decides `is_proof_amenable` (`build_proof_framing_gate_prompt` → `autonomous_proof_framing_gate` role). The gate errs on the side of `true` — it returns `false` only when the prompt is purely empirical, engineering-focused, or has no meaningful mathematical content. If `true`, `PROOF_FRAMING_CONTEXT` (which directs submissions to pursue **novel, non-trivial** theorems and explicitly discourages standard identities and Mathlib restatements) is appended to every subsequent submitter prompt via `_append_proof_framing()` and persisted to workflow state for crash recovery. Decision is broadcast via `proof_framing_decided`. Silent no-op when disabled or when the prompt is not proof-amenable. +**Proof Framing Gate (one-shot, at autonomous start)**: When `lean4_enabled`, the coordinator runs `_run_proof_framing_gate()` before research begins. A single LLM call on the user prompt decides `is_proof_amenable` (`build_proof_framing_gate_prompt` → `autonomous_proof_framing_gate` role). The gate errs on the side of `true` when formal proof can help the user's prompt — it returns `false` when the prompt is purely empirical, engineering-focused, or has no meaningful prompt-relevant mathematical content. If `true`, `PROOF_FRAMING_CONTEXT` (which directs submissions to pursue theorems/lemmas/formalizations that directly answer, support, or advance the user prompt, with novelty/non-triviality valuable only inside that boundary) is appended to every subsequent submitter prompt via `_append_proof_framing()` and persisted to workflow state for crash recovery. Decision is broadcast via `proof_framing_decided`. Silent no-op when disabled or when the prompt is not proof-amenable. **Pipeline** (`backend/autonomous/core/proof_verification_stage.py`): -1. **Candidate identification** — `ProofIdentificationAgent` (`build_proof_identification_prompt`) extracts up to 5 novel, non-trivial theorem candidates from brainstorm or paper content, ranked by novelty potential. Trivial identities, textbook restatements, and single-tactic-closable results are filtered out at this stage before any Lean 4 cost is incurred. -2. **Optional Mathlib lemma search** — `MathlibLemmaSearchAgent` surfaces relevant existing lemmas into the formalization prompt +1. **Candidate identification** — `ProofIdentificationAgent` (`build_proof_identification_prompt`) extracts every prompt-relevant, non-trivial theorem candidate from brainstorm or paper content. Candidates are ordered by direct usefulness to the user prompt first, then novelty/formalization value; there is no artificial theorem-count cap. Trivial identities, off-prompt curiosities, textbook restatements, and single-tactic-closable results are filtered out before any Lean 4 cost is incurred. +2. **Optional Mathlib lemma search** — `MathlibLemmaSearchAgent` surfaces relevant existing lemmas into the formalization prompt, tied to the target theorem and user prompt 3. **Optional SMT early-exit** — when `smt_enabled`, `SmtClient` classifies candidates conservatively; successful SMT results become Lean tactic hints (nativeDecide / omega / decide style), never stored as standalone proofs 4. **Lean 4 formalization attempts** — two-phase retry: up to 3 full-proof attempts via `ProofFormalizationAgent.prove_candidate`, then up to 2 multi-tactic script attempts via `prove_candidate_tactic_script` (5 total per candidate). Prior `FailedProofCandidate` failure hints from `proof_database.inject_failure_hints_into_prompt()` thread into each retry. 5. **Novelty check** — `autonomous_proof_novelty` role compares verified proof against existing proof library -6. **Storage** — `proof_database.add_proof()` persists novel and known proofs as session-aware records (`proofs_index.json`, `proof_.json`, `proof__lean.lean`) with extracted `ProofDependency` records and reverse Mathlib usage index. Verified proofs are also appended as a "Verified Proofs" section at the bottom of the source brainstorm DB and/or paper file via `append_proofs_section()`. Cross-session read access is provided by `proof_database.list_proof_library()` (all sessions, novelty-filtered) and `proof_database.get_library_proof(session_id, proof_id)`, consumed by the `ProofLibrary` UI component and `/api/proofs/library` endpoints. +6. **Storage** — `proof_registration.register_verified_lean_proof()` uses `proof_database.add_proof_if_absent()` to atomically persist novel and known proofs as session-aware records (`proofs_index.json`, `proof_.json`, `proof__lean.lean`) with extracted `ProofDependency` records and reverse Mathlib usage index. Duplicate detection is scoped to source type/id + normalized theorem statement + normalized Lean code and must return `duplicate=True` to callers so source files are not appended twice. If `proofs_index.json` is corrupt, rebuild from existing `proof_*.json` record files instead of replacing the library with an empty index. Verified proofs are appended as a "Verified Proofs" section at the bottom of the source brainstorm DB and/or paper file via `append_proofs_section()` only for non-duplicate novel records. Cross-session read access is provided by `proof_database.list_proof_library()` (all sessions, novelty-filtered) and `proof_database.get_library_proof(session_id, proof_id)`, consumed by the `ProofLibrary` UI component and `/api/proofs/library` endpoints. -**Parallelism (two-phase execution per stage run)**: Steps 2–4 above (the per-candidate "Phase A" pipeline: lemma search → optional SMT hint → `prove_candidate` → `prove_candidate_tactic_script` → `proof_attempts_exhausted` broadcast on failure) run concurrently across *all* identified candidates inside a single `ProofVerificationStage.run()` invocation, bounded by `system_config.proof_max_parallel_candidates` (default 6, env: `MOTO_PROOF_MAX_PARALLEL_CANDIDATES` / `PROOF_MAX_PARALLEL_CANDIDATES`) via an `asyncio.Semaphore`. Phase A parallelizes agent/model work, but actual Lean 4 subprocess verification is serialized by `Lean4Client` behind a shared execution lock so all candidates queue one-at-a-time against the shared Mathlib workspace; LSP mode remains independently serialized by its operation lock and subprocess fallback uses the same shared queue. The identification stage (step 1) caps candidates at 5 and filters trivial/well-known results before Phase A begins, so Phase A only processes genuinely novel-potential theorems. Completed candidates are consumed by the driver loop through `asyncio.as_completed`, and steps 5–6 (the "Phase B" post-processing: novelty assessment, `add_proof`, dependency extraction via `ProofDependencyExtractor`, `append_proofs_section`, `novel_proof_discovered` / `known_proof_verified` broadcast, `record_failed_candidate` for brainstorm failures) are performed strictly **one-at-a-time** in Phase-A completion order inside that driver loop so later candidates can observe earlier stored proofs as MOTO dependencies. Each Phase-A task instantiates its own `ProofIdentificationAgent` / `MathlibLemmaSearchAgent` / `ProofFormalizationAgent` so the per-agent `task_sequence` counter cannot collide across concurrent candidates. If any Phase-A task raises `FreeModelExhaustedError` (or any other exception), the driver cancels all still-running sibling tasks and re-raises so the coordinator's recovery path runs with no orphaned background API calls. `should_stop` is plumbed into each Phase-A pipeline and checked before each Phase-B pass, so a stop-request short-circuits cleanly without leaking tasks. +**Parallelism (two-phase execution per stage run)**: Steps 2–4 above (the per-candidate "Phase A" pipeline: lemma search → optional SMT hint → `prove_candidate` → `prove_candidate_tactic_script` → `proof_attempts_exhausted` broadcast on failure) run concurrently across *all* identified candidates inside a single `ProofVerificationStage.run()` invocation, bounded by `system_config.proof_max_parallel_candidates` (default 6, env: `MOTO_PROOF_MAX_PARALLEL_CANDIDATES` / `PROOF_MAX_PARALLEL_CANDIDATES`) via an `asyncio.Semaphore`. Phase A parallelizes agent/model work, but actual Lean 4 subprocess verification is serialized by `Lean4Client` behind a shared execution lock so all candidates queue one-at-a-time against the shared Mathlib workspace; LSP mode remains independently serialized by its operation lock and subprocess fallback uses the same shared queue. The identification stage (step 1) filters off-prompt, trivial, and well-known results before Phase A begins, so Phase A only processes prompt-relevant theorem candidates. Completed candidates are consumed by the driver loop through `asyncio.as_completed`, and steps 5–6 (the "Phase B" post-processing: novelty assessment, `add_proof`, dependency extraction via `ProofDependencyExtractor`, `append_proofs_section`, `novel_proof_discovered` / `known_proof_verified` broadcast, `record_failed_candidate` for brainstorm failures) are performed strictly **one-at-a-time** in Phase-A completion order inside that driver loop so later candidates can observe earlier stored proofs as MOTO dependencies. Each Phase-A task instantiates its own `ProofIdentificationAgent` / `MathlibLemmaSearchAgent` / `ProofFormalizationAgent` so the per-agent `task_sequence` counter cannot collide across concurrent candidates. If any Phase-A task raises `FreeModelExhaustedError` (or any other exception), the driver cancels all still-running sibling tasks and re-raises so the coordinator's recovery path runs with no orphaned background API calls. `should_stop` is plumbed into each Phase-A pipeline and checked before each Phase-B pass, so a stop-request short-circuits cleanly without leaking tasks. **Rigor mode is NOT parallelized** (compiler Part 2): `submit_rigor_lean_theorem()` runs one candidate per rigor cycle by design (discovery → 5 Lean attempts → novelty → placement) and the outer `_rigor_loop` drives cycles serially so each proven theorem can land in the paper before the next discovery sees updated context. The parallel candidate pipeline lives only in `ProofVerificationStage`. @@ -1269,11 +1240,11 @@ Runs automatically after every completed brainstorm (Tier 1) and every completed **Subprocess vs LSP**: `lean4_client` runs Lean via subprocess by default. When `lean4_lsp_enabled`, a persistent LSP-style process reduces cold-start overhead; the subprocess path remains the fallback and must keep working when LSP is disabled. Missing/corrupt Mathlib `.olean` diagnostics are infrastructure failures, not proof failures: the client must re-check workspace readiness inside the serialized Lean execution queue, invalidate readiness when the cache is bad, refetch the Mathlib cache, retry the same Lean check once, and return a distinct `LEAN 4 WORKSPACE ERROR` if repair still fails. Future checks may attempt repair again after external fixes or transient failures clear, but the current failed check must not burn proof attempts as ordinary Lean feedback. -**Manual proof checks** (Build 5): `POST /api/proofs/check` reuses `ProofVerificationStage.run_manual()` with the stored `ProofRuntimeConfigSnapshot` (brainstorm / paper / validator role configs captured during autonomous startup). Readiness is surfaced via `/api/proofs/status.manual_check_ready` + `manual_check_message`. Required state: `lean4_enabled=True` AND a runtime snapshot must exist (start autonomous research once to seed it). +**Manual proof checks** (Build 5): `POST /api/proofs/check` reuses `ProofVerificationStage.run_manual()` with the stored `ProofRuntimeConfigSnapshot` (brainstorm / paper / validator role configs captured during autonomous startup). Manual checks may target any brainstorm with content, including in-progress brainstorms; papers remain completed-only. Readiness is surfaced via `/api/proofs/status.manual_check_ready` + `manual_check_message`. Required state: `lean4_enabled=True` AND a runtime snapshot must exist (start autonomous research once to seed it). **Proof runtime config snapshot** (`research_metadata.set_proof_runtime_config`): Captures a `ProofRuntimeConfigSnapshot` with three `ProofRoleConfigSnapshot` entries — `brainstorm` (from first aggregator submitter config), `paper` (from high-context submitter config), `validator` (from validator config). Each holds provider, model_id, openrouter_provider, lm_studio_fallback_id, context_window, and max_output_tokens. Lets manual checks run without an active autonomous session. -**Proof WebSocket events** (all broadcast through the standard `/api/ws` stream): +**Proof WebSocket events** (all broadcast through the standard `/api/ws` stream). `proof_verified` is emitted only after the proof has passed integrity checks and has been registered/reused in the proof database; payloads include `proof_id`. - `proof_framing_decided` - `proof_check_started`, `proof_check_complete`, `proof_check_no_candidates` - `proof_check_candidates_found`, `mathlib_lemmas_suggested` @@ -1293,10 +1264,11 @@ Runs automatically after every completed brainstorm (Tier 1) and every completed 7. Proof certificates stay text-based (`.lean` source + JSON metadata) — no binary artifacts 8. Hosted/generic mode keeps `lean4_enabled` and `smt_enabled` default false and the hosted image stays Lean-free and Z3-free (no proof binaries in the `python:3.12-slim` runtime) 9. Proof framing gate runs once per autonomous start and only when `lean4_enabled`; the resulting `proof_framing_active` flag and `PROOF_FRAMING_CONTEXT` are persisted in workflow state for crash recovery -10. Candidate identification (`build_proof_identification_prompt`) is a novelty-seeking gate — it rejects trivial identities, textbook restatements, and single-tactic-closable results, and returns **at most 5** candidates ranked by novelty potential. Every candidate that passes this gate is attempted — Phase A is bounded by `proof_max_parallel_candidates` but never truncates the post-identification candidate list; Phase A agent/model work runs concurrently across candidates while actual Lean 4 subprocess verification queues one-at-a-time through `Lean4Client`, and Phase B (novelty / `add_proof` / dependency extraction / brainstorm+paper `append_proofs_section` / novel/known broadcasts / `record_failed_candidate`) remains strictly serialized in Phase-A completion order so intra-batch MOTO dependencies and per-source proof appending stay coherent +10. Candidate identification (`build_proof_identification_prompt`) is a user-prompt relevance gate first and a novelty/non-triviality gate second — it rejects off-prompt curiosities, trivial identities, textbook restatements, and single-tactic-closable results, then returns every prompt-relevant candidate ordered by direct usefulness to the user prompt. Every candidate that passes this gate is attempted — Phase A is bounded by `proof_max_parallel_candidates` but never truncates the post-identification candidate list; Phase A agent/model work runs concurrently across candidates while actual Lean 4 subprocess verification queues one-at-a-time through `Lean4Client`, and Phase B (novelty / `add_proof` / dependency extraction / brainstorm+paper `append_proofs_section` / novel/known broadcasts / `record_failed_candidate`) remains strictly serialized in Phase-A completion order so intra-batch MOTO dependencies and per-source proof appending stay coherent 11. Each Phase-A task owns its own `ProofIdentificationAgent` / `MathlibLemmaSearchAgent` / `ProofFormalizationAgent` instance to keep per-agent `task_sequence` counters collision-free; any Phase-A exception (including `FreeModelExhaustedError`) must cancel all sibling tasks and re-raise so the coordinator's recovery path runs without orphaned background API calls 12. `should_stop` propagates into Phase A and is re-checked before each Phase-B pass so stop-requests short-circuit without leaking tasks or partially-applied Phase-B writes -13. Compiler rigor mode (`submit_rigor_lean_theorem`, `_rigor_loop`) is NOT parallelized — rigor cycles discover, verify, and place one theorem per cycle so each verified theorem lands in the paper before the next discovery; the parallel candidate pipeline lives only in `ProofVerificationStage` +13. Compiler rigor mode (`submit_rigor_lean_theorem`, `_rigor_loop`) is NOT parallelized — rigor cycles discover, verify, and route one theorem per cycle (inline for eligible existing-paper claims, appendix-only for extension-derived theorems or placement fallback) so each verified theorem lands in the paper before the next discovery; the parallel candidate pipeline lives only in `ProofVerificationStage` +14. Post-Lean integrity scanning rejects newly introduced `axiom`, `constant`, and `opaque` declarations even when the declaration name appears on following lines. Generated source text is not an authorization baseline unless explicitly passed as allowed baseline. --- @@ -1324,6 +1296,8 @@ Contains abstract only. Contains complete brainstorm database that sourced this paper. +**Pruned Papers**: `auto_papers/pruned/pruned_paper_{paper_id}.txt` (or session `papers/pruned/`) preserves papers removed from model context. The raw file begins with a `PRUNED PAPER - REMOVED FROM MODEL CONTEXT` banner and is for user review/download only. Companion metadata/abstract/outline files use the same `pruned_` prefix. Brainstorms do not use this soft-prune preservation feature. + ### Research Metadata File **File**: `auto_research_metadata.json` under the active instance data root (default desktop path: `backend/data/auto_research_metadata.json`) @@ -1334,26 +1308,32 @@ Contains complete brainstorm database that sourced this paper. **File**: `auto_workflow_state.json` under the active instance data root (default desktop path: `backend/data/auto_workflow_state.json`) -This file persists the current workflow state to enable **automatic resume** after program restart or crash. The system automatically saves this state at key checkpoints: +This file persists the current workflow state to enable **automatic resume** after program restart, crash, or user stop. The system automatically saves this state at key checkpoints: - After topic selection (starting brainstorm aggregation) - Periodically during brainstorm aggregation (every 5 acceptances) +- Before and after brainstorm proof verification (`paper_phase="brainstorm_proof_verification"` / `pre_paper_compilation`) - When transitioning to paper compilation -- During paper writing phases +- During paper writing phases (`body`, `conclusion`, `introduction`, `abstract`) +- Before completed-paper proof verification (`paper_phase="paper_proof_verification"`) - **During Tier 3 final answer generation phases** -On **clean stop** (user-initiated via stop button), this file is automatically cleared. +On **clean stop** (user-initiated via stop button), this file is preserved for pause/resume. Only `clear_all_data()` should clear workflow state. `_save_workflow_state()` must preserve the previous `paper_phase` when called without an explicit phase, and only clear the phase when passed `phase=None` intentionally after successful completion. -On **restart/crash recovery**, if this file exists with `is_running: true`, the system detects an interrupted workflow and: +On **restart/crash recovery**, if this file exists with a resumable tier/topic/paper (regardless of `is_running`), the system detects an interrupted workflow and: 1. Restores internal state (topic ID, acceptance counts, model config, etc.) -2. Automatically resumes from the last known phase -3. Broadcasts `auto_research_resumed` WebSocket event +2. Recovers stale acceptance counts from brainstorm metadata/database files when workflow state says `0` +3. Automatically resumes from the last known phase; completed brainstorms never re-enter aggregation and instead resume at proof/paper handoff +4. Detects completed papers paused before proof verification and resumes `paper_proof_verification` before moving on +5. Broadcasts `auto_research_resumed` WebSocket event + +If `workflow_state.json` is stale, idle, or missing, session recovery must conservatively synthesize a resume point from durable `session_stats.json`, brainstorm metadata/database files, and in-progress paper metadata/content. This includes scanning `papers/*_metadata.json` for `status="in_progress"` when stats lost `current_paper_id`; the resume phase is detected from saved paper content rather than defaulting to body. **Important Notes:** - The user research prompt is saved in `auto_research_metadata.json`, not the workflow state - Model configuration is saved to allow resuming with the same model settings -- If the workflow state file is corrupted or missing, the system starts fresh +- If the workflow state file is corrupted or missing, first try durable session-file recovery; start fresh only if no current topic, in-progress paper, completed unpapered brainstorm, completed papers, or active Tier 3 state can be recovered - The `clear_all_data` API endpoint clears the workflow state along with all other data --- @@ -1501,7 +1481,7 @@ Paper library component: Persistent popup notification component for high-scoring paper critiques: - **Fixed position**: Bottom-right corner with z-index 999999 - **Max 3 notifications**: FIFO queue (oldest removed when 4th arrives) -- **Trigger condition**: Paper completed with validator critique avg rating ≥ 7.0 +- **Trigger condition**: Paper has validator critique avg rating ≥ 6.25 - **Each notification displays**: - Paper title (truncated to 2 lines) - Average rating with color coding (green ≥8, blue ≥7) @@ -1511,9 +1491,9 @@ Persistent popup notification component for high-scoring paper critiques: - **Interactions**: - Click anywhere (except X) → opens `PaperCritiqueModal` with full critique - Click X → dismisses notification with slide-out animation -- **Persistence**: Stays visible across all screens until dismissed (not saved to localStorage) +- **Persistence**: Stays visible across screens until dismissed; dismissed/clicked high-score popup keys are stored in localStorage so missed WebSocket events can be replayed once from saved paper critique ratings without repeating forever - **Styling**: Purple gradient, compact design (~250px × ~80px), smooth animations -- **WebSocket Integration**: Listens to `high_score_critique` event from backend +- **WebSocket Integration**: Listens to `high_score_critique` event from backend and de-duplicates against recovered paper-list notifications **State Management** (in App.jsx): - `critiqueNotifications` array stores active notifications @@ -1535,7 +1515,7 @@ Settings integrated into main Settings panel: Metrics and logging component: - Real-time metrics: - Total brainstorms (complete / in-progress) - - Total papers (complete / archived) + - Total papers (complete / pruned) - Acceptance/rejection rates (brainstorm vs paper compilation) - Average submissions per brainstorm - Average words per paper @@ -1613,9 +1593,10 @@ Tier 3 Final Answer display component (separate tab for completed/overall final - Selective non-use of brainstorm/database material is allowed when the resulting paper is stronger, rigorous, and aligned with the prompt ### Running Modes -- **Part 1, Part 2, and Part 3 remain user-selectable modes** -- **Only ONE workflow mode may be active at a time** — Aggregator, Compiler, and Autonomous Research are mutually exclusive at runtime (applies identically in both default and generic deployment) +- **Part 1, Part 2, Part 3, and LeanOJ Proof Solver remain user-selectable modes** +- **Only ONE workflow mode may be active at a time** — Aggregator, Compiler, Autonomous Research, and LeanOJ Proof Solver are mutually exclusive at runtime (applies identically in both default and generic deployment) - **Part 3 internally controls Part 1 and Part 2 components** during autonomous execution +- **LeanOJ is proof-only and separate from Part 3** — it does not write papers, does not mutate autonomous brainstorm/paper memory, and stores resumable run-local state under the active `leanoj_sessions` data root until explicit clear - Starting any mode while another mode is running must be blocked until the active mode is stopped - In generic mode, all API routes and WebSocket events are identical — the only difference is provider availability (OpenRouter-only, FastEmbed embeddings, no PDF download) @@ -1710,9 +1691,9 @@ Tier 3 Final Answer display component (separate tab for completed/overall final 21. **Same model = single author** - Model used in multiple instances counts as ONE author entry, but all API calls tallied 22. **Paper redundancy is DISABLED during Tier 3** - `_tier3_active` flag prevents redundancy checks from purging papers being used in the final volume 23. **Brainstorm hard limit is 30 acceptances** - After 30 acceptances, paper writing is forced (no completion review) -24. **Maximum 1 completed rewrite per paper** - Rewrite counts as "completed" only after first successful body acceptance; prevents infinite loops from failed rewrite attempts -25. **Partial revision option available** - Allows targeted edits without full body rewrite -26. **Total rewrite is last resort** - Only for catastrophic issues that can't be fixed with targeted edits +24. **Critiques append as self-review, never rewrite** - Post-body critique runs 3 total attempts and appends validator-accepted critiques as `AI Self-Review and Limitations`; no partial or total body rewrites are allowed +25. **Self-review follows proofs/conclusion** - The self-review section is placed after compiler/appended proof material when present, otherwise after conclusion, and later proof appends must stay before it +26. **Critique declines remain valid** - If no critiques are accepted after the 3 attempts, the workflow proceeds to conclusion without adding a self-review section 27. **Rejection hard limit is 10 consecutive rejections (with 5+ acceptances)** - Prevents infinite rejection loops 28. **Retroactive brainstorm corrections during Tier 2 paper compilation** - Submitter sees unified paper+brainstorm workspace; operations validated independently by validator (paper-only context for paper ops, brainstorm-only context for brainstorm ops); each operation must stand alone without requiring the other for correctness 29. **Max 3 papers per brainstorm** - hard limit, continuation decision skipped after 3rd paper @@ -1779,4 +1760,5 @@ Out-of-order paper writing: The sequential paper writing order (body → conclus - Handles reasoning tokens (`...`), markdown wrappers, control tokens - Extracts first complete JSON object when multiple present - Handles LaTeX escape sequences comprehensively: fixes invalid `\u{word}` patterns, fixes invalid `\uXXXX` non-hex escapes, **pre-escapes dangerous LaTeX commands BEFORE any json.loads() attempt** using `(? api_id │ │ │ ├── api/ │ │ ├── __init__.py -│ │ ├── main.py # FastAPI app entry point (lifespan reads generic_mode from env, fail-closes hosted startup when proxy auth env is missing, skips LM Studio test in generic mode) -│ │ ├── middleware.py # CORS, error handling, proxy auth validation (X-Moto-* headers in generic mode) -│ │ ├── proxy_auth.py # Shared generic-mode proxy auth helpers (allowlist + HMAC signature validation for REST/WebSocket) +│ │ ├── main.py # FastAPI app entry point (lifespan reads generic_mode, fail-closes hosted auth env, ensures desktop API token, skips LM Studio test in generic mode) +│ │ ├── middleware.py # CORS, error handling, desktop token/origin checks, proxy auth validation, hosted body-size cap + actual-body hash check +│ │ ├── proxy_auth.py # Shared generic-mode proxy auth helpers (allowlist + HMAC over method/path/query/verified body hash for REST/WebSocket) │ │ └── routes/ │ │ ├── __init__.py │ │ ├── aggregator.py # Aggregator API endpoints (includes /events) │ │ ├── compiler.py # Compiler API endpoints │ │ ├── autonomous.py # Autonomous Research API endpoints +│ │ ├── leanoj.py # LeanOJ Proof Solver API endpoints (`/api/leanoj/*`: start/resume, stop, status, master-proof draft/edit summaries, proofs/library, skip-brainstorm, force-brainstorm, clear) │ │ ├── boost.py # Boost API endpoints (enable/disable/toggle/status + OpenRouter provider endpoint metadata) │ │ ├── workflow.py # Workflow API endpoints (predictions/history) -│ │ ├── download.py # PDF generation endpoint via Playwright (POST /api/download/pdf); returns 501 in generic mode -│ │ ├── openrouter.py # OpenRouter API endpoints (global key, models, providers + endpoint metadata, LM Studio availability, **GET /api/model-cache** for model ID caching, **POST /api/openrouter/reset-exhaustion** to reset credit exhaustion mid-session) -│ │ ├── websocket.py # WebSocket for real-time updates (proxy auth validation in generic mode before accept) +│ │ ├── update.py # Update/check endpoints for launcher/updater state +│ │ ├── download.py # PDF generation endpoint via Playwright (desktop only; sanitize/block external requests; returns 501 in generic mode) +│ │ ├── openrouter.py # OpenRouter API endpoints (global key, models/providers via header/body keys only, LM Studio availability, model cache, reset exhaustion) +│ │ ├── websocket.py # WebSocket for real-time updates (generic proxy auth or desktop one-time tickets before accept) │ │ ├── features.py # GET /api/features — shared build identity plus stable capability flags (`generic_mode`, `lm_studio_enabled`, `pdf_download_available`) │ │ ├── proofs.py # Proof database + Lean 4/SMT runtime + manual proof-check + certificate export + dependency graph routes; listing proofs (`GET /`, `/novel`, `/known`, `/library*`) and certificate/lean downloads (`/{id}/certificate`, `/{id}/certificate.lean`) are always available regardless of `lean4_enabled`; dependency/graph routes and `/check` are gated on `lean4_enabled`; `/status` uses short timeouts so it never blocks the UI │ │ └── health.py # GET /api/health — readiness/liveness probe with instance/build metadata @@ -205,7 +221,10 @@ project-root/ │ │ │ ├── paper_{paper_id}_abstract.txt # Abstract only │ │ │ ├── paper_{paper_id}_source_brainstorm.txt # Cached brainstorm database │ │ │ ├── paper_{paper_id}_last_10_rejections.txt # Compiler rejections for this paper -│ │ │ └── archive/ # Archived (redundant) papers +│ │ │ ├── pruned/ # Pruned papers preserved for user download, excluded from model context +│ │ │ │ ├── pruned_paper_{paper_id}.txt # Pruned full paper with top-of-file PRUNED PAPER banner +│ │ │ │ └── pruned_paper_{paper_id}_metadata.json # Pruned metadata/reason +│ │ │ └── archive/ # Legacy archived (redundant) papers, treated as pruned history │ │ │ └── paper_{paper_id}.txt │ │ ├── auto_final_answer/ # Autonomous Research - Tier 3 (LEGACY - replaced by auto_sessions) │ │ │ ├── final_answer_state.json # Tier 3 state (crash recovery) @@ -217,13 +236,15 @@ project-root/ │ │ ├── auto_sessions/ # Autonomous Research - Session-based folder organization │ │ │ └── {sanitized_prompt}_{timestamp}/ # Per-session folder │ │ │ ├── brainstorms/ # Tier 1 brainstorm databases -│ │ │ ├── papers/ # Tier 2 completed papers +│ │ │ ├── papers/ # Tier 2 completed papers plus pruned/ preserved context-excluded papers │ │ │ ├── final_answer/ # Tier 3 final answer data │ │ │ ├── proofs/ # Lean 4 verified-proof records (proofs_index.json, proof_.json, proof__lean.lean) │ │ │ ├── session_metadata.json # Session info (prompt, created_at, status) │ │ │ ├── session_stats.json # Session statistics │ │ │ └── workflow_state.json # Workflow state for crash recovery │ │ ├── proofs/ # Legacy (non-session) Lean 4 proof storage (mirrors per-session proofs/ layout) +│ │ ├── leanoj_sessions/ # LeanOJ run state (state.json, master_proof.lean, master_proof_edits.jsonl, master_proof_snapshots.jsonl, phase counters, subproofs, attempts, verified final Lean code; stop/crash resumes unless cleared) +│ │ ├── leanoj_artifacts/ # LeanOJ full-memory artifact logs (accepted ideas with context_role metadata, recursive topics, verified/partial/failed subproofs, final attempts, final-cycle packets) used for direct-first RAG allocation │ │ ├── auto_research_metadata.json # Autonomous Research metadata (LEGACY - now in session folders) │ │ ├── auto_research_stats.json # Autonomous Research statistics (LEGACY - now in session folders) │ │ ├── auto_workflow_state.json # Autonomous Research workflow state (LEGACY - now in session folders) @@ -247,7 +268,7 @@ project-root/ │ │ │ │ ├── CompilerLogs.jsx # Metrics: construction vs rigor, miniscule edits │ │ │ │ └── LivePaper.jsx # Real-time paper viewing, save draft, word count │ │ │ │ -│ │ │ └── autonomous/ # AUTONOMOUS RESEARCH +│ │ │ ├── autonomous/ # AUTONOMOUS RESEARCH │ │ │ ├── AutonomousResearchInterface.jsx # Main control: research prompt, start/stop, current tier │ │ │ ├── AutonomousResearch.css # Autonomous research styles │ │ │ ├── BrainstormList.jsx # List all brainstorm topics with status @@ -271,6 +292,17 @@ project-root/ │ │ │ ├── Stage2PaperHistory.jsx # Tier 2 paper history list (grouped per research run; sub-tab inside CompletedWorksLibrary) │ │ │ └── Stage2PaperHistory.css # Tier 2 paper history styles │ │ │ +│ │ │ └── leanoj/ # LEANOJ PROOF SOLVER UI +│ │ │ ├── LeanOJInterface.jsx # Prompt/template input, start/resume, stop, skip/force brainstorm, clear progress, live status, verified Lean output +│ │ │ ├── LeanOJSettings.jsx # LeanOJ-specific model profiles/settings; grouped UI controls map to underlying role keys (Submitter 1 also sets topic_generator, Validator sets both validators, Brainstorm Proof Solver sets subproof identifier+solver, Final Proof Solver also owns path/final-readiness decisions) +│ │ │ ├── LeanOJBrainstorms.jsx # LeanOJ accepted ideas/recursive brainstorm memory viewer +│ │ │ ├── LeanOJLogs.jsx # LeanOJ topics, subproofs, failed feedback, event stream +│ │ │ ├── LeanOJMasterProof.jsx # Master proof draft tab (on-demand draft, metadata, edit history, download) +│ │ │ ├── LeanOJMasterProof.css # Master proof draft tab styles +│ │ │ ├── LeanOJMathematicalProofs.jsx # Current-run verified LeanOJ proof/subproof viewer +│ │ │ ├── LeanOJProofLibrary.jsx # Cross-session completed LeanOJ proof-work library +│ │ │ └── index.js # LeanOJ component exports +│ │ │ │ │ ├── StartupProviderSetupModal.jsx # Post-disclaimer startup chooser for OpenRouter vs LM Studio setup (OpenRouter-only in generic mode) │ │ ├── OpenRouterApiKeyModal.jsx # Modal for global OpenRouter API key configuration with mode-aware persistence messaging │ │ ├── PaperCritiqueModal.jsx # Modal for displaying validator paper critiques (ratings, feedback, history) @@ -285,11 +317,12 @@ project-root/ │ │ ├── TextFileUploader.css # File uploader styles │ │ ├── OpenRouterPrivacyWarningModal.jsx # Privacy policy error modal (OpenRouter data sharing, capability-aware alternatives) │ │ ├── HelpTooltip.jsx # Shared portal-based help tooltip component (used across settings/interfaces) +│ │ ├── ProofStrengthBadge.jsx # Shared PS badge/tooltip for highlighted proof-strength models and primary proof-creation roles │ │ ├── settings-common.css # Shared settings panel styles │ │ ├── critique-modal.css # Paper critique modal styles │ │ │ │ │ ├── services/ -│ │ │ ├── api.js # Backend API calls (includes openRouterAPI, `/api/features` capability bootstrap helper, proof routes under `/api/proofs/*`, and cross-session proof library routes `getProofLibrary` / `getLibraryProof` under `/api/proofs/library`) +│ │ │ ├── api.js # Backend API calls (includes openRouterAPI, `/api/features`, proof routes, LeanOJ API, and cross-session proof library helpers) │ │ │ └── websocket.js # WebSocket connection │ │ │ │ │ ├── hooks/ @@ -300,11 +333,12 @@ project-root/ │ │ │ ├── modelCache.js # Frontend model cache utilities (display_name → api_id lookup) │ │ │ ├── openRouterSelection.js # Shared OpenRouter selector auto-fill helpers (context/output from model + host metadata) │ │ │ ├── autonomousProfiles.js # Shared autonomous recommended-profile definitions and persistence helpers +│ │ │ ├── leanojProfiles.js # LeanOJ-specific recommended/user profile definitions, persistence helpers, and request builder (topic generation uses all submitters; legacy topic_generator/selector is sourced from Brainstorm Submitter 1; legacy path_decider request field is derived from Final Proof Solver) │ │ │ ├── runtimeConfig.js # Frontend runtime helpers (instance storage prefix, active data-root display, instance ID) │ │ │ ├── researchRunHistory.js # Groups Tier 2 papers + final answers into per-run history entries for Stage2PaperHistory/FinalAnswerLibrary │ │ │ └── disclaimerHelper.js # Frontend-only disclaimer injection for brainstorm/paper views │ │ │ -│ │ ├── App.jsx # Main app shell with top-level mode switch, `/api/features` capability bootstrap, and capability propagation into settings/interfaces/modals +│ │ ├── App.jsx # Main app shell with top-level mode switch, `/api/features` capability bootstrap, capability propagation, and developer-mode raw-settings shortcut │ │ ├── index.css # Styles │ │ └── index.jsx # React entry point │ │ @@ -321,7 +355,7 @@ project-root/ ├── moto-update-manifest.json # Build 0 updater/build identity manifest committed on main ├── SECURITY.md # Security policy and private vulnerability reporting ├── Click To Launch MOTO.bat # The authoritative Windows launcher entrypoint (thin wrapper that delegates to moto_launcher.py) -├── Launch MOTO.sh # Linux/Ubuntu launcher entrypoint (thin bash wrapper that delegates to moto_launcher.py) +├── linux-ubuntu-launcher.sh # Linux/Ubuntu launcher entrypoint (thin bash wrapper that delegates to moto_launcher.py) ├── moto_launcher.py # Internal Python launcher orchestration (update check, runtime resolution, dependency install, service startup) ├── moto_updater.py # Build 1 updater helper (manifest fetch, install classification, ZIP/git apply flow, launcher state tracking) └── .moto_launcher_state.json # Gitignored local launcher state (tracks active service-window PIDs and runtime roots to block unsafe update-apply) @@ -331,7 +365,7 @@ project-root/ ### Launcher and Updater - `Click To Launch MOTO.bat`: The only Windows consumer entrypoint. It stays thin and always delegates to the Python launcher. -- `Launch MOTO.sh`: The Linux/Ubuntu consumer entrypoint. Same thin-wrapper contract as the `.bat`; delegates to `moto_launcher.py`. +- `linux-ubuntu-launcher.sh`: The Linux/Ubuntu consumer entrypoint. Same thin-wrapper contract as the `.bat`; delegates to `moto_launcher.py`. - `moto_launcher.py`: Orchestrates the launcher flow in order: update check, runtime resolution, dependency install, LM Studio detection, detached backend/frontend startup, and browser launch. - `moto_updater.py`: Owns Build 1 updater behavior, including GitHub `main` manifest fetch, install-state classification, clean-git fast-forward apply, ZIP overlay apply, rollback-aware relaunch, and launcher-managed instance safety checks. - `.moto_launcher_state.json`: Local-only state written by the launcher so future launches can detect still-open backend/frontend windows from the same install and skip update-apply until those windows are closed. @@ -345,19 +379,21 @@ project-root/ ### Shared Resources - `config.py`: RAGConfig, SystemConfig (context windows, chunk sizes, max output tokens, `generic_mode` flag) -- `models.py`: Pydantic models (ModelConfig, BoostConfig, WorkflowTask, ModelUsageTracker, FinalAnswerState) -- `lm_studio_client.py`: LM Studio HTTP client (completions, embeddings, model listing); unused in generic mode +- `models.py`: Pydantic models (ModelConfig with per-role `supercharge_enabled`, BoostConfig, WorkflowTask, ModelUsageTracker, FinalAnswerState) +- `lm_studio_client.py`: LM Studio HTTP client (completions, embeddings, model listing, same-base numeric `:#` instance sharing for independent calls); unused in generic mode - `openrouter_client.py`: OpenRouter HTTP client (credit exhaustion detection, fallback, model/provider endpoint metadata) -- `api_client_manager.py`: Unified API router (OpenRouter/LM Studio fallback + boost + model tracking); generic mode early-returns FastEmbed for embeddings -- `boost_manager.py`: Singleton boost manager (three modes: Boost Next X Calls, Always Prefer Boost, Category Boost; broadcasts events) +- `api_client_manager.py`: Unified API router (optional per-role Supercharge wrapper, OpenRouter/LM Studio fallback, boost, and model tracking); generic mode early-returns FastEmbed for embeddings +- `boost_manager.py`: Singleton boost manager (next-count, always-prefer, category, and per-task boost routing; broadcasts events) - `boost_logger.py`: Boost API call logger (persists boost-routed calls for the combined API log view) - `workflow_predictor.py`: Predicts next 20 API calls for internal boost routing (not displayed in UI) - `free_model_manager.py`: Free model rotation/cooldown singleton (looping, auto-selector `openrouter/free`, account exhaustion detection) -- `wolfram_alpha_client.py`: Wolfram Alpha API client. Exposed to the HighContextSubmitter.submit_construction loop as the `wolfram_alpha_query` tool (up to 20 calls per construction submission). +- `model_error_utils.py`: Shared non-retryable provider/config error detection; callers must pause/resume rather than convert those errors into proof or validation failures. +- `brainstorm_proof_gate.py`: Shared Lean 4 gate for optional proof-candidate brainstorm submissions before normal brainstorm validation. +- `wolfram_alpha_client.py`: Wolfram Alpha API client. Exposed to the HighContextSubmitter.submit_construction loop as the `wolfram_alpha_query` tool (up to 20 calls per construction submission); logs/broadcasts must redact raw query/result text. - `rag_lock.py`: Global RAG operation lock (prevents collision, retry logic for reads); embedding lock skip in generic mode (FastEmbed is in-process/thread-safe) - `token_tracker.py`: Cumulative input/output token tracker singleton with per-model breakdown and research timer. Reset on session start, timer start/stop tied to coordinator lifecycle. Stats broadcast via `token_usage_updated` WebSocket event after each successful LLM call. - `utils.py`: Token counting, text compression, file I/O -- `json_parser.py`: JSON parsing with sanitization for LLM responses; sanitizes reasoning tokens, markdown blocks, control tokens, LaTeX escapes, control characters; **rejects truncated JSON** (raises ValueError with diagnostics) to prevent corrupted content from passing validation +- `json_parser.py`: JSON parsing with sanitization for LLM responses; sanitizes reasoning tokens, markdown blocks, control tokens, LaTeX escapes, control characters; **rejects truncated JSON** (raises ValueError with diagnostics) to prevent corrupted content from passing validation; also provides `sanitize_model_output_for_retry_context()` so retries/memory/RAG can preserve visible failed-output excerpts without replaying known private thought/channel/control tokens or corrupting visible Lean/math syntax such as `<|`; retry-facing parser exceptions must not include raw response excerpts - `critique_memory.py`: Paper critique persistence (ratings, feedback, history, session-aware) - `critique_prompts.py`: Default critique prompt and builder function - `secret_store.py`: Secure API key persistence via OS keyring; bypassed in generic mode (keys are env-injected/in-memory only) @@ -381,7 +417,8 @@ project-root/ - `autonomous_coordinator.py`: Three-tier workflow orchestrator (Tier 1→2→3, triggers, crash recovery, invokes `ProofVerificationStage` after brainstorm/paper completion when `lean4_enabled`) - `autonomous_rag_manager.py`: Autonomous RAG wrapper -- `proof_verification_stage.py`: Proof pipeline orchestrator — candidate identification → per-candidate Phase A (Mathlib lemma search → optional SMT early-exit → Lean 4 formalization attempts, 5 retries per candidate) runs concurrently across all identified candidates bounded by `proof_max_parallel_candidates` (default 6) → Phase B (novelty check → `add_proof` → `ProofDependency` extraction → brainstorm/paper `append_proofs_section`) remains strictly serialized in Phase-A completion order. Per-source reservation lock prevents duplicate concurrent checks for the same `{source_type}:{source_id}`; `FreeModelExhaustedError` (or any Phase-A exception) cancels sibling tasks before the coordinator's recovery path runs. +- `proof_verification_stage.py`: Proof pipeline orchestrator — prompt-relevant candidate identification → per-candidate Phase A (Mathlib lemma search → optional SMT early-exit → Lean 4 formalization attempts, 5 retries per candidate) runs concurrently across all identified candidates bounded by `proof_max_parallel_candidates` (default 6) → Phase B (novelty check → `add_proof` → `ProofDependency` extraction → brainstorm/paper `append_proofs_section`) remains strictly serialized in Phase-A completion order. Per-source reservation lock prevents duplicate concurrent checks for the same `{source_type}:{source_id}`; `FreeModelExhaustedError` (or any Phase-A exception) cancels sibling tasks before the coordinator's recovery path runs. +- `proof_registration.py`: Shared verified-proof registration helper used by autonomous, compiler, aggregator, and LeanOJ proof flows. - `proof_dependency_extractor.py`: Parses verified Lean 4 code into `ProofDependency` records (imports, Mathlib lemmas, MOTO-origin proof ancestry). - Agents: `topic_selector.py`, `topic_validator.py`, `completion_reviewer.py`, `reference_selector.py`, `paper_title_selector.py`, `proof_identification_agent.py`, `proof_formalization_agent.py`, `lemma_search_agent.py` - Tier 3 Agents: `certainty_assessor.py`, `answer_format_selector.py`, `volume_organizer.py` @@ -389,18 +426,26 @@ project-root/ - Prompts: `topic_prompts.py`, `topic_exploration_prompts.py`, `completion_prompts.py`, `paper_reference_prompts.py`, `paper_title_exploration_prompts.py`, `paper_title_prompts.py`, `paper_redundancy_prompts.py`, `paper_continuation_prompts.py`, `proof_prompts.py`, `final_answer_prompts.py` - Memory: `brainstorm_memory.py`, `paper_library.py`, `research_metadata.py` (also stores the proof runtime config snapshot), `session_manager.py`, `autonomous_rejection_logs.py`, `topic_exploration_memory.py` (in-memory candidate DB), `paper_model_tracker.py` (per-paper model usage tracking and author attribution), `autonomous_api_logger.py` (API call logging singleton), `proof_database.py` (session-aware Lean 4 proof storage + novelty index + reverse Mathlib index + cross-session library access), `final_answer_memory.py` (model tracking, archival) +### LeanOJ Components + +- `leanoj_coordinator.py`: Runs the proof-only LeanOJ state machine. It uses parallel submitters plus batch validators for broad initial foundation topics and brainstorms; classifies accepted brainstorm context as `active_plan`, `verified_hint`, `refuted_construction`, or `scratch`; keeps ordinary partial `sorry` scaffolds and failed final attempts out of master-proof seeding unless explicitly elevated; persists accepted-idea `context_role` and chronological occurrence metadata; stores full proof memory independently from trimmed UI/status lists; rejects fake proof devices; persists final-cycle failure packets; emits LeanOJ progress events; routes prompt memory through allocated context blocks; passes the most recent 5 final attempts as compact final-solver execution feedback; and requires Final Proof Solver semantic review before a Lean-passing final proof stops as verified. +- `leanoj_context.py`: Owns LeanOJ artifact JSONL persistence under the active data root, direct-first allocation, final-solver context routing (verified subproofs + `active_plan` notes direct, refuted constructions only as compact warnings, ordinary partial scaffolds excluded from final direct proof context), source-name generation, RAG indexing, session-scoped retrieval with `include_source_prefixes`, direct-source exclusion, resume reload support, and Clear Progress cleanup for LeanOJ RAG sources. +- `prompts.py`: LeanOJ prompt builders for topic, brainstorm, prune review, path, subproof, final-solver editing, and final semantic review roles. These consume prepared context blocks (`direct_proof_context`, `rag_evidence_context`, `refuted_construction_warnings`, `capped_rejection_feedback`, `current_final_cycle_packet`) instead of owning persistence or truncation policy; prune prompts must conservatively ask whether any outdated/redundant memory should be removed or updated without forcing deletion; final-solver prompts must keep `master_proof.lean` to the current chosen proof route, include only compact recent-attempt execution feedback, and avoid accumulating explored/refuted routes. + ### API Routes - `compiler.py`: Compiler control (start/stop/status), paper/outline access, critique management - `autonomous.py`: Autonomous research control (start/stop/clear/status), brainstorm/paper access, Tier 3 endpoints - `proofs.py`: Proof database listing (`GET /`, `/novel`, `/known`) and `/status` runtime readiness — always available, never gated. `/{id}/certificate` and `/{id}/certificate.lean` — always available (data is stored on disk; Lean version info populated only when Lean is enabled). `/status` uses `asyncio.wait_for` timeouts (5s Lean, 3s Z3) so the endpoint never hangs. `POST /settings` runtime flag updates. `POST /check` manual proof check, `/{id}/dependencies`, `/graph`, `/mathlib/{lemma}/dependents` graph/lineage queries — gated on `lean4_enabled`. `GET /library` + `GET /library/{session_id}/{proof_id}` cross-session proof library endpoints — always available. +- `leanoj.py`: LeanOJ proof-solver routes for start/resume, stop, status, clear, skip-brainstorm, force-brainstorm, current proof listing/library, plus read-only `GET /api/leanoj/master-proof` and `/api/leanoj/master-proof/edits` for the durable master proof draft and compact edit-history summaries. ### Frontend Components -- `App.jsx`: Top-level GUI shell. Default mode is `Autonomous ASI S.T.E.M.` for Part 3 screens; `Advanced Manual ASI S.T.E.M.` contains the manual Part 1 Aggregator + Part 2 Compiler workspace. Shared utility controls (Boost, OpenRouter, WorkflowPanel) remain global, and Build 3C bootstraps `/api/features` here so hosted mode can hide LM Studio-only UI and copy. **Tab persistence**: `autonomousActiveTab` → `localStorage['autonomousActiveTab']`; `completedWorksSubTab` → `localStorage['completedWorksSubTab']`; `manualActiveTab` → `localStorage['manualActiveTab']`. **Autonomous tab groups**: main tabs (interface, brainstorms, papers, proofs, optional final-answer) + settings group (Your Completed Works Library, API Call Logs, Settings). The "Your Completed Works Library" tab hosts three sub-tabs rendered inside its content area: Stage 2 Papers History, Stage 3 Final Answers History, and Proof Library. +- `App.jsx`: Top-level GUI shell. Default mode is `Autonomous ASI S.T.E.M.` for Part 3 screens; `Advanced Manual ASI S.T.E.M.` contains the manual Part 1 Aggregator + Part 2 Compiler workspace; `LeanOJ Proof Solver` is a developer-mode-only proof mode. Shared utility controls (Boost, OpenRouter, WorkflowPanel) remain global, and Build 3C bootstraps `/api/features` here so hosted mode can hide LM Studio-only UI and copy. Shift + Z + X toggles persisted developer-mode settings, LeanOJ mode, raw JSON editors, and Supercharge controls. Supercharge request payloads must be forced off unless developer mode is active. **Tab persistence**: `autonomousActiveTab` → `localStorage['autonomousActiveTab']`; `completedWorksSubTab` → `localStorage['completedWorksSubTab']`; `manualActiveTab` → `localStorage['manualActiveTab']`; `leanojActiveTab` → `localStorage['leanojActiveTab']`. **Autonomous tab groups**: main tabs (interface, brainstorms, papers, proofs, optional final-answer) + settings group (Your Completed Works Library, API Call Logs, Settings). The "Your Completed Works Library" tab hosts three sub-tabs rendered inside its content area: Stage 2 Papers History, Stage 3 Final Answers History, and Proof Library. - **Aggregator**: `AggregatorInterface.jsx`, `AggregatorSettings.jsx`, `AggregatorLogs.jsx`, `LiveResults.jsx` - **Compiler**: `CompilerInterface.jsx`, `CompilerSettings.jsx`, `CompilerLogs.jsx`, `LivePaper.jsx` - **Autonomous**: `AutonomousResearchInterface.jsx`, `BrainstormList.jsx`, `PaperLibrary.jsx`, `AutonomousResearchSettings.jsx`, `AutonomousResearchLogs.jsx`, `LivePaperProgress.jsx`, `LiveTier3Progress.jsx`, `FinalAnswerView.jsx`, `FinalAnswerLibrary.jsx` (Stage 3 history sub-tab), `ArchiveViewerModal.jsx`, `MathematicalProofs.jsx` (live-session proof tab), `ProofGraph.jsx` (dependency graph), `ProofNotificationStack.jsx` (novel-proof popups), `ProofLibrary.jsx` (cross-session proof library sub-tab), `Stage2PaperHistory.jsx` (Stage 2 history sub-tab) +- **LeanOJ**: `LeanOJInterface.jsx`, `LeanOJBrainstorms.jsx`, `LeanOJLogs.jsx`, `LeanOJMasterProof.jsx`, `LeanOJMathematicalProofs.jsx`, `LeanOJProofLibrary.jsx`, `LeanOJSettings.jsx` - **Shared**: `StartupProviderSetupModal.jsx`, `OpenRouterApiKeyModal.jsx`, `PaperCritiqueModal.jsx`, `CritiqueNotificationStack.jsx`, `CreditExhaustionNotificationStack.jsx`, `HungConnectionNotificationStack.jsx`, `BoostControlModal.jsx`, `WorkflowPanel.jsx`, `TextFileUploader.jsx`, `OpenRouterPrivacyWarningModal.jsx`, `LatexRenderer.jsx` (dual view, KaTeX, theorem parsing), `LatexRenderer.css` - **Hooks**: `useProofCheckRuntime.js` (reads `/api/proofs/status` + runtime config so UI can enable/disable manual proof-check controls) - **Utils**: `downloadHelpers.js` (PDF/raw download), `modelCache.js` (display_name → api_id lookup), `openRouterSelection.js` (shared OpenRouter selector auto-fill helpers using model context and provider endpoint caps), `autonomousProfiles.js` (shared recommended-profile definitions + persistence helpers; when editing a preset, anchor to the exact profile block and exact nested role such as `validator` or `highContext`, never to a shared literal alone, then verify the diff only touched that intended profile/role), `disclaimerHelper.js` (frontend-only disclaimer injection), `api.js`, `websocket.js` diff --git a/.cursor/rules/rag-design-for-overall-program.mdc b/.cursor/rules/rag-design-for-overall-program.mdc index 6057269..6b79c5a 100644 --- a/.cursor/rules/rag-design-for-overall-program.mdc +++ b/.cursor/rules/rag-design-for-overall-program.mdc @@ -9,11 +9,31 @@ The RAG system in this program is very advanced, be certain that any changes you DIRECT INJECTION FIRST, RAG SECOND IF DIRECT INJECTION DOESN'T FIT. +Some inputs are **mandatory direct-inject** and must never be RAG'd, summarized, compressed, truncated, excerpted, or replaced by partial views. If mandatory direct-inject context does not fit the configured model context, halt with an explicit context-overflow error and tell the user which mandatory context overflowed. + If an item is direct injected, its RAG counterpart must NOT also be included. -**RAG Offload Priority — Submitter:** Shared Training DB → Local Submitter DB → Rejection Log → User Upload Files +### Paper-Writing RAG Modes + +These priorities apply to the Aggregator/Compiler/Autonomous paper-writing workflows. They do **not** describe LeanOJ proof-only memory ordering. + +**RAG Offload Priority — Paper-Writing Submitter:** Shared Training DB → Local Submitter DB → Rejection Log → User Upload Files + +**RAG Offload Priority — Paper-Writing Validator:** Shared Training DB → User Upload Files (submission under review is always direct injected) + +### LeanOJ Proof-Only RAG Mode + +These priorities apply only to the LeanOJ proof solver. LeanOJ stores proof artifacts under session-scoped sources such as `leanoj_{session_id}_accepted_ideas` and retrieves with `include_source_prefixes=[f"leanoj_{session_id}_"]`. Do not apply these orders to paper-writing prompts. + +**RAG Offload Priority — LeanOJ Final Solver:** Verified Subproofs → Partial Proof Scaffolds → Accepted Proof Memory Notes. Final-solver proof memory must not include recursive topics, historical final-cycle packets, failed-attempt counts, or phase-transition/path vocabulary. It is an edit-only mode. The prompt may separately include the most recent 5 final attempts as compact execution feedback for edit selection only; this is not proof evidence and must not seed `master_proof.lean`. -**RAG Offload Priority — Validator:** Shared Training DB → User Upload Files (submission under review is always direct injected) +**RAG Offload Priority — LeanOJ Proofstorm/Subproof Solver:** Current Final-Cycle Failure Packet (always direct if active) → Verified Subproofs → Relevant Partial Proof Scaffolds → Accepted Brainstorm Ideas → Historical Failed Attempts For Related Obstacles + +**RAG Offload Priority — LeanOJ Brainstorm After Final-Loop Failure:** Current final-attempt-cycle failure packet (always direct) → Accepted Brainstorm Ideas → Partial Proof Scaffolds → Verified Subproofs → Older Historical Final Failures (RAG only) + +**LeanOJ capped feedback rule:** Same-subproof prior attempt errors and rejection/failure summaries may stay capped as direct feedback. The final solver may receive compact execution feedback from the most recent 5 final attempts after filtering or rewriting path-transition vocabulary and final-cycle attempt-count summaries. Validator feedback rejecting non-progressive `master_proof.lean` shortening edits is allowed as direct final-solver feedback. The cap applies only to direct rejection/execution feedback, not to total persisted LeanOJ memory. + +**LeanOJ mandatory direct-inject inputs:** User problem, Lean template, JSON/schema/task instructions, and the canonical `master_proof.lean` during the final proof-editing loop are mandatory direct-inject context. The master proof is the active proof attempt and must be injected in full. It must never be RAG offloaded, summarized, compressed, truncated, chunk-windowed, or replaced by an excerpt. If the full master proof cannot fit with the other mandatory prompt context, LeanOJ must stop with a hard mandatory direct-context overflow error. ## Further RAG Specifications @@ -62,6 +82,9 @@ User-uploaded files: pre-generate ALL 4 configurations. Dynamic files (training - **Generic mode lock skip**: FastEmbed is in-process and thread-safe — embedding calls skip the global RAG lock. ChromaDB write locking remains in both modes. - **Read retry**: Vector search auto-retries with exponential backoff (0.5s → 1s → 2s, max 3 attempts) on HNSW index errors during concurrent writes - **Embedding rate limiting**: Semaphore limits concurrent embedding requests to 2 (default mode only; generic mode uses in-process FastEmbed) +- **FastAPI event-loop safety**: `rag_manager` must not run synchronous ChromaDB calls or CPU-heavy scoring directly on the event loop. Use `asyncio.to_thread()` for ChromaDB `add/query/get/delete` and for large in-memory vector/BM25 scoring. +- **Retrieval snapshots**: Take a stable chunk snapshot before threaded scoring so concurrent add/remove operations cannot mutate the iterable being scored. Worker-thread BM25/vector scoring should use local snapshot state and must not mutate shared caches such as `self.bm25_index`. +- **GUI responsiveness invariant**: RAG work runs inside long-lived research/proof tasks; it must never starve `/api/health`, `/api/features`, status polling, OpenRouter key-status, or WebSocket handling. --- @@ -69,12 +92,14 @@ User-uploaded files: pre-generate ALL 4 configurations. Dynamic files (training **Stage A — Query Rewriting**: Expands to 3-6 semantic variants; filters queries < 3 words; embeddings cached (500-entry LRU); variants batched into single embedding API call. -**Stage B — Hybrid Recall**: BM25 (exact terms) + ANN Cosine (semantic); top 120 from each, deduped by chunk_id. +**Stage B — Hybrid Recall**: BM25 (exact terms) + ANN Cosine (semantic); top 120 from each, deduped by chunk_id. Optional `include_sources` / `include_source_prefixes` scopes recall to named source files or source-name prefixes before reranking. Recall operates on a chunk snapshot; scoped in-memory vector fallback and BM25 scoring must run off-loop. **Stage C — Reranking + MMR**: Blend vector (60%) + BM25 (40%); MMR λ=0.8 (80% relevance, 20% diversity); removes near-duplicates (similarity > 0.85); hard cap at context budget. **Stage D — Packing**: Assembles evidence with headers; priority: document → section → relevance. Packs chunks incrementally until budget is reached (no compression — disabled as unreliable). Skips chunks from `exclude_sources` (content already direct-injected in prompt). Returns `ContextPack` with evidence tracking. +**Scoped retrieval**: `rag_manager.retrieve()` may receive `include_sources` and/or `include_source_prefixes` to restrict recall to a namespaced source set before reranking/packing. Use this for mode-specific memory namespaces such as LeanOJ so proof-solver artifacts cannot leak into unrelated paper-writing or compiler retrieval. `exclude_sources` still applies afterward for anti-duplication when a scoped source was direct-injected. + --- ## Multi-Configuration Chunk Storage @@ -103,6 +128,8 @@ User-uploaded files: pre-generate ALL 4 configurations. Dynamic files (training **Always direct injected**: User prompt/goal, JSON output format specs, system prompts. +**Mandatory direct-inject overflow**: Mandatory direct-inject inputs are not eligible for RAG fallback or compression. If they exceed available prompt context, halt with an explicit overflow error. Examples include the LeanOJ final-loop `master_proof.lean`, validator submissions under review, active Lean source/proof attempts, and proof-verification candidate theorem/formalization inputs. + **Token budget formula**: `available_input = context_window - output_reserve - buffer(500)` **Context allocation algorithm**: @@ -113,10 +140,12 @@ User-uploaded files: pre-generate ALL 4 configurations. Dynamic files (training **Key Invariant**: Context allocator returns content parts only. Prompt builder adds template parts (system prompt, JSON, user prompt). Both must be counted to avoid overflow. -**Overflow handling**: User prompt always direct injected; if exceeds `context_window - minimum_RAG_allocation`: HALT with error. Content too large: offload to RAG. Still doesn't fit: compress (NEVER truncate). +**Overflow handling**: User prompt always direct injected; if exceeds `context_window - minimum_RAG_allocation`: HALT with error. Mandatory direct-inject content that does not fit: HALT with explicit context-overflow error. Non-mandatory content too large: offload to RAG. Still doesn't fit: compress only when the mode explicitly allows compression (NEVER truncate). **Source Exclusion (anti-duplication)**: `rag_manager.retrieve(exclude_sources=[...])` filters chunks from named sources during Stage D packing. Callers pass source names of content already direct-injected so RAG budget goes entirely to non-duplicated content. +**Source Scoping (anti-leakage)**: `rag_manager.retrieve(include_sources=[...], include_source_prefixes=[...])` restricts recall to explicit sources before reranking/packing. Use this whenever a mode stores specialized memory in the shared Chroma collections and must prevent cross-mode retrieval. LeanOJ uses session-prefixed sources such as `leanoj_{session_id}_accepted_ideas` and retrieves with `include_source_prefixes=[f"leanoj_{session_id}_"]`. + | Mode | Excluded Sources | Reason | |---|---|---| | Compiler construction | `compiler_outline.txt`, `compiler_paper.txt`, brainstorm source (when direct-injected) | All three always direct-injected in construction prompts | @@ -124,6 +153,7 @@ User-uploaded files: pre-generate ALL 4 configurations. Dynamic files (training | Compiler rigor | `compiler_outline.txt` | Outline always direct-injected; paper intentionally RAG'd (smaller context) | | Aggregator submitter/validator | Direct-injected user file names + direct-injected shared-training sources (current training file + `rag_shared_training_update_*`) | Prevents RAG returning chunks already in direct context when only some content is offloaded | | Aggregator cleanup review | Same as above, when full submissions DB is direct-injected | Prevents cleanup RAG evidence from repeating already-injected submissions | +| LeanOJ proof solver | Direct-injected LeanOJ source names, scoped to `leanoj_{session_id}_*` sources | Keeps useful proof memory session-scoped and prevents cross-mode retrieval pollution | --- @@ -162,6 +192,8 @@ User-uploaded files: pre-generate ALL 4 configurations. Dynamic files (training **Proof Verification Stage (optional, gated on `lean4_enabled`)**: Proof identification, formalization, and lemma search agents operate outside the RAG pipeline. Verified `ProofRecord` summaries and `FailedProofCandidate` hints (from `proof_prompts.format_failure_hints_for_injection`) are **highest-priority direct injections** into subsequent brainstorm/paper submitter prompts when present — never RAG'd. Lean source files under the session `proofs/` directory are not indexed into Chroma. +**LeanOJ Proof Solver**: LeanOJ useful proof memory uses the existing RAG pipeline through `backend/leanoj/core/leanoj_context.py`, not a separate/simple retriever. Mandatory prompt inputs (user problem, Lean template, role task, JSON schema) stay direct. Useful artifacts (accepted ideas, recursive topics, verified subproofs, partial proof scaffolds, historical final attempts, final-cycle packets, failed subproof context) are persisted in full, direct-injected if they fit, otherwise indexed under session-scoped `leanoj_{session_id}_*` sources and retrieved with source scoping. Direct-injected LeanOJ sources must be excluded from RAG evidence. Current final-cycle failure packets are direct context for the next brainstorm/proofstorm phase; older final-cycle packets remain available through scoped RAG only. Recent rejection/error summaries remain capped direct feedback. During final proof-editing, allocation is narrower: no recursive topics, no historical final-cycle packets, no failed-attempt counts, and no phase-transition/path vocabulary; the prompt may still include the most recent 5 final attempts as capped execution feedback so the solver does not repeat stale edits or ignored Lean errors. Validator feedback from rejected non-progressive master-proof shortening edits may be direct feedback because it tells the next final solver what proof progress to restore. The canonical LeanOJ master proof draft (`master_proof.lean`) is file-backed working state, not a RAG artifact: during the final proof-editing loop it is mandatory direct-inject context and must be shown fully or the program must halt with a mandatory direct-context overflow error. Edits always apply to the full persisted proof. + **Embedding provider routing**: See dual-contract table above. Default mode uses LM Studio with OpenRouter fallback. Generic mode uses in-process FastEmbed. Both modes produce compatible vector dimensions for the same ChromaDB collections. **Training DB files**: `rag_shared_training.txt` and `Summary_Of_Last_5_Validator_Rejections_For_Submitter_{num}.txt` live under the active instance data root (default desktop paths: `backend/data/rag_shared_training.txt` and `backend/data/Summary_Of_Last_5_Validator_Rejections_For_Submitter_{num}.txt`). @@ -170,11 +202,11 @@ User-uploaded files: pre-generate ALL 4 configurations. Dynamic files (training ## Agents Intentionally Without RAG -These agents use ONLY direct injection (no RAG fallback) by design. Each operates on compact metadata summaries where RAG is unnecessary. Documented in each file's module docstring. +These agents use ONLY direct injection for their compact metadata decision steps. If a listed agent has a later full-content expansion step, that expansion may use the normal direct-first/RAG fallback documented in its module docstring. | Agent | Inputs | Why No RAG | |---|---|---| -| Topic selector | Brainstorm metadata, paper titles/abstracts | Strategic "what to work on" decision — summaries suffice; abstracts truncated as overflow fallback | +| Topic selector | Brainstorm metadata, paper titles/abstracts | Strategic "what to work on" decision — bounded metadata summaries suffice | | Topic validator | Same as topic selector | Validates strategic decision, same compact metadata | | Paper title selector | Brainstorm summary, existing paper titles/abstracts | Title selection needs topic overview, not full content | | Paper redundancy checker | Paper titles/abstracts | Redundancy detected at abstract level, not full-content comparison | @@ -184,7 +216,7 @@ These agents use ONLY direct injection (no RAG fallback) by design. Each operate | Brainstorm continuation | Brainstorm summary, prior paper titles/abstracts | "Write another or move on" uses summary, not full DB | | Proof identification / formalization / lemma search | Candidate theorem text, Lean error output, targeted Mathlib lemma metadata | Operates on compact Lean source + structured hints; proof agents consume `ProofRecord` direct-injection summaries and do not route through the RAG pipeline | -**Known oversight**: Certainty assessor Step 2 drops expanded papers when they don't fit instead of RAG'ing them. Should use RAG fallback like reference_selector does. +**Certainty assessor overflow handling**: Certainty assessor Step 1 remains abstract/outline-only. Step 2 uses RAG fallback for requested expanded papers when full direct injection does not fit. --- @@ -195,7 +227,7 @@ These agents use ONLY direct injection (no RAG fallback) by design. Each operate 3. User files pre-generate 4 configs — no re-chunking during session 4. Dynamic files re-chunked on update — single config 5. Submitter cycling is independent — each maintains own cycle state -6. No truncation fallback — fails cleanly, uses RAG or compresses +6. No truncation fallback — mandatory direct-inject context fails cleanly; non-mandatory oversized content uses RAG or mode-approved compression 7. Evidence tracking mandatory — all facts map to source spans 8. User files protected from eviction — permanent cache 9. Contradiction check pre-acceptance @@ -205,3 +237,7 @@ These agents use ONLY direct injection (no RAG fallback) by design. Each operate 13. Per-size chunk cap (`max_chunks_per_size`) enforced after every add — prevents unbounded in-memory embedding growth 14. Agents that use only metadata summaries (topic selector, title selector, redundancy checker, etc.) intentionally skip RAG — see "Agents Intentionally Without RAG" table above 15. If content is already direct-injected, it must NOT also appear in RAG retrieval results — no duplication +16. Shared Chroma retrieval must use source scoping for mode/session-specific memory such as LeanOJ proof artifacts — no cross-mode memory leakage +17. LeanOJ `master_proof.lean` is mandatory full direct-inject context during the final proof-editing loop. Never RAG, summarize, compress, truncate, or window it. If it does not fit, halt with a mandatory direct-context overflow error. +18. Synchronous ChromaDB operations and heavy RAG scoring must be offloaded from the FastAPI event loop. +19. Threaded RAG scoring must use local snapshots and must not mutate shared retrieval indexes/caches. diff --git a/.dockerignore b/.dockerignore index 765032d..8e3b315 100644 --- a/.dockerignore +++ b/.dockerignore @@ -31,7 +31,7 @@ commits_pending.txt proof-integration-build*-plan.md Click To Launch MOTO.bat -Launch MOTO.sh +linux-ubuntu-launcher.sh moto_launcher.py moto_updater.py diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..3bdebe0 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,16 @@ +* text=auto + +*.py text eol=lf +*.js text eol=lf +*.jsx text eol=lf +*.json text eol=lf +*.md text eol=lf +*.mdc text eol=lf +*.css text eol=lf +*.yml text eol=lf +*.yaml text eol=lf + +.gitignore text eol=lf +.gitattributes text eol=lf + +*.sh text eol=lf diff --git a/.gitignore b/.gitignore index 50261b8..5a6207c 100644 --- a/.gitignore +++ b/.gitignore @@ -21,6 +21,7 @@ wheels/ *.egg MANIFEST venv/ +.venv/ ENV/ env/ @@ -71,6 +72,8 @@ backend/data/auto_final_answer/* !backend/data/auto_final_answer/.gitkeep backend/data/auto_sessions/ +backend/data/leanoj_sessions/ +backend/data/leanoj_artifacts/ # Proof verification artifacts (Lean 4 / Z3 hybrid mode) backend/data/proofs/* @@ -123,4 +126,11 @@ htmlcov/ final_volume.txt RANDOM LOG.txt randomlog.txt +randomlog*.txt +leanoj_master_proof_*.lean.txt commits_pending.txt + +# Private/local planning notes that should not be published +HARDOJ_AWS_COMPUTE_DONATION_OUTLINE.md +LEANOJ_MASTER_PROOF_WRITER_REMAINDER.md +LEANOJ_PROBLEM_11_PROMPT.md diff --git a/README.md b/README.md index 3da2ffb..3aa162d 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,12 @@ # MOTO Autonomous ASI -## An Autonomous Prototype Superintelligence - Automated Theorem Generation with Lean 4 Mathematics Proof Verification -**Version: 1.0.7** +## Autonomous Prototype Superintelligence - Automated Theorem Generation with Lean 4 Math Proof Verification +**Version: 1.0.8** [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) -[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) -[![Node.js 16+](https://img.shields.io/badge/node-16+-green.svg)](https://nodejs.org/) +[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) +[![Node.js 20.19+](https://img.shields.io/badge/node-20.19+-green.svg)](https://nodejs.org/) -**A breakthrough in AI automated theorem generation. An autonomous AI/ASI research system that generates novel and publication-worthy research papers — and the machine-checked theorem proving programming language Lean 4 proofs alongside them for definitive mathematical confirmation of correctness. This ASI is autonomously powered by Intrafere Research Group's new ASI discovery of [Top-P Exploration Through Structured Brainstorming & Validated Feedback](https://intrafere.com/structured-brainstorming-validated-feedback/). Top-P exploration assists in deciphering how we explore AI weights, a specific combination of reiterative brainstorming, validation, feedback, and pruning allows for superintelligence exploration and creative multi-model data extraction from nearly any combination of AI models. Additionally, MOTO ships an optional automated theorem generation pipeline that formalizes candidate theorems and lemmas in Lean 4 (with optional Z3/SMT hinting and Mathlib lemma search) and only stores proofs that Lean 4 accepts as genuinely verified. This exact version of MOTO is customized to be useful for any discipline with an interest in creative and novel solution generation in S.T.E.M.: physicists, engineers, mathematicians, chemists, etc. This harness can also easily be modified for topics such as general academic research, chatbots, niche research, robotics, or anything requiring creative output and/or general autonomy. MOTO's novel brainstorming and rejection/validation stage allows autonomous long-term runtime without user intervention — if desired, research can be conducted for days or weeks without user input.** +**A breakthrough in AI automated theorem generation. An autonomous AI/ASI research system that generates novel and publication-worthy research papers — and the machine-checked theorem proving programming language Lean 4 proofs alongside them for definitive mathematical confirmation of correctness. This ASI is autonomously powered by Intrafere Research Group's new ASI discovery of [Top-P Exploration Through Structured Brainstorming & Validated Feedback](https://intrafere.com/structured-brainstorming-validated-feedback/). Top-P exploration assists in deciphering how we explore AI weights, a specific combination of reiterative brainstorming, validation, feedback, and pruning allows for superintelligence exploration and creative multi-model data extraction from nearly any combination of AI models. Additionally, MOTO has optional automated theorem generation capabilities that formalize candidate theorems and lemmas in Lean 4 (with optional Z3/SMT hinting and Mathlib lemma search) and only stores proofs that Lean 4 accepts as genuinely mathematically verified. Lean 4 automation means the user gets guaranteed verification of the mathematical results produced. This exact version of MOTO is customized to be useful for any discipline with an interest in creative and novel solution generation in S.T.E.M.: physicists, engineers, mathematicians, chemists, researchers, etc. This harness can also easily be modified for topics such as general academic research, chatbots, niche research, robotics, or anything requiring creative output and/or general autonomy. MOTO's novel brainstorming and rejection/validation stage allows autonomous long-term runtime without user intervention — if desired, research can be conducted for days or weeks without user input.** ### The Core Discovery: Top-P Exploration @@ -22,7 +22,7 @@ MOTO may produce many brilliant papers as it runs; these intermediate papers are ### Secondary Feature: Automated Theorem Generation with Lean 4 Verification -Paired with Top-P Exploration — and secondary to it — MOTO ships an **optional automated theorem generation pipeline** that turns the autonomous brainstorm and paper stream into **machine-checked Lean 4 proofs**. When `lean4_enabled` is on, the coordinator first runs a one-shot *proof-framing gate* to decide whether the user's prompt is proof-amenable; if it is, every subsequent brainstorm and paper becomes a candidate source for formalization. After each completed brainstorm (Tier 1) and each completed paper (Tier 2 / Tier 3 chapter), a dedicated proof stage runs: +Paired with Top-P Exploration — and secondary to it — MOTO has an **optional automated theorem generation pipeline** that turns the autonomous brainstorm and paper stream into **machine-checked Lean 4 proofs**. When `lean4_enabled` is on, the coordinator first runs a one-shot *proof-framing gate* to decide whether the user's prompt is proof-amenable; if it is, every subsequent brainstorm and paper becomes a candidate source for formalization. After each completed brainstorm (Tier 1) and each completed paper (Tier 2 / Tier 3 chapter), a dedicated proof stage runs: 1. **Candidate identification** — an LLM agent extracts theorem/lemma candidates from the brainstorm or paper. 2. **Mathlib lemma search** — a second agent surfaces relevant existing Mathlib lemmas and threads them into the formalization prompt. @@ -33,7 +33,7 @@ Paired with Top-P Exploration — and secondary to it — MOTO ships an **option **Lean 4 is authoritative.** SMT results are hints only — they never substitute for Lean verification, and any proof that would compile only because of a `sorry` or `admit` is rejected. The pipeline is entirely silent and skipped when `lean4_enabled=False`, so it never blocks brainstorm or paper completion; the default hosted image stays Lean-free and Z3-free. A manual-check endpoint (`POST /api/proofs/check`) also lets you re-run the pipeline on any stored brainstorm or paper after the fact, and the compiler's "rigor mode" reuses the same Lean 4 checker to upgrade lemmas inside a paper as it's being written. -Give the program a try — MOTO is as cool as it sounds. Windows has a one-click launcher and Ubuntu 24.04 now has a repo-root launcher too. Use the two links below to download Python and Node.js, they should automatically install in seconds. Once those are downloaded, click the green "< > Code" drop-down menu on the top right of this GitHub page and download the zip file. On Windows, extract it to your desktop and double-click `Click To Launch MOTO.bat`. On Ubuntu 24.04, extract it and run `bash "Launch MOTO.sh"`. Put in your OpenRouter.AI API key (or optionally connect LM Studio for faster performance), select your agents in the settings profile - if desired and you are unsure you may use the preselected "fastest" profile. +Give the program a try — MOTO is as cool as it sounds. Windows has a one-click launcher and Ubuntu 24.04 now has a repo-root launcher too. Use the two links below to download Python and Node.js, they should automatically install in seconds. Once those are downloaded, click the green "< > Code" drop-down menu on the top right of this GitHub page and download the zip file. On Windows, extract it to your desktop and double-click `Click To Launch MOTO.bat`. On Ubuntu 24.04, extract it and run `bash linux-ubuntu-launcher.sh`. Put in your OpenRouter.AI API key (or optionally connect LM Studio for faster performance), select your agents in the settings profile - if desired and you are unsure you may use the preselected "fastest" profile. ***Now you are set up and every time you press launch your home lab is ready for your prompt!*** **Give MOTO the toughest question you can think of and press start to begin YOUR creations!** @@ -59,9 +59,9 @@ MOTO (Multi-Output Token Orchestrator) is a high-risk high-reward (novelty seeki Before installation, you need: -1. **Python 3.8+** - [Download here](https://www.python.org/downloads/) +1. **Python 3.10+** - [Download here](https://www.python.org/downloads/) - ⚠️ **IMPORTANT**: Check "Add Python to PATH" during installation -2. **Node.js 16+** - [Download here](https://nodejs.org/) +2. **Node.js 20.19+** - [Download here](https://nodejs.org/) 3. **LM Studio** (optional but HIGHLY recommended - otherwise your system will need to pay OpenRouter for RAG embedding calls, which is very slow compared to LM Studio's local embeddings) - [Download here](https://lmstudio.ai/) - If using OpenRouter, then download and load at least one model (e.g., DeepSeek, Llama, Qwen - older models and some models below 12 billion parameters may struggle; however, it is always worth a try!) - **Load the LM Studio RAG agent [optional but HIGHLY recommended for much faster outputs/answers]**: Load the embedding model `nomic-ai/nomic-embed-text-v1.5` in your LM Studio "Developer" tab (server tab) (search for "nomic-ai/nomic-embed-text-v1.5" to download it in the LM Studio downloads center). Please note: you may need to enable "Power User" or "Developer" to see this developer tab - this server will let you load the amount and capacity of simultaneous models that your PC will support. In this developer tab is where you load both your nomic-ai embedding agent and any optional local hosted agents you want to use in the program (e.g., GPT OSS 20b, DeepSeek 32B, etc.). **If you do not download LM Studio and enable the Nomic agent the system will run much slower and cost slightly more due to having to use the paid service OpenRouter for RAG calls.** @@ -69,6 +69,16 @@ Before installation, you need: 4. **If using cloud AI - Get an OpenRouter API key**: Sign up at OpenRouter.ai and get a paid or free API key to use the most powerful cloud models available from your favorite providers. OpenRouter may also offer a certain amount of free API calls per day with your account key. When you download MOTO Autonomous ASI, you can see which models are free by checking the "show only free models" check box(es) in the MOTO app settings. 5. **On first startup, pick your provider path**: After you acknowledge the disclaimer, MOTO will prompt you to either enter an OpenRouter key or confirm that LM Studio is running. If you save an OpenRouter key there, the recommended default autonomous profile is applied immediately so you can open Settings and see it already selected. +#### Optional Lean 4 / SMT Proof Verification Requirements + +Lean 4 proof verification is optional. The launcher prepares it when available, but normal brainstorming and paper generation still run when Lean 4 is disabled or unavailable. + +- **Lean 4 / elan / lake**: Required only when `lean4_enabled` is turned on. The launcher attempts a one-time `elan` install and expects both `lean` and `lake` to be available afterward. +- **Git and internet access**: Required for the first Lean 4 workspace setup because Mathlib is fetched through Lake. +- **Mathlib storage**: Plan on several additional GB for the repo-local Lean workspace, Mathlib sources, and prebuilt `.olean` cache. First setup can take a while. +- **Z3 / SMT**: Optional. When `smt_enabled` is turned on, MOTO uses Z3 only for conservative hints; Lean 4 remains authoritative. The launcher attempts to find or download Z3, and advanced users can provide a path through the proof settings or `MOTO_Z3_PATH`. +- **Linux note**: On Ubuntu 24.04, make sure `python3`, `python3-venv`, `bash`, `curl`, `git`, Node.js, and npm are available. A desktop keyring backend is recommended if you want provider keys saved securely. + ### Installation #### Windows (One-Click Launcher) @@ -98,7 +108,7 @@ Before installation, you need: 3. From the repo root, run: ```bash -bash "Launch MOTO.sh" +bash linux-ubuntu-launcher.sh ``` 4. The Ubuntu launcher will: @@ -112,13 +122,15 @@ bash "Launch MOTO.sh" **Ubuntu note:** If Playwright or the desktop keyring is unavailable, the launcher stays runnable and explains the limitation. Saved provider keys will only persist when a Linux desktop keyring backend is available. +**Linux support note:** Ubuntu 24.04 is the tested Linux launcher target. Other Linux distributions may work through the manual installation flow if they provide compatible Python, Node.js/npm, shell, keyring, Lean 4/elan, and browser dependencies, but they are best-effort unless explicitly tested. + ### Build Identity and Update Contract - `moto-update-manifest.json` is the authoritative Build 0 updater/build identity manifest for the `main` branch. - `GET /api/features` exposes the public build-comparison fields `version`, `build_commit`, `update_channel`, and `api_contract_version`. - Official update comparisons target GitHub `main`, not GitHub Releases. - `Click To Launch MOTO.bat` is the authoritative Windows launcher entrypoint and delegates to `moto_launcher.py`. -- `Launch MOTO.sh` is the authoritative Ubuntu 24.04 launcher entrypoint; it bootstraps the repo-local `.venv`, delegates to `moto_launcher.py`, and is used again for relaunch after an update when MOTO was started from that wrapper. +- `linux-ubuntu-launcher.sh` is the authoritative Ubuntu 24.04 launcher entrypoint; it bootstraps the repo-local `.venv`, delegates to `moto_launcher.py`, and is used again for relaunch after an update when MOTO was started from that wrapper. - Clean extracted ZIP installs and clean `main`-tracking git clones are the supported automatic update-apply targets. - Dirty or locally mutated repos remain runnable, but they are update-detection-only and are not eligible for automatic update-apply behavior. - If launcher-managed backend/frontend services from this install are still running, the updater warns and skips update-apply until those services are closed. @@ -173,7 +185,7 @@ bash "Launch MOTO.sh" ### Technology Stack -- **Backend**: Python 3.8+, FastAPI, Uvicorn +- **Backend**: Python 3.10+, FastAPI, Uvicorn - **Frontend**: React, Vite, Tailwind CSS - **AI**: LM Studio API, OpenRouter API - **RAG**: ChromaDB, Nomic Embeddings, or OpenRouter embeddings fallback if LM Studio is unavailable (not recommended - slower). @@ -207,6 +219,7 @@ moto-math-variant/ ├── .cursor/ │ └── rules/ # AI agent design specifications (full system documentation) ├── Click To Launch MOTO.bat # One-click Windows launcher +├── linux-ubuntu-launcher.sh # Ubuntu 24.04 launcher ├── moto_launcher.py # Internal Python launcher orchestration ├── moto_updater.py # Build 1 updater helper and launcher state manager ├── requirements.txt # Python dependencies @@ -323,7 +336,7 @@ All configurable per role: #### Manual Installation (All Platforms) -If you want the consumer launcher experience on Ubuntu 24.04, prefer `bash "Launch MOTO.sh"` instead of the manual steps below. The manual flow remains the fallback path when you intentionally want full terminal-level control. +If you want the consumer launcher experience on Ubuntu 24.04, prefer `bash linux-ubuntu-launcher.sh` instead of the manual steps below. The manual flow remains the fallback path when you intentionally want full terminal-level control. For normal desktop use, the launchers are preferred because they create the matching backend/frontend desktop API tokens automatically. ```bash # Clone the repository @@ -346,13 +359,17 @@ mkdir -p backend/data/user_uploads mkdir -p backend/logs # Start the backend (in one terminal) -python -m uvicorn backend.api.main:app --host 0.0.0.0 --port 8000 +export MOTO_DESKTOP_API_TOKEN="local-dev-token" +python -m uvicorn backend.api.main:app --host 127.0.0.1 --port 8000 # Start the frontend (in another terminal) cd frontend +export VITE_MOTO_DESKTOP_API_TOKEN="local-dev-token" npm run dev ``` +On Windows PowerShell, use `$env:MOTO_DESKTOP_API_TOKEN="local-dev-token"` and `$env:VITE_MOTO_DESKTOP_API_TOKEN="local-dev-token"` instead of `export ...`. + Then open `http://localhost:5173` in your browser. --- @@ -427,9 +444,9 @@ All content generated by this system is for informational purposes only. Papers Best if you want to run local models in LM Studio, especially models above 20B parameters or larger MoE-style models. -- **OS**: Windows 10+, macOS 12+, Linux +- **OS**: Windows 10+, macOS 12+, Linux; Ubuntu 24.04 is the tested Linux launcher target - **RAM**: 32GB+ recommended -- **Storage**: 50GB+ free space for models and project data +- **Storage**: 50GB+ free space for models and project data; add several GB more if Lean 4 / Mathlib proof verification is enabled - **GPU**: 16GB+ VRAM recommended for practical local inference on 20B+ class models - **Internet**: Required for installation; optional afterward if staying local-only @@ -437,13 +454,13 @@ Best if you want to run local models in LM Studio, especially models above 20B p Best if you want the lightest local hardware requirements and are comfortable running inference in the cloud through OpenRouter. -- **OS**: Windows, macOS, Linux, or Raspberry Pi OS +- **OS**: Windows, macOS, Linux, or Raspberry Pi OS; Ubuntu 24.04 is the tested Linux launcher target - **RAM**: 4GB minimum, 8GB recommended -- **Storage**: 5GB+ free space +- **Storage**: 5GB+ free space for base MOTO; use 15GB+ if enabling Lean 4 / Mathlib proof verification - **GPU**: Not required - **Internet**: Required -Because the heavy model inference happens on OpenRouter, MOTO can run on very modest local hardware in this mode, including a Raspberry Pi, as long as it can run Python, Node.js, and maintain a stable internet connection. +Because the heavy model inference happens on OpenRouter, MOTO can run on very modest local hardware in this mode, including a Raspberry Pi, as long as it can run Python, Node.js, and maintain a stable internet connection. Lean 4 proof verification adds a local toolchain and Mathlib workspace requirement even in OpenRouter-only mode. --- diff --git a/backend/aggregator/agents/submitter.py b/backend/aggregator/agents/submitter.py index 0ca5646..0072d9f 100644 --- a/backend/aggregator/agents/submitter.py +++ b/backend/aggregator/agents/submitter.py @@ -13,8 +13,9 @@ from backend.shared.models import Submission, SubmitterState from backend.shared.lm_studio_client import lm_studio_client from backend.shared.api_client_manager import api_client_manager +from backend.shared.brainstorm_proof_gate import is_lean_proof_submission, verify_brainstorm_proof_candidate from backend.shared.openrouter_client import FreeModelExhaustedError -from backend.shared.json_parser import parse_json +from backend.shared.json_parser import parse_json, sanitize_model_output_for_retry_context from backend.autonomous.memory.proof_database import proof_database from backend.aggregator.core.context_allocator import context_allocator from backend.aggregator.core.queue_manager import queue_manager @@ -89,6 +90,12 @@ def set_task_tracking_callback(self, callback: Callable) -> None: def get_current_task_id(self) -> str: """Get the task ID for the current/next API call.""" return f"agg_sub{self.submitter_id}_{self.task_sequence:03d}" + + def _generation_temperature(self) -> float: + """Use diversified lanes only when the coordinator is running submitters in parallel.""" + if self.coordinator and not getattr(self.coordinator, "single_model_mode", False): + return api_client_manager.parallel_brainstorm_submitter_temperature(self.submitter_id) + return 0.0 async def start(self) -> None: """Start the submitter agent.""" @@ -250,7 +257,7 @@ async def _generate_submission(self) -> Optional[Submission]: role_id=self.role_id, model=self.model_name, messages=[{"role": "user", "content": prompt}], - temperature=0.0, # Deterministic generation - evolving context provides diversity + temperature=self._generation_temperature(), max_tokens=self.max_output_tokens # Per-submitter max output tokens ) call_metadata = api_client_manager.extract_call_metadata(response) @@ -349,13 +356,13 @@ async def _generate_submission(self) -> Optional[Submission]: ) try: - # CRITICAL FIX: Don't include full failed output - it can be 90K+ tokens! - # Truncate to prevent context overflow during retry + # Keep conversational retry context, but never replay private + # model thought/channel/control tokens as an assistant turn. max_failed_output_chars = 2000 # ~500 tokens - enough to show error context - if len(llm_output) > max_failed_output_chars: - failed_output_preview = llm_output[:max_failed_output_chars] + "\n[...output truncated for retry...]" - else: - failed_output_preview = llm_output + failed_output_preview = sanitize_model_output_for_retry_context( + llm_output, + max_chars=max_failed_output_chars, + ) # Calculate if conversation fits in context window prompt_tokens = count_tokens(prompt) @@ -426,11 +433,10 @@ async def _generate_submission(self) -> Optional[Submission]: ) try: - # Truncate retry output for second stage as well - if len(retry_output_1) > max_failed_output_chars: - retry_output_1_preview = retry_output_1[:max_failed_output_chars] + "\n[...truncated...]" - else: - retry_output_1_preview = retry_output_1 + retry_output_1_preview = sanitize_model_output_for_retry_context( + retry_output_1, + max_chars=max_failed_output_chars, + ) # Check if second retry conversation fits retry2_tokens = (prompt_tokens + preview_tokens + retry_prompt_tokens + @@ -516,7 +522,7 @@ async def _generate_submission(self) -> Optional[Submission]: # Record as rejection in local memory await self.local_memory.add_rejection( error_feedback, - llm_output[:750] + sanitize_model_output_for_retry_context(llm_output, max_chars=750) ) self._increment_rejection() # Notify task completed (failed but still completed) @@ -524,6 +530,83 @@ async def _generate_submission(self) -> Optional[Submission]: self.task_tracking_callback("completed", task_id) return None + proof_metadata = {} + if is_lean_proof_submission(parsed): + if not system_config.lean4_enabled: + await self.local_memory.add_rejection( + "Lean proof candidate rejected before validation because Lean 4 verification is disabled. " + "Submit a normal brainstorm idea or enable Lean 4 before choosing `submission_type: lean_proof`.", + str(parsed)[:750] + ) + self._increment_rejection() + if self.task_tracking_callback: + self.task_tracking_callback("completed", task_id) + return None + + validator_model = getattr(self.coordinator, "validator_model", self.model_name) if self.coordinator else self.model_name + validator_context = getattr(context_allocator, "validator_context_window", rag_config.validator_context_window) + validator_max_tokens = getattr(rag_config, "validator_max_output_tokens", self.max_output_tokens) + source_context = "\n\n".join( + part + for part in [ + allocation.get("direct", ""), + rag_evidence, + shared_training_content, + ] + if part + ) + gate_result = await verify_brainstorm_proof_candidate( + parsed=parsed, + user_prompt=self.user_prompt, + source_context=source_context, + model_id=self.model_name, + role_id=self.role_id, + task_id_prefix=f"{task_id}_lean", + max_tokens=self.max_output_tokens, + validator_model=validator_model, + validator_context=validator_context, + validator_max_tokens=validator_max_tokens, + validator_role_id="aggregator_validator", + max_attempts=5, + ) + if not gate_result.accepted: + await self.local_memory.add_rejection( + gate_result.failure_feedback, + str(parsed.get("lean_code") or parsed.get("submission") or parsed)[:750], + ) + self._increment_rejection() + if self.task_tracking_callback: + self.task_tracking_callback("completed", task_id) + return None + + parsed["submission"] = gate_result.submission_content + parsed["reasoning"] = gate_result.reasoning or parsed.get("reasoning", "") + proof_metadata = { + "brainstorm_lean_proof": { + "theorem_statement": gate_result.theorem_statement, + "theorem_name": gate_result.theorem_name, + "formal_sketch": gate_result.formal_sketch, + "lean_code": gate_result.lean_code, + "lean_feedback": gate_result.lean_feedback, + "reasoning": gate_result.reasoning, + "attempts": [ + attempt.model_dump(mode="json") + for attempt in (gate_result.attempts or []) + ], + "attempt_count": len(gate_result.attempts or []), + } + } + + if "submission" not in parsed or "reasoning" not in parsed: + await self.local_memory.add_rejection( + "Submission JSON missing required `submission` or `reasoning` fields after proof gating.", + str(parsed)[:750], + ) + self._increment_rejection() + if self.task_tracking_callback: + self.task_tracking_callback("completed", task_id) + return None + # Create submission submission = Submission( submission_id=str(uuid.uuid4()), @@ -535,6 +618,7 @@ async def _generate_submission(self) -> Optional[Submission]: "chunk_size": chunk_size, "rag_used": bool(allocation["rag_context"]), "llm_call": call_metadata, + **proof_metadata, } ) diff --git a/backend/aggregator/agents/validator.py b/backend/aggregator/agents/validator.py index 5551f25..6ce0dee 100644 --- a/backend/aggregator/agents/validator.py +++ b/backend/aggregator/agents/validator.py @@ -12,7 +12,7 @@ from backend.shared.lm_studio_client import lm_studio_client from backend.shared.api_client_manager import api_client_manager from backend.shared.openrouter_client import FreeModelExhaustedError -from backend.shared.json_parser import parse_json +from backend.shared.json_parser import parse_json, sanitize_model_output_for_retry_context from backend.autonomous.memory.proof_database import proof_database from backend.aggregator.core.context_allocator import context_allocator from backend.aggregator.memory.shared_training import shared_training_memory @@ -326,12 +326,13 @@ async def _assess_quality(self, submission: Submission) -> ValidationResult: ) try: - # CRITICAL FIX: Truncate failed output to prevent context overflow during retry + # Keep conversational retry context, but never replay private + # model thought/channel/control tokens as an assistant turn. max_failed_output_chars = 2000 # ~500 tokens - enough for error context - if len(llm_output) > max_failed_output_chars: - failed_output_preview = llm_output[:max_failed_output_chars] + "\n[...output truncated for retry...]" - else: - failed_output_preview = llm_output + failed_output_preview = sanitize_model_output_for_retry_context( + llm_output, + max_chars=max_failed_output_chars, + ) # Calculate if conversation fits in context window prompt_tokens = count_tokens(prompt) @@ -817,12 +818,13 @@ async def _retry_batch_json_parse( try: call_metadata = {} - # CRITICAL FIX: Truncate failed output to prevent context overflow during retry + # Keep conversational retry context, but never replay private + # model thought/channel/control tokens as an assistant turn. max_failed_output_chars = 2000 # ~500 tokens - enough for error context - if len(failed_output) > max_failed_output_chars: - failed_output_preview = failed_output[:max_failed_output_chars] + "\n[...output truncated for retry...]" - else: - failed_output_preview = failed_output + failed_output_preview = sanitize_model_output_for_retry_context( + failed_output, + max_chars=max_failed_output_chars, + ) # Calculate if conversation fits in context window from backend.shared.utils import count_tokens diff --git a/backend/aggregator/core/coordinator.py b/backend/aggregator/core/coordinator.py index 3b89d3a..8d0130e 100644 --- a/backend/aggregator/core/coordinator.py +++ b/backend/aggregator/core/coordinator.py @@ -12,13 +12,15 @@ import aiofiles from backend.shared.config import system_config, rag_config -from backend.shared.models import SystemStatus, Submission, ValidationResult, SubmitterConfig, WorkflowTask, ModelConfig +from backend.shared.models import SystemStatus, Submission, ValidationResult, SubmitterConfig, WorkflowTask, ModelConfig, ProofAttemptFeedback from backend.shared.lm_studio_client import lm_studio_client from backend.shared.rag_lock import rag_operation_lock from backend.shared.workflow_predictor import workflow_predictor from backend.shared.api_client_manager import api_client_manager from backend.shared.openrouter_client import FreeModelExhaustedError from backend.shared.free_model_manager import free_model_manager +from backend.shared.path_safety import resolve_path_within_root, validate_single_path_component +from backend.shared.log_redaction import redact_log_text from backend.aggregator.agents.submitter import SubmitterAgent from backend.aggregator.agents.validator import ValidatorAgent from backend.aggregator.core.queue_manager import queue_manager @@ -29,6 +31,36 @@ logger = logging.getLogger(__name__) +def _resolve_uploaded_user_file(file_ref: str, *, allow_trusted_context_files: bool = False) -> Optional[Path]: + """Resolve a user upload reference without exposing arbitrary local files.""" + raw_ref = str(file_ref or "").strip() + if not raw_ref: + return None + + uploads_root = Path(system_config.user_uploads_dir).resolve() + data_root = Path(system_config.data_dir).resolve() + + # Public uploads are logical filenames. Absolute paths are only accepted + # after the same root-containment check, so a caller cannot expand access. + try: + if Path(raw_ref).is_absolute(): + return resolve_path_within_root(uploads_root, raw_ref) + + safe_filename = validate_single_path_component(raw_ref, "uploaded filename") + return resolve_path_within_root(uploads_root, safe_filename) + except ValueError as exc: + upload_error = exc + + if allow_trusted_context_files: + try: + return resolve_path_within_root(data_root, raw_ref) + except ValueError: + pass + + logger.warning("Rejected unsafe uploaded file reference: %s", redact_log_text(upload_error, 240)) + return None + + class Coordinator: """ Coordinates the entire aggregator system. @@ -72,6 +104,7 @@ def __init__(self): self.single_model_mode = False self.submitter_configs: List[SubmitterConfig] = [] self.validator_model = "" + self.validator_provider = "lm_studio" # Workflow tracking self.workflow_tasks: List[WorkflowTask] = [] @@ -84,6 +117,12 @@ def __init__(self): # Cleanup review toggle (disabled for short-lived mini-brainstorm phases) self.enable_cleanup_review = True + + # Optional source-level hard cap used by autonomous brainstorm mode. + self.max_total_acceptances: Optional[int] = None + self.acceptance_count_offset: int = 0 + self.acceptance_cap_callback: Optional[Callable[[int], Any]] = None + self._acceptance_cap_reached = False async def _load_stats(self) -> None: """Load persisted stats from file.""" @@ -120,6 +159,43 @@ async def _save_stats(self) -> None: logger.debug("Saved stats to file") except Exception as e: logger.error(f"Failed to save stats: {e}") + + def _should_use_single_model_mode( + self, + submitter_configs: List[SubmitterConfig], + validator_model: str, + validator_provider: str, + loaded_models: List[str], + ) -> bool: + """ + Decide whether same-model aggregator roles must run sequentially. + + Multiple loaded LM Studio `:#` siblings of the same base model provide + safe submitter fan-out capacity; otherwise preserve the existing + sequential single-model mode. + """ + all_models = [sc.model_id for sc in submitter_configs] + [validator_model] + if len(set(all_models)) != 1: + return False + + all_lm_studio = ( + validator_provider == "lm_studio" + and all(sc.provider == "lm_studio" for sc in submitter_configs) + ) + if not all_lm_studio: + return True + + sibling_count = lm_studio_client.count_sibling_instances_from_loaded(validator_model, loaded_models) + if sibling_count > 1: + logger.info( + "Single configured LM Studio model '%s' has %s loaded same-base instances; " + "using parallel submitter workflow with instance sharing.", + validator_model, + sibling_count, + ) + return False + + return True async def initialize( self, @@ -132,8 +208,14 @@ async def initialize( validator_max_tokens: Optional[int] = None, validator_provider: str = "lm_studio", validator_openrouter_provider: Optional[str] = None, + validator_openrouter_reasoning_effort: str = "auto", validator_lm_studio_fallback: Optional[str] = None, - enable_cleanup_review: bool = True + validator_supercharge_enabled: bool = False, + enable_cleanup_review: bool = True, + max_total_acceptances: Optional[int] = None, + acceptance_count_offset: int = 0, + acceptance_cap_callback: Optional[Callable[[int], Any]] = None, + allow_trusted_context_files: bool = False, ) -> None: """ Initialize the coordinator with configuration. @@ -148,12 +230,22 @@ async def initialize( validator_max_tokens: Optional max output tokens override for validator validator_provider: Provider for validator ("lm_studio" or "openrouter") validator_openrouter_provider: OpenRouter host provider for validator (e.g., "Anthropic") + validator_openrouter_reasoning_effort: OpenRouter reasoning effort for validator validator_lm_studio_fallback: LM Studio fallback model for validator when using OpenRouter + validator_supercharge_enabled: Whether validator answers should use Supercharge + max_total_acceptances: Optional hard cap for accepted submissions, including offset + acceptance_count_offset: Existing acceptances before this coordinator run + acceptance_cap_callback: Async callback fired when the cap is reached + allow_trusted_context_files: Allow internal callers to pass data-root files as context """ logger.info("Initializing coordinator...") # Store cleanup review toggle self.enable_cleanup_review = enable_cleanup_review + self.max_total_acceptances = max_total_acceptances + self.acceptance_count_offset = max(0, acceptance_count_offset) + self.acceptance_cap_callback = acceptance_cap_callback + self._acceptance_cap_reached = False # Validate submitter count num_submitters = len(submitter_configs) @@ -189,13 +281,21 @@ async def initialize( final_validator_max_output = validator_max_tokens if validator_max_tokens is not None else rag_config.validator_max_output_tokens context_allocator.set_context_windows(final_submitter_context, final_validator_context, final_submitter_max_output, final_validator_max_output) - # CRITICAL: Detect single-model mode ONLY based on configured model IDs + # Log currently loaded models for diagnostics and same-base instance scheduling. + loaded_models = await lm_studio_client.get_loaded_models() + logger.info(f"Currently loaded models: {loaded_models}") + + # CRITICAL: Detect single-model mode based on configured model IDs, with + # an LM Studio sibling-instance exception for safe submitter fan-out. # Boost routing is INDEPENDENT of this decision and does NOT affect concurrency # Single-model mode prevents queue overflow when all agents share the same LM Studio server # Boost can route calls to OpenRouter even in single-model mode (if enabled) - all_models = [sc.model_id for sc in submitter_configs] + [validator_model] - unique_models = set(all_models) - self.single_model_mode = len(unique_models) == 1 + self.single_model_mode = self._should_use_single_model_mode( + submitter_configs, + validator_model, + validator_provider, + loaded_models, + ) if self.single_model_mode: logger.info( @@ -217,10 +317,6 @@ async def initialize( f"This does NOT affect parallel execution mode." ) - # Log currently loaded models for diagnostics - loaded_models = await lm_studio_client.get_loaded_models() - logger.info(f"Currently loaded models: {loaded_models}") - # CRITICAL: Warn user about potential context mismatches # LM Studio may not load models with requested context - this causes silent failures context_info = "\n".join([ @@ -260,19 +356,25 @@ async def initialize( # Load user files into RAG system user_files_content = {} - for file_path in user_files: - path = Path(file_path) + for file_ref in user_files: + path = _resolve_uploaded_user_file( + file_ref, + allow_trusted_context_files=allow_trusted_context_files, + ) + if path is None: + continue if path.exists(): # Add to RAG system with all 4 chunk configs await rag_manager.add_document( - file_path, + str(path), chunk_sizes=rag_config.submitter_chunk_intervals, is_user_file=True ) # Also load content for potential direct injection (async to avoid blocking) - async with aiofiles.open(file_path, 'r', encoding='utf-8') as f: + # codeql[py/path-injection]: path is resolved by _resolve_uploaded_user_file within uploads/data roots. + async with aiofiles.open(path, 'r', encoding='utf-8') as f: user_files_content[path.name] = await f.read() - logger.info(f"Loaded user file: {path.name}") + logger.info("Loaded user file: %s", redact_log_text(path.name, 120)) # Create submitter agents from configs (1-10 submitters with individual settings) self.submitters = [] @@ -301,9 +403,11 @@ async def initialize( provider=config.provider, model_id=config.model_id, openrouter_provider=config.openrouter_provider, + openrouter_reasoning_effort=config.openrouter_reasoning_effort, lm_studio_fallback_id=config.lm_studio_fallback_id, context_window=config.context_window, - max_output_tokens=config.max_output_tokens + max_output_tokens=config.max_output_tokens, + supercharge_enabled=config.supercharge_enabled ) ) logger.info(f"Created Submitter {config.submitter_id}: model={config.model_id}, provider={config.provider}, context={config.context_window}") @@ -326,9 +430,11 @@ async def initialize( provider=validator_provider, model_id=validator_model, openrouter_provider=validator_openrouter_provider, + openrouter_reasoning_effort=validator_openrouter_reasoning_effort, lm_studio_fallback_id=validator_lm_studio_fallback, context_window=final_validator_context, - max_output_tokens=final_validator_max_output + max_output_tokens=final_validator_max_output, + supercharge_enabled=validator_supercharge_enabled ) ) logger.info(f"Created Validator: model={validator_model}, provider={validator_provider}") @@ -627,6 +733,8 @@ async def _validator_loop(self) -> None: for submission, result in zip(submissions, results): if result.decision == "accept": await self._handle_acceptance(submission, result) + if self._acceptance_cap_reached: + break else: await self._handle_rejection(submission, result) @@ -721,6 +829,8 @@ async def _single_model_workflow(self) -> None: for submission, result in zip(submissions, results): if result.decision == "accept": await self._handle_acceptance(submission, result) + if self._acceptance_cap_reached: + break else: await self._handle_rejection(submission, result) @@ -755,10 +865,22 @@ async def _single_model_workflow(self) -> None: async def _handle_acceptance(self, submission: Submission, result: ValidationResult) -> None: """Handle accepted submission.""" + next_total_acceptances = self.acceptance_count_offset + self.total_acceptances + 1 + if ( + self.max_total_acceptances is not None + and next_total_acceptances > self.max_total_acceptances + ): + await self._handle_acceptance_cap_reached( + self.acceptance_count_offset + self.total_acceptances + ) + return + self.total_acceptances += 1 + total_acceptances_with_offset = self.acceptance_count_offset + self.total_acceptances # Add to shared training await shared_training_memory.add_accepted_submission(submission.content) + await self._register_accepted_brainstorm_proof(submission) # Notify submitter submitter = next((s for s in self.submitters if s.submitter_id == submission.submitter_id), None) @@ -812,6 +934,115 @@ async def _handle_acceptance(self, submission: Submission, result: ValidationRes # Trigger cleanup review every 7 acceptances if self.enable_cleanup_review and self.total_acceptances % 7 == 0 and self.total_acceptances > 0: await self._perform_cleanup_review() + + if ( + self.max_total_acceptances is not None + and total_acceptances_with_offset >= self.max_total_acceptances + ): + await self._handle_acceptance_cap_reached(total_acceptances_with_offset) + + async def _handle_acceptance_cap_reached(self, total_acceptances: int) -> None: + """Stop accepting new work once an optional source-level cap is reached.""" + if self._acceptance_cap_reached: + return + + self._acceptance_cap_reached = True + self.is_running = False + + logger.info( + "Acceptance cap reached at %s total acceptances; stopping aggregator at source", + total_acceptances, + ) + + await self._broadcast("acceptance_cap_reached", { + "total_acceptances": total_acceptances, + "max_total_acceptances": self.max_total_acceptances, + }) + + if self.acceptance_cap_callback: + try: + await self.acceptance_cap_callback(total_acceptances) + except Exception as e: + logger.error("Acceptance cap callback failed: %s", e, exc_info=True) + + current_task = asyncio.current_task() + for submitter in self.submitters: + try: + await submitter.stop() + except Exception as e: + logger.warning("Error stopping submitter after acceptance cap: %s", e) + + if self._main_task and self._main_task is not current_task and not self._main_task.done(): + self._main_task.cancel() + + def _brainstorm_proof_source_id(self) -> str: + """Derive a stable proof source id from the active brainstorm database path.""" + try: + stem = Path(shared_training_memory.file_path).stem + if stem.startswith("brainstorm_"): + return stem[len("brainstorm_"):] or stem + return stem or "manual_aggregator" + except Exception: + return "manual_aggregator" + + async def _register_accepted_brainstorm_proof(self, submission: Submission) -> None: + """Store validator-accepted Lean-verified brainstorm proofs in the proof database.""" + proof_payload = (submission.metadata or {}).get("brainstorm_lean_proof") + if not isinstance(proof_payload, dict): + return + + theorem_statement = str(proof_payload.get("theorem_statement") or "").strip() + lean_code = str(proof_payload.get("lean_code") or "").strip() + if not theorem_statement or not lean_code: + return + + try: + from backend.autonomous.core.proof_registration import register_verified_lean_proof + from backend.autonomous.memory.proof_database import proof_database + + attempts = [ + item if isinstance(item, ProofAttemptFeedback) else ProofAttemptFeedback.model_validate(item) + for item in (proof_payload.get("attempts") or []) + ] + source_id = self._brainstorm_proof_source_id() + source_title = (self.validator.user_prompt if self.validator else "")[:300] + registration = await register_verified_lean_proof( + proof_database=proof_database, + user_prompt=self.validator.user_prompt if self.validator else "", + theorem_statement=theorem_statement, + lean_code=lean_code, + validator_model=self.validator_model, + validator_context=rag_config.validator_context_window, + validator_max_tokens=rag_config.validator_max_output_tokens, + task_id=f"agg_proof_novelty_{self.total_acceptances:03d}", + role_id="aggregator_validator", + source_type="brainstorm", + source_id=source_id, + source_title=source_title, + theorem_id=f"brainstorm_submission_{self.total_acceptances}", + theorem_name=str(proof_payload.get("theorem_name") or ""), + formal_sketch=str(proof_payload.get("formal_sketch") or ""), + verification_notes="Lean 4 accepted this brainstorm proof before validator acceptance.", + attempt_count=int(proof_payload.get("attempt_count") or len(attempts)), + attempts=attempts, + broadcast_fn=self._broadcast, + base_event={ + "source_type": "brainstorm", + "source_id": source_id, + "submission_id": submission.submission_id, + "submitter_id": submission.submitter_id, + "trigger": "brainstorm_inline", + }, + proof_label=f"Brainstorm submission {self.total_acceptances}", + ) + submission.metadata["proof_id"] = registration.record.proof_id + except Exception as exc: + logger.warning( + "Accepted Lean brainstorm proof registration failed for submission %s: %s", + submission.submission_id, + exc, + exc_info=True, + ) async def _handle_rejection(self, submission: Submission, result: ValidationResult) -> None: """Handle rejected submission.""" diff --git a/backend/aggregator/core/queue_manager.py b/backend/aggregator/core/queue_manager.py index e7a7b50..4c5de1d 100644 --- a/backend/aggregator/core/queue_manager.py +++ b/backend/aggregator/core/queue_manager.py @@ -1,9 +1,9 @@ """ Submission queue manager. -Handles queue with special logic: if 10+ submissions waiting, skip to latest. +The coordinator handles overflow by pausing submitters; queued submissions stay FIFO. """ import asyncio -from typing import Optional, List +from typing import List, Optional from collections import deque import logging @@ -15,56 +15,20 @@ class QueueManager: """ - Thread-safe submission queue. - If queue >= 10 on next dequeue, jump to latest and clear rest. + Thread-safe FIFO submission queue. """ def __init__(self): self.queue: deque[Submission] = deque() self._lock = asyncio.Lock() - self._not_empty = asyncio.Event() self.overflow_threshold = system_config.queue_overflow_threshold async def enqueue(self, submission: Submission) -> None: """Add submission to queue.""" async with self._lock: self.queue.append(submission) - self._not_empty.set() logger.debug(f"Enqueued submission {submission.submission_id}. Queue size: {len(self.queue)}") - async def dequeue(self) -> Optional[Submission]: - """ - Dequeue next submission. - If queue >= overflow_threshold, skip to latest and clear rest. - """ - # Wait for queue to have items - while True: - async with self._lock: - if self.queue: - break - self._not_empty.clear() - - await self._not_empty.wait() - - async with self._lock: - # Check for overflow - if len(self.queue) >= self.overflow_threshold: - # Get latest submission - latest = self.queue[-1] - # Clear all others - cleared_count = len(self.queue) - 1 - self.queue.clear() - - logger.warning( - f"Queue overflow ({cleared_count + 1} submissions). " - f"Cleared {cleared_count} old submissions, processing latest." - ) - - return latest - else: - # Normal dequeue - return self.queue.popleft() - async def size(self) -> int: """Get current queue size.""" async with self._lock: @@ -79,7 +43,6 @@ async def clear(self) -> None: """Clear the queue.""" async with self._lock: self.queue.clear() - self._not_empty.clear() logger.info("Queue cleared") async def peek(self) -> Optional[Submission]: diff --git a/backend/aggregator/core/rag_manager.py b/backend/aggregator/core/rag_manager.py index 4e1e35c..676463d 100644 --- a/backend/aggregator/core/rag_manager.py +++ b/backend/aggregator/core/rag_manager.py @@ -19,6 +19,7 @@ from backend.shared.api_client_manager import api_client_manager from backend.shared.rag_lock import rag_operation_lock from backend.shared.utils import count_tokens, compress_text +from backend.shared.log_redaction import redact_log_text from backend.aggregator.ingestion.pipeline import ingestion_pipeline logger = logging.getLogger(__name__) @@ -69,7 +70,8 @@ async def add_document( self, file_path: str, chunk_sizes: List[int] = None, - is_user_file: bool = False + is_user_file: bool = False, + trusted_roots: List[str | Path] | None = None, ) -> None: """ Add a document to the RAG system. @@ -80,11 +82,18 @@ async def add_document( is_user_file: Whether this is a user file (never evicted) """ try: + if trusted_roots is None: + trusted_roots = [ + system_config.data_dir, + system_config.user_uploads_dir, + ] + # Ingest document chunks_by_size = await ingestion_pipeline.ingest_file( file_path, chunk_sizes, - is_user_file + is_user_file, + trusted_roots=trusted_roots, ) # Add to ChromaDB and memory @@ -106,10 +115,14 @@ async def add_document( # Enforce per-size chunk cap await self._enforce_chunk_cap() - logger.info(f"Added document: {file_path}") + logger.info("Added document: %s", redact_log_text(Path(file_path).name, 120)) except Exception as e: - logger.error(f"Failed to add document {file_path}: {e}") + logger.error( + "Failed to add document %s: %s", + redact_log_text(Path(file_path).name, 120), + redact_log_text(e, 240), + ) raise async def add_text( @@ -155,10 +168,14 @@ async def add_text( # Enforce per-size chunk cap await self._enforce_chunk_cap() - logger.info(f"Added text: {source_name}") + logger.info("Added text: %s", redact_log_text(source_name, 120)) except Exception as e: - logger.error(f"Failed to add text {source_name}: {e}") + logger.error( + "Failed to add text %s: %s", + redact_log_text(source_name, 120), + redact_log_text(e, 240), + ) raise async def retrieve( @@ -166,7 +183,9 @@ async def retrieve( query: str, chunk_size: int = 512, max_tokens: int = None, - exclude_sources: Optional[List[str]] = None + exclude_sources: Optional[List[str]] = None, + include_sources: Optional[List[str]] = None, + include_source_prefixes: Optional[List[str]] = None ) -> ContextPack: """ 4-stage retrieval pipeline. @@ -176,6 +195,8 @@ async def retrieve( chunk_size: Chunk size to retrieve from max_tokens: Maximum tokens in result exclude_sources: Source names to skip during packing (already direct-injected) + include_sources: Optional source allowlist for scoped retrieval + include_source_prefixes: Optional source-name prefixes for scoped retrieval Returns: ContextPack with retrieved context @@ -189,7 +210,18 @@ async def retrieve( # Stage B: Hybrid Recall (BM25 + Vector) logger.debug(f"RAG Stage 2/4: Hybrid recall (BM25 + Vector) with chunk_size={chunk_size}") - candidates = await self._hybrid_recall(queries, chunk_size) + if include_sources or include_source_prefixes: + logger.info( + "RAG Stage 2/4: Restricting retrieval scope to sources=%s prefixes=%s", + include_sources or [], + include_source_prefixes or [], + ) + candidates = await self._hybrid_recall( + queries, + chunk_size, + include_sources=include_sources, + include_source_prefixes=include_source_prefixes, + ) logger.debug(f"RAG Stage 2/4 complete: Retrieved {len(candidates)} candidate chunks") # Stage C: Reranking + MMR @@ -213,14 +245,19 @@ async def _add_chunks(self, chunks: List[DocumentChunk], chunk_size: int) -> Non texts = [chunk.text for chunk in chunks] + embeddings = None + lock_acquired = False if system_config.generic_mode: embeddings = await api_client_manager.get_embeddings(texts) await rag_operation_lock.acquire(f"RAGManager add_chunks write (size={chunk_size})") + lock_acquired = True else: await rag_operation_lock.acquire(f"RAGManager add_chunks (size={chunk_size})") - embeddings = await api_client_manager.get_embeddings(texts) - + lock_acquired = True try: + if embeddings is None: + embeddings = await api_client_manager.get_embeddings(texts) + # Update chunks with embeddings and tokens for chunk, embedding in zip(chunks, embeddings): chunk.embedding = embedding @@ -229,7 +266,8 @@ async def _add_chunks(self, chunks: List[DocumentChunk], chunk_size: int) -> Non # ChromaDB writes stay under the global RAG lock in both modes. collection = self.collections[chunk_size] try: - collection.add( + await asyncio.to_thread( + collection.add, ids=[chunk.chunk_id for chunk in chunks], embeddings=embeddings, documents=texts, @@ -247,7 +285,8 @@ async def _add_chunks(self, chunks: List[DocumentChunk], chunk_size: int) -> Non # Invalidate BM25 index for this size self.bm25_index[chunk_size] = None finally: - rag_operation_lock.release() + if lock_acquired: + rag_operation_lock.release() async def _rewrite_query(self, query: str) -> List[str]: """Stage A: Expand query into semantic variants.""" @@ -283,18 +322,31 @@ async def _rewrite_query(self, query: str) -> List[str]: async def _hybrid_recall( self, queries: List[str], - chunk_size: int + chunk_size: int, + include_sources: Optional[List[str]] = None, + include_source_prefixes: Optional[List[str]] = None ) -> List[Tuple[DocumentChunk, float]]: """Stage B: Hybrid BM25 + Vector search.""" - chunks = self.chunks_by_size[chunk_size] + # Work from a stable snapshot so threaded scoring does not race with + # concurrent RAG add/remove operations mutating the live chunk lists. + chunks = list(self._filter_chunks_by_source_scope( + self.chunks_by_size[chunk_size], + include_sources=include_sources, + include_source_prefixes=include_source_prefixes, + )) if not chunks: return [] # Vector search - vector_results = await self._vector_search(queries, chunk_size) + vector_results = await self._vector_search(queries, chunk_size, candidate_chunks=chunks) # BM25 search - bm25_results = self._bm25_search(queries, chunk_size) + bm25_results = await asyncio.to_thread( + self._bm25_search, + queries, + chunk_size, + chunks, + ) # Combine and deduplicate combined = {} @@ -315,17 +367,26 @@ async def _hybrid_recall( async def _vector_search( self, queries: List[str], - chunk_size: int + chunk_size: int, + candidate_chunks: Optional[List[DocumentChunk]] = None ) -> List[Tuple[DocumentChunk, float]]: """Vector similarity search with retry logic for HNSW index race conditions.""" collection = self.collections[chunk_size] - chunks = self.chunks_by_size[chunk_size] + chunks = candidate_chunks if candidate_chunks is not None else self.chunks_by_size[chunk_size] if not chunks: return [] query_embeddings = await api_client_manager.get_embeddings(queries) + if candidate_chunks is not None and len(candidate_chunks) != len(self.chunks_by_size[chunk_size]): + return await asyncio.to_thread( + self._score_vector_candidates, + query_embeddings, + chunks, + ) + all_results = [] + chunk_by_id = {chunk.chunk_id: chunk for chunk in chunks} for query_embedding in query_embeddings: # Search with retry logic for transient HNSW errors during concurrent writes max_retries = 3 @@ -334,7 +395,8 @@ async def _vector_search( for attempt in range(max_retries): try: - results = collection.query( + results = await asyncio.to_thread( + collection.query, query_embeddings=[query_embedding], n_results=min(rag_config.hybrid_recall_top_k, len(chunks)) ) @@ -359,7 +421,7 @@ async def _vector_search( # Map back to chunks for chunk_id, distance in zip(results['ids'][0], results['distances'][0]): - chunk = next((c for c in chunks if c.chunk_id == chunk_id), None) + chunk = chunk_by_id.get(chunk_id) if chunk: # Convert distance to similarity (cosine distance -> similarity) similarity = 1.0 - distance @@ -374,23 +436,43 @@ async def _vector_search( unique_results.append((chunk, score)) return unique_results[:rag_config.hybrid_recall_top_k] + + def _score_vector_candidates( + self, + query_embeddings: List[List[float]], + chunks: List[DocumentChunk], + ) -> List[Tuple[DocumentChunk, float]]: + """Score a scoped chunk snapshot in memory without blocking the event loop.""" + scored: List[Tuple[DocumentChunk, float]] = [] + for query_embedding in query_embeddings: + for chunk in chunks: + if not chunk.embedding: + continue + scored.append((chunk, self._cosine_similarity(query_embedding, chunk.embedding))) + + seen = set() + unique_results = [] + for chunk, score in sorted(scored, key=lambda x: x[1], reverse=True): + if chunk.chunk_id in seen: + continue + seen.add(chunk.chunk_id) + unique_results.append((chunk, score)) + return unique_results[:rag_config.hybrid_recall_top_k] def _bm25_search( self, queries: List[str], - chunk_size: int + chunk_size: int, + candidate_chunks: Optional[List[DocumentChunk]] = None ) -> List[Tuple[DocumentChunk, float]]: """BM25 lexical search.""" - chunks = self.chunks_by_size[chunk_size] + chunks = list(candidate_chunks) if candidate_chunks is not None else list(self.chunks_by_size[chunk_size]) if not chunks: return [] - # Build or get BM25 index - if self.bm25_index[chunk_size] is None: - corpus = [chunk.tokens for chunk in chunks] - self.bm25_index[chunk_size] = BM25Okapi(corpus) - - bm25 = self.bm25_index[chunk_size] + # Build a local index for the snapshot. This runs in a worker thread + # and intentionally does not mutate self.bm25_index across threads. + bm25 = BM25Okapi([chunk.tokens for chunk in chunks]) all_scores = np.zeros(len(chunks)) for query in queries: @@ -407,6 +489,25 @@ def _bm25_search( results = [(chunks[i], float(all_scores[i])) for i in top_indices if all_scores[i] > 0] return results + + @staticmethod + def _filter_chunks_by_source_scope( + chunks: List[DocumentChunk], + *, + include_sources: Optional[List[str]] = None, + include_source_prefixes: Optional[List[str]] = None + ) -> List[DocumentChunk]: + """Limit chunks to an explicit source allowlist and/or source prefixes.""" + include_set = {source for source in (include_sources or []) if source} + prefixes = tuple(prefix for prefix in (include_source_prefixes or []) if prefix) + if not include_set and not prefixes: + return chunks + + scoped = [] + for chunk in chunks: + if chunk.source_file in include_set or (prefixes and chunk.source_file.startswith(prefixes)): + scoped.append(chunk) + return scoped def _rerank_and_diversify( self, @@ -610,7 +711,7 @@ async def _enforce_chunk_cap(self) -> None: if evict_ids: collection = self.collections[chunk_size] try: - collection.delete(ids=evict_ids) + await asyncio.to_thread(collection.delete, ids=evict_ids) except Exception as e: logger.error(f"ChromaDB delete during chunk cap enforcement (size={chunk_size}): {e}") @@ -634,16 +735,27 @@ async def _evict_lru_document(self) -> None: return # Evict the oldest document - logger.info(f"LRU eviction: Removing oldest document '{oldest_doc}' (last accessed: {oldest_time})") + logger.info( + "LRU eviction: Removing oldest document '%s' (last accessed: %s)", + redact_log_text(oldest_doc, 120), + oldest_time, + ) try: await self.remove_document(oldest_doc) # Remove from access tracking if oldest_doc in self.document_access_order: del self.document_access_order[oldest_doc] - logger.info(f"LRU eviction complete: '{oldest_doc}' removed successfully") + logger.info( + "LRU eviction complete: '%s' removed successfully", + redact_log_text(oldest_doc, 120), + ) except Exception as e: - logger.error(f"LRU eviction failed for '{oldest_doc}': {e}") + logger.error( + "LRU eviction failed for '%s': %s", + redact_log_text(oldest_doc, 120), + redact_log_text(e, 240), + ) async def remove_document(self, source_name: str) -> None: """Remove a document from all collections.""" @@ -659,9 +771,9 @@ async def remove_document(self, source_name: str) -> None: # Remove from ChromaDB collection = self.collections[chunk_size] # Get IDs for this source - results = collection.get(where={"source_file": source_name}) + results = await asyncio.to_thread(collection.get, where={"source_file": source_name}) if results['ids']: - collection.delete(ids=results['ids']) + await asyncio.to_thread(collection.delete, ids=results['ids']) # Invalidate BM25 self.bm25_index[chunk_size] = None @@ -675,7 +787,7 @@ async def remove_document(self, source_name: str) -> None: if source_name in self.permanent_documents: self.permanent_documents.discard(source_name) - logger.info(f"Removed document: {source_name}") + logger.info("Removed document: %s", redact_log_text(source_name, 120)) def clear_all_documents(self) -> None: """Clear all documents from RAG database (synchronous for cleanup). diff --git a/backend/aggregator/ingestion/pipeline.py b/backend/aggregator/ingestion/pipeline.py index a4d681a..dd4a568 100644 --- a/backend/aggregator/ingestion/pipeline.py +++ b/backend/aggregator/ingestion/pipeline.py @@ -8,6 +8,8 @@ import logging from backend.shared.models import DocumentChunk +from backend.shared.path_safety import resolve_path_within_root +from backend.shared.log_redaction import redact_log_text from backend.aggregator.ingestion.normalizer import normalize_text from backend.aggregator.ingestion.chunker import chunker @@ -21,7 +23,8 @@ async def ingest_file( self, file_path: str, chunk_sizes: List[int] = None, - is_user_file: bool = False + is_user_file: bool = False, + trusted_roots: List[str | Path] | None = None, ) -> Dict[int, List[DocumentChunk]]: """ Ingest a file and return chunks at multiple sizes. @@ -35,15 +38,26 @@ async def ingest_file( Dict mapping chunk_size -> list of DocumentChunks """ try: + resolved_path = Path(file_path) + if trusted_roots: + for root in trusted_roots: + try: + resolved_path = resolve_path_within_root(Path(root), str(file_path)) + break + except ValueError: + continue + else: + raise ValueError("File path is outside trusted ingestion roots") + # Read file - async with aiofiles.open(file_path, 'r', encoding='utf-8') as f: + async with aiofiles.open(resolved_path, 'r', encoding='utf-8') as f: text = await f.read() # Normalize text normalized_text = normalize_text(text) # Get file name - file_name = Path(file_path).name + file_name = resolved_path.name # Chunk at multiple sizes chunks_by_size = chunker.chunk_text( @@ -53,12 +67,15 @@ async def ingest_file( is_user_file ) - logger.info(f"Ingested {file_name}: {sum(len(chunks) for chunks in chunks_by_size.values())} total chunks") + logger.info( + "Ingested trusted file into %s total chunks", + sum(len(chunks) for chunks in chunks_by_size.values()), + ) return chunks_by_size except Exception as e: - logger.error(f"Failed to ingest file {file_path}: {e}") + logger.error("Failed to ingest trusted file: %s", redact_log_text(e, 240)) raise async def ingest_text( @@ -92,12 +109,20 @@ async def ingest_text( is_user_file ) - logger.info(f"Ingested {source_name}: {sum(len(chunks) for chunks in chunks_by_size.values())} total chunks") + logger.info( + "Ingested %s: %s total chunks", + redact_log_text(source_name, 120), + sum(len(chunks) for chunks in chunks_by_size.values()), + ) return chunks_by_size except Exception as e: - logger.error(f"Failed to ingest text {source_name}: {e}") + logger.error( + "Failed to ingest text %s: %s", + redact_log_text(source_name, 120), + redact_log_text(e, 240), + ) raise diff --git a/backend/aggregator/memory/local_training.py b/backend/aggregator/memory/local_training.py index 9ac8a23..a501a8b 100644 --- a/backend/aggregator/memory/local_training.py +++ b/backend/aggregator/memory/local_training.py @@ -9,7 +9,10 @@ import logging from backend.shared.config import system_config, rag_config -from backend.shared.utils import truncate_with_ellipsis +from backend.shared.json_parser import ( + RETRY_CONTEXT_EMPTY_PLACEHOLDER, + sanitize_model_output_for_retry_context, +) logger = logging.getLogger(__name__) @@ -69,9 +72,14 @@ async def add_rejection( submission_content: Original submission (first 750 chars) """ async with self._lock: - # Truncate to limits - summary = truncate_with_ellipsis(validator_summary, 750) - preview = truncate_with_ellipsis(submission_content, 750) + # This log is reused as submitter context, so sanitize at the memory + # boundary rather than persisting raw provider/model transcript text. + summary = sanitize_model_output_for_retry_context(validator_summary, max_chars=750) + preview = sanitize_model_output_for_retry_context(submission_content, max_chars=750) + if summary == RETRY_CONTEXT_EMPTY_PLACEHOLDER: + summary = "Validator rejection summary unavailable after retry-context sanitization." + if preview == RETRY_CONTEXT_EMPTY_PLACEHOLDER: + preview = "Rejected submission preview unavailable after retry-context sanitization." # Add rejection self.rejections.append({ diff --git a/backend/aggregator/prompts/submitter_prompts.py b/backend/aggregator/prompts/submitter_prompts.py index 5bb1fbf..dd55ee4 100644 --- a/backend/aggregator/prompts/submitter_prompts.py +++ b/backend/aggregator/prompts/submitter_prompts.py @@ -21,7 +21,7 @@ def get_submitter_system_prompt() -> str: 1. Analyze the user's prompt and provided context carefully 2. Build upon the shared training database (accepted submissions from other agents) 3. Learn from your rejection history to avoid repeating mistakes -4. Generate novel, valuable mathematical insights that advance the solution +4. Generate novel, valuable mathematical progress that advances the solution ⚠️ CRITICAL - INTERNAL CONTENT WARNING ⚠️ @@ -43,13 +43,25 @@ def get_submitter_system_prompt() -> str: --- YOUR TASK: -Generate a novel mathematical insight that advances the user's goal. +Generate the strongest rigorous mathematical contribution you can toward the user's goal, preferring direct solutions, direct partial solutions, impossibility results, exact reductions, or sharp constraints whenever they are justified. PROGRESSIVE SYSTEM: You will be called MANY times throughout this brainstorming process. Each call should produce ONE deep, well-developed mathematical insight. Do not try to cover everything at once — focus on thoroughly developing a single avenue per submission with full rigor. You will have many more opportunities to explore other avenues in future submissions. -Focus on mathematical concepts, theorems, techniques, and proofs that may provide an avenue towards solving or understanding the mathematical problem in the prompt. Use all available resources including web search if available. +DIRECT-SOLUTION PREFERENCE: +- If you can directly solve the user's problem, a clearly necessary subproblem, or prove a meaningful impossibility/limitation result, do that FIRST +- Prefer contributions that close the problem, partially close it, or sharply reduce what remains +- Use indirect background, exploratory framing, or supportive observations ONLY when a stronger direct step is not yet justified + +META-PHASE EXCEPTION: +If the USER PROMPT explicitly says TOPIC EXPLORATION PHASE or PAPER TITLE EXPLORATION PHASE, follow that requested output format exactly: +- For TOPIC EXPLORATION PHASE, propose one candidate brainstorm question optimized for producing a future direct answer +- For PAPER TITLE EXPLORATION PHASE, propose one candidate paper title optimized for communicating the paper's direct answer-bearing content +- In these meta-phases, do NOT solve the mathematical problem or write the paper unless the user prompt explicitly asks for that; the direct-solution preference means the candidate should point toward or communicate direct resolution + +Focus on mathematical concepts, theorems, techniques, and proofs that solve, partially solve, refute, or sharply characterize the mathematical problem in the prompt whenever possible. Use all available resources including web search if available. WHAT MAKES A VALUABLE SUBMISSION - Consider: +- Does it directly answer, partially answer, or sharply constrain the user's problem or a necessary subproblem? - Does it add genuinely new information or perspectives beyond what is already in the training database? - Does it connect existing mathematical concepts in novel ways? - Does it provide concrete methods, theorems, proofs, or mathematical techniques? @@ -59,6 +71,7 @@ def get_submitter_system_prompt() -> str: CRITICAL REQUIREMENTS - CONTENT: - ALL submissions must be rooted in sound mathematical reasoning - NO unfounded claims or logical fallacies +- Prefer directly resolving the user's problem or a clearly necessary subproblem over auxiliary exposition - Focus on mathematical concepts, theorems, and techniques that are verifiable and established - Be specific and actionable, not vague or generic - Avoid redundancy with existing accepted submissions @@ -67,17 +80,36 @@ def get_submitter_system_prompt() -> str: - Unsupported empirical or artifact claims must be framed as proposals, hypotheses, or future work rather than as completed results Your submission will be validated against these criteria: +- Does it provide the strongest direct progress currently justified? - Does it meaningfully advance the solution space? - Is it based on sound mathematical principles? - Does it avoid contradictions? - Is it non-redundant with existing knowledge? - Is it mathematically rigorous? -Output your response ONLY as JSON in this exact format: +OPTIONAL LEAN 4 PROOF ROUTE: +If Lean 4 proof verification is enabled and you can produce a complete Lean 4 proof that would be useful brainstorm progress, you may choose the `lean_proof` submission type. A Lean proof candidate is NOT added directly to the knowledge base: the system first runs Lean 4, gives you up to 5 repair attempts with Lean/integrity feedback, and only then sends the Lean-verified proof to the normal brainstorm validator for usefulness and redundancy review. + +Use `lean_proof` only for complete proof code you genuinely expect Lean 4 to accept. Do not use `sorry`, `admit`, or fake `axiom`/`constant`/`opaque` devices. + +Output your response ONLY as JSON in one of these exact formats: + +Normal brainstorm idea: { + "submission_type": "idea", "submission": "Your detailed mathematical submission describing concepts, theorems, proofs, and approaches based on established mathematical principles.", "reasoning": "Brief explanation of why this submission is valuable" } + +Lean proof candidate: +{ + "submission_type": "lean_proof", + "theorem_statement": "Natural-language statement of the theorem or lemma proved by the Lean code.", + "formal_sketch": "Brief note about assumptions, formalization choices, and why this proof helps the brainstorm.", + "theorem_name": "Optional Lean declaration name", + "lean_code": "Complete Lean 4 code expected to verify.", + "reasoning": "Why this verified proof would be a useful brainstorm addition" +} """ @@ -85,11 +117,23 @@ def get_submitter_json_schema() -> str: """Get JSON schema specification for submitter.""" return """ REQUIRED JSON FORMAT: +Normal brainstorm idea: { + "submission_type": "idea", "submission": "string - your detailed mathematical submission with theorems, proofs, and techniques", "reasoning": "string - explanation of submission value" } +Lean proof candidate, only when Lean 4 is enabled and you can provide complete code: +{ + "submission_type": "lean_proof", + "theorem_statement": "string - natural-language statement proved", + "formal_sketch": "string - formalization notes", + "theorem_name": "string - optional Lean declaration name", + "lean_code": "string - complete Lean 4 source code", + "reasoning": "string - why the verified proof would help the brainstorm" +} + CRITICAL JSON ESCAPE RULES: 1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text - Example: Write "\\\\tau" not "\\tau", write "\\\\(" not "\\(" @@ -103,15 +147,27 @@ def get_submitter_json_schema() -> str: Example (mathematical proof): { + "submission_type": "idea", "submission": "The problem of squaring the circle is equivalent to constructing a line segment of length \\\\sqrt{\\\\pi} using only compass and straightedge. By the Lindemann-Weierstrass theorem (1882), \\\\pi is transcendental, meaning it is not the root of any polynomial with rational coefficients. Since compass and straightedge constructions can only produce algebraic numbers (roots of polynomials with rational coefficients), and \\\\sqrt{\\\\pi} would require \\\\pi to be algebraic, the construction is impossible.", "reasoning": "This submission provides the rigorous mathematical foundation for why squaring the circle is impossible, connecting transcendental number theory to geometric constructibility." } GOOD Example (technique application): { + "submission_type": "idea", "submission": "For problems involving irrational approximations, continued fractions provide optimal rational approximations. The continued fraction expansion of \\\\pi = [3; 7, 15, 1, 292, ...] shows that 22/7 and 355/113 are best rational approximants within their denominator ranges. This technique generalizes: for any irrational \\\\alpha, its convergents p_n/q_n satisfy |\\\\alpha - p_n/q_n| < 1/(q_n * q_{n+1}), providing provably good approximations.", "reasoning": "Leverages established number theory techniques for understanding irrational approximations relevant to the mathematical problem." } + +GOOD Example (Lean proof candidate): +{ + "submission_type": "lean_proof", + "theorem_statement": "For every natural number n, n + 0 = n.", + "formal_sketch": "A minimal sanity-check example; in real brainstorms prefer non-trivial proofs.", + "theorem_name": "moto_nat_add_zero", + "lean_code": "import Mathlib\\n\\ntheorem moto_nat_add_zero (n : Nat) : n + 0 = n := by\\n simpa using Nat.add_zero n", + "reasoning": "Demonstrates the Lean proof-candidate format." +} """ diff --git a/backend/aggregator/prompts/validator_prompts.py b/backend/aggregator/prompts/validator_prompts.py index 52822de..0b694c1 100644 --- a/backend/aggregator/prompts/validator_prompts.py +++ b/backend/aggregator/prompts/validator_prompts.py @@ -13,6 +13,12 @@ - If a submission offers an unsupported benchmark-style idea that is still useful, it must be framed as a proposed experiment, hypothesis, expected benefit, or future-work direction rather than as a completed result. - NEVER accept invented citations, fabricated experiments, fake benchmark numbers, or nonexistent code artifacts.""" +LEAN_VERIFIED_SUBMISSION_RULES = """LEAN 4 VERIFIED SUBMISSION RULES: +- A submission containing [LEAN 4 VERIFIED BRAINSTORM PROOF] has already passed Lean 4 and MOTO integrity/statement-alignment checks before this validator call. +- Do NOT reject such a submission by re-litigating Lean syntax or proof-checker correctness. +- Still judge whether the verified theorem/proof is useful, non-redundant, relevant to the user's goal, and strong enough to add to the brainstorm database. +- Reject Lean-verified proofs that are trivial, irrelevant, already covered, or not a useful brainstorm addition despite being formally verified.""" + def get_validator_system_prompt() -> str: """Get system prompt for validator agent.""" @@ -29,7 +35,7 @@ def get_validator_system_prompt() -> str: - NEVER cite internal documents as authoritative or established sources - Question and validate every assertion, even if it appears in validated content -""" + EMPIRICAL_PROVENANCE_VALIDATION_RULES + """ +""" + EMPIRICAL_PROVENANCE_VALIDATION_RULES + "\n\n" + LEAN_VERIFIED_SUBMISSION_RULES + """ The internal context shows what has been explored by AI agents, NOT what has been proven correct. Your role is to generate rigorous, verifiable mathematical content. Use internal context as exploration history and your base knowledge for reasoning and verification. @@ -38,13 +44,25 @@ def get_validator_system_prompt() -> str: --- YOUR TASK: -Tell me if the addition of the new submission increases potential solution availability in a significant way and/or provides a valuable solution space-constraint that narrows where we need to search in a significant way. +Decide whether this submission provides the strongest rigorous progress currently justified toward solving the user's problem, with highest priority given to direct solutions, direct partial solutions, impossibility results, exact reductions, or sharp constraints. + +Essentially, you are evaluating whether the knowledge base becomes more useful toward directly answering the user's mathematical prompt with this submission added than it was without it. -Essentially, you are evaluating whether the knowledge base becomes more useful toward finding mathematical solutions with this submission added than it was without it. +CRITICAL: You are NOT generating solutions yourself. You are judging whether this submission directly solves, partially solves, refutes, or materially enables the user's problem better than the current knowledge base does. -CRITICAL: You are NOT generating solutions yourself - you are assessing if there are new solutions POTENTIALLY available if we add this submission to the knowledge base, or if the solution space becomes stronger in any way. +DIRECT-SOLUTION PREFERENCE: +- If the submission directly resolves the user's problem, a clearly necessary subproblem, or proves a meaningful impossibility/limitation result, that is the strongest kind of acceptance case +- If no direct resolution is available, accept supportive material only when it materially increases the chance of a later direct answer +- Do not reward breadth, novelty, or interesting side observations over a stronger direct result + +META-PHASE EXCEPTION: +If the USER PROMPT explicitly says TOPIC EXPLORATION PHASE or PAPER TITLE EXPLORATION PHASE, evaluate the submission as the requested candidate artifact, not as a direct solution: +- TOPIC EXPLORATION PHASE: accept a candidate brainstorm question if it is specific, distinct, relevant, grounded, and aimed at a strong direct-answer path +- PAPER TITLE EXPLORATION PHASE: accept a candidate title if it is accurate, specific, distinct, professional, and foregrounds direct answer-bearing content when justified +- Do NOT reject these meta-phase submissions merely because they are questions or titles rather than mathematical solutions EVALUATION CRITERIA - Consider: +- Does the submission directly answer, partially answer, refute, or sharply constrain the user's problem or a necessary subproblem? - Does the submission add genuinely new information or perspectives beyond what is already accepted? - Does the submission connect existing mathematical concepts in novel ways? - Does the submission provide concrete methods, theorems, proofs, or mathematical techniques? @@ -57,9 +75,9 @@ def get_validator_system_prompt() -> str: VALIDATION DECISION RULES: A submission should be ACCEPTED if it: -1. Increases potential solution availability in a significant way, OR -2. Provides valuable solution space constraints that narrow where to search, OR -3. Offers novel mathematical insights not present in existing accepted submissions, OR +1. Directly solves, partially solves, or proves a meaningful impossibility/limitation result for the user's problem or a necessary subproblem, OR +2. Provides valuable solution space constraints that sharply narrow where a direct answer can lie, OR +3. Offers rigorous enabling insights not present in existing accepted submissions when a stronger direct step is not yet available, OR 4. Presents rigorous mathematical arguments based on established principles A submission should be REJECTED if it: @@ -71,8 +89,9 @@ def get_validator_system_prompt() -> str: 6. Contains logical fallacies or mathematically unsound reasoning 7. Presents claims as proven without proper mathematical justification 8. Presents unsupported empirical, benchmark, hardware, or artifact claims as established fact +9. Is merely tangential or exploratory when a more direct, rigorous contribution was available from the same content -Ask yourself: "Does adding this submission to our knowledge base make us more capable of solving the user's mathematical prompt than we were without it?" +Ask yourself: "Does adding this submission make us more capable of directly answering the user's mathematical prompt than we were without it, and is this the strongest justified kind of progress?" REJECTION FEEDBACK FORMAT: If rejecting, your "summary" field must provide CONCRETE, ACTIONABLE guidance using this structure: @@ -199,7 +218,7 @@ def get_validator_dual_system_prompt() -> str: - NEVER cite internal documents as authoritative or established sources - Question and validate every assertion, even if it appears in validated content -""" + EMPIRICAL_PROVENANCE_VALIDATION_RULES + """ +""" + EMPIRICAL_PROVENANCE_VALIDATION_RULES + "\n\n" + LEAN_VERIFIED_SUBMISSION_RULES + """ The internal context shows what has been explored by AI agents, NOT what has been proven correct. Your role is to generate rigorous, verifiable mathematical content. Use internal context as exploration history and your base knowledge for reasoning and verification. @@ -211,11 +230,23 @@ def get_validator_dual_system_prompt() -> str: Evaluate EACH submission INDEPENDENTLY to determine if it would make a valuable cumulative addition to the shared knowledge base. CRITICAL - INDEPENDENT ASSESSMENT: -For EACH submission, ask: "Does THIS submission increase potential solution availability or provide valuable constraints, considering ONLY the existing database (not the other submission in this batch)?" +For EACH submission, ask: "Does THIS submission provide the strongest rigorous direct progress currently justified toward the user's problem, considering ONLY the existing database (not the other submission in this batch)?" + +Essentially, you are evaluating whether the training database becomes more useful toward directly answering the user's mathematical prompt with each submission added than it was without it. -Essentially, you are evaluating whether the training database becomes more useful toward finding mathematical solutions with each submission added than it was without it. +DIRECT-SOLUTION PREFERENCE: +- Prefer submissions that directly solve, partially solve, refute, or sharply constrain the problem +- Accept supportive material only when it materially enables a later direct answer and no stronger direct step is currently justified +- Do not prefer broader or more novel side ideas over a stronger direct result + +META-PHASE EXCEPTION: +If the USER PROMPT explicitly says TOPIC EXPLORATION PHASE or PAPER TITLE EXPLORATION PHASE, evaluate each submission as the requested candidate artifact, not as a direct solution: +- TOPIC EXPLORATION PHASE: accept a candidate brainstorm question if it is specific, distinct, relevant, grounded, and aimed at a strong direct-answer path +- PAPER TITLE EXPLORATION PHASE: accept a candidate title if it is accurate, specific, distinct, professional, and foregrounds direct answer-bearing content when justified +- Do NOT reject these meta-phase submissions merely because they are questions or titles rather than mathematical solutions EVALUATION CRITERIA (Apply to EACH submission independently): +- Does the submission directly answer, partially answer, refute, or sharply constrain the user's problem or a necessary subproblem? - Does the submission add genuinely new information or perspectives beyond what is already accepted? - Does the submission connect existing mathematical concepts in novel ways? - Does the submission provide concrete methods, theorems, proofs, or mathematical techniques? @@ -227,9 +258,9 @@ def get_validator_dual_system_prompt() -> str: VALIDATION DECISION RULES (for each submission): A submission should be ACCEPTED if it: -1. Increases potential solution availability in a significant way, OR -2. Provides valuable solution space constraints that narrow where to search, OR -3. Offers novel mathematical insights not present in existing accepted submissions, OR +1. Directly solves, partially solves, or proves a meaningful impossibility/limitation result for the user's problem or a necessary subproblem, OR +2. Provides valuable solution space constraints that sharply narrow where a direct answer can lie, OR +3. Offers rigorous enabling insights not present in existing accepted submissions when a stronger direct step is not yet available, OR 4. Presents rigorous mathematical arguments based on established principles A submission should be REJECTED if it: @@ -239,6 +270,7 @@ def get_validator_dual_system_prompt() -> str: 4. Is too vague or generic to be actionable 5. Contains logical fallacies or mathematically unsound reasoning 6. Presents unsupported empirical, benchmark, hardware, or artifact claims as established fact +7. Is merely tangential or exploratory when a more direct, rigorous contribution was available from the same content CRITICAL - INTRA-BATCH REDUNDANCY PREVENTION: You must make TWO SEPARATE, INDEPENDENT decisions first - one for each submission. @@ -422,7 +454,7 @@ def get_validator_triple_system_prompt() -> str: - NEVER cite internal documents as authoritative or established sources - Question and validate every assertion, even if it appears in validated content -""" + EMPIRICAL_PROVENANCE_VALIDATION_RULES + """ +""" + EMPIRICAL_PROVENANCE_VALIDATION_RULES + "\n\n" + LEAN_VERIFIED_SUBMISSION_RULES + """ The internal context shows what has been explored by AI agents, NOT what has been proven correct. Your role is to generate rigorous, verifiable mathematical content. Use internal context as exploration history and your base knowledge for reasoning and verification. @@ -434,11 +466,23 @@ def get_validator_triple_system_prompt() -> str: Evaluate EACH submission INDEPENDENTLY to determine if it would make a valuable cumulative addition to the shared knowledge base. CRITICAL - INDEPENDENT ASSESSMENT: -For EACH of the three submissions, ask: "Does THIS submission increase potential solution availability or provide valuable constraints, considering ONLY the existing database (not the other submissions in this batch)?" +For EACH of the three submissions, ask: "Does THIS submission provide the strongest rigorous direct progress currently justified toward the user's problem, considering ONLY the existing database (not the other submissions in this batch)?" + +Essentially, you are evaluating whether the training database becomes more useful toward directly answering the user's mathematical prompt with each submission added than it was without it. -Essentially, you are evaluating whether the training database becomes more useful toward finding mathematical solutions with each submission added than it was without it. +DIRECT-SOLUTION PREFERENCE: +- Prefer submissions that directly solve, partially solve, refute, or sharply constrain the problem +- Accept supportive material only when it materially enables a later direct answer and no stronger direct step is currently justified +- Do not prefer broader or more novel side ideas over a stronger direct result + +META-PHASE EXCEPTION: +If the USER PROMPT explicitly says TOPIC EXPLORATION PHASE or PAPER TITLE EXPLORATION PHASE, evaluate each submission as the requested candidate artifact, not as a direct solution: +- TOPIC EXPLORATION PHASE: accept a candidate brainstorm question if it is specific, distinct, relevant, grounded, and aimed at a strong direct-answer path +- PAPER TITLE EXPLORATION PHASE: accept a candidate title if it is accurate, specific, distinct, professional, and foregrounds direct answer-bearing content when justified +- Do NOT reject these meta-phase submissions merely because they are questions or titles rather than mathematical solutions EVALUATION CRITERIA (Apply to EACH submission independently): +- Does the submission directly answer, partially answer, refute, or sharply constrain the user's problem or a necessary subproblem? - Does the submission add genuinely new information or perspectives beyond what is already accepted? - Does the submission connect existing mathematical concepts in novel ways? - Does the submission provide concrete methods, theorems, proofs, or mathematical techniques? @@ -450,9 +494,9 @@ def get_validator_triple_system_prompt() -> str: VALIDATION DECISION RULES (for each submission): A submission should be ACCEPTED if it: -1. Increases potential solution availability in a significant way, OR -2. Provides valuable solution space constraints that narrow where to search, OR -3. Offers novel mathematical insights not present in existing accepted submissions, OR +1. Directly solves, partially solves, or proves a meaningful impossibility/limitation result for the user's problem or a necessary subproblem, OR +2. Provides valuable solution space constraints that sharply narrow where a direct answer can lie, OR +3. Offers rigorous enabling insights not present in existing accepted submissions when a stronger direct step is not yet available, OR 4. Presents rigorous mathematical arguments based on established principles A submission should be REJECTED if it: @@ -462,6 +506,7 @@ def get_validator_triple_system_prompt() -> str: 4. Is too vague or generic to be actionable 5. Contains logical fallacies or mathematically unsound reasoning 6. Presents unsupported empirical, benchmark, hardware, or artifact claims as established fact +7. Is merely tangential or exploratory when a more direct, rigorous contribution was available from the same content CRITICAL - INTRA-BATCH REDUNDANCY PREVENTION: You must make THREE SEPARATE, INDEPENDENT decisions first - one for each submission. @@ -708,10 +753,11 @@ def get_cleanup_review_system_prompt() -> str: 6. Contains unsupported empirical or artifact claims presented as established fact REASONS TO KEEP - A submission should be kept if it: -1. Provides ANY unique information not covered elsewhere -2. Offers a different perspective or approach even if related to other content -3. Contains specific mathematical details, proofs, or techniques -4. Contributes to solution diversity in any meaningful way +1. Directly answers, partially answers, refutes, or sharply constrains the user's problem better than alternatives +2. Provides ANY unique information not covered elsewhere +3. Offers a different perspective or approach even if related to other content +4. Contains specific mathematical details, proofs, or techniques +5. Contributes to solution diversity in any meaningful way CONSERVATIVE APPROACH: - When in doubt, DO NOT recommend removal @@ -721,6 +767,9 @@ def get_cleanup_review_system_prompt() -> str: CRITICAL SELECTION RULE: When multiple submissions are redundant with each other, you MUST select the WEAKEST one for removal - the one that provides the LEAST unique value. NEVER remove a more complete submission in favor of keeping a less complete one. +DIRECT-SOLUTION PRIORITY: +If overlapping submissions differ in how directly they answer the user's problem, keep the one that provides the strongest rigorous direct resolution or sharpest justified constraint. Remove the more indirect auxiliary submission first when all else is equal. + Output your decision ONLY as JSON in this exact format: { "should_remove": true or false, @@ -850,12 +899,14 @@ def get_removal_validation_system_prompt() -> str: 2. The reasoning for removal is sound and well-justified 3. The database would be objectively better without this submission 4. The unique value claimed by the submission is truly covered elsewhere +5. Any more direct or stronger resolution in the database is preserved while the weaker, more auxiliary submission is the one being removed REJECT REMOVAL (decision: "reject") if: 1. The submission provides ANY unique value not covered elsewhere 2. The reasoning for removal is weak or unconvincing 3. There is ANY doubt about whether the content is truly redundant 4. Removing would reduce solution diversity or coverage +5. The proposed removal would discard a more direct answer, stronger impossibility result, or sharper constraint than the alternatives being kept CONSERVATIVE DEFAULT: - If uncertain, REJECT the removal (keep the submission) diff --git a/backend/api/main.py b/backend/api/main.py index bc1c77b..281138b 100644 --- a/backend/api/main.py +++ b/backend/api/main.py @@ -3,6 +3,7 @@ """ import asyncio import os +import secrets from pathlib import Path from typing import Optional from fastapi import FastAPI @@ -23,6 +24,7 @@ health, proofs, update, + leanoj, ) from backend.shared.build_info import get_build_info from backend.shared.lm_studio_client import lm_studio_client @@ -31,6 +33,7 @@ from backend.aggregator.core.coordinator import coordinator from backend.compiler.core.compiler_coordinator import compiler_coordinator from backend.autonomous.core.autonomous_coordinator import autonomous_coordinator +from backend.leanoj.core.leanoj_coordinator import leanoj_coordinator # Setup logging with millisecond precision for log correlation logging.basicConfig( @@ -79,6 +82,19 @@ def _validate_generic_mode_startup_env() -> None: ) +def _ensure_desktop_api_token() -> None: + """Ensure default-mode HTTP/WebSocket routes have an instance token.""" + if system_config.generic_mode: + return + + if not system_config.desktop_api_token: + system_config.desktop_api_token = secrets.token_urlsafe(32) + logger.warning( + "Generated a runtime desktop API token because MOTO_DESKTOP_API_TOKEN was not provided. " + "Launch through moto_launcher.py so the frontend receives the same token." + ) + + def _apply_generic_mode_openrouter_env(api_client_manager) -> None: """Load the hosted OpenRouter key from env without using the desktop keyring.""" api_key = os.environ.get("OPENROUTER_API_KEY", "").strip() @@ -149,6 +165,7 @@ async def lifespan(app: FastAPI): """Lifespan events for the FastAPI app.""" _apply_generic_mode_from_env() _validate_generic_mode_startup_env() + _ensure_desktop_api_token() # Startup logger.info( @@ -218,6 +235,7 @@ async def lifespan(app: FastAPI): coordinator.set_websocket_broadcaster(websocket.broadcast_event) compiler_coordinator.set_websocket_broadcaster(websocket.broadcast_event) autonomous_coordinator.set_broadcast_callback(websocket.broadcast_event) + leanoj_coordinator.set_broadcast_callback(websocket.broadcast_event) # Set boost manager broadcaster from backend.shared.boost_manager import boost_manager @@ -226,6 +244,16 @@ async def lifespan(app: FastAPI): # Set API client manager broadcaster (token tracking, rate limits, fallbacks) api_client_manager.set_broadcast_callback(websocket.broadcast_event) + try: + # Restore saved LeanOJ state for the UI, but only launch model work when + # explicitly requested. Lean 4 being enabled is not enough to imply that + # LM Studio/OpenRouter models are loaded and ready at backend startup. + await leanoj_coordinator.restore_latest_session( + auto_resume=system_config.lean4_enabled and system_config.leanoj_auto_resume_enabled + ) + except Exception as exc: + logger.warning("Failed to restore LeanOJ session state on startup: %s", exc) + # Lean 4 warm start must NEVER block the FastAPI lifespan. A cold Mathlib # workspace can spend many minutes inside `lake update` / `lake exe cache # get`, during which the backend would otherwise refuse every HTTP request @@ -264,6 +292,7 @@ async def _warm_start_lean4() -> None: await coordinator.stop() await compiler_coordinator.stop() await autonomous_coordinator.stop() + await leanoj_coordinator.stop() await close_lean4_client() clear_lean4_client() await lm_studio_client.close() @@ -293,6 +322,7 @@ async def _warm_start_lean4() -> None: app.include_router(openrouter.router) app.include_router(download.router) app.include_router(update.router) +app.include_router(leanoj.router) app.include_router(websocket.router) diff --git a/backend/api/middleware.py b/backend/api/middleware.py index 357f160..1099165 100644 --- a/backend/api/middleware.py +++ b/backend/api/middleware.py @@ -1,13 +1,23 @@ """ Middleware for CORS and error handling. """ +import hmac import os +from urllib.parse import urlparse from fastapi import FastAPI, Request from fastapi.middleware.cors import CORSMiddleware from fastapi.responses import JSONResponse +from starlette import status import logging -from backend.api.proxy_auth import ProxyAuthError, validate_proxy_headers +from backend.api.proxy_auth import ( + EMPTY_BODY_SHA256, + PROXY_BODY_SHA256_HEADER, + ProxyAuthError, + hash_proxy_body, + is_proxy_auth_allowlisted, + validate_proxy_headers, +) from backend.shared.config import system_config logger = logging.getLogger(__name__) @@ -19,6 +29,85 @@ f"http://localhost:{system_config.backend_port}", f"http://127.0.0.1:{system_config.backend_port}", ] +DESKTOP_API_TOKEN_HEADER = "X-Moto-Desktop-Token" +UNSAFE_HTTP_METHODS = {"POST", "PUT", "PATCH", "DELETE"} + + +def _origin_from_url(value: str) -> str: + """Return scheme://host[:port] for an Origin/Referer-like value.""" + parsed = urlparse(value or "") + if not parsed.scheme or not parsed.netloc: + return "" + return f"{parsed.scheme}://{parsed.netloc}" + + +def _validate_desktop_token(request: Request, allowed_origins: list[str]) -> None: + """Require the launcher-provided desktop API token outside public routes.""" + if is_proxy_auth_allowlisted(request.method, request.url.path): + return + + expected = (system_config.desktop_api_token or "").strip() + if not expected: + raise ProxyAuthError( + "Desktop API token is not configured for this runtime.", + status.HTTP_503_SERVICE_UNAVAILABLE, + ) + + provided = (request.headers.get(DESKTOP_API_TOKEN_HEADER) or "").strip() + if not provided or not hmac.compare_digest(provided, expected): + raise ProxyAuthError( + "Missing or invalid desktop API token.", + status.HTTP_401_UNAUTHORIZED, + ) + + if request.method.upper() in UNSAFE_HTTP_METHODS: + origin = (request.headers.get("origin") or "").strip() + referer = _origin_from_url(request.headers.get("referer") or "") + candidate = origin or referer + if candidate and candidate not in allowed_origins: + raise ProxyAuthError( + "Unsafe request origin is not allowed for this desktop runtime.", + status.HTTP_403_FORBIDDEN, + ) + + +def _validate_generic_content_length(request: Request) -> None: + """Reject oversized hosted requests before route handlers parse the body.""" + raw_content_length = (request.headers.get("content-length") or "").strip() + if not raw_content_length: + return + + try: + content_length = int(raw_content_length) + except ValueError as exc: + raise ProxyAuthError( + "Invalid Content-Length header.", + status.HTTP_400_BAD_REQUEST, + ) from exc + + max_bytes = max(int(system_config.generic_max_request_bytes or 0), 1) + if content_length > max_bytes: + raise ProxyAuthError( + f"Request body exceeds hosted limit of {max_bytes} bytes.", + status.HTTP_413_CONTENT_TOO_LARGE, + ) + + +async def _validate_generic_body_hash(request: Request, expected_hash: str) -> str: + """Verify the signed body hash against the actual request body.""" + body = await request.body() + actual_hash = hash_proxy_body(body) + if not hmac.compare_digest(expected_hash, actual_hash): + raise ProxyAuthError( + "X-Moto body hash does not match the received request body.", + status.HTTP_403_FORBIDDEN, + ) + + async def receive(): + return {"type": "http.request", "body": body, "more_body": False} + + request._receive = receive + return actual_hash def setup_middleware(app: FastAPI) -> None: @@ -44,20 +133,47 @@ def setup_middleware(app: FastAPI) -> None: ) @app.middleware("http") - async def generic_mode_proxy_auth(request: Request, call_next): - """Require signed internal proxy headers for protected hosted routes.""" + async def moto_request_auth(request: Request, call_next): + """Require hosted proxy auth or desktop instance tokens for protected routes.""" if system_config.generic_mode: try: + if not is_proxy_auth_allowlisted(request.method, request.url.path): + _validate_generic_content_length(request) + + body_hash = request.headers.get(PROXY_BODY_SHA256_HEADER) + verified_body_hash = EMPTY_BODY_SHA256 + if ( + not is_proxy_auth_allowlisted(request.method, request.url.path) + and request.method.upper() not in {"GET", "HEAD"} + and not body_hash + ): + raise ProxyAuthError( + "Missing required X-Moto body hash header.", + status.HTTP_401_UNAUTHORIZED, + ) + if ( + not is_proxy_auth_allowlisted(request.method, request.url.path) + and request.method.upper() not in {"GET", "HEAD"} + ): + verified_body_hash = await _validate_generic_body_hash(request, body_hash or "") validate_proxy_headers( request.headers, method=request.method, path=request.url.path, + query_string=request.url.query, + body_hash=verified_body_hash, expected_instance_id=system_config.instance_id, shared_secret=system_config.internal_proxy_secret or "", ) except ProxyAuthError as exc: logger.warning("Rejected generic-mode request %s %s: %s", request.method, request.url.path, exc.detail) return JSONResponse(status_code=exc.status_code, content={"detail": exc.detail}) + else: + try: + _validate_desktop_token(request, origins) + except ProxyAuthError as exc: + logger.warning("Rejected desktop request %s %s: %s", request.method, request.url.path, exc.detail) + return JSONResponse(status_code=exc.status_code, content={"detail": exc.detail}) return await call_next(request) diff --git a/backend/api/proxy_auth.py b/backend/api/proxy_auth.py index f1577c2..f85ecd6 100644 --- a/backend/api/proxy_auth.py +++ b/backend/api/proxy_auth.py @@ -13,12 +13,16 @@ PROXY_INSTANCE_HEADER = "X-Moto-Instance-Id" PROXY_TIMESTAMP_HEADER = "X-Moto-Proxy-Timestamp" PROXY_SIGNATURE_HEADER = "X-Moto-Proxy-Signature" +PROXY_BODY_SHA256_HEADER = "X-Moto-Body-SHA256" PROXY_AUTH_MAX_SKEW_SECONDS = 60 +PROXY_REPLAY_CACHE_MAX_ENTRIES = 4096 +EMPTY_BODY_SHA256 = hashlib.sha256(b"").hexdigest() PROXY_AUTH_ALLOWLIST = { ("GET", "/health"), ("GET", "/api/health"), ("GET", "/api/features"), } +_SEEN_PROXY_SIGNATURES: dict[str, int] = {} class ProxyAuthError(RuntimeError): @@ -36,6 +40,51 @@ def normalize_proxy_path(path: str) -> str: return normalized or "/" +def normalize_proxy_query(query_string: str | bytes | None) -> str: + """Normalize the raw query string used for proxy signatures.""" + if isinstance(query_string, bytes): + query_string = query_string.decode("utf-8", errors="surrogatepass") + normalized = (query_string or "").strip() + return normalized[1:] if normalized.startswith("?") else normalized + + +def hash_proxy_body(body: bytes | str | None) -> str: + """Return the SHA-256 hex digest for the request body.""" + if body is None: + raw_body = b"" + elif isinstance(body, bytes): + raw_body = body + else: + raw_body = body.encode("utf-8") + return hashlib.sha256(raw_body).hexdigest() + + +def _remember_proxy_signature(signature: str, timestamp_value: int, current_time: int) -> None: + """Reject replayed signatures within the accepted timestamp skew window.""" + stale_cutoff = current_time - PROXY_AUTH_MAX_SKEW_SECONDS + stale_signatures = [ + seen_signature + for seen_signature, seen_timestamp in _SEEN_PROXY_SIGNATURES.items() + if seen_timestamp < stale_cutoff + ] + for seen_signature in stale_signatures: + _SEEN_PROXY_SIGNATURES.pop(seen_signature, None) + + if signature in _SEEN_PROXY_SIGNATURES: + raise ProxyAuthError( + "Replayed X-Moto-Proxy-Signature was rejected.", + status.HTTP_401_UNAUTHORIZED, + ) + + _SEEN_PROXY_SIGNATURES[signature] = timestamp_value + if len(_SEEN_PROXY_SIGNATURES) > PROXY_REPLAY_CACHE_MAX_ENTRIES: + for seen_signature, _ in sorted( + _SEEN_PROXY_SIGNATURES.items(), + key=lambda item: item[1], + )[: len(_SEEN_PROXY_SIGNATURES) - PROXY_REPLAY_CACHE_MAX_ENTRIES]: + _SEEN_PROXY_SIGNATURES.pop(seen_signature, None) + + def is_proxy_auth_allowlisted(method: str, path: str) -> bool: """Return True when a route is intentionally public in generic mode.""" normalized_method = (method or "").upper() @@ -45,9 +94,26 @@ def is_proxy_auth_allowlisted(method: str, path: str) -> bool: return (normalized_method, normalized_path) in PROXY_AUTH_ALLOWLIST -def build_proxy_signature(secret: str, instance_id: str, timestamp: str, method: str, path: str) -> str: +def build_proxy_signature( + secret: str, + instance_id: str, + timestamp: str, + method: str, + path: str, + query_string: str | bytes | None = "", + body_hash: str | None = EMPTY_BODY_SHA256, +) -> str: """Build the expected HMAC signature for a proxied request.""" - payload = f"{instance_id}:{timestamp}:{(method or '').upper()}:{normalize_proxy_path(path)}" + payload = "\n".join( + ( + instance_id, + timestamp, + (method or "").upper(), + normalize_proxy_path(path), + normalize_proxy_query(query_string), + body_hash or EMPTY_BODY_SHA256, + ) + ) return hmac.new(secret.encode("utf-8"), payload.encode("utf-8"), hashlib.sha256).hexdigest() @@ -56,6 +122,9 @@ def validate_proxy_headers( *, method: str, path: str, + query_string: str | bytes | None = "", + body: bytes | str | None = b"", + body_hash: str | None = None, expected_instance_id: str, shared_secret: str, now: int | None = None, @@ -107,9 +176,13 @@ def validate_proxy_headers( timestamp=timestamp_raw, method=method, path=path, + query_string=query_string, + body_hash=body_hash or hash_proxy_body(body), ) if not hmac.compare_digest(signature, expected_signature): raise ProxyAuthError( "Invalid X-Moto-Proxy-Signature for the requested path.", status.HTTP_403_FORBIDDEN, ) + + _remember_proxy_signature(signature, timestamp_value, current_time) diff --git a/backend/api/routes/__init__.py b/backend/api/routes/__init__.py index 512b263..9183649 100644 --- a/backend/api/routes/__init__.py +++ b/backend/api/routes/__init__.py @@ -1,4 +1,4 @@ """API routes""" -from . import aggregator, compiler, autonomous, websocket, boost, workflow, features, health, proofs, update +from . import aggregator, compiler, autonomous, websocket, boost, workflow, features, health, proofs, update, leanoj -__all__ = ['aggregator', 'compiler', 'autonomous', 'websocket', 'boost', 'workflow', 'features', 'health', 'proofs', 'update'] +__all__ = ['aggregator', 'compiler', 'autonomous', 'websocket', 'boost', 'workflow', 'features', 'health', 'proofs', 'update', 'leanoj'] diff --git a/backend/api/routes/aggregator.py b/backend/api/routes/aggregator.py index 54bb5ce..6ae150e 100644 --- a/backend/api/routes/aggregator.py +++ b/backend/api/routes/aggregator.py @@ -12,16 +12,20 @@ from backend.shared.config import system_config, rag_config from backend.shared.token_tracker import token_tracker from backend.shared.path_safety import resolve_path_within_root, validate_single_path_component +from backend.shared.workflow_start_guard import workflow_start_guard from backend.aggregator.core.coordinator import coordinator from backend.aggregator.core.context_allocator import context_allocator from backend.aggregator.memory.event_log import event_log from backend.compiler.core.compiler_coordinator import compiler_coordinator from backend.autonomous.core.autonomous_coordinator import autonomous_coordinator +from backend.leanoj.core.leanoj_coordinator import leanoj_coordinator logger = logging.getLogger(__name__) router = APIRouter(prefix="/api/aggregator", tags=["aggregator"]) +MAX_UPLOAD_BYTES = 5 * 1024 * 1024 + def _get_start_conflict() -> Optional[str]: """Return a user-facing conflict message if another workflow is active.""" @@ -32,9 +36,12 @@ def _get_start_conflict() -> Optional[str]: return "Cannot start Aggregator while Compiler is running. Stop Compiler first." autonomous_state = autonomous_coordinator.get_state() - if autonomous_state.is_running: + if autonomous_state.is_running or autonomous_coordinator.is_active: return "Cannot start Aggregator while Autonomous Research is running. Stop Autonomous Research first." + if leanoj_coordinator.is_active: + return "Cannot start Aggregator while Proof Solver is running. Stop Proof Solver first." + return None @@ -42,71 +49,76 @@ def _get_start_conflict() -> Optional[str]: async def start_aggregator(request: AggregatorStartRequest): """Start the aggregator system.""" try: - conflict = _get_start_conflict() - if conflict: - raise HTTPException(status_code=400, detail=conflict) - - # Validate submitter configs - num_submitters = len(request.submitter_configs) - if not (system_config.min_submitters <= num_submitters <= system_config.max_submitters): - raise HTTPException( - status_code=400, - detail=f"Number of submitters must be {system_config.min_submitters}-{system_config.max_submitters}, got {num_submitters}" - ) - - # Update validator context window configuration - rag_config.validator_context_window = request.validator_context_size - rag_config.validator_max_output_tokens = request.validator_max_output_tokens - - # Use first submitter's context for context_allocator (for compatibility) - if request.submitter_configs: - first_submitter = request.submitter_configs[0] - rag_config.submitter_context_window = first_submitter.context_window - rag_config.submitter_max_output_tokens = first_submitter.max_output_tokens - context_allocator.set_context_windows( - first_submitter.context_window, - request.validator_context_size, - first_submitter.max_output_tokens, - request.validator_max_output_tokens - ) - - # Log submitter configurations - for config in request.submitter_configs: - label = "(Main Submitter)" if config.submitter_id == 1 else "" + async with workflow_start_guard.reserve(): + conflict = _get_start_conflict() + if conflict: + raise HTTPException(status_code=400, detail=conflict) + + # Validate submitter configs + num_submitters = len(request.submitter_configs) + if not (system_config.min_submitters <= num_submitters <= system_config.max_submitters): + raise HTTPException( + status_code=400, + detail=f"Number of submitters must be {system_config.min_submitters}-{system_config.max_submitters}, got {num_submitters}" + ) + + # Update validator context window configuration + rag_config.validator_context_window = request.validator_context_size + rag_config.validator_max_output_tokens = request.validator_max_output_tokens + + # Use first submitter's context for context_allocator (for compatibility) + if request.submitter_configs: + first_submitter = request.submitter_configs[0] + rag_config.submitter_context_window = first_submitter.context_window + rag_config.submitter_max_output_tokens = first_submitter.max_output_tokens + context_allocator.set_context_windows( + first_submitter.context_window, + request.validator_context_size, + first_submitter.max_output_tokens, + request.validator_max_output_tokens + ) + + # Log submitter configurations + for config in request.submitter_configs: + label = "(Main Submitter)" if config.submitter_id == 1 else "" + logger.info( + f"Submitter {config.submitter_id} {label}: model={config.model_id}, " + f"context={config.context_window}, max_tokens={config.max_output_tokens}" + ) logger.info( - f"Submitter {config.submitter_id} {label}: model={config.model_id}, " - f"context={config.context_window}, max_tokens={config.max_output_tokens}" + f"Validator: model={request.validator_model}, " + f"context={request.validator_context_size}, max_tokens={request.validator_max_output_tokens}" ) - logger.info( - f"Validator: model={request.validator_model}, " - f"context={request.validator_context_size}, max_tokens={request.validator_max_output_tokens}" - ) - - # Initialize coordinator with per-submitter configs (includes OpenRouter provider fields) - await coordinator.initialize( - user_prompt=request.user_prompt, - submitter_configs=request.submitter_configs, - validator_model=request.validator_model, - user_files=request.uploaded_files, - validator_context_window=request.validator_context_size, - validator_max_tokens=request.validator_max_output_tokens, - # Pass OpenRouter provider config for validator - validator_provider=request.validator_provider, - validator_openrouter_provider=request.validator_openrouter_provider, - validator_lm_studio_fallback=request.validator_lm_studio_fallback - ) - - # Start coordinator - token_tracker.reset() - token_tracker.start_timer() - await coordinator.start() - - return { - "status": "started", - "message": f"Aggregator system started with {num_submitters} submitters", - "num_submitters": num_submitters - } - + + # Initialize coordinator with per-submitter configs (includes OpenRouter provider fields) + await coordinator.initialize( + user_prompt=request.user_prompt, + submitter_configs=request.submitter_configs, + validator_model=request.validator_model, + user_files=request.uploaded_files, + validator_context_window=request.validator_context_size, + validator_max_tokens=request.validator_max_output_tokens, + # Pass OpenRouter provider config for validator + validator_provider=request.validator_provider, + validator_openrouter_provider=request.validator_openrouter_provider, + validator_openrouter_reasoning_effort=request.validator_openrouter_reasoning_effort, + validator_lm_studio_fallback=request.validator_lm_studio_fallback, + validator_supercharge_enabled=request.validator_supercharge_enabled + ) + + # Start coordinator + token_tracker.reset() + token_tracker.start_timer() + await coordinator.start() + + return { + "status": "started", + "message": f"Aggregator system started with {num_submitters} submitters", + "num_submitters": num_submitters + } + + except HTTPException: + raise except ValueError as e: # Model compatibility errors logger.error(f"Model compatibility error: {e}", exc_info=True) @@ -169,8 +181,8 @@ async def save_results(): return { "status": "saved", - "path": str(output_path), - "message": f"Results saved to {output_path}" + "path": output_path.name, + "message": f"Results saved to {output_path.name}" } except Exception as e: logger.error(f"Failed to save results: {e}") @@ -197,19 +209,30 @@ async def upload_file(file: UploadFile = File(...)): """Upload a user file.""" try: safe_filename = validate_single_path_component(file.filename, "filename") + if not safe_filename.lower().endswith(".txt"): + raise HTTPException(status_code=400, detail="Only .txt uploads are supported") + + content = await file.read(MAX_UPLOAD_BYTES + 1) + if len(content) > MAX_UPLOAD_BYTES: + raise HTTPException(status_code=413, detail="Upload exceeds 5 MB limit") + uploads_dir = Path(system_config.user_uploads_dir) uploads_dir.mkdir(parents=True, exist_ok=True) file_path = resolve_path_within_root(uploads_dir, safe_filename) async with aiofiles.open(file_path, 'wb') as f: - content = await file.read() await f.write(content) return { "status": "uploaded", "filename": safe_filename, - "path": str(file_path) + "path": safe_filename } + except HTTPException: + raise + except ValueError as e: + logger.warning("Rejected unsafe upload filename: %s", e) + raise HTTPException(status_code=400, detail=str(e)) except Exception as e: logger.error(f"Failed to upload file: {e}") raise HTTPException(status_code=500, detail="Internal server error") diff --git a/backend/api/routes/autonomous.py b/backend/api/routes/autonomous.py index 8f6fb79..2cb7a7c 100644 --- a/backend/api/routes/autonomous.py +++ b/backend/api/routes/autonomous.py @@ -3,6 +3,7 @@ Includes Tier 1 (Brainstorm), Tier 2 (Paper Writing), and Tier 3 (Final Answer) endpoints. """ import logging +import hashlib from datetime import datetime from pathlib import Path from typing import Optional, Any, Dict, List @@ -21,8 +22,11 @@ from backend.autonomous.memory.session_manager import session_manager from backend.autonomous.memory.autonomous_api_logger import autonomous_api_logger from backend.aggregator.core.coordinator import coordinator +from backend.aggregator.memory.shared_training import shared_training_memory from backend.compiler.core.compiler_coordinator import compiler_coordinator +from backend.leanoj.core.leanoj_coordinator import leanoj_coordinator from backend.shared.boost_logger import boost_logger +from backend.shared.workflow_start_guard import workflow_start_guard logger = logging.getLogger(__name__) @@ -51,6 +55,19 @@ def _parse_api_log_timestamp(timestamp: Optional[str]) -> datetime: return datetime.min +def _infer_api_log_workflow(entry: Dict[str, Any]) -> str: + """Infer workflow namespace for legacy API log entries.""" + workflow = str(entry.get("workflow") or "").strip().lower() + if workflow: + return workflow + + role_id = str(entry.get("role_id") or "") + task_id = str(entry.get("task_id") or "") + if role_id.startswith("leanoj_") or task_id.startswith("leanoj_"): + return "leanoj" + return "autonomous" + + def _normalize_autonomous_api_log(entry: Dict[str, Any]) -> Dict[str, Any]: """Normalize autonomous log entries into the combined API log shape.""" return { @@ -60,10 +77,11 @@ def _normalize_autonomous_api_log(entry: Dict[str, Any]) -> Dict[str, Any]: "boost_mode": entry.get("boost_mode"), "provider": entry.get("provider") or "unknown", "phase": entry.get("phase") or "unknown", + "workflow": _infer_api_log_workflow(entry), "prompt_preview": entry.get("prompt_preview") or "", - "prompt_full": entry.get("prompt_full") or entry.get("prompt_preview") or "", + "prompt_full": entry.get("prompt_full") or "", "response_preview": entry.get("response_preview") or "", - "response_full": entry.get("response_full") or entry.get("response_preview") or "", + "response_full": entry.get("response_full") or "", } @@ -71,7 +89,7 @@ def _normalize_boost_api_log(entry: Dict[str, Any]) -> Dict[str, Any]: """Normalize boost log entries so they can be shown in the main API log view.""" prompt_preview = entry.get("prompt_preview") or "" response_preview = entry.get("response_preview") or "" - response_full = entry.get("response_full") or response_preview + response_full = entry.get("response_full") or "" return { **entry, @@ -79,8 +97,9 @@ def _normalize_boost_api_log(entry: Dict[str, Any]) -> Dict[str, Any]: "boosted": True, "provider": entry.get("provider") or "openrouter", "phase": entry.get("phase") or "boost", + "workflow": _infer_api_log_workflow(entry), "prompt_preview": prompt_preview, - "prompt_full": entry.get("prompt_full") or prompt_preview, + "prompt_full": entry.get("prompt_full") or "", "response_preview": response_preview, "response_full": response_full, } @@ -214,20 +233,73 @@ def _build_combined_api_stats(logs: List[Dict[str, Any]]) -> Dict[str, Any]: } -async def _get_combined_api_logs(limit: int = 100) -> Dict[str, Any]: +def _normalize_api_log_workflow_filter(workflow: Optional[str]) -> Optional[str]: + if workflow is None: + return None + + normalized = workflow.strip().lower() + if not normalized: + return None + if normalized not in {"autonomous", "leanoj"}: + raise HTTPException(status_code=400, detail="Invalid workflow filter") + return normalized + + +def _get_api_log_key(entry: Dict[str, Any]) -> str: + """Build a stable opaque key for a combined API log entry.""" + parts = [ + str(entry.get("timestamp") or ""), + str(entry.get("task_id") or ""), + str(entry.get("role_id") or ""), + str(entry.get("model") or ""), + str(entry.get("source") or ""), + str(entry.get("boost_mode") or ""), + ] + return hashlib.sha256("\x1f".join(parts).encode("utf-8")).hexdigest()[:24] + + +def _summarize_api_log_entry(entry: Dict[str, Any]) -> Dict[str, Any]: + """Return a UI-safe log-list entry without large prompt/response bodies.""" + prompt_full = str(entry.get("prompt_full") or "") + response_full = str(entry.get("response_full") or "") + prompt_size = int(entry.get("prompt_size") or len(prompt_full)) + response_size = int(entry.get("response_size") or len(response_full)) + summary = { + **entry, + "log_key": _get_api_log_key(entry), + "prompt_full": "", + "response_full": "", + "prompt_size": prompt_size, + "response_size": response_size, + "has_full_prompt": bool(entry.get("has_full_prompt", bool(prompt_full))), + "has_full_response": bool(entry.get("has_full_response", bool(response_full))), + } + return summary + + +async def _get_combined_api_logs( + limit: int = 100, + workflow: Optional[str] = None, + include_full: bool = True, +) -> Dict[str, Any]: """Fetch, deduplicate, and summarize the combined autonomous + boost API logs.""" fetch_limit = max(limit * 3, 300) - autonomous_logs = await autonomous_api_logger.get_logs(limit=fetch_limit) - boost_logs = await boost_logger.get_logs(limit=fetch_limit) - combined_logs = _merge_combined_api_logs(autonomous_logs, boost_logs, limit=limit) - combined_stats = _build_combined_api_stats( - _merge_combined_api_logs( - autonomous_logs, - boost_logs, - limit=max(fetch_limit, len(autonomous_logs) + len(boost_logs)), - ) + autonomous_logs = await autonomous_api_logger.get_logs(limit=fetch_limit, include_full=include_full) + boost_logs = await boost_logger.get_logs(limit=fetch_limit, include_full=include_full) + all_combined_logs = _merge_combined_api_logs( + autonomous_logs, + boost_logs, + limit=max(fetch_limit, len(autonomous_logs) + len(boost_logs)), ) - return {"logs": combined_logs, "stats": combined_stats} + if workflow: + all_combined_logs = [ + log for log in all_combined_logs + if log.get("workflow") == workflow + ] + return { + "logs": all_combined_logs[:limit], + "stats": _build_combined_api_stats(all_combined_logs), + } if session_id == "legacy": return @@ -250,6 +322,9 @@ def _get_start_conflict() -> Optional[str]: if compiler_coordinator.is_running: return "Cannot start Autonomous Research while Compiler is running. Stop Compiler first." + if leanoj_coordinator.is_active: + return "Cannot start Autonomous Research while Proof Solver is running. Stop Proof Solver first." + return None @@ -379,16 +454,20 @@ def _resolve_validator_config(request: Optional[CritiqueRequest]) -> Dict[str, A validator_max_tokens = None validator_provider = None validator_openrouter_provider = None + validator_openrouter_reasoning_effort = "auto" + validator_supercharge_enabled = False custom_prompt = None if request: custom_prompt = request.custom_prompt + validator_supercharge_enabled = bool(request.validator_supercharge_enabled) if request.validator_model: validator_model = request.validator_model validator_context_window = request.validator_context_window or 131072 validator_max_tokens = request.validator_max_tokens or 25000 validator_provider = request.validator_provider or "lm_studio" validator_openrouter_provider = request.validator_openrouter_provider + validator_openrouter_reasoning_effort = request.validator_openrouter_reasoning_effort if not validator_model: coordinator_config = autonomous_coordinator.get_validator_config() @@ -398,6 +477,8 @@ def _resolve_validator_config(request: Optional[CritiqueRequest]) -> Dict[str, A validator_max_tokens = coordinator_config["validator_max_tokens"] validator_provider = coordinator_config["validator_provider"] validator_openrouter_provider = coordinator_config.get("validator_openrouter_provider") + validator_openrouter_reasoning_effort = coordinator_config.get("validator_openrouter_reasoning_effort", "auto") + validator_supercharge_enabled = bool(coordinator_config.get("validator_supercharge_enabled", False)) if not validator_model: raise HTTPException( @@ -412,6 +493,8 @@ def _resolve_validator_config(request: Optional[CritiqueRequest]) -> Dict[str, A "validator_max_tokens": validator_max_tokens, "validator_provider": validator_provider, "validator_openrouter_provider": validator_openrouter_provider, + "validator_openrouter_reasoning_effort": validator_openrouter_reasoning_effort, + "validator_supercharge_enabled": validator_supercharge_enabled, } @@ -465,9 +548,11 @@ async def _generate_autonomous_paper_critique( model_id=config["validator_model"], openrouter_model_id=config["validator_model"] if config["validator_provider"] == "openrouter" else None, openrouter_provider=config["validator_openrouter_provider"], + openrouter_reasoning_effort=config["validator_openrouter_reasoning_effort"], lm_studio_fallback_id=None, context_window=config["validator_context_window"], max_output_tokens=config["validator_max_tokens"], + supercharge_enabled=bool(config.get("validator_supercharge_enabled", False)), ) ) @@ -544,9 +629,7 @@ async def _delete_autonomous_paper_from_scope( scoped_research_metadata: ResearchMetadata, paper_id: str, ) -> Dict[str, Any]: - """Delete a Stage 2 paper and clean its related metadata/critique state.""" - from backend.shared.critique_memory import clear_critiques - + """Soft-prune a Stage 2 paper and remove it from future model context.""" state = autonomous_coordinator.get_state() active_session_id = _get_active_autonomous_session_id() if ( @@ -564,18 +647,25 @@ async def _delete_autonomous_paper_from_scope( if not metadata: raise HTTPException(status_code=404, detail=f"Paper not found: {paper_id}") - paper_path = scoped_paper_library.get_paper_path(paper_id) - base_dir = Path(paper_path).parent source_brainstorms = metadata.source_brainstorm_ids or [] - success = await scoped_paper_library.delete_paper(paper_id) + prune_reason = "The user removed this paper from model context accumulation." + success = await scoped_paper_library.prune_paper( + paper_id, + reason=prune_reason, + pruned_by="user", + ) if not success: raise HTTPException( status_code=500, - detail=f"Failed to delete paper files for {paper_id}" + detail=f"Failed to prune paper files for {paper_id}" ) - await scoped_research_metadata.delete_paper(paper_id) + await scoped_research_metadata.prune_paper( + paper_id, + reason=prune_reason, + pruned_by="user", + ) for topic_id in source_brainstorms: try: @@ -586,22 +676,23 @@ async def _delete_autonomous_paper_from_scope( ) try: - await clear_critiques("autonomous_paper", paper_id, base_dir) - logger.info(f"Cleared critiques for deleted paper {paper_id}") + from backend.autonomous.core.autonomous_rag_manager import autonomous_rag_manager + await autonomous_rag_manager.remove_paper_from_rag(paper_id) except Exception as e: - logger.warning(f"Failed to clear critiques for paper {paper_id}: {e}") + logger.warning(f"Failed to remove pruned paper {paper_id} from RAG: {e}") logger.info( - f"Deleted paper {paper_id} from session {session_id} " + f"Pruned paper {paper_id} from session {session_id} " f"(from brainstorms: {', '.join(source_brainstorms)})" ) return { "success": True, - "message": f"Paper {paper_id} deleted successfully", + "message": f"Paper {paper_id} was pruned from model context and preserved for download", "paper_id": paper_id, "session_id": session_id, "source_brainstorms": source_brainstorms, + "pruned": True, } @@ -611,71 +702,80 @@ async def start_autonomous_research(request: AutonomousResearchStartRequest): try: from backend.shared.config import system_config - conflict = _get_start_conflict() - if conflict: - raise HTTPException(status_code=400, detail=conflict) - - # Validate submitter configs - num_submitters = len(request.submitter_configs) - if not (system_config.min_submitters <= num_submitters <= system_config.max_submitters): - raise HTTPException( - status_code=400, - detail=f"Number of submitters must be {system_config.min_submitters}-{system_config.max_submitters}, got {num_submitters}" - ) - - # Log submitter configurations - for config in request.submitter_configs: - label = "(Main Submitter)" if config.submitter_id == 1 else "" + async with workflow_start_guard.reserve(): + conflict = _get_start_conflict() + if conflict: + raise HTTPException(status_code=400, detail=conflict) + + # Validate submitter configs + num_submitters = len(request.submitter_configs) + if not (system_config.min_submitters <= num_submitters <= system_config.max_submitters): + raise HTTPException( + status_code=400, + detail=f"Number of submitters must be {system_config.min_submitters}-{system_config.max_submitters}, got {num_submitters}" + ) + + # Log submitter configurations + for config in request.submitter_configs: + label = "(Main Submitter)" if config.submitter_id == 1 else "" + logger.info( + f"Brainstorm Submitter {config.submitter_id} {label}: model={config.model_id}, " + f"context={config.context_window}, max_tokens={config.max_output_tokens}" + ) logger.info( - f"Brainstorm Submitter {config.submitter_id} {label}: model={config.model_id}, " - f"context={config.context_window}, max_tokens={config.max_output_tokens}" + f"Validator: model={request.validator_model}, " + f"context={request.validator_context_window}, max_tokens={request.validator_max_tokens}" ) - logger.info( - f"Validator: model={request.validator_model}, " - f"context={request.validator_context_window}, max_tokens={request.validator_max_tokens}" - ) - - # Initialize coordinator - await autonomous_coordinator.initialize( - user_research_prompt=request.user_research_prompt, - submitter_configs=request.submitter_configs, - validator_model=request.validator_model, - validator_context_window=request.validator_context_window, - validator_max_tokens=request.validator_max_tokens, - high_context_model=request.high_context_model, - high_context_context_window=request.high_context_context_window, - high_context_max_tokens=request.high_context_max_tokens, - high_param_model=request.high_param_model, - high_param_context_window=request.high_param_context_window, - high_param_max_tokens=request.high_param_max_tokens, - critique_submitter_model=request.critique_submitter_model, - critique_submitter_context_window=request.critique_submitter_context_window, - critique_submitter_max_tokens=request.critique_submitter_max_tokens, - # OpenRouter provider configs for each role - validator_provider=request.validator_provider, - validator_openrouter_provider=request.validator_openrouter_provider, - validator_lm_studio_fallback=request.validator_lm_studio_fallback, - high_context_provider=request.high_context_provider, - high_context_openrouter_provider=request.high_context_openrouter_provider, - high_context_lm_studio_fallback=request.high_context_lm_studio_fallback, - high_param_provider=request.high_param_provider, - high_param_openrouter_provider=request.high_param_openrouter_provider, - high_param_lm_studio_fallback=request.high_param_lm_studio_fallback, - critique_submitter_provider=request.critique_submitter_provider, - critique_submitter_openrouter_provider=request.critique_submitter_openrouter_provider, - critique_submitter_lm_studio_fallback=request.critique_submitter_lm_studio_fallback, - tier3_enabled=request.tier3_enabled - ) - - # Start in background with a retained task handle so Stop can cancel it. - if not autonomous_coordinator.start_in_background(): - raise HTTPException(status_code=400, detail="Autonomous research is already running") - - return { - "success": True, - "message": f"Autonomous research started with {num_submitters} brainstorm submitters", - "num_submitters": num_submitters - } + + # Initialize coordinator + await autonomous_coordinator.initialize( + user_research_prompt=request.user_research_prompt, + submitter_configs=request.submitter_configs, + validator_model=request.validator_model, + validator_context_window=request.validator_context_window, + validator_max_tokens=request.validator_max_tokens, + high_context_model=request.high_context_model, + high_context_context_window=request.high_context_context_window, + high_context_max_tokens=request.high_context_max_tokens, + high_param_model=request.high_param_model, + high_param_context_window=request.high_param_context_window, + high_param_max_tokens=request.high_param_max_tokens, + critique_submitter_model=request.critique_submitter_model, + critique_submitter_context_window=request.critique_submitter_context_window, + critique_submitter_max_tokens=request.critique_submitter_max_tokens, + # OpenRouter provider configs for each role + validator_provider=request.validator_provider, + validator_openrouter_provider=request.validator_openrouter_provider, + validator_openrouter_reasoning_effort=request.validator_openrouter_reasoning_effort, + validator_lm_studio_fallback=request.validator_lm_studio_fallback, + high_context_provider=request.high_context_provider, + high_context_openrouter_provider=request.high_context_openrouter_provider, + high_context_openrouter_reasoning_effort=request.high_context_openrouter_reasoning_effort, + high_context_lm_studio_fallback=request.high_context_lm_studio_fallback, + high_param_provider=request.high_param_provider, + high_param_openrouter_provider=request.high_param_openrouter_provider, + high_param_openrouter_reasoning_effort=request.high_param_openrouter_reasoning_effort, + high_param_lm_studio_fallback=request.high_param_lm_studio_fallback, + critique_submitter_provider=request.critique_submitter_provider, + critique_submitter_openrouter_provider=request.critique_submitter_openrouter_provider, + critique_submitter_openrouter_reasoning_effort=request.critique_submitter_openrouter_reasoning_effort, + critique_submitter_lm_studio_fallback=request.critique_submitter_lm_studio_fallback, + tier3_enabled=request.tier3_enabled, + validator_supercharge_enabled=request.validator_supercharge_enabled, + high_context_supercharge_enabled=request.high_context_supercharge_enabled, + high_param_supercharge_enabled=request.high_param_supercharge_enabled, + critique_submitter_supercharge_enabled=request.critique_submitter_supercharge_enabled + ) + + # Start in background with a retained task handle so Stop can cancel it. + if not autonomous_coordinator.start_in_background(): + raise HTTPException(status_code=400, detail="Autonomous research is already running") + + return { + "success": True, + "message": f"Autonomous research started with {num_submitters} brainstorm submitters", + "num_submitters": num_submitters + } except HTTPException: raise @@ -791,14 +891,28 @@ async def get_autonomous_status(): # Try to get aggregator queue size if autonomous_coordinator._brainstorm_aggregator: - from backend.aggregator.core.queue_manager import queue_manager try: - queue_size = await queue_manager.size() + aggregator_status = await autonomous_coordinator._brainstorm_aggregator.get_status() + queue_size = aggregator_status.queue_size + aggregator_offset = autonomous_coordinator._brainstorm_aggregator.acceptance_count_offset + acceptance_count = max( + acceptance_count, + aggregator_offset + aggregator_status.total_acceptances, + aggregator_status.shared_training_size, + ) except Exception: - pass + from backend.aggregator.core.queue_manager import queue_manager + try: + queue_size = await queue_manager.size() + except Exception: + pass # Get counts from autonomous coordinator internal state - acceptance_count = autonomous_coordinator._acceptance_count + acceptance_count = max( + acceptance_count, + autonomous_coordinator._acceptance_count, + metadata.submission_count or 0, + ) rejection_count = autonomous_coordinator._rejection_count cleanup_removals = autonomous_coordinator._cleanup_removals @@ -1013,6 +1127,75 @@ async def get_paper_history(): raise HTTPException(status_code=500, detail="Internal server error") +@router.get("/paper-history/pruned") +async def get_pruned_paper_history(): + """Get all pruned Stage 2 papers from legacy and session history.""" + try: + papers = await paper_library.list_pruned_history_papers() + return { + "success": True, + "papers": papers, + "total_count": len(papers) + } + except Exception as e: + logger.error(f"Failed to get pruned Stage 2 paper history: {e}") + raise HTTPException(status_code=500, detail="Internal server error") + + +@router.get("/paper-history/pruned/{session_id}/{paper_id}") +async def get_pruned_history_paper(session_id: str, paper_id: str): + """Get one pruned Stage 2 paper from legacy/session history.""" + try: + paper = await paper_library.get_pruned_history_paper(session_id, paper_id) + if not paper: + raise HTTPException( + status_code=404, + detail=f"Pruned paper not found in history: session={session_id}, paper={paper_id}" + ) + + return { + "success": True, + **paper + } + except HTTPException: + raise + except Exception as e: + logger.error(f"Failed to get pruned history paper {session_id}/{paper_id}: {e}") + raise HTTPException(status_code=500, detail="Internal server error") + + +@router.delete("/paper-history/pruned/{session_id}") +async def delete_pruned_history_papers(session_id: str, confirm: bool = False): + """Permanently delete all pruned Stage 2 paper files in one legacy/session scope.""" + try: + if not confirm: + raise HTTPException( + status_code=400, + detail="Must confirm deletion with confirm=true" + ) + paths = _resolve_history_session_paths(session_id) + scoped_paper_library = _build_scoped_paper_library(paths) + scoped_research_metadata = await _build_scoped_research_metadata(paths) + pruned_papers = await scoped_paper_library._list_pruned_history_papers_from_directory( + paths["papers_dir"], + session_id, + ) + deleted_count = await scoped_paper_library.delete_all_pruned_papers() + for paper in pruned_papers: + await scoped_research_metadata.delete_paper(paper["paper_id"]) + return { + "success": True, + "session_id": session_id, + "deleted_count": deleted_count, + "message": f"Deleted {deleted_count} pruned paper records" + } + except HTTPException: + raise + except Exception as e: + logger.error(f"Failed to delete pruned history papers for {session_id}: {e}") + raise HTTPException(status_code=500, detail="Internal server error") + + @router.get("/paper-history/{session_id}/{paper_id}") async def get_history_paper(session_id: str, paper_id: str): """Get one completed, non-archived Stage 2 paper from legacy/session history.""" @@ -1179,7 +1362,7 @@ async def get_current_session(): return { "is_active": True, "session_id": session_manager.session_id, - "path": str(session_manager.session_path) if session_manager.session_path else None + "path": session_manager.session_id } except Exception as e: @@ -1523,15 +1706,23 @@ async def delete_brainstorm(topic_id: str, confirm: bool = False): detail="Must confirm deletion with confirm=true" ) - # Check if running - state = autonomous_coordinator.get_state() - if state.is_running and state.current_tier == "tier1_aggregation": - # Check if this is the active brainstorm - if autonomous_coordinator._current_topic_id == topic_id: - raise HTTPException( - status_code=400, - detail="Cannot delete active brainstorm while it's being aggregated. Stop autonomous research first." - ) + # Check if this brainstorm is still owned by the running coordinator. + # The live aggregator keeps a direct file handle path through shared_training_memory; + # deleting it while active can recreate an unlisted "invisible" brainstorm DB. + active_topic_id = autonomous_coordinator._current_topic_id + active_aggregator = autonomous_coordinator._brainstorm_aggregator + aggregator_running = bool(active_aggregator and active_aggregator.is_running) + target_db_path = Path(brainstorm_memory.get_database_path(topic_id)).resolve() + active_shared_path = Path(shared_training_memory.file_path).resolve() + active_shared_path_matches = active_shared_path == target_db_path + if ( + (active_topic_id == topic_id or active_shared_path_matches) + and (autonomous_coordinator.is_active or aggregator_running) + ): + raise HTTPException( + status_code=400, + detail="Cannot delete the active brainstorm while autonomous research is running. Stop autonomous research first." + ) # Get brainstorm metadata metadata = await brainstorm_memory.get_metadata(topic_id) @@ -1554,6 +1745,15 @@ async def delete_brainstorm(topic_id: str, confirm: bool = False): # Remove from central metadata await research_metadata.delete_brainstorm(topic_id) + if active_topic_id == topic_id: + await autonomous_coordinator.clear_deleted_brainstorm_reference( + topic_id, + "brainstorm deleted through API while coordinator was stopped" + ) + else: + stats = await research_metadata.get_stats() + if stats.get("current_brainstorm_id") == topic_id: + await research_metadata.set_current_brainstorm(None) logger.info(f"Deleted brainstorm {topic_id} (had {len(associated_papers)} associated papers)") @@ -1574,7 +1774,7 @@ async def delete_brainstorm(topic_id: str, confirm: bool = False): @router.delete("/paper/{paper_id}") async def delete_paper(paper_id: str, confirm: bool = False): """ - Delete a paper and optionally its source brainstorm. + Prune a paper from model context while preserving it for user download. Query params: confirm: Must be True to execute deletion (safety check) @@ -1600,9 +1800,39 @@ async def delete_paper(paper_id: str, confirm: bool = False): raise HTTPException(status_code=500, detail="Internal server error") +@router.delete("/pruned-papers") +async def delete_current_pruned_papers(confirm: bool = False): + """Permanently delete all pruned papers in the active autonomous paper scope.""" + try: + if not confirm: + raise HTTPException( + status_code=400, + detail="Must confirm deletion with confirm=true" + ) + + pruned_papers = await paper_library._list_pruned_history_papers_from_directory( + paper_library._base_dir, + _get_active_autonomous_session_id(), + ) + deleted_count = await paper_library.delete_all_pruned_papers() + for paper in pruned_papers: + await research_metadata.delete_paper(paper["paper_id"]) + return { + "success": True, + "session_id": _get_active_autonomous_session_id(), + "deleted_count": deleted_count, + "message": f"Deleted {deleted_count} pruned paper records" + } + except HTTPException: + raise + except Exception as e: + logger.error(f"Failed to delete current pruned papers: {e}") + raise HTTPException(status_code=500, detail="Internal server error") + + @router.delete("/paper-history/{session_id}/{paper_id}") async def delete_history_paper(session_id: str, paper_id: str, confirm: bool = False): - """Delete a completed Stage 2 history paper from a specific legacy/session scope.""" + """Prune a completed Stage 2 history paper from a specific legacy/session scope.""" try: if not confirm: raise HTTPException( @@ -1994,6 +2224,10 @@ async def get_final_answer_archived_papers(answer_id: str): memory = _build_scoped_final_answer_memory(answer_id) papers = await memory.get_archived_papers_list() return {"papers": papers} + except HTTPException: + raise + except ValueError as e: + raise HTTPException(status_code=400, detail=str(e)) except Exception as e: logger.error(f"Failed to get archived papers for {answer_id}: {e}") raise HTTPException(status_code=500, detail="Internal server error") @@ -2020,6 +2254,8 @@ async def get_final_answer_archived_paper(answer_id: str, paper_id: str): return paper except HTTPException: raise + except ValueError as e: + raise HTTPException(status_code=400, detail=str(e)) except Exception as e: logger.error(f"Failed to get archived paper {paper_id} for {answer_id}: {e}") raise HTTPException(status_code=500, detail="Internal server error") @@ -2040,6 +2276,10 @@ async def get_final_answer_archived_brainstorms(answer_id: str): memory = _build_scoped_final_answer_memory(answer_id) brainstorms = await memory.get_archived_brainstorms_list() return {"brainstorms": brainstorms} + except HTTPException: + raise + except ValueError as e: + raise HTTPException(status_code=400, detail=str(e)) except Exception as e: logger.error(f"Failed to get archived brainstorms for {answer_id}: {e}") raise HTTPException(status_code=500, detail="Internal server error") @@ -2066,6 +2306,8 @@ async def get_final_answer_archived_brainstorm(answer_id: str, topic_id: str): return brainstorm except HTTPException: raise + except ValueError as e: + raise HTTPException(status_code=400, detail=str(e)) except Exception as e: logger.error(f"Failed to get archived brainstorm {topic_id} for {answer_id}: {e}") raise HTTPException(status_code=500, detail="Internal server error") @@ -2315,10 +2557,13 @@ async def request_final_answer_critique(answer_id: str, request: CritiqueRequest validator_max_tokens = None validator_provider = None validator_openrouter_provider = None + validator_openrouter_reasoning_effort = "auto" + validator_supercharge_enabled = False custom_prompt = None if request: custom_prompt = request.custom_prompt + validator_supercharge_enabled = bool(request.validator_supercharge_enabled) # Check if request provides validator config if request.validator_model: validator_model = request.validator_model @@ -2326,6 +2571,7 @@ async def request_final_answer_critique(answer_id: str, request: CritiqueRequest validator_max_tokens = request.validator_max_tokens or 25000 validator_provider = request.validator_provider or "lm_studio" validator_openrouter_provider = request.validator_openrouter_provider + validator_openrouter_reasoning_effort = request.validator_openrouter_reasoning_effort # If no validator config from request, try coordinator if not validator_model: @@ -2336,6 +2582,8 @@ async def request_final_answer_critique(answer_id: str, request: CritiqueRequest validator_max_tokens = coordinator_config["validator_max_tokens"] validator_provider = coordinator_config["validator_provider"] validator_openrouter_provider = coordinator_config.get("validator_openrouter_provider") + validator_openrouter_reasoning_effort = coordinator_config.get("validator_openrouter_reasoning_effort", "auto") + validator_supercharge_enabled = bool(coordinator_config.get("validator_supercharge_enabled", False)) # If still no config, error if not validator_model: @@ -2386,9 +2634,11 @@ async def request_final_answer_critique(answer_id: str, request: CritiqueRequest model_id=validator_model, openrouter_model_id=validator_model if validator_provider == "openrouter" else None, openrouter_provider=validator_openrouter_provider, + openrouter_reasoning_effort=validator_openrouter_reasoning_effort, lm_studio_fallback_id=None, # No fallback for direct critique calls context_window=validator_context_window, - max_output_tokens=validator_max_tokens + max_output_tokens=validator_max_tokens, + supercharge_enabled=validator_supercharge_enabled ) ) @@ -2559,7 +2809,7 @@ async def get_default_critique_prompt(): # ============================================================================ @router.get("/api-logs") -async def get_autonomous_api_logs(limit: int = 100): +async def get_autonomous_api_logs(limit: int = 100, workflow: Optional[str] = None): """ Get autonomous research API call logs. @@ -2570,20 +2820,57 @@ async def get_autonomous_api_logs(limit: int = 100): Dict with logs and statistics """ try: - combined = await _get_combined_api_logs(limit=limit) + safe_limit = max(1, min(limit, 100)) + workflow_filter = _normalize_api_log_workflow_filter(workflow) + combined = await _get_combined_api_logs( + limit=safe_limit, + workflow=workflow_filter, + include_full=False, + ) return { "success": True, - "logs": combined["logs"], + "logs": [_summarize_api_log_entry(log) for log in combined["logs"]], "stats": combined["stats"], } + except HTTPException: + raise except Exception as e: logger.error(f"Failed to get autonomous API logs: {e}") raise HTTPException(status_code=500, detail="Internal server error") +@router.get("/api-logs/detail/{log_key}") +async def get_autonomous_api_log_detail(log_key: str, workflow: Optional[str] = None): + """Get one full API log entry by key for explicit user inspection/copy.""" + try: + if not log_key or len(log_key) > 128: + raise HTTPException(status_code=400, detail="Invalid API log key") + + workflow_filter = _normalize_api_log_workflow_filter(workflow) + combined = await _get_combined_api_logs(limit=1000, workflow=workflow_filter) + for log in combined["logs"]: + if _get_api_log_key(log) == log_key: + return { + "success": True, + "log": { + **log, + "log_key": log_key, + "prompt_size": len(str(log.get("prompt_full") or "")), + "response_size": len(str(log.get("response_full") or "")), + }, + } + + raise HTTPException(status_code=404, detail="API log entry not found") + except HTTPException: + raise + except Exception as e: + logger.error(f"Failed to get autonomous API log detail: {e}") + raise HTTPException(status_code=500, detail="Internal server error") + + @router.post("/api-logs/clear") -async def clear_autonomous_api_logs(): +async def clear_autonomous_api_logs(workflow: Optional[str] = None): """ Clear all autonomous API logs. @@ -2591,20 +2878,23 @@ async def clear_autonomous_api_logs(): Success status """ try: - await autonomous_api_logger.clear_logs() - await boost_logger.clear_logs() + workflow_filter = _normalize_api_log_workflow_filter(workflow) + await autonomous_api_logger.clear_logs(workflow=workflow_filter) + await boost_logger.clear_logs(workflow=workflow_filter) return { "success": True, "message": "Combined API logs cleared successfully" } + except HTTPException: + raise except Exception as e: logger.error(f"Failed to clear autonomous API logs: {e}") raise HTTPException(status_code=500, detail="Internal server error") @router.get("/api-logs/stats") -async def get_autonomous_api_stats(): +async def get_autonomous_api_stats(workflow: Optional[str] = None): """ Get statistics about autonomous API calls. @@ -2612,12 +2902,19 @@ async def get_autonomous_api_stats(): Statistics dict (total calls, by phase, by model, success rate, etc.) """ try: - combined = await _get_combined_api_logs(limit=1000) + workflow_filter = _normalize_api_log_workflow_filter(workflow) + combined = await _get_combined_api_logs( + limit=1000, + workflow=workflow_filter, + include_full=False, + ) return { "success": True, "stats": combined["stats"] } + except HTTPException: + raise except Exception as e: logger.error(f"Failed to get autonomous API stats: {e}") raise HTTPException(status_code=500, detail="Internal server error") \ No newline at end of file diff --git a/backend/api/routes/boost.py b/backend/api/routes/boost.py index 2838296..8d87f6b 100644 --- a/backend/api/routes/boost.py +++ b/backend/api/routes/boost.py @@ -13,7 +13,7 @@ from typing import Dict, Any, Optional import logging -from backend.shared.config import rag_config +from backend.shared.config import rag_config, system_config from backend.shared.models import BoostConfig from backend.shared.boost_manager import boost_manager from backend.shared.boost_logger import boost_logger @@ -28,12 +28,17 @@ class BoostNextCountRequest(BaseModel): count: int -def _resolve_boost_api_key(api_key: Optional[str]) -> str: - """Use the explicit boost key when provided, otherwise fall back to the active global key.""" +def _resolve_boost_api_key(api_key: Optional[str], *, allow_current_override: bool = False) -> str: + """Use an explicit/current boost key when provided, otherwise fall back to the active global key.""" explicit_key = (api_key or "").strip() if explicit_key: return explicit_key + if allow_current_override and boost_manager.boost_config: + current_key = (boost_manager.boost_config.openrouter_api_key or "").strip() + if current_key: + return current_key + global_key = (rag_config.openrouter_api_key or "").strip() if global_key: return global_key @@ -56,7 +61,8 @@ async def enable_boost(config: BoostConfig) -> Dict[str, Any]: Status and boost configuration """ try: - effective_api_key = _resolve_boost_api_key(config.openrouter_api_key) + explicit_api_key = (config.openrouter_api_key or "").strip() + effective_api_key = _resolve_boost_api_key(explicit_api_key) client = OpenRouterClient(effective_api_key) try: @@ -70,7 +76,9 @@ async def enable_boost(config: BoostConfig) -> Dict[str, Any]: finally: await client.close() - config.openrouter_api_key = effective_api_key + # Keep explicit boost override keys in process memory only. When the + # user relies on the global OpenRouter key, Boost stores no key at all. + config.openrouter_api_key = explicit_api_key # Enable boost await boost_manager.set_boost_config(config) @@ -84,6 +92,7 @@ async def enable_boost(config: BoostConfig) -> Dict[str, Any]: "config": { "model_id": config.boost_model_id, "provider": config.boost_provider, + "reasoning_effort": config.boost_reasoning_effort, "context_window": config.boost_context_window, "max_output_tokens": config.boost_max_output_tokens } @@ -119,7 +128,11 @@ async def update_boost_model(config: BoostConfig) -> Dict[str, Any]: detail="Boost must be enabled first. Use /api/boost/enable to enable boost." ) - effective_api_key = _resolve_boost_api_key(config.openrouter_api_key) + explicit_api_key = (config.openrouter_api_key or "").strip() + effective_api_key = _resolve_boost_api_key( + explicit_api_key, + allow_current_override=True, + ) client = OpenRouterClient(effective_api_key) try: @@ -133,7 +146,12 @@ async def update_boost_model(config: BoostConfig) -> Dict[str, Any]: finally: await client.close() - config.openrouter_api_key = effective_api_key + if explicit_api_key: + config.openrouter_api_key = explicit_api_key + elif boost_manager.boost_config and boost_manager.boost_config.openrouter_api_key: + config.openrouter_api_key = boost_manager.boost_config.openrouter_api_key + else: + config.openrouter_api_key = "" # Store current boost state before update old_boost_next_count = boost_manager.boost_next_count @@ -158,6 +176,7 @@ async def update_boost_model(config: BoostConfig) -> Dict[str, Any]: "config": { "model_id": config.boost_model_id, "provider": config.boost_provider, + "reasoning_effort": config.boost_reasoning_effort, "context_window": config.boost_context_window, "max_output_tokens": config.boost_max_output_tokens }, @@ -467,7 +486,7 @@ async def get_boost_logs(limit: int = 100) -> Dict[str, Any]: List of log entries (most recent first) """ try: - logs = await boost_logger.get_logs(limit) + logs = await boost_logger.get_logs(limit, include_full=False) stats = await boost_logger.get_stats() return { @@ -493,7 +512,7 @@ async def get_boost_log_entry(index: int) -> Dict[str, Any]: Full log entry including complete response """ try: - entry = await boost_logger.get_log_entry(index) + entry = await boost_logger.get_log_entry(index, include_full=system_config.api_log_store_full_payloads) if not entry: raise HTTPException(status_code=404, detail="Log entry not found") diff --git a/backend/api/routes/compiler.py b/backend/api/routes/compiler.py index 4561369..4c431a7 100644 --- a/backend/api/routes/compiler.py +++ b/backend/api/routes/compiler.py @@ -1,25 +1,106 @@ """ Compiler API routes. """ +import asyncio +import hashlib from fastapi import APIRouter, HTTPException import logging from pathlib import Path import aiofiles -from backend.shared.models import CompilerStartRequest, CompilerState, CritiqueRequest +from backend.api.routes import websocket +from backend.shared.models import CompilerStartRequest, CompilerState, CritiqueRequest, ModelConfig from backend.shared.config import system_config from backend.shared.token_tracker import token_tracker -from backend.compiler.core.compiler_coordinator import compiler_coordinator +from backend.shared.api_client_manager import api_client_manager +from backend.shared.workflow_start_guard import workflow_start_guard +from backend.compiler.core.compiler_coordinator import CRITIQUE_ATTEMPT_TARGET, compiler_coordinator from backend.compiler.memory.outline_memory import outline_memory from backend.compiler.memory.paper_memory import paper_memory from backend.aggregator.core.coordinator import coordinator from backend.autonomous.core.autonomous_coordinator import autonomous_coordinator +from backend.autonomous.core.proof_verification_stage import ProofVerificationStage +from backend.autonomous.memory.proof_database import proof_database +from backend.leanoj.core.leanoj_coordinator import leanoj_coordinator logger = logging.getLogger(__name__) router = APIRouter(prefix="/api/compiler", tags=["compiler"]) +async def _run_saved_compiler_paper_proof_check( + full_content: str, + source_title: str, + proof_config: dict, +) -> None: + """Run autonomous proof extraction/tiering for a saved manual compiler paper.""" + if not proof_config.get("lean4_enabled"): + logger.info("Skipping saved compiler paper proof check: Lean 4 disabled") + return + if not full_content.strip(): + return + submitter_model = str(proof_config.get("submitter_model") or "") + validator_model = str(proof_config.get("validator_model") or "") + if not submitter_model: + logger.warning("Skipping saved compiler paper proof check: high-context model is unavailable") + return + if not validator_model: + logger.warning("Skipping saved compiler paper proof check: validator model is unavailable") + return + + source_hash = hashlib.sha256(full_content.encode("utf-8")).hexdigest()[:16] + source_id = f"compiler_manual_{source_hash}" + role_suffix = "compiler_manual_paper" + + submitter_config = ModelConfig( + provider=str(proof_config.get("submitter_provider") or "lm_studio"), + model_id=submitter_model, + openrouter_provider=proof_config.get("submitter_openrouter_provider"), + openrouter_reasoning_effort=proof_config.get("submitter_openrouter_reasoning_effort", "auto"), + lm_studio_fallback_id=proof_config.get("submitter_lm_studio_fallback"), + context_window=int(proof_config.get("submitter_context") or system_config.compiler_high_context_context_window), + max_output_tokens=int(proof_config.get("submitter_max_tokens") or system_config.compiler_high_context_max_output_tokens), + supercharge_enabled=bool(proof_config.get("submitter_supercharge_enabled", False)), + ) + validator_config = ModelConfig( + provider=str(proof_config.get("validator_provider") or "lm_studio"), + model_id=validator_model, + openrouter_provider=proof_config.get("validator_openrouter_provider"), + openrouter_reasoning_effort=proof_config.get("validator_openrouter_reasoning_effort", "auto"), + lm_studio_fallback_id=proof_config.get("validator_lm_studio_fallback"), + context_window=int(proof_config.get("validator_context") or system_config.compiler_validator_context_window), + max_output_tokens=int(proof_config.get("validator_max_tokens") or system_config.compiler_validator_max_output_tokens), + supercharge_enabled=bool(proof_config.get("validator_supercharge_enabled", False)), + ) + for role_id in ( + f"autonomous_proof_identification_{role_suffix}", + f"autonomous_proof_lemma_search_{role_suffix}", + f"autonomous_proof_formalization_{role_suffix}", + ): + api_client_manager.configure_role(role_id, submitter_config) + api_client_manager.configure_role("autonomous_proof_novelty", validator_config) + + stage = ProofVerificationStage() + await stage.run( + content=full_content, + source_type="paper", + source_id=source_id, + user_prompt=str(proof_config.get("user_prompt") or ""), + submitter_model=submitter_model, + submitter_context=submitter_config.context_window, + submitter_max_tokens=submitter_config.max_output_tokens, + validator_model=validator_model, + validator_context=validator_config.context_window, + validator_max_tokens=validator_config.max_output_tokens, + broadcast_fn=websocket.broadcast_event, + novel_proofs_db=proof_database, + source_title=source_title, + role_suffix_override=role_suffix, + trigger="manual_compiler_save", + append_to_source=False, + ) + + def _get_start_conflict() -> str | None: """Return a user-facing conflict message if another workflow is active.""" if compiler_coordinator.is_running: @@ -29,71 +110,94 @@ def _get_start_conflict() -> str | None: return "Cannot start Compiler while Aggregator is running. Stop Aggregator first." autonomous_state = autonomous_coordinator.get_state() - if autonomous_state.is_running: + if autonomous_state.is_running or autonomous_coordinator.is_active: return "Cannot start Compiler while Autonomous Research is running. Stop Autonomous Research first." + if leanoj_coordinator.is_active: + return "Cannot start Compiler while Proof Solver is running. Stop Proof Solver first." + return None +def _log_background_task_failure(task: asyncio.Task) -> None: + try: + task.result() + except asyncio.CancelledError: + logger.info("Saved compiler paper proof check was cancelled") + except Exception: + logger.exception("Saved compiler paper proof check failed") + + @router.post("/start") async def start_compiler(request: CompilerStartRequest): """Start the compiler system.""" try: - conflict = _get_start_conflict() - if conflict: - raise HTTPException(status_code=400, detail=conflict) - - # Update system config with user-provided context sizes - system_config.compiler_validator_context_window = request.validator_context_size - system_config.compiler_high_context_context_window = request.high_context_context_size - system_config.compiler_high_param_context_window = request.high_param_context_size - system_config.compiler_critique_submitter_context_window = request.critique_submitter_context_window - - # Update max output token configurations - system_config.compiler_validator_max_output_tokens = request.validator_max_output_tokens - system_config.compiler_high_context_max_output_tokens = request.high_context_max_output_tokens - system_config.compiler_high_param_max_output_tokens = request.high_param_max_output_tokens - system_config.compiler_critique_submitter_max_tokens = request.critique_submitter_max_tokens - - # Store critique submitter model - system_config.compiler_critique_submitter_model = request.critique_submitter_model - - logger.info( - f"Compiler max output tokens - " - f"Validator: {request.validator_max_output_tokens}, " - f"High-context: {request.high_context_max_output_tokens}, " - f"High-param: {request.high_param_max_output_tokens}" - ) - - # Initialize coordinator with OpenRouter provider configurations - await compiler_coordinator.initialize( - compiler_prompt=request.compiler_prompt, - validator_model=request.validator_model, - high_context_model=request.high_context_model, - high_param_model=request.high_param_model, - critique_submitter_model=request.critique_submitter_model, - # OpenRouter provider configs for each role - validator_provider=request.validator_provider, - validator_openrouter_provider=request.validator_openrouter_provider, - validator_lm_studio_fallback=request.validator_lm_studio_fallback, - high_context_provider=request.high_context_provider, - high_context_openrouter_provider=request.high_context_openrouter_provider, - high_context_lm_studio_fallback=request.high_context_lm_studio_fallback, - high_param_provider=request.high_param_provider, - high_param_openrouter_provider=request.high_param_openrouter_provider, - high_param_lm_studio_fallback=request.high_param_lm_studio_fallback, - critique_submitter_provider=request.critique_submitter_provider, - critique_submitter_openrouter_provider=request.critique_submitter_openrouter_provider, - critique_submitter_lm_studio_fallback=request.critique_submitter_lm_studio_fallback - ) - - # Start coordinator - token_tracker.reset() - token_tracker.start_timer() - await compiler_coordinator.start() - - return {"status": "started", "message": "Compiler started successfully"} + async with workflow_start_guard.reserve(): + conflict = _get_start_conflict() + if conflict: + raise HTTPException(status_code=400, detail=conflict) + + # Update system config with user-provided context sizes + system_config.compiler_validator_context_window = request.validator_context_size + system_config.compiler_high_context_context_window = request.high_context_context_size + system_config.compiler_high_param_context_window = request.high_param_context_size + system_config.compiler_critique_submitter_context_window = request.critique_submitter_context_window + + # Update max output token configurations + system_config.compiler_validator_max_output_tokens = request.validator_max_output_tokens + system_config.compiler_high_context_max_output_tokens = request.high_context_max_output_tokens + system_config.compiler_high_param_max_output_tokens = request.high_param_max_output_tokens + system_config.compiler_critique_submitter_max_tokens = request.critique_submitter_max_tokens + + # Store critique submitter model + system_config.compiler_critique_submitter_model = request.critique_submitter_model + + logger.info( + f"Compiler max output tokens - " + f"Validator: {request.validator_max_output_tokens}, " + f"High-context: {request.high_context_max_output_tokens}, " + f"High-param: {request.high_param_max_output_tokens}" + ) + + # Initialize coordinator with OpenRouter provider configurations + await compiler_coordinator.initialize( + compiler_prompt=request.compiler_prompt, + validator_model=request.validator_model, + high_context_model=request.high_context_model, + high_param_model=request.high_param_model, + critique_submitter_model=request.critique_submitter_model, + # OpenRouter provider configs for each role + validator_provider=request.validator_provider, + validator_openrouter_provider=request.validator_openrouter_provider, + validator_openrouter_reasoning_effort=request.validator_openrouter_reasoning_effort, + validator_lm_studio_fallback=request.validator_lm_studio_fallback, + high_context_provider=request.high_context_provider, + high_context_openrouter_provider=request.high_context_openrouter_provider, + high_context_openrouter_reasoning_effort=request.high_context_openrouter_reasoning_effort, + high_context_lm_studio_fallback=request.high_context_lm_studio_fallback, + high_param_provider=request.high_param_provider, + high_param_openrouter_provider=request.high_param_openrouter_provider, + high_param_openrouter_reasoning_effort=request.high_param_openrouter_reasoning_effort, + high_param_lm_studio_fallback=request.high_param_lm_studio_fallback, + critique_submitter_provider=request.critique_submitter_provider, + critique_submitter_openrouter_provider=request.critique_submitter_openrouter_provider, + critique_submitter_openrouter_reasoning_effort=request.critique_submitter_openrouter_reasoning_effort, + critique_submitter_lm_studio_fallback=request.critique_submitter_lm_studio_fallback, + validator_supercharge_enabled=request.validator_supercharge_enabled, + high_context_supercharge_enabled=request.high_context_supercharge_enabled, + high_param_supercharge_enabled=request.high_param_supercharge_enabled, + critique_submitter_supercharge_enabled=request.critique_submitter_supercharge_enabled + ) + + # Start coordinator + token_tracker.reset() + token_tracker.start_timer() + await compiler_coordinator.start() + + return {"status": "started", "message": "Compiler started successfully"} + except HTTPException: + raise except ValueError as e: # Model compatibility errors - provide structured error response error_msg = str(e) @@ -336,13 +440,47 @@ async def save_paper(): async with aiofiles.open(output_path, 'w', encoding='utf-8') as f: await f.write(full_content) + + high_context = compiler_coordinator.high_context_submitter + proof_check_scheduled = bool( + system_config.lean4_enabled + and full_content.strip() + and high_context is not None + and getattr(high_context, "model_name", "") + and compiler_coordinator.validator_model + ) + if proof_check_scheduled: + source_title = compiler_coordinator.paper_title or compiler_coordinator.user_prompt or "Compiler Paper" + proof_config = { + "lean4_enabled": system_config.lean4_enabled, + "user_prompt": compiler_coordinator.user_prompt, + "submitter_model": high_context.model_name, + "submitter_provider": compiler_coordinator.high_context_provider, + "submitter_openrouter_provider": compiler_coordinator.high_context_openrouter_provider, + "submitter_openrouter_reasoning_effort": compiler_coordinator.high_context_openrouter_reasoning_effort, + "submitter_lm_studio_fallback": compiler_coordinator.high_context_lm_studio_fallback, + "submitter_context": system_config.compiler_high_context_context_window, + "submitter_max_tokens": system_config.compiler_high_context_max_output_tokens, + "submitter_supercharge_enabled": getattr(compiler_coordinator, "high_context_supercharge_enabled", False), + "validator_model": compiler_coordinator.validator_model, + "validator_provider": compiler_coordinator.validator_provider, + "validator_openrouter_provider": compiler_coordinator.validator_openrouter_provider, + "validator_openrouter_reasoning_effort": compiler_coordinator.validator_openrouter_reasoning_effort, + "validator_lm_studio_fallback": compiler_coordinator.validator_lm_studio_fallback, + "validator_context": compiler_coordinator.validator_context_window, + "validator_max_tokens": compiler_coordinator.validator_max_tokens, + "validator_supercharge_enabled": getattr(compiler_coordinator, "validator_supercharge_enabled", False), + } + task = asyncio.create_task(_run_saved_compiler_paper_proof_check(full_content, source_title, proof_config)) + task.add_done_callback(_log_background_task_failure) return { "status": "saved", - "path": str(output_path), + "path": output_path.name, "word_count": word_count, - "message": f"Paper saved to {output_path} ({word_count} words)", - "has_attribution": bool(attribution_section) + "message": f"Paper saved to {output_path.name} ({word_count} words)", + "has_attribution": bool(attribution_section), + "proof_check_scheduled": proof_check_scheduled } except Exception as e: logger.error(f"Failed to save paper: {e}") @@ -441,7 +579,7 @@ async def get_critique_status(): "in_critique_phase": compiler_coordinator.in_critique_phase, "critique_acceptances": compiler_coordinator.critique_acceptances, "paper_version": compiler_coordinator.paper_version, - "target_critiques": 5 + "target_critiques": CRITIQUE_ATTEMPT_TARGET } except Exception as e: logger.error(f"Failed to get critique status: {e}") @@ -512,6 +650,8 @@ async def request_compiler_critique(critique_request: CritiqueRequest = None): validator_max_tokens = critique_request.validator_max_tokens validator_provider = critique_request.validator_provider validator_openrouter_provider = critique_request.validator_openrouter_provider + validator_openrouter_reasoning_effort = critique_request.validator_openrouter_reasoning_effort + validator_supercharge_enabled = bool(critique_request.validator_supercharge_enabled) # If validator config not provided in request, fall back to coordinator config if not validator_model: @@ -520,6 +660,8 @@ async def request_compiler_critique(critique_request: CritiqueRequest = None): validator_max_tokens = system_config.compiler_validator_max_output_tokens validator_provider = getattr(compiler_coordinator, 'validator_provider', 'lm_studio') validator_openrouter_provider = getattr(compiler_coordinator, 'validator_openrouter_provider', None) + validator_openrouter_reasoning_effort = getattr(compiler_coordinator, 'validator_openrouter_reasoning_effort', 'auto') + validator_supercharge_enabled = bool(getattr(compiler_coordinator, 'validator_supercharge_enabled', False)) if not validator_model: raise HTTPException( @@ -576,9 +718,11 @@ async def request_compiler_critique(critique_request: CritiqueRequest = None): model_id=validator_model, openrouter_model_id=validator_model if validator_provider == "openrouter" else None, openrouter_provider=validator_openrouter_provider, + openrouter_reasoning_effort=validator_openrouter_reasoning_effort, lm_studio_fallback_id=None, # No fallback for direct critique calls context_window=validator_context_window, - max_output_tokens=validator_max_tokens + max_output_tokens=validator_max_tokens, + supercharge_enabled=validator_supercharge_enabled ) ) diff --git a/backend/api/routes/download.py b/backend/api/routes/download.py index 07cdf7f..dfe44dc 100644 --- a/backend/api/routes/download.py +++ b/backend/api/routes/download.py @@ -3,6 +3,8 @@ Runs in a thread pool so the FastAPI event loop is never blocked. """ import asyncio +from html import escape +from html.parser import HTMLParser import logging from pathlib import Path from fastapi import APIRouter, HTTPException @@ -46,6 +48,140 @@ class PDFRequest(BaseModel): filename: str = "document" +_ALLOWED_PDF_TAGS = { + "div", "span", "p", "br", "hr", + "strong", "b", "em", "i", "u", "s", "sub", "sup", "small", + "h1", "h2", "h3", "h4", "h5", "h6", + "ul", "ol", "li", "dl", "dt", "dd", + "table", "thead", "tbody", "tr", "th", "td", + "math", "semantics", "mrow", "mi", "mo", "mn", "msup", "msub", + "mfrac", "mroot", "msqrt", "mtext", "mspace", "mtable", "mtr", "mtd", + "annotation", "annotation-xml", + "svg", "path", "line", "rect", "circle", "g", "use", "defs", "clippath", +} +_VOID_PDF_TAGS = {"br", "hr", "path", "line", "rect", "circle", "use"} +_DROP_CONTENT_TAGS = {"script", "style", "iframe", "object", "embed", "form", "textarea", "select"} +_ALLOWED_PDF_ATTRS = { + "class", "id", "title", "style", + "mathvariant", "encoding", "xmlns", "displaystyle", "scriptlevel", + "columnalign", "rowalign", "columnspacing", "rowspacing", "stretchy", + "symmetric", "fence", "separator", "lspace", "rspace", "accent", + "accentunder", "movablelimits", "minsize", "maxsize", "width", "height", + "d", "viewbox", "preserveaspectratio", "fill", "stroke", "stroke-width", + "transform", "x", "y", "dx", "dy", "x1", "y1", "x2", "y2", "r", "cx", "cy", + "href", "xlink:href", "clip-path", +} +_FORBIDDEN_STYLE_TOKENS = ("url(", "expression", "@import", "behavior:") + + +class _PdfHtmlSanitizer(HTMLParser): + """Small allowlist sanitizer for already-rendered LaTeX/KaTeX HTML.""" + + def __init__(self) -> None: + super().__init__(convert_charrefs=True) + self._parts: list[str] = [] + self._drop_content_depth = 0 + + @staticmethod + def _is_safe_attr(name: str, value: str) -> bool: + attr = name.lower() + if attr not in _ALLOWED_PDF_ATTRS or attr.startswith("on"): + return False + lowered_value = (value or "").strip().lower() + if attr == "style": + return not any(token in lowered_value for token in _FORBIDDEN_STYLE_TOKENS) + if attr in {"href", "xlink:href"}: + return lowered_value.startswith("#") or lowered_value.startswith("data:image/") + return True + + def _append_start_tag(self, tag: str, attrs, *, self_closing: bool = False) -> None: + normalized_tag = tag.lower() + if normalized_tag not in _ALLOWED_PDF_TAGS: + if normalized_tag in _DROP_CONTENT_TAGS and not self_closing: + self._drop_content_depth += 1 + return + + rendered_attrs = [] + for name, value in attrs: + attr_name = (name or "").lower() + attr_value = "" if value is None else str(value) + if self._is_safe_attr(attr_name, attr_value): + rendered_attrs.append(f'{attr_name}="{escape(attr_value, quote=True)}"') + + suffix = " /" if self_closing and normalized_tag not in _VOID_PDF_TAGS else "" + attr_text = f" {' '.join(rendered_attrs)}" if rendered_attrs else "" + self._parts.append(f"<{normalized_tag}{attr_text}{suffix}>") + + def handle_starttag(self, tag, attrs) -> None: + self._append_start_tag(tag, attrs) + + def handle_startendtag(self, tag, attrs) -> None: + self._append_start_tag(tag, attrs, self_closing=True) + + def handle_endtag(self, tag) -> None: + normalized_tag = tag.lower() + if normalized_tag in _DROP_CONTENT_TAGS and self._drop_content_depth > 0: + self._drop_content_depth -= 1 + return + if normalized_tag in _ALLOWED_PDF_TAGS and normalized_tag not in _VOID_PDF_TAGS: + self._parts.append(f"") + + def handle_data(self, data) -> None: + if self._drop_content_depth > 0: + return + self._parts.append(escape(data or "")) + + def handle_entityref(self, name) -> None: + if self._drop_content_depth > 0: + return + self._parts.append(f"&{name};") + + def handle_charref(self, name) -> None: + if self._drop_content_depth > 0: + return + self._parts.append(f"&#{name};") + + def get_html(self) -> str: + return "".join(self._parts) + + +def _sanitize_pdf_html(html_body: str) -> str: + sanitizer = _PdfHtmlSanitizer() + sanitizer.feed(html_body or "") + sanitizer.close() + return sanitizer.get_html() + + +def _encoded_size(value: Optional[str]) -> int: + return len((value or "").encode("utf-8")) + + +def _validate_pdf_request_size(req: PDFRequest) -> None: + html_size = _encoded_size(req.html_body) + if html_size > system_config.pdf_max_html_bytes: + raise HTTPException( + status_code=413, + detail=f"html_body exceeds PDF limit of {system_config.pdf_max_html_bytes} bytes", + ) + + outline_size = _encoded_size(req.outline) + if outline_size > system_config.pdf_max_outline_bytes: + raise HTTPException( + status_code=413, + detail=f"outline exceeds PDF limit of {system_config.pdf_max_outline_bytes} bytes", + ) + + metadata_size = sum( + _encoded_size(value) + for value in (req.title, req.date, req.models, req.filename) + ) + if metadata_size > system_config.pdf_max_metadata_bytes: + raise HTTPException( + status_code=413, + detail=f"PDF metadata exceeds limit of {system_config.pdf_max_metadata_bytes} bytes", + ) + + def _build_html_document(req: PDFRequest) -> str: """ Wrap the rendered HTML body in a complete standalone HTML document @@ -56,9 +192,9 @@ def _build_html_document(req: PDFRequest) -> str: if req.word_count: meta_parts.append(f"Word Count: {req.word_count:,}") if req.date: - meta_parts.append(f"Generated: {req.date}") + meta_parts.append(f"Generated: {_escape_html(req.date)}") if req.models: - meta_parts.append(f"AI Models: {req.models}") + meta_parts.append(f"AI Models: {_escape_html(req.models)}") meta_line = "  |  ".join(meta_parts) if meta_parts else "" outline_section = "" @@ -293,10 +429,18 @@ def _generate_pdf_sync(html: str) -> bytes: with sync_playwright() as pw: browser = pw.chromium.launch( headless=True, - args=["--no-sandbox", "--disable-setuid-sandbox", "--disable-dev-shm-usage"] + args=["--disable-dev-shm-usage"] ) + context = None try: - page = browser.new_page() + context = browser.new_context(java_script_enabled=False) + page = context.new_page() + page.route( + "**/*", + lambda route: route.continue_() + if route.request.url.startswith(("data:", "blob:", "about:")) + else route.abort(), + ) page.set_content(html, wait_until="load", timeout=60000) pdf_bytes = page.pdf( format="A4", @@ -305,6 +449,11 @@ def _generate_pdf_sync(html: str) -> bytes: ) return pdf_bytes finally: + if context is not None: + try: + context.close() + except Exception: + pass browser.close() @@ -329,7 +478,10 @@ async def generate_pdf(req: PDFRequest): raise HTTPException(status_code=400, detail="html_body is required and cannot be empty") try: - html_document = _build_html_document(req) + _validate_pdf_request_size(req) + sanitized_body = _sanitize_pdf_html(req.html_body) + sanitized_request = req.model_copy(update={"html_body": sanitized_body}) + html_document = _build_html_document(sanitized_request) loop = asyncio.get_running_loop() pdf_bytes = await loop.run_in_executor(None, _generate_pdf_sync, html_document) diff --git a/backend/api/routes/leanoj.py b/backend/api/routes/leanoj.py new file mode 100644 index 0000000..851d88c --- /dev/null +++ b/backend/api/routes/leanoj.py @@ -0,0 +1,392 @@ +"""Proof Solver API routes backed by the LeanOJ workflow.""" +from __future__ import annotations + +import json +import logging +from pathlib import Path +from typing import Any, Optional + +from fastapi import APIRouter, HTTPException + +from backend.aggregator.core.coordinator import coordinator +from backend.autonomous.core.autonomous_coordinator import autonomous_coordinator +from backend.compiler.core.compiler_coordinator import compiler_coordinator +from backend.leanoj.core.leanoj_coordinator import leanoj_coordinator +from backend.shared.config import system_config +from backend.shared.models import LeanOJStartRequest +from backend.shared.workflow_start_guard import workflow_start_guard + +logger = logging.getLogger(__name__) + +router = APIRouter(prefix="/api/leanoj", tags=["leanoj"]) + + +def _leanoj_sessions_base_dir() -> Path: + return Path(system_config.data_dir) / "leanoj_sessions" + + +def _read_leanoj_state_file(path: Path) -> dict[str, Any] | None: + try: + payload = json.loads(path.read_text(encoding="utf-8")) + except Exception as exc: + logger.warning("Failed to read LeanOJ state file %s: %s", path, exc) + return None + + if not isinstance(payload, dict): + return None + payload.setdefault("session_id", path.parent.name) + return payload + + +def _iter_leanoj_state_payloads() -> list[dict[str, Any]]: + base_dir = _leanoj_sessions_base_dir() + if not base_dir.exists(): + return [] + + payloads: list[dict[str, Any]] = [] + for state_file in base_dir.glob("*/state.json"): + if not state_file.is_file(): + continue + payload = _read_leanoj_state_file(state_file) + if payload is not None: + payload["_state_file_mtime"] = state_file.stat().st_mtime + payloads.append(payload) + + return payloads + + +def _leanoj_request_payload(payload: dict[str, Any]) -> dict[str, Any]: + request_payload = payload.get("request") + return request_payload if isinstance(request_payload, dict) else {} + + +def _leanoj_prompt(payload: dict[str, Any]) -> str: + request_payload = _leanoj_request_payload(payload) + return ( + str(request_payload.get("user_prompt") or "").strip() + or str(payload.get("selected_topic") or "").strip() + or "Proof Solver problem" + ) + + +def _leanoj_created_at(payload: dict[str, Any], fallback: str = "") -> str: + return ( + str(payload.get("updated_at") or "").strip() + or fallback + or "" + ) + + +def _leanoj_library_id(session_id: str, proof_id: str) -> str: + return f"{session_id}:{proof_id}" + + +def _build_leanoj_final_proof(payload: dict[str, Any]) -> dict[str, Any] | None: + final_solution = str(payload.get("final_solution") or "").strip() + if not final_solution: + return None + + session_id = str(payload.get("session_id") or "latest") + prompt = _leanoj_prompt(payload) + request_payload = _leanoj_request_payload(payload) + proof_id = "final_solution" + shared_proof_id = str(payload.get("final_proof_id") or "").strip() + return { + "library_id": _leanoj_library_id(session_id, proof_id), + "proof_id": proof_id, + "shared_proof_id": shared_proof_id, + "session_id": session_id, + "proof_kind": "final", + "theorem_name": "Final Proof Solver Submission", + "theorem_statement": prompt, + "source_type": "leanoj_final", + "source_id": session_id, + "source_title": str(payload.get("selected_topic") or "").strip() or prompt, + "user_prompt": prompt, + "lean_template": str(request_payload.get("lean_template") or ""), + "lean_code": final_solution, + "solver": "Proof Solver", + "attempt_count": int(payload.get("final_attempt_count") or 0), + "verified": True, + "novel": bool(payload.get("final_novel")), + "novelty_tier": str(payload.get("final_novelty_tier") or "not_novel"), + "novelty_reasoning": str(payload.get("final_novelty_reasoning") or ""), + "created_at": _leanoj_created_at(payload), + "phase": str(payload.get("phase") or ""), + } + + +def _build_leanoj_subproofs(payload: dict[str, Any]) -> list[dict[str, Any]]: + session_id = str(payload.get("session_id") or "latest") + prompt = _leanoj_prompt(payload) + request_payload = _leanoj_request_payload(payload) + created_at_fallback = _leanoj_created_at(payload) + subproofs = payload.get("verified_subproofs") or [] + if not isinstance(subproofs, list): + return [] + + proofs: list[dict[str, Any]] = [] + for index, subproof in enumerate(subproofs, start=1): + if not isinstance(subproof, dict) or subproof.get("verified") is False: + continue + lean_code = str(subproof.get("lean_code") or "").strip() + if not lean_code: + continue + + proof_id = str(subproof.get("subproof_id") or f"subproof_{index:03d}") + shared_proof_id = str(subproof.get("proof_id") or "").strip() + request_text = str(subproof.get("request") or "").strip() + theorem_or_lemma = str(subproof.get("theorem_or_lemma") or "").strip() + return_title = theorem_or_lemma or request_text or proof_id + proofs.append( + { + "library_id": _leanoj_library_id(session_id, proof_id), + "proof_id": proof_id, + "shared_proof_id": shared_proof_id, + "session_id": session_id, + "proof_kind": "subproof", + "theorem_name": return_title, + "theorem_statement": theorem_or_lemma or request_text or "Verified Proof Solver subproof", + "source_type": "leanoj_subproof", + "source_id": session_id, + "source_title": request_text or prompt, + "user_prompt": prompt, + "lean_template": str(request_payload.get("lean_template") or ""), + "lean_code": lean_code, + "lean_feedback": str(subproof.get("lean_feedback") or ""), + "verification_notes": str(subproof.get("lean_feedback") or ""), + "solver": "Proof Solver", + "attempt_count": int(subproof.get("attempts_used") or 0), + "verified": True, + "novel": bool(subproof.get("novel")), + "novelty_tier": str(subproof.get("novelty_tier") or "not_novel"), + "novelty_reasoning": str(subproof.get("novelty_reasoning") or ""), + "role": str(subproof.get("role") or ""), + "created_at": str(subproof.get("created_at") or "") or created_at_fallback, + "phase": str(payload.get("phase") or ""), + } + ) + return proofs + + +def _extract_leanoj_proofs(payload: dict[str, Any], *, include_subproofs: bool = True) -> list[dict[str, Any]]: + proofs: list[dict[str, Any]] = [] + final_proof = _build_leanoj_final_proof(payload) + if final_proof is not None: + proofs.append(final_proof) + if include_subproofs: + proofs.extend(_build_leanoj_subproofs(payload)) + return proofs + + +def _build_leanoj_session_summary(payload: dict[str, Any], proofs: list[dict[str, Any]]) -> dict[str, Any]: + session_id = str(payload.get("session_id") or "latest") + prompt = _leanoj_prompt(payload) + final_count = sum(1 for proof in proofs if proof.get("proof_kind") == "final") + subproof_count = sum(1 for proof in proofs if proof.get("proof_kind") == "subproof") + return { + "session_id": session_id, + "user_prompt": prompt, + "selected_topic": str(payload.get("selected_topic") or ""), + "created_at": _leanoj_created_at(payload), + "updated_at": _leanoj_created_at(payload), + "phase": str(payload.get("phase") or ""), + "proof_count": len(proofs), + "final_count": final_count, + "subproof_count": subproof_count, + "is_current": session_id == leanoj_coordinator.get_state().session_id, + } + + +def _sort_leanoj_proofs(proofs: list[dict[str, Any]]) -> list[dict[str, Any]]: + return sorted( + proofs, + key=lambda proof: str(proof.get("created_at") or ""), + reverse=True, + ) + + +def _get_start_conflict() -> Optional[str]: + if leanoj_coordinator.is_active: + return "Proof Solver is already running" + if coordinator.is_running: + return "Cannot start Proof Solver while Aggregator is running. Stop Aggregator first." + if compiler_coordinator.is_running: + return "Cannot start Proof Solver while Compiler is running. Stop Compiler first." + autonomous_state = autonomous_coordinator.get_state() + if autonomous_state.is_running or autonomous_coordinator.is_active: + return "Cannot start Proof Solver while Autonomous Research is running. Stop Autonomous Research first." + return None + + +@router.post("/start") +async def start_leanoj(request: LeanOJStartRequest): + """Start a Proof Solver run.""" + try: + async with workflow_start_guard.reserve(): + conflict = _get_start_conflict() + if conflict: + raise HTTPException(status_code=400, detail=conflict) + if not system_config.lean4_enabled: + raise HTTPException(status_code=400, detail="Lean 4 is disabled. Enable Lean 4 proof verification before starting Proof Solver.") + resumed = await leanoj_coordinator.resume_or_initialize(request) + if not leanoj_coordinator.start_in_background(): + raise HTTPException(status_code=400, detail="Proof Solver is already running") + return { + "success": True, + "message": "Proof Solver resumed" if resumed else "Proof Solver started", + "resumed": resumed, + "session_id": leanoj_coordinator.get_state().session_id, + } + except HTTPException: + raise + except ValueError as exc: + raise HTTPException(status_code=400, detail=str(exc)) + except Exception as exc: + logger.exception("Failed to start Proof Solver") + raise HTTPException(status_code=500, detail=str(exc)) + + +@router.post("/stop") +async def stop_leanoj(): + """Stop the active Proof Solver run.""" + try: + await leanoj_coordinator.stop() + return { + "success": True, + "message": "Proof Solver stopped", + "status": leanoj_coordinator.get_status(), + } + except Exception as exc: + logger.exception("Failed to stop Proof Solver") + raise HTTPException(status_code=500, detail=str(exc)) + + +@router.post("/clear") +async def clear_leanoj(confirm: bool = False): + """Clear saved Proof Solver progress.""" + if not confirm: + raise HTTPException(status_code=400, detail="Confirmation required. Use ?confirm=true to clear Proof Solver progress.") + try: + await leanoj_coordinator.clear() + return { + "success": True, + "message": "Proof Solver progress cleared", + "status": leanoj_coordinator.get_status(), + } + except Exception as exc: + logger.exception("Failed to clear Proof Solver progress") + raise HTTPException(status_code=500, detail=str(exc)) + + +@router.get("/status") +async def get_leanoj_status(): + """Return the current Proof Solver state.""" + return leanoj_coordinator.get_status() + + +@router.get("/master-proof") +async def get_leanoj_master_proof(): + """Return the current Proof Solver master proof draft on demand.""" + return await leanoj_coordinator.get_master_proof_draft() + + +@router.get("/master-proof/edits") +async def get_leanoj_master_proof_edits(limit: int = 50): + """Return compact summaries of recent Proof Solver master proof edits.""" + return await leanoj_coordinator.get_master_proof_edit_summaries(limit=limit) + + +@router.get("/proofs") +async def get_leanoj_proofs(): + """Return verified proofs from the currently loaded LeanOJ run.""" + status = leanoj_coordinator.get_status() + proofs = _extract_leanoj_proofs(status) + return { + "proofs": _sort_leanoj_proofs(proofs), + "status": status, + "counts": { + "total": len(proofs), + "final": sum(1 for proof in proofs if proof.get("proof_kind") == "final"), + "subproof": sum(1 for proof in proofs if proof.get("proof_kind") == "subproof"), + }, + } + + +@router.get("/library") +async def get_leanoj_library(include_subproofs: bool = True): + """Return completed Proof Solver proof works across saved sessions.""" + payloads_by_session: dict[str, dict[str, Any]] = { + str(payload.get("session_id") or ""): payload + for payload in _iter_leanoj_state_payloads() + if payload.get("session_id") + } + + current_status = leanoj_coordinator.get_status() + current_session_id = str(current_status.get("session_id") or "") + if current_session_id: + payloads_by_session[current_session_id] = current_status + + proofs: list[dict[str, Any]] = [] + sessions: list[dict[str, Any]] = [] + for payload in payloads_by_session.values(): + session_proofs = _extract_leanoj_proofs(payload, include_subproofs=include_subproofs) + if not session_proofs: + continue + proofs.extend(session_proofs) + sessions.append(_build_leanoj_session_summary(payload, session_proofs)) + + return { + "proofs": _sort_leanoj_proofs(proofs), + "sessions": sorted( + sessions, + key=lambda session: str(session.get("updated_at") or ""), + reverse=True, + ), + } + + +@router.get("/library/{session_id}/{proof_id}") +async def get_leanoj_library_proof(session_id: str, proof_id: str): + """Return one completed Proof Solver proof work with full Lean source.""" + current_status = leanoj_coordinator.get_status() + if str(current_status.get("session_id") or "") == session_id: + for proof in _extract_leanoj_proofs(current_status): + if proof.get("proof_id") == proof_id: + return proof + + for payload in _iter_leanoj_state_payloads(): + if str(payload.get("session_id") or "") != session_id: + continue + for proof in _extract_leanoj_proofs(payload): + if proof.get("proof_id") == proof_id: + return proof + break + + raise HTTPException(status_code=404, detail="Proof Solver proof work not found") + + +@router.post("/skip-brainstorm") +async def skip_leanoj_brainstorm(): + """Request immediate exit from Proof Solver brainstorming into final proof solving.""" + if not leanoj_coordinator.is_active: + raise HTTPException(status_code=400, detail="Proof Solver is not running") + await leanoj_coordinator.skip_brainstorm() + return { + "success": True, + "message": "Proof Solver brainstorming will be skipped and final proof solving will start", + "status": leanoj_coordinator.get_status(), + } + + +@router.post("/force-brainstorm") +async def force_leanoj_brainstorm(): + """Request a return to recursive Proof Solver brainstorming without clearing proof progress.""" + if not leanoj_coordinator.is_active: + raise HTTPException(status_code=400, detail="Proof Solver is not running") + await leanoj_coordinator.force_brainstorm() + return { + "success": True, + "message": "Proof Solver will return to recursive brainstorming with the current proof preserved", + "status": leanoj_coordinator.get_status(), + } diff --git a/backend/api/routes/openrouter.py b/backend/api/routes/openrouter.py index 5289f2c..8ae6294 100644 --- a/backend/api/routes/openrouter.py +++ b/backend/api/routes/openrouter.py @@ -10,7 +10,7 @@ Note: Boost routes can reuse the active global key by default, while still allowing an explicit boost-only override key when the user provides one. """ -from fastapi import APIRouter, HTTPException, Header +from fastapi import APIRouter, HTTPException, Header, Request from pydantic import BaseModel from typing import Dict, Any, Optional import logging @@ -242,21 +242,27 @@ async def get_api_key_status() -> Dict[str, Any]: @router.get("/api/openrouter/models") -async def get_models(api_key: Optional[str] = None, free_only: bool = False) -> Dict[str, Any]: +async def get_models(request: Request, free_only: bool = False, authorization: Optional[str] = Header(None)) -> Dict[str, Any]: """ Fetch available OpenRouter models. - If api_key is provided, uses that key. Otherwise uses the stored global key. + If Authorization is provided, uses that key. Otherwise uses the stored global key. Args: - api_key: Optional API key to use instead of stored key (query parameter) free_only: If True, only return models with $0 pricing (query parameter) + authorization: Optional API key via Authorization header (Bearer token) Returns: List of available models with their details """ try: - # Use provided key or fall back to stored key + if "api_key" in request.query_params: + raise HTTPException( + status_code=400, + detail="OpenRouter API keys must be supplied via Authorization header, not URL query parameters.", + ) + + api_key = authorization.replace("Bearer ", "") if authorization and authorization.startswith("Bearer ") else authorization key_to_use = api_key or rag_config.openrouter_api_key if not key_to_use: diff --git a/backend/api/routes/proofs.py b/backend/api/routes/proofs.py index 9dff94b..f46333d 100644 --- a/backend/api/routes/proofs.py +++ b/backend/api/routes/proofs.py @@ -5,9 +5,10 @@ import asyncio import logging +from pathlib import Path from typing import Optional, Tuple -from fastapi import APIRouter, BackgroundTasks, HTTPException +from fastapi import APIRouter, BackgroundTasks, HTTPException, Query from fastapi.responses import JSONResponse, PlainTextResponse from backend.api.routes import websocket @@ -39,19 +40,50 @@ router = APIRouter(prefix="/api/proofs", tags=["proofs"]) +def _safe_path_label(path_value: str) -> str: + """Return a display-safe basename instead of an absolute local path.""" + text = str(path_value or "").strip() + if not text: + return "" + try: + return Path(text).name or "[configured]" + except Exception: + return "[configured]" + + def _build_model_config(role: ProofRoleConfigSnapshot) -> ModelConfig: return ModelConfig( provider=role.provider, model_id=role.model_id, openrouter_model_id=role.model_id if role.provider == "openrouter" else None, openrouter_provider=role.openrouter_provider, + openrouter_reasoning_effort=role.openrouter_reasoning_effort, lm_studio_fallback_id=role.lm_studio_fallback_id, context_window=role.context_window, max_output_tokens=role.max_output_tokens, + supercharge_enabled=role.supercharge_enabled, ) -async def _get_runtime_snapshot() -> Optional[ProofRuntimeConfigSnapshot]: +def _get_request_runtime_snapshot(request: Optional[ProofCheckRequest]) -> Optional[ProofRuntimeConfigSnapshot]: + if not request or not request.proof_runtime_config: + return None + + try: + return ProofRuntimeConfigSnapshot(**request.proof_runtime_config) + except Exception as exc: + logger.error("Manual proof runtime config from request is invalid: %s", exc) + raise HTTPException( + status_code=400, + detail="Manual proof runtime model configuration is invalid.", + ) + + +async def _get_runtime_snapshot(request: Optional[ProofCheckRequest] = None) -> Optional[ProofRuntimeConfigSnapshot]: + request_snapshot = _get_request_runtime_snapshot(request) + if request_snapshot is not None: + return request_snapshot + snapshot_dict = autonomous_coordinator.get_proof_runtime_config() if not snapshot_dict: snapshot_dict = await research_metadata.get_proof_runtime_config() @@ -134,7 +166,7 @@ async def _resolve_manual_source(request: ProofCheckRequest) -> Tuple[str, str]: async def _run_manual_proof_check(request: ProofCheckRequest) -> None: try: source_content, source_title = await _resolve_manual_source(request) - snapshot = await _get_runtime_snapshot() + snapshot = await _get_runtime_snapshot(request) if snapshot is None: raise RuntimeError("No proof runtime model configuration is available yet.") @@ -287,15 +319,38 @@ def _clean_content(content: str, proof_header: str) -> tuple[str, int]: @router.post("/cleanup-known-from-files") -async def cleanup_known_proofs_from_files(): +async def cleanup_known_proofs_from_files(confirm: bool = Query(default=False)): """One-time cleanup: strip non-novel proof entries from brainstorm/paper files. Non-novel proofs are stored in ProofDatabase (no data loss). This endpoint removes their raw Lean 4 code from brainstorm and paper .txt files so that compiler and RAG context is no longer polluted by standard known results. - Safe to call on a running session. Novel proof entries are preserved. + Requires explicit confirmation because it mutates brainstorm/paper files. + Novel proof entries are preserved. """ + if system_config.generic_mode: + raise HTTPException( + status_code=501, + detail={ + "lean4_enabled": False, + "message": "Proof file cleanup is unavailable in hosted mode.", + }, + ) + if not system_config.lean4_enabled: + raise HTTPException( + status_code=501, + detail={ + "lean4_enabled": False, + "message": "Proof file cleanup is unavailable while Lean 4 is disabled.", + }, + ) + if not confirm: + raise HTTPException( + status_code=400, + detail="Pass ?confirm=true to strip known proof entries from brainstorm and paper files.", + ) + result = await _strip_known_proofs_from_files() return result @@ -335,8 +390,11 @@ async def get_proofs_status(): return { "lean4_enabled": system_config.lean4_enabled, "lean4_lsp_enabled": system_config.lean4_lsp_enabled, - "lean4_path": system_config.lean4_path, - "lean4_workspace_dir": system_config.lean4_workspace_dir, + "lean4_path": _safe_path_label(system_config.lean4_path), + "lean4_path_configured": bool(system_config.lean4_path), + "lean4_workspace_dir": _safe_path_label(system_config.lean4_workspace_dir), + "lean4_workspace_configured": bool(system_config.lean4_workspace_dir), + "runtime_paths_redacted": True, "lean_version": version, "lean4_version": version, "lean4_proof_timeout": system_config.lean4_proof_timeout, @@ -347,7 +405,8 @@ async def get_proofs_status(): "mathlib_commit": mathlib_commit, "smt_enabled": system_config.smt_enabled, "smt_available": smt_available, - "z3_path": system_config.z3_path, + "z3_path": _safe_path_label(system_config.z3_path), + "z3_path_configured": bool(system_config.z3_path), "smt_timeout": system_config.smt_timeout, "z3_version": z3_version, "manual_check_ready": manual_check_ready, @@ -371,7 +430,6 @@ async def update_proof_settings(request: ProofSettingsUpdateRequest): ) previous_smt_settings = ( system_config.smt_enabled, - system_config.z3_path, system_config.smt_timeout, ) @@ -383,8 +441,6 @@ async def update_proof_settings(request: ProofSettingsUpdateRequest): system_config.lean4_lsp_idle_timeout = int(request.lean4_lsp_idle_timeout) if request.smt_enabled is not None: system_config.smt_enabled = bool(request.smt_enabled) - if request.z3_path is not None: - system_config.z3_path = str(request.z3_path or "").strip() if request.smt_timeout is not None: system_config.smt_timeout = int(request.smt_timeout) @@ -397,7 +453,6 @@ async def update_proof_settings(request: ProofSettingsUpdateRequest): ) smt_settings_changed = previous_smt_settings != ( system_config.smt_enabled, - system_config.z3_path, system_config.smt_timeout, ) @@ -421,7 +476,7 @@ async def run_manual_proof_check(request: ProofCheckRequest, background_tasks: B if not system_config.lean4_enabled: raise HTTPException(status_code=501, detail={"lean4_enabled": False, "message": "Lean 4 proof checks are disabled."}) - snapshot = await _get_runtime_snapshot() + snapshot = await _get_runtime_snapshot(request) if snapshot is None: raise HTTPException( status_code=409, @@ -431,7 +486,7 @@ async def run_manual_proof_check(request: ProofCheckRequest, background_tasks: B if not selected_role.model_id or not snapshot.validator.model_id: raise HTTPException( status_code=409, - detail="Proof runtime model configuration is incomplete. Start autonomous research again to refresh proof roles.", + detail="Proof runtime model configuration is incomplete. Select models for the proof role and validator, then try again.", ) await _resolve_manual_source(request) diff --git a/backend/api/routes/update.py b/backend/api/routes/update.py index 1a23699..d6fc8bf 100644 --- a/backend/api/routes/update.py +++ b/backend/api/routes/update.py @@ -14,7 +14,9 @@ from pathlib import Path from typing import Any, Dict, Tuple -from fastapi import APIRouter +from fastapi import APIRouter, HTTPException + +from backend.shared.config import system_config router = APIRouter(tags=["update"]) logger = logging.getLogger(__name__) @@ -30,7 +32,7 @@ def _parse_semver(version_str: str) -> Tuple[int, ...]: - """Extract numeric version tuple from a semver string (e.g. '1.0.7' -> (1,0,7)).""" + """Extract numeric version tuple from a semver string (e.g. '1.0.8' -> (1,0,8)).""" parts = re.findall(r"\d+", version_str or "") return tuple(int(p) for p in parts) if parts else (0,) @@ -215,6 +217,12 @@ async def _run_zip_update() -> None: @router.post("/api/update/pull") async def start_pull() -> Dict[str, Any]: """Kick off an update. Routes to git pull or ZIP overlay depending on install type.""" + if system_config.generic_mode: + raise HTTPException( + status_code=501, + detail="Self-update is unavailable in hosted generic mode.", + ) + if _pull_state["status"] == "running": return {"started": False, "reason": "An update is already in progress."} diff --git a/backend/api/routes/websocket.py b/backend/api/routes/websocket.py index 02608e4..3ccce5a 100644 --- a/backend/api/routes/websocket.py +++ b/backend/api/routes/websocket.py @@ -1,12 +1,14 @@ """ WebSocket route for real-time updates. """ -from fastapi import APIRouter, WebSocket, WebSocketDisconnect, status +from fastapi import APIRouter, HTTPException, WebSocket, WebSocketDisconnect, status from typing import List, Dict from datetime import datetime import asyncio import logging import json +import secrets +import time from backend.api.proxy_auth import ProxyAuthError, validate_proxy_headers from backend.shared.config import system_config @@ -57,6 +59,34 @@ async def broadcast(self, event_type: str, data: Dict): # Global connection manager manager = ConnectionManager() +_DESKTOP_WS_TICKET_TTL_SECONDS = 30 +_desktop_ws_tickets: Dict[str, float] = {} + + +def _prune_expired_desktop_tickets(now: float) -> None: + expired = [ + ticket + for ticket, expires_at in _desktop_ws_tickets.items() + if expires_at <= now + ] + for ticket in expired: + _desktop_ws_tickets.pop(ticket, None) + + +@router.post("/api/ws-ticket") +async def create_desktop_websocket_ticket(): + """Create a one-time desktop WebSocket ticket via token-authenticated HTTP.""" + if system_config.generic_mode: + raise HTTPException( + status_code=501, + detail="Desktop WebSocket tickets are not used in generic mode.", + ) + + now = time.time() + _prune_expired_desktop_tickets(now) + ticket = secrets.token_urlsafe(32) + _desktop_ws_tickets[ticket] = now + _DESKTOP_WS_TICKET_TTL_SECONDS + return {"ticket": ticket, "expires_in": _DESKTOP_WS_TICKET_TTL_SECONDS} @router.websocket("/ws") @@ -68,6 +98,8 @@ async def websocket_endpoint(websocket: WebSocket): websocket.headers, method="GET", path=websocket.url.path, + query_string=websocket.url.query, + body=b"", expected_instance_id=system_config.instance_id, shared_secret=system_config.internal_proxy_secret or "", ) @@ -78,6 +110,18 @@ async def websocket_endpoint(websocket: WebSocket): reason=exc.detail, ) return + else: + now = time.time() + _prune_expired_desktop_tickets(now) + ticket = (websocket.query_params.get("ticket") or "").strip() + expires_at = _desktop_ws_tickets.pop(ticket, None) if ticket else None + if not expires_at or expires_at <= now: + logger.warning("Rejected desktop websocket connection: missing or invalid ticket") + await websocket.close( + code=status.WS_1008_POLICY_VIOLATION, + reason="Missing or invalid desktop WebSocket ticket.", + ) + return await manager.connect(websocket) diff --git a/backend/api/routes/workflow.py b/backend/api/routes/workflow.py index 8ccaa45..2f9c375 100644 --- a/backend/api/routes/workflow.py +++ b/backend/api/routes/workflow.py @@ -33,12 +33,17 @@ async def get_workflow_predictions() -> Dict[str, Any]: from backend.aggregator.core.coordinator import coordinator from backend.compiler.core.compiler_coordinator import compiler_coordinator from backend.autonomous.core.autonomous_coordinator import autonomous_coordinator + from backend.leanoj.core.leanoj_coordinator import leanoj_coordinator # Determine which coordinator is active and return its workflow tasks = [] mode = "idle" - if autonomous_coordinator._running: + if leanoj_coordinator.is_active: + mode = "leanoj" + tasks = [task.model_dump(mode="json") for task in leanoj_coordinator.workflow_tasks] + logger.debug(f"Returning {len(tasks)} tasks from LeanOJ coordinator") + elif autonomous_coordinator._running: mode = "autonomous" # For autonomous mode, check which sub-coordinator is active if autonomous_coordinator._brainstorm_aggregator and autonomous_coordinator._brainstorm_aggregator.is_running: diff --git a/backend/autonomous/agents/final_answer/certainty_assessor.py b/backend/autonomous/agents/final_answer/certainty_assessor.py index 4c04b1e..e6f0961 100644 --- a/backend/autonomous/agents/final_answer/certainty_assessor.py +++ b/backend/autonomous/agents/final_answer/certainty_assessor.py @@ -9,8 +9,7 @@ CRITICAL: Operates ONLY on Tier 2 papers, NOT on Tier 1 brainstorm databases. NO RAG FOR ABSTRACTS (by design): Step 1 browses abstracts/outlines which are small metadata. -EXPANDED PAPERS OVERFLOW: Step 2 currently drops expanded papers if they don't fit. -TODO: Should RAG expanded papers instead of dropping — see audit note in rag-design rule. +EXPANDED PAPERS OVERFLOW: Step 2 uses RAG fallback for expanded papers when full direct injection does not fit. """ import asyncio import json diff --git a/backend/autonomous/agents/lemma_search_agent.py b/backend/autonomous/agents/lemma_search_agent.py index d2e69a2..cd696f7 100644 --- a/backend/autonomous/agents/lemma_search_agent.py +++ b/backend/autonomous/agents/lemma_search_agent.py @@ -13,6 +13,7 @@ from backend.shared.api_client_manager import api_client_manager from backend.shared.json_parser import parse_json from backend.shared.lean4_client import get_lean4_client +from backend.shared.model_error_utils import is_non_retryable_model_error from backend.shared.models import MathlibLemmaHint, ProofCandidate from backend.shared.openrouter_client import FreeModelExhaustedError from backend.shared.utils import count_tokens @@ -295,6 +296,8 @@ async def suggest_relevant_lemmas( except FreeModelExhaustedError: raise except Exception as exc: + if is_non_retryable_model_error(exc): + raise logger.warning( "MathlibLemmaSearchAgent failed for theorem %s: %s", theorem_candidate.theorem_id, diff --git a/backend/autonomous/agents/paper_title_selector.py b/backend/autonomous/agents/paper_title_selector.py index 6107b56..195c364 100644 --- a/backend/autonomous/agents/paper_title_selector.py +++ b/backend/autonomous/agents/paper_title_selector.py @@ -55,6 +55,10 @@ def set_task_tracking_callback(self, callback: Callable) -> None: def get_current_task_id(self) -> str: """Get the task ID for the current/next API call.""" return f"agg_sub1_{self.task_sequence:03d}" + + def get_current_validation_task_id(self) -> str: + """Get a validator-routed task ID for title validation.""" + return f"agg_val_{self.task_sequence:03d}" async def select_title( self, @@ -232,7 +236,7 @@ async def _generate_title( return None # Generate task ID for tracking - task_id = self.get_current_task_id() + task_id = self.get_current_validation_task_id() self.task_sequence += 1 # Notify task started (for workflow panel) @@ -326,7 +330,7 @@ async def _validate_title( response = await api_client_manager.generate_completion( task_id=task_id, - role_id=self.role_id, # Use same role_id for validation + role_id="autonomous_paper_title_validator", model=self.validator_model_id, messages=[{"role": "user", "content": prompt}], max_tokens=15000, diff --git a/backend/autonomous/agents/proof_formalization_agent.py b/backend/autonomous/agents/proof_formalization_agent.py index aa589e3..e43c085 100644 --- a/backend/autonomous/agents/proof_formalization_agent.py +++ b/backend/autonomous/agents/proof_formalization_agent.py @@ -10,6 +10,7 @@ from backend.shared.api_client_manager import api_client_manager from backend.shared.json_parser import parse_json from backend.shared.lean4_client import get_lean4_client +from backend.shared.model_error_utils import is_non_retryable_model_error from backend.shared.models import ProofAttemptFeedback, ProofCandidate, SmtHint from backend.shared.openrouter_client import FreeModelExhaustedError from backend.shared.utils import count_tokens @@ -266,6 +267,8 @@ async def _run_full_script_attempt( except FreeModelExhaustedError: raise except Exception as exc: + if is_non_retryable_model_error(exc): + raise is_parse_error = _is_json_parse_error(exc) feedback = ProofAttemptFeedback( attempt=attempt_number, @@ -558,6 +561,8 @@ async def prove_candidate_tactic_script( except FreeModelExhaustedError: raise except Exception as exc: + if is_non_retryable_model_error(exc): + raise is_parse_error = _is_json_parse_error(exc) feedback = ProofAttemptFeedback( attempt=attempt_number, diff --git a/backend/autonomous/agents/proof_identification_agent.py b/backend/autonomous/agents/proof_identification_agent.py index 9fa7791..82bde54 100644 --- a/backend/autonomous/agents/proof_identification_agent.py +++ b/backend/autonomous/agents/proof_identification_agent.py @@ -6,6 +6,7 @@ from backend.shared.api_client_manager import api_client_manager from backend.shared.json_parser import parse_json +from backend.shared.model_error_utils import is_non_retryable_model_error from backend.shared.models import ProofCandidate from backend.shared.openrouter_client import FreeModelExhaustedError from backend.shared.utils import count_tokens @@ -104,6 +105,8 @@ async def translate_candidate_to_smt( except FreeModelExhaustedError: raise except Exception as exc: + if is_non_retryable_model_error(exc): + raise logger.debug( "ProofIdentificationAgent SMT translation failed for theorem %s: %s", theorem_candidate.theorem_id, @@ -183,6 +186,8 @@ async def identify_candidates( except FreeModelExhaustedError: raise except Exception as exc: + if is_non_retryable_model_error(exc): + raise logger.error( "ProofIdentificationAgent failed for %s %s: %s", source_type, diff --git a/backend/autonomous/core/autonomous_coordinator.py b/backend/autonomous/core/autonomous_coordinator.py index 337e922..3805d1c 100644 --- a/backend/autonomous/core/autonomous_coordinator.py +++ b/backend/autonomous/core/autonomous_coordinator.py @@ -84,6 +84,10 @@ logger = logging.getLogger(__name__) +_PARENT_PHASE_SHUTDOWN_TIMEOUT_SECONDS = 60 * 60 +_WORKFLOW_PHASE_UNSET = object() +_BRAINSTORM_ACCEPTANCE_HARD_LIMIT = 30 + class AutonomousCoordinator: """ @@ -107,7 +111,9 @@ def __init__(self): self._validator_max_tokens: int = 15000 self._validator_provider: str = "lm_studio" self._validator_openrouter_provider: Optional[str] = None + self._validator_openrouter_reasoning_effort: str = "auto" self._validator_lm_studio_fallback: Optional[str] = None + self._validator_supercharge_enabled: bool = False # Compiler models (separate from aggregator submitters) self._high_context_model: str = "" @@ -116,6 +122,24 @@ def __init__(self): self._high_param_context: int = 10000 self._high_context_max_tokens: int = 25000 self._high_param_max_tokens: int = 15000 + self._high_context_provider: str = "lm_studio" + self._high_context_openrouter_provider: Optional[str] = None + self._high_context_openrouter_reasoning_effort: str = "auto" + self._high_context_lm_studio_fallback: Optional[str] = None + self._high_context_supercharge_enabled: bool = False + self._high_param_provider: str = "lm_studio" + self._high_param_openrouter_provider: Optional[str] = None + self._high_param_openrouter_reasoning_effort: str = "auto" + self._high_param_lm_studio_fallback: Optional[str] = None + self._high_param_supercharge_enabled: bool = False + self._critique_submitter_model: str = "" + self._critique_submitter_context: int = 131072 + self._critique_submitter_max_tokens: int = 25000 + self._critique_submitter_provider: str = "lm_studio" + self._critique_submitter_openrouter_provider: Optional[str] = None + self._critique_submitter_openrouter_reasoning_effort: str = "auto" + self._critique_submitter_lm_studio_fallback: Optional[str] = None + self._critique_submitter_supercharge_enabled: bool = False # Agents (initialized during setup) self._topic_selector: Optional[TopicSelectorAgent] = None @@ -133,6 +157,7 @@ def __init__(self): # Part 1 & 2 Integration self._brainstorm_aggregator: Optional[AggregatorCoordinator] = None self._paper_compiler: Optional[CompilerCoordinator] = None + self._active_child_aggregators: List[AggregatorCoordinator] = [] # Callbacks self._broadcast_callback: Optional[Callable] = None @@ -151,6 +176,7 @@ def __init__(self): self._last_redundancy_check_at: int = 0 self._last_completion_review_at: int = 0 # Acceptance count at last completion review self._manual_paper_writing_triggered: bool = False + self._brainstorm_hard_limit_triggered: bool = False self._resume_paper_phase: Optional[str] = None # Saved phase for resume (body/conclusion/intro/abstract) self._brainstorm_missing_during_paper: bool = False @@ -190,6 +216,52 @@ async def _broadcast(self, event: str, data: Dict[str, Any] = None) -> None: # broadcast_event expects (event_type, data) as separate arguments await self._broadcast_callback(event, data or {}) + def _track_child_aggregator(self, aggregator: AggregatorCoordinator) -> None: + """Track local child aggregators so parent phase changes can stop them.""" + if aggregator not in self._active_child_aggregators: + self._active_child_aggregators.append(aggregator) + + def _untrack_child_aggregator(self, aggregator: Optional[AggregatorCoordinator]) -> None: + if aggregator in self._active_child_aggregators: + self._active_child_aggregators.remove(aggregator) + + async def _await_parent_phase_shutdown( + self, + label: str, + awaitable, + *, + timeout: float = _PARENT_PHASE_SHUTDOWN_TIMEOUT_SECONDS, + ) -> bool: + task = asyncio.create_task(awaitable) + try: + await asyncio.wait_for(task, timeout=timeout) + return True + except asyncio.TimeoutError: + logger.warning( + "Timed out waiting %.0fs for %s; cancelling so parent phase can continue", + timeout, + label, + ) + task.cancel() + try: + await task + except asyncio.CancelledError: + pass + return False + + async def _stop_active_child_aggregators(self, reason: str) -> None: + for aggregator in list(self._active_child_aggregators): + try: + if await self._await_parent_phase_shutdown( + f"child aggregator shutdown for {reason}", + aggregator.stop(), + ): + logger.info("Stopped child aggregator for %s", reason) + except Exception as exc: + logger.warning("Error stopping child aggregator for %s: %s", reason, exc) + finally: + self._untrack_child_aggregator(aggregator) + def _append_proof_framing(self, prompt: str) -> str: """Append the persisted proof-framing context when active.""" effective_prompt = prompt or "" @@ -240,25 +312,31 @@ def _build_proof_runtime_config_snapshot(self) -> Dict[str, Any]: provider=first_submitter.provider if first_submitter else "lm_studio", model_id=first_submitter.model_id if first_submitter else self._high_context_model, openrouter_provider=first_submitter.openrouter_provider if first_submitter else self._high_context_openrouter_provider, + openrouter_reasoning_effort=first_submitter.openrouter_reasoning_effort if first_submitter else self._high_context_openrouter_reasoning_effort, lm_studio_fallback_id=first_submitter.lm_studio_fallback_id if first_submitter else self._high_context_lm_studio_fallback, context_window=first_submitter.context_window if first_submitter else self._high_context_context, max_output_tokens=first_submitter.max_output_tokens if first_submitter else self._high_context_max_tokens, + supercharge_enabled=first_submitter.supercharge_enabled if first_submitter else self._high_context_supercharge_enabled, ) paper_config = ProofRoleConfigSnapshot( provider=self._high_context_provider, model_id=self._high_context_model, openrouter_provider=self._high_context_openrouter_provider, + openrouter_reasoning_effort=self._high_context_openrouter_reasoning_effort, lm_studio_fallback_id=self._high_context_lm_studio_fallback, context_window=self._high_context_context, max_output_tokens=self._high_context_max_tokens, + supercharge_enabled=self._high_context_supercharge_enabled, ) validator_config = ProofRoleConfigSnapshot( provider=self._validator_provider, model_id=self._validator_model, openrouter_provider=self._validator_openrouter_provider, + openrouter_reasoning_effort=self._validator_openrouter_reasoning_effort, lm_studio_fallback_id=self._validator_lm_studio_fallback, context_window=self._validator_context, max_output_tokens=self._validator_max_tokens, + supercharge_enabled=self._validator_supercharge_enabled, ) return ProofRuntimeConfigSnapshot( brainstorm=brainstorm_config, @@ -373,6 +451,16 @@ async def _run_brainstorm_completion_proofs(self) -> None: if not self._current_topic_id: return + # Entering Lean proof verification is already past brainstorm aggregation. + # Persist that handoff before the potentially long proof stage so a restart + # cannot fall back to a fresh 0-count brainstorm loop. + self._state.current_tier = "tier2_paper_writing" + await self._recover_brainstorm_acceptance_count(self._current_topic_id) + await self._save_workflow_state( + tier="tier2_paper_writing", + phase="brainstorm_proof_verification", + ) + metadata = await brainstorm_memory.get_metadata(self._current_topic_id) brainstorm_content = await brainstorm_memory.get_database_content(self._current_topic_id) await self._run_proof_verification( @@ -381,6 +469,96 @@ async def _run_brainstorm_completion_proofs(self) -> None: self._current_topic_id, source_title=metadata.topic_prompt if metadata else "", ) + + if not self._stop_event.is_set(): + await self._save_workflow_state( + tier="tier2_paper_writing", + phase="pre_paper_compilation", + ) + + async def _recover_brainstorm_acceptance_count(self, topic_id: Optional[str]) -> int: + """Recover a non-zero brainstorm acceptance count from durable files. + + Older workflow states can be stale around the proof handoff. Use the + current workflow count when present, but fall back to brainstorm metadata + and finally to the database file so resume/status never shows a completed + brainstorm as starting from zero. + """ + if not topic_id: + return self._acceptance_count + + recovered_count = max(0, int(self._acceptance_count or 0)) + metadata = await brainstorm_memory.get_metadata(topic_id) + if metadata is not None: + recovered_count = max(recovered_count, int(metadata.submission_count or 0)) + + try: + content = await brainstorm_memory.get_database_content(topic_id, strip_proofs=True) + file_count = len( + re.findall(r"^SUBMISSION\s+#\d+\s*\|", content or "", flags=re.MULTILINE) + ) + recovered_count = max(recovered_count, file_count) + except Exception as exc: + logger.debug("Failed to recover brainstorm count for %s from file: %s", topic_id, exc) + + if recovered_count > (self._acceptance_count or 0): + logger.info( + "Recovered brainstorm acceptance count for %s: %s -> %s", + topic_id, + self._acceptance_count, + recovered_count, + ) + self._acceptance_count = recovered_count + + if metadata is not None and recovered_count > int(metadata.submission_count or 0): + try: + await brainstorm_memory.update_metadata(topic_id, submission_count=recovered_count) + except Exception as exc: + logger.debug("Failed to update recovered brainstorm count for %s: %s", topic_id, exc) + + return self._acceptance_count + + async def _trigger_brainstorm_hard_limit(self, acceptance_count: int) -> None: + """Record the brainstorm hard-limit transition exactly once.""" + self._acceptance_count = max(self._acceptance_count, int(acceptance_count or 0)) + if self._brainstorm_hard_limit_triggered: + return + + self._brainstorm_hard_limit_triggered = True + logger.info( + "Hard limit of %s acceptances reached for %s. Forcing paper writing transition.", + _BRAINSTORM_ACCEPTANCE_HARD_LIMIT, + self._current_topic_id, + ) + + shared_training_size = 0 + try: + shared_training_size = await shared_training_memory.get_insights_count() + except Exception as exc: + logger.debug("Failed to read live brainstorm size at hard limit: %s", exc) + + if self._current_topic_id: + try: + await brainstorm_memory.update_metadata( + self._current_topic_id, + submission_count=shared_training_size or self._acceptance_count, + ) + except Exception as exc: + logger.debug("Failed to update hard-limit brainstorm metadata: %s", exc) + + await self._broadcast("brainstorm_hard_limit_reached", { + "topic_id": self._current_topic_id, + "acceptance_count": self._acceptance_count, + "message": ( + f"Brainstorm hard limit of {_BRAINSTORM_ACCEPTANCE_HARD_LIMIT} " + "acceptances reached. Forcing paper writing." + ) + }) + + await brainstorm_memory.mark_complete(self._current_topic_id) + await research_metadata.mark_brainstorm_complete(self._current_topic_id) + + await self._save_workflow_state(tier="tier1_aggregation") async def initialize( self, @@ -401,21 +579,29 @@ async def initialize( # OpenRouter provider configs for validator validator_provider: str = "lm_studio", validator_openrouter_provider: Optional[str] = None, + validator_openrouter_reasoning_effort: str = "auto", validator_lm_studio_fallback: Optional[str] = None, # OpenRouter provider configs for high-context submitter high_context_provider: str = "lm_studio", high_context_openrouter_provider: Optional[str] = None, + high_context_openrouter_reasoning_effort: str = "auto", high_context_lm_studio_fallback: Optional[str] = None, # OpenRouter provider configs for high-param submitter high_param_provider: str = "lm_studio", high_param_openrouter_provider: Optional[str] = None, + high_param_openrouter_reasoning_effort: str = "auto", high_param_lm_studio_fallback: Optional[str] = None, # OpenRouter provider configs for critique submitter critique_submitter_provider: str = "lm_studio", critique_submitter_openrouter_provider: Optional[str] = None, + critique_submitter_openrouter_reasoning_effort: str = "auto", critique_submitter_lm_studio_fallback: Optional[str] = None, # Tier 3 Final Answer setting - tier3_enabled: bool = False + tier3_enabled: bool = False, + validator_supercharge_enabled: bool = False, + high_context_supercharge_enabled: bool = False, + high_param_supercharge_enabled: bool = False, + critique_submitter_supercharge_enabled: bool = False ) -> None: """Initialize the coordinator with configuration.""" # Store configuration @@ -447,17 +633,25 @@ async def initialize( # Store OpenRouter provider configs for all roles self._validator_provider = validator_provider self._validator_openrouter_provider = validator_openrouter_provider + self._validator_openrouter_reasoning_effort = validator_openrouter_reasoning_effort self._validator_lm_studio_fallback = validator_lm_studio_fallback self._high_context_provider = high_context_provider self._high_context_openrouter_provider = high_context_openrouter_provider + self._high_context_openrouter_reasoning_effort = high_context_openrouter_reasoning_effort self._high_context_lm_studio_fallback = high_context_lm_studio_fallback self._high_param_provider = high_param_provider self._high_param_openrouter_provider = high_param_openrouter_provider + self._high_param_openrouter_reasoning_effort = high_param_openrouter_reasoning_effort self._high_param_lm_studio_fallback = high_param_lm_studio_fallback self._critique_submitter_provider = critique_submitter_provider self._critique_submitter_openrouter_provider = critique_submitter_openrouter_provider + self._critique_submitter_openrouter_reasoning_effort = critique_submitter_openrouter_reasoning_effort self._critique_submitter_lm_studio_fallback = critique_submitter_lm_studio_fallback self._tier3_enabled = tier3_enabled + self._validator_supercharge_enabled = validator_supercharge_enabled + self._high_context_supercharge_enabled = high_context_supercharge_enabled + self._high_param_supercharge_enabled = high_param_supercharge_enabled + self._critique_submitter_supercharge_enabled = critique_submitter_supercharge_enabled logger.info(f"Autonomous coordinator initializing with {len(submitter_configs)} submitters") for config in submitter_configs: @@ -613,6 +807,8 @@ async def initialize( # CRITICAL: Configure roles with api_client_manager so routing works correctly # Configure first submitter (used by topic selector, completion reviewer, reference selector, title selector) first_config = submitter_configs[0] if submitter_configs else SubmitterConfig(submitter_id=1, model_id=first_submitter_model) + first_supercharge_enabled = first_config.supercharge_enabled if hasattr(first_config, 'supercharge_enabled') else False + first_reasoning_effort = getattr(first_config, "openrouter_reasoning_effort", "auto") api_client_manager.configure_role( "autonomous_topic_selector", ModelConfig( @@ -620,9 +816,11 @@ async def initialize( model_id=first_submitter_model, openrouter_model_id=first_config.openrouter_model_id if hasattr(first_config, 'openrouter_model_id') else None, openrouter_provider=first_config.openrouter_provider if hasattr(first_config, 'openrouter_provider') else None, + openrouter_reasoning_effort=first_reasoning_effort, lm_studio_fallback_id=first_config.lm_studio_fallback_id if hasattr(first_config, 'lm_studio_fallback_id') else None, context_window=first_submitter_context, - max_output_tokens=first_submitter_max_tokens + max_output_tokens=first_submitter_max_tokens, + supercharge_enabled=first_supercharge_enabled ) ) @@ -633,9 +831,11 @@ async def initialize( model_id=first_submitter_model, openrouter_model_id=first_config.openrouter_model_id if hasattr(first_config, 'openrouter_model_id') else None, openrouter_provider=first_config.openrouter_provider if hasattr(first_config, 'openrouter_provider') else None, + openrouter_reasoning_effort=first_reasoning_effort, lm_studio_fallback_id=first_config.lm_studio_fallback_id if hasattr(first_config, 'lm_studio_fallback_id') else None, context_window=first_submitter_context, - max_output_tokens=first_submitter_max_tokens + max_output_tokens=first_submitter_max_tokens, + supercharge_enabled=first_supercharge_enabled ) ) @@ -646,9 +846,11 @@ async def initialize( model_id=first_submitter_model, openrouter_model_id=first_config.openrouter_model_id if hasattr(first_config, 'openrouter_model_id') else None, openrouter_provider=first_config.openrouter_provider if hasattr(first_config, 'openrouter_provider') else None, + openrouter_reasoning_effort=first_reasoning_effort, lm_studio_fallback_id=first_config.lm_studio_fallback_id if hasattr(first_config, 'lm_studio_fallback_id') else None, context_window=first_submitter_context, - max_output_tokens=first_submitter_max_tokens + max_output_tokens=first_submitter_max_tokens, + supercharge_enabled=first_supercharge_enabled ) ) @@ -659,9 +861,25 @@ async def initialize( model_id=first_submitter_model, openrouter_model_id=first_config.openrouter_model_id if hasattr(first_config, 'openrouter_model_id') else None, openrouter_provider=first_config.openrouter_provider if hasattr(first_config, 'openrouter_provider') else None, + openrouter_reasoning_effort=first_reasoning_effort, lm_studio_fallback_id=first_config.lm_studio_fallback_id if hasattr(first_config, 'lm_studio_fallback_id') else None, context_window=first_submitter_context, - max_output_tokens=first_submitter_max_tokens + max_output_tokens=first_submitter_max_tokens, + supercharge_enabled=first_supercharge_enabled + ) + ) + api_client_manager.configure_role( + "autonomous_paper_title_validator", + ModelConfig( + provider=validator_provider, + model_id=validator_model, + openrouter_model_id=validator_model if validator_provider == "openrouter" else None, + openrouter_provider=validator_openrouter_provider, + openrouter_reasoning_effort=validator_openrouter_reasoning_effort, + lm_studio_fallback_id=validator_lm_studio_fallback, + context_window=validator_context_window, + max_output_tokens=validator_max_tokens, + supercharge_enabled=validator_supercharge_enabled ) ) @@ -673,9 +891,11 @@ async def initialize( model_id=validator_model, openrouter_model_id=validator_model if validator_provider == "openrouter" else None, openrouter_provider=validator_openrouter_provider, + openrouter_reasoning_effort=validator_openrouter_reasoning_effort, lm_studio_fallback_id=validator_lm_studio_fallback, context_window=validator_context_window, - max_output_tokens=validator_max_tokens + max_output_tokens=validator_max_tokens, + supercharge_enabled=validator_supercharge_enabled ) ) @@ -686,9 +906,11 @@ async def initialize( model_id=validator_model, openrouter_model_id=validator_model if validator_provider == "openrouter" else None, openrouter_provider=validator_openrouter_provider, + openrouter_reasoning_effort=validator_openrouter_reasoning_effort, lm_studio_fallback_id=validator_lm_studio_fallback, context_window=validator_context_window, - max_output_tokens=validator_max_tokens + max_output_tokens=validator_max_tokens, + supercharge_enabled=validator_supercharge_enabled ) ) @@ -699,9 +921,11 @@ async def initialize( model_id=first_submitter_model, openrouter_model_id=first_config.openrouter_model_id if hasattr(first_config, 'openrouter_model_id') else None, openrouter_provider=first_config.openrouter_provider if hasattr(first_config, 'openrouter_provider') else None, + openrouter_reasoning_effort=first_reasoning_effort, lm_studio_fallback_id=first_config.lm_studio_fallback_id if hasattr(first_config, 'lm_studio_fallback_id') else None, context_window=first_submitter_context, - max_output_tokens=first_submitter_max_tokens + max_output_tokens=first_submitter_max_tokens, + supercharge_enabled=first_supercharge_enabled ) ) @@ -712,9 +936,11 @@ async def initialize( model_id=first_submitter_model, openrouter_model_id=first_config.openrouter_model_id if hasattr(first_config, 'openrouter_model_id') else None, openrouter_provider=first_config.openrouter_provider if hasattr(first_config, 'openrouter_provider') else None, + openrouter_reasoning_effort=first_reasoning_effort, lm_studio_fallback_id=first_config.lm_studio_fallback_id if hasattr(first_config, 'lm_studio_fallback_id') else None, context_window=first_submitter_context, - max_output_tokens=first_submitter_max_tokens + max_output_tokens=first_submitter_max_tokens, + supercharge_enabled=first_supercharge_enabled ) ) @@ -725,9 +951,11 @@ async def initialize( model_id=first_submitter_model, openrouter_model_id=first_config.openrouter_model_id if hasattr(first_config, 'openrouter_model_id') else None, openrouter_provider=first_config.openrouter_provider if hasattr(first_config, 'openrouter_provider') else None, + openrouter_reasoning_effort=first_reasoning_effort, lm_studio_fallback_id=first_config.lm_studio_fallback_id if hasattr(first_config, 'lm_studio_fallback_id') else None, context_window=first_submitter_context, - max_output_tokens=first_submitter_max_tokens + max_output_tokens=first_submitter_max_tokens, + supercharge_enabled=first_supercharge_enabled ) ) @@ -738,9 +966,11 @@ async def initialize( model_id=first_submitter_model, openrouter_model_id=first_config.openrouter_model_id if hasattr(first_config, 'openrouter_model_id') else None, openrouter_provider=first_config.openrouter_provider if hasattr(first_config, 'openrouter_provider') else None, + openrouter_reasoning_effort=first_reasoning_effort, lm_studio_fallback_id=first_config.lm_studio_fallback_id if hasattr(first_config, 'lm_studio_fallback_id') else None, context_window=first_submitter_context, - max_output_tokens=first_submitter_max_tokens + max_output_tokens=first_submitter_max_tokens, + supercharge_enabled=first_supercharge_enabled ) ) @@ -751,9 +981,11 @@ async def initialize( model_id=self._high_context_model, openrouter_model_id=self._high_context_model if high_context_provider == "openrouter" else None, openrouter_provider=high_context_openrouter_provider, + openrouter_reasoning_effort=high_context_openrouter_reasoning_effort, lm_studio_fallback_id=high_context_lm_studio_fallback, context_window=self._high_context_context, - max_output_tokens=self._high_context_max_tokens + max_output_tokens=self._high_context_max_tokens, + supercharge_enabled=high_context_supercharge_enabled ) ) @@ -764,9 +996,11 @@ async def initialize( model_id=self._high_context_model, openrouter_model_id=self._high_context_model if high_context_provider == "openrouter" else None, openrouter_provider=high_context_openrouter_provider, + openrouter_reasoning_effort=high_context_openrouter_reasoning_effort, lm_studio_fallback_id=high_context_lm_studio_fallback, context_window=self._high_context_context, - max_output_tokens=self._high_context_max_tokens + max_output_tokens=self._high_context_max_tokens, + supercharge_enabled=high_context_supercharge_enabled ) ) @@ -777,9 +1011,11 @@ async def initialize( model_id=self._high_context_model, openrouter_model_id=self._high_context_model if high_context_provider == "openrouter" else None, openrouter_provider=high_context_openrouter_provider, + openrouter_reasoning_effort=high_context_openrouter_reasoning_effort, lm_studio_fallback_id=high_context_lm_studio_fallback, context_window=self._high_context_context, - max_output_tokens=self._high_context_max_tokens + max_output_tokens=self._high_context_max_tokens, + supercharge_enabled=high_context_supercharge_enabled ) ) @@ -790,9 +1026,11 @@ async def initialize( model_id=validator_model, openrouter_model_id=validator_model if validator_provider == "openrouter" else None, openrouter_provider=validator_openrouter_provider, + openrouter_reasoning_effort=validator_openrouter_reasoning_effort, lm_studio_fallback_id=validator_lm_studio_fallback, context_window=validator_context_window, - max_output_tokens=validator_max_tokens + max_output_tokens=validator_max_tokens, + supercharge_enabled=validator_supercharge_enabled ) ) @@ -803,9 +1041,11 @@ async def initialize( model_id=first_submitter_model, openrouter_model_id=first_config.openrouter_model_id if hasattr(first_config, 'openrouter_model_id') else None, openrouter_provider=first_config.openrouter_provider if hasattr(first_config, 'openrouter_provider') else None, + openrouter_reasoning_effort=first_reasoning_effort, lm_studio_fallback_id=first_config.lm_studio_fallback_id if hasattr(first_config, 'lm_studio_fallback_id') else None, context_window=first_submitter_context, - max_output_tokens=first_submitter_max_tokens + max_output_tokens=first_submitter_max_tokens, + supercharge_enabled=first_supercharge_enabled ) ) @@ -816,9 +1056,11 @@ async def initialize( model_id=first_submitter_model, openrouter_model_id=first_config.openrouter_model_id if hasattr(first_config, 'openrouter_model_id') else None, openrouter_provider=first_config.openrouter_provider if hasattr(first_config, 'openrouter_provider') else None, + openrouter_reasoning_effort=first_reasoning_effort, lm_studio_fallback_id=first_config.lm_studio_fallback_id if hasattr(first_config, 'lm_studio_fallback_id') else None, context_window=first_submitter_context, - max_output_tokens=first_submitter_max_tokens + max_output_tokens=first_submitter_max_tokens, + supercharge_enabled=first_supercharge_enabled ) ) @@ -829,9 +1071,11 @@ async def initialize( model_id=first_submitter_model, openrouter_model_id=first_config.openrouter_model_id if hasattr(first_config, 'openrouter_model_id') else None, openrouter_provider=first_config.openrouter_provider if hasattr(first_config, 'openrouter_provider') else None, + openrouter_reasoning_effort=first_reasoning_effort, lm_studio_fallback_id=first_config.lm_studio_fallback_id if hasattr(first_config, 'lm_studio_fallback_id') else None, context_window=first_submitter_context, - max_output_tokens=first_submitter_max_tokens + max_output_tokens=first_submitter_max_tokens, + supercharge_enabled=first_supercharge_enabled ) ) @@ -842,9 +1086,11 @@ async def initialize( model_id=self._high_context_model, openrouter_model_id=self._high_context_model if high_context_provider == "openrouter" else None, openrouter_provider=high_context_openrouter_provider, + openrouter_reasoning_effort=high_context_openrouter_reasoning_effort, lm_studio_fallback_id=high_context_lm_studio_fallback, context_window=self._high_context_context, - max_output_tokens=self._high_context_max_tokens + max_output_tokens=self._high_context_max_tokens, + supercharge_enabled=high_context_supercharge_enabled ) ) @@ -855,9 +1101,11 @@ async def initialize( model_id=self._high_context_model, openrouter_model_id=self._high_context_model if high_context_provider == "openrouter" else None, openrouter_provider=high_context_openrouter_provider, + openrouter_reasoning_effort=high_context_openrouter_reasoning_effort, lm_studio_fallback_id=high_context_lm_studio_fallback, context_window=self._high_context_context, - max_output_tokens=self._high_context_max_tokens + max_output_tokens=self._high_context_max_tokens, + supercharge_enabled=high_context_supercharge_enabled ) ) @@ -868,9 +1116,11 @@ async def initialize( model_id=self._high_context_model, openrouter_model_id=self._high_context_model if high_context_provider == "openrouter" else None, openrouter_provider=high_context_openrouter_provider, + openrouter_reasoning_effort=high_context_openrouter_reasoning_effort, lm_studio_fallback_id=high_context_lm_studio_fallback, context_window=self._high_context_context, - max_output_tokens=self._high_context_max_tokens + max_output_tokens=self._high_context_max_tokens, + supercharge_enabled=high_context_supercharge_enabled ) ) @@ -885,11 +1135,28 @@ async def initialize( model_id=first_submitter_model, openrouter_model_id=first_config.openrouter_model_id if hasattr(first_config, 'openrouter_model_id') else None, openrouter_provider=first_config.openrouter_provider if hasattr(first_config, 'openrouter_provider') else None, + openrouter_reasoning_effort=first_reasoning_effort, lm_studio_fallback_id=first_config.lm_studio_fallback_id if hasattr(first_config, 'lm_studio_fallback_id') else None, context_window=first_submitter_context, - max_output_tokens=first_submitter_max_tokens + max_output_tokens=first_submitter_max_tokens, + supercharge_enabled=first_supercharge_enabled ) ) + tier3_validator_config = ModelConfig( + provider=validator_provider, + model_id=validator_model, + openrouter_model_id=validator_model if validator_provider == "openrouter" else None, + openrouter_provider=validator_openrouter_provider, + openrouter_reasoning_effort=validator_openrouter_reasoning_effort, + lm_studio_fallback_id=validator_lm_studio_fallback, + context_window=validator_context_window, + max_output_tokens=validator_max_tokens, + supercharge_enabled=validator_supercharge_enabled, + ) + api_client_manager.configure_role( + "autonomous_certainty_assessor_validator", + tier3_validator_config, + ) api_client_manager.configure_role( "autonomous_format_selector", @@ -898,11 +1165,17 @@ async def initialize( model_id=first_submitter_model, openrouter_model_id=first_config.openrouter_model_id if hasattr(first_config, 'openrouter_model_id') else None, openrouter_provider=first_config.openrouter_provider if hasattr(first_config, 'openrouter_provider') else None, + openrouter_reasoning_effort=first_reasoning_effort, lm_studio_fallback_id=first_config.lm_studio_fallback_id if hasattr(first_config, 'lm_studio_fallback_id') else None, context_window=first_submitter_context, - max_output_tokens=first_submitter_max_tokens + max_output_tokens=first_submitter_max_tokens, + supercharge_enabled=first_supercharge_enabled ) ) + api_client_manager.configure_role( + "autonomous_format_selector_validator", + tier3_validator_config, + ) api_client_manager.configure_role( "autonomous_volume_organizer", @@ -911,11 +1184,17 @@ async def initialize( model_id=first_submitter_model, openrouter_model_id=first_config.openrouter_model_id if hasattr(first_config, 'openrouter_model_id') else None, openrouter_provider=first_config.openrouter_provider if hasattr(first_config, 'openrouter_provider') else None, + openrouter_reasoning_effort=first_reasoning_effort, lm_studio_fallback_id=first_config.lm_studio_fallback_id if hasattr(first_config, 'lm_studio_fallback_id') else None, context_window=first_submitter_context, - max_output_tokens=first_submitter_max_tokens + max_output_tokens=first_submitter_max_tokens, + supercharge_enabled=first_supercharge_enabled ) ) + api_client_manager.configure_role( + "autonomous_volume_organizer_validator", + tier3_validator_config, + ) logger.info("Configured Tier 3 Final Answer agents with api_client_manager") @@ -963,6 +1242,8 @@ async def _check_resume_state(self) -> None: self._current_reference_papers = workflow_state.get("reference_paper_ids", []) self._current_paper_title = workflow_state.get("current_paper_title") self._acceptance_count = workflow_state.get("acceptance_count", 0) + if self._current_topic_id: + await self._recover_brainstorm_acceptance_count(self._current_topic_id) self._rejection_count = workflow_state.get("rejection_count", 0) self._consecutive_rejections = workflow_state.get("consecutive_rejections", 0) self._exhaustion_signals = workflow_state.get("exhaustion_signals", 0) @@ -1302,6 +1583,65 @@ async def _current_brainstorm_available_for_paper(self) -> bool: return True + async def clear_deleted_brainstorm_reference(self, topic_id: str, reason: str) -> None: + """Clear stale coordinator pointers after a stopped brainstorm is deleted.""" + if self._current_topic_id != topic_id: + return + + logger.warning( + f"Clearing current brainstorm reference for {topic_id}: {reason}" + ) + stale_paper_id = self._current_paper_id + if stale_paper_id: + await self._delete_stale_incomplete_paper(stale_paper_id, topic_id, reason) + + self._current_topic_id = None + self._current_paper_id = None + self._current_paper_title = None + self._resume_paper_phase = None + self._acceptance_count = 0 + self._rejection_count = 0 + self._cleanup_removals = 0 + self._consecutive_rejections = 0 + self._exhaustion_signals = 0 + self._brainstorm_hard_limit_triggered = False + self._current_reference_papers = [] + self._brainstorm_paper_count = 0 + self._current_brainstorm_paper_ids = [] + self._last_completed_paper_id = None + await research_metadata.set_current_brainstorm(None) + await self._save_workflow_state(tier="tier1_aggregation") + + async def _current_brainstorm_available_for_aggregation(self, db_path: Path) -> bool: + """Return False if the active brainstorm was deleted during aggregation.""" + if not self._current_topic_id: + return False + + metadata = await brainstorm_memory.get_metadata(self._current_topic_id) + if metadata is None or not db_path.exists(): + logger.warning( + f"Stopping aggregation for missing brainstorm {self._current_topic_id}: " + f"metadata_exists={metadata is not None}, db_path={db_path}, " + f"db_exists={db_path.exists()}" + ) + if self._brainstorm_aggregator and self._brainstorm_aggregator.is_running: + await self._brainstorm_aggregator.stop() + + if metadata is None and db_path.exists(): + try: + db_path.unlink() + logger.info(f"Deleted orphaned brainstorm database: {db_path}") + except Exception as e: + logger.warning(f"Failed to delete orphaned brainstorm database {db_path}: {e}") + + await self.clear_deleted_brainstorm_reference( + self._current_topic_id, + "brainstorm metadata or database disappeared during aggregation" + ) + return False + + return True + async def _preserve_failed_paper_state(self, paper_id: str, paper_title: str) -> None: """ Preserve in-progress paper state after a compiler failure so retries resume. @@ -1336,8 +1676,18 @@ async def _preserve_failed_paper_state(self, paper_id: str, paper_title: str) -> f"outline_chars={len(current_outline or '')}" ) - async def _save_workflow_state(self, tier: str = None, phase: str = None) -> None: + async def _save_workflow_state(self, tier: str = None, phase: Any = _WORKFLOW_PHASE_UNSET) -> None: """Save current workflow state for crash recovery.""" + if phase is _WORKFLOW_PHASE_UNSET: + phase_to_store = self._resume_paper_phase + try: + existing_state = await research_metadata.get_workflow_state() + phase_to_store = phase_to_store or existing_state.get("paper_phase") + except Exception: + phase_to_store = phase_to_store or None + else: + phase_to_store = phase + # Serialize submitter configs for storage submitter_configs_data = [ { @@ -1359,7 +1709,7 @@ async def _save_workflow_state(self, tier: str = None, phase: str = None) -> Non "current_topic_id": self._current_topic_id, "current_paper_id": self._current_paper_id, "current_paper_title": self._current_paper_title, - "paper_phase": phase, + "paper_phase": phase_to_store, "reference_paper_ids": self._current_reference_papers, # Persist reference papers across restarts "acceptance_count": self._acceptance_count, "rejection_count": self._rejection_count, @@ -1546,24 +1896,90 @@ async def log_callback(task_id, role_id, model, provider, prompt, response, # CRITICAL: Restore paper_id so compilation workflow knows to resume self._current_topic_id = resume_topic self._current_paper_id = resume_paper # FIX: Restore paper_id + await self._recover_brainstorm_acceptance_count(resume_topic) + + if resume_paper and resume_state.get("paper_phase") != "paper_proof_verification": + resume_paper_metadata = await paper_library.get_metadata(resume_paper) + if resume_paper_metadata and resume_paper_metadata.status == "complete": + if await paper_library.is_paper_complete(resume_paper): + logger.info( + "Recovered completed paper %s from stale Tier 2 resume state; " + "running paper proof checkpoint instead of recompiling.", + resume_paper, + ) + resume_state["paper_phase"] = "paper_proof_verification" + elif resume_paper and resume_paper_metadata is None: + logger.warning( + "Ignoring stale current_paper_id %s during resume: metadata missing", + resume_paper, + ) + self._current_paper_id = None + resume_paper = None + + paper_resume_completed = False + if resume_state.get("paper_phase") == "paper_proof_verification" and resume_paper: + logger.info( + "Resuming paper proof verification before continuing: %s", + resume_paper, + ) + paper_metadata = await paper_library.get_metadata(resume_paper) + paper_content = await paper_library.get_paper_content( + resume_paper, + strip_proofs=True, + ) + if paper_metadata and paper_content: + self._current_paper_title = paper_metadata.title + self._current_reference_papers = paper_metadata.referenced_papers or self._current_reference_papers + await self._run_completed_paper_proof_checks( + paper_id=resume_paper, + title=paper_metadata.title, + content=paper_content, + source_brainstorm_ids=paper_metadata.source_brainstorm_ids, + ) + if self._stop_event.is_set(): + break + self._last_completed_paper_id = resume_paper + self._current_paper_id = None + self._current_paper_title = None + self._current_paper_tracker = None + await self._save_workflow_state(tier=None, phase=None) + paper_resume_completed = True + else: + logger.warning( + "Cannot resume paper proof verification for %s; saved paper metadata/content is missing", + resume_paper, + ) + self._current_paper_id = None + + if resume_state.get("paper_phase") == "brainstorm_proof_verification": + logger.info( + "Resuming brainstorm proof verification before paper compilation: %s", + resume_topic, + ) + self._resume_paper_phase = None + await self._run_brainstorm_completion_proofs() + if self._stop_event.is_set(): + break + resume_state = None # Clear resume state before retry loop # A resumed brainstorm MUST produce a paper - retry until success or stop _resume_paper_attempt = 0 - while not self._stop_event.is_set(): - _resume_paper_attempt += 1 - if _resume_paper_attempt > 1: - logger.warning( - f"Resume paper compilation attempt {_resume_paper_attempt} " - f"for brainstorm {self._current_topic_id} - retrying..." - ) - await asyncio.sleep(5) - if await self._paper_compilation_workflow( - emit_resume_event=(_resume_paper_attempt == 1) - ): - break - if self._brainstorm_missing_during_paper: - break + if not paper_resume_completed: + while not self._stop_event.is_set(): + _resume_paper_attempt += 1 + if _resume_paper_attempt > 1: + logger.warning( + f"Resume paper compilation attempt {_resume_paper_attempt} " + f"for brainstorm {self._current_topic_id} - retrying..." + ) + await asyncio.sleep(5) + if await self._paper_compilation_workflow( + emit_resume_event=(_resume_paper_attempt == 1) + ): + break + if self._brainstorm_missing_during_paper: + break if self._brainstorm_missing_during_paper: self._brainstorm_missing_during_paper = False @@ -1704,6 +2120,24 @@ async def log_callback(task_id, role_id, model, provider, prompt, response, await self._save_workflow_state(tier="tier1_aggregation") resume_state = None continue + await self._recover_brainstorm_acceptance_count(resume_topic) + if metadata.status == "complete": + logger.info( + "Recovered completed brainstorm %s from Tier 1 resume state; " + "continuing at proof/paper handoff instead of aggregation.", + resume_topic, + ) + await self._save_workflow_state( + tier="tier2_paper_writing", + phase="brainstorm_proof_verification", + ) + resume_state = { + **resume_state, + "current_tier": "tier2_paper_writing", + "paper_phase": "brainstorm_proof_verification", + "acceptance_count": self._acceptance_count, + } + continue write_paper = await self._brainstorm_aggregation_loop() resume_state = None # Clear resume state after handling @@ -2018,7 +2452,88 @@ async def _get_resume_point(self) -> Optional[Dict[str, Any]]: """Get resume point if there's an interrupted workflow.""" if research_metadata.has_interrupted_workflow(): return await research_metadata.get_workflow_state() + recovered_state = await self._recover_resume_point_from_current_metadata() + if recovered_state: + await research_metadata.save_workflow_state(recovered_state) + logger.info( + "Recovered resume point from saved metadata: tier=%s topic=%s paper=%s", + recovered_state.get("current_tier"), + recovered_state.get("current_topic_id"), + recovered_state.get("current_paper_id"), + ) + return recovered_state return None + + async def _recover_resume_point_from_current_metadata(self) -> Optional[Dict[str, Any]]: + """Synthesize a resume point from durable stats/metadata when workflow state is stale.""" + try: + stats = await research_metadata.get_stats() + topic_id = stats.get("current_brainstorm_id") + paper_id = stats.get("current_paper_id") + if not topic_id and not paper_id: + return None + + paper_title = None + reference_paper_ids: List[str] = [] + paper_phase = None + if paper_id: + paper_metadata = await paper_library.get_metadata(paper_id) + if paper_metadata is None: + logger.info("Ignoring stale current_paper_id %s during resume recovery: metadata missing", paper_id) + paper_id = None + else: + paper_is_complete = await paper_library.is_paper_complete(paper_id) + paper_title = paper_metadata.title + reference_paper_ids = paper_metadata.referenced_papers or [] + if not topic_id and paper_metadata.source_brainstorm_ids: + topic_id = paper_metadata.source_brainstorm_ids[0] + if paper_metadata.status == "in_progress" or not paper_is_complete: + paper_content = await self._get_paper_content_for_resume(paper_id) + paper_phase = self._detect_paper_phase(paper_content) + else: + logger.info( + "Ignoring stale current_paper_id %s during resume recovery: paper is already complete", + paper_id, + ) + paper_id = None + paper_title = None + reference_paper_ids = [] + + metadata = await brainstorm_memory.get_metadata(topic_id) if topic_id else None + if topic_id and metadata is None: + return None + + current_tier = "tier2_paper_writing" if paper_id else "tier1_aggregation" + if ( + metadata is not None + and metadata.status == "complete" + and not paper_id + and not (metadata.papers_generated or []) + ): + current_tier = "tier2_paper_writing" + paper_phase = "brainstorm_proof_verification" + elif metadata is not None and metadata.status == "complete" and not paper_id: + return None + + acceptance_count = await self._recover_brainstorm_acceptance_count(topic_id) + workflow_state = await research_metadata.get_workflow_state() + workflow_state.update( + { + "is_running": False, + "current_tier": current_tier, + "current_topic_id": topic_id, + "current_paper_id": paper_id, + "current_paper_title": paper_title, + "paper_phase": paper_phase, + "reference_paper_ids": reference_paper_ids, + "acceptance_count": acceptance_count, + "papers_completed_count": stats.get("total_papers_completed", 0), + } + ) + return workflow_state + except Exception as exc: + logger.debug("Failed to recover resume point from metadata: %s", exc) + return None async def stop(self) -> None: """Stop the autonomous research gracefully. @@ -2050,6 +2565,8 @@ async def _run_shutdown_step(label: str, awaitable, timeout: float = 5.0) -> boo return False # Stop any running aggregator or compiler to prevent orphan tasks + await self._stop_active_child_aggregators("autonomous stop") + if self._brainstorm_aggregator: try: if await _run_shutdown_step("brainstorm aggregator", self._brainstorm_aggregator.stop()): @@ -2435,8 +2952,7 @@ def get_validator_config(self) -> Optional[Dict[str, Any]]: Returns None if not initialized. Returns: - Dict with validator_model, validator_context_window, validator_max_tokens, - validator_provider, and validator_openrouter_provider, or None if not initialized. + Dict with validator model/runtime settings, or None if not initialized. """ if not self._validator_model: return None @@ -2447,6 +2963,8 @@ def get_validator_config(self) -> Optional[Dict[str, Any]]: "validator_max_tokens": self._validator_max_tokens, "validator_provider": self._validator_provider, "validator_openrouter_provider": self._validator_openrouter_provider, + "validator_openrouter_reasoning_effort": self._validator_openrouter_reasoning_effort, + "validator_supercharge_enabled": self._validator_supercharge_enabled, } def get_proof_runtime_config(self) -> Optional[Dict[str, Any]]: @@ -2530,6 +3048,7 @@ async def _topic_exploration_phase(self) -> str: try: exploration_aggregator = AggregatorCoordinator() + self._track_child_aggregator(exploration_aggregator) await exploration_aggregator.initialize( user_prompt=exploration_prompt, @@ -2541,7 +3060,9 @@ async def _topic_exploration_phase(self) -> str: validator_max_tokens=self._validator_max_tokens, validator_provider=self._validator_provider, validator_openrouter_provider=self._validator_openrouter_provider, + validator_openrouter_reasoning_effort=self._validator_openrouter_reasoning_effort, validator_lm_studio_fallback=self._validator_lm_studio_fallback, + validator_supercharge_enabled=self._validator_supercharge_enabled, enable_cleanup_review=False ) @@ -2644,6 +3165,8 @@ async def _topic_exploration_phase(self) -> str: pass return "" finally: + self._untrack_child_aggregator(exploration_aggregator) + # Restore original shared training path system_config.shared_training_file = original_shared_path shared_training_memory.file_path = original_memory_path @@ -3019,7 +3542,7 @@ async def _pre_brainstorm_reference_selection(self) -> List[str]: logger.info(f"Pre-brainstorm reference selection: selected {len(selected_ids)} papers") return selected_ids - def _get_reference_paper_paths(self) -> List[str]: + async def _get_reference_paper_paths(self) -> List[str]: """ Get file paths for currently selected reference papers. Uses session-based paths if session manager is active. @@ -3029,6 +3552,10 @@ def _get_reference_paper_paths(self) -> List[str]: """ paths = [] for paper_id in self._current_reference_papers: + metadata = await paper_library.get_metadata(paper_id) + if metadata and metadata.status != "complete": + logger.info(f"Skipping pruned/non-complete reference paper {paper_id}") + continue # Use paper_library to get session-aware path # paper_library handles both legacy flat structure and session-based paths paper_path = paper_library._get_paper_path(paper_id) @@ -3052,6 +3579,9 @@ async def _get_reference_paper_details( if not metadata: logger.warning(f"Reference paper metadata not found: {paper_id}") continue + if metadata.status != "complete": + logger.info(f"Skipping non-complete reference paper metadata: {paper_id} ({metadata.status})") + continue reference_title_display = await paper_library.get_reference_title_display( paper_id, @@ -3089,6 +3619,8 @@ async def _brainstorm_aggregation_loop(self) -> bool: if metadata is None: logger.error(f"Cannot start aggregation: brainstorm {self._current_topic_id} not found") return False + + self._brainstorm_hard_limit_triggered = False # Initialize per-paper model tracker for this brainstorm/paper cycle self._current_paper_tracker = PaperModelTracker( @@ -3114,6 +3646,15 @@ async def paper_model_tracking_callback(model_id: str) -> None: # Override shared training memory path to brainstorm-specific # Use brainstorm_memory to get correct path (respects session manager) brainstorm_db_path = brainstorm_memory._get_database_path(self._current_topic_id) + if not brainstorm_db_path.exists(): + logger.error( + f"Cannot start aggregation: brainstorm database not found at {brainstorm_db_path}" + ) + await self.clear_deleted_brainstorm_reference( + self._current_topic_id, + "brainstorm database missing before aggregation start" + ) + return False brainstorm_db_path.parent.mkdir(parents=True, exist_ok=True) # Temporarily override shared training path @@ -3134,9 +3675,12 @@ async def paper_model_tracking_callback(model_id: str) -> None: try: # Get reference paper paths for brainstorm context # This enables compounding knowledge - brainstorm submitters can build on prior papers - reference_paper_paths = self._get_reference_paper_paths() + reference_paper_paths = await self._get_reference_paper_paths() if reference_paper_paths: logger.info(f"Loading {len(reference_paper_paths)} reference papers for brainstorm aggregation") + + async def hard_limit_callback(total_acceptances: int) -> None: + await self._trigger_brainstorm_hard_limit(total_acceptances) # Initialize aggregator with topic prompt # CRITICAL: skip_stats_load=True to prevent loading manual aggregator stats @@ -3153,7 +3697,13 @@ async def paper_model_tracking_callback(model_id: str) -> None: # Pass OpenRouter provider configs for validator validator_provider=self._validator_provider, validator_openrouter_provider=self._validator_openrouter_provider, - validator_lm_studio_fallback=self._validator_lm_studio_fallback + validator_openrouter_reasoning_effort=self._validator_openrouter_reasoning_effort, + validator_lm_studio_fallback=self._validator_lm_studio_fallback, + validator_supercharge_enabled=self._validator_supercharge_enabled, + max_total_acceptances=_BRAINSTORM_ACCEPTANCE_HARD_LIMIT, + acceptance_count_offset=max(0, self._acceptance_count), + acceptance_cap_callback=hard_limit_callback, + allow_trusted_context_files=True, ) # CRITICAL FIX: Re-ingest existing submissions into RAG after resume @@ -3240,18 +3790,22 @@ async def paper_model_tracking_callback(model_id: str) -> None: # Safety check: if topic already at or past hard cap (e.g. resume of # already-complete brainstorm that slipped past the code guard), skip # aggregation entirely and go straight to paper writing. - if self._acceptance_count >= 30: + if self._acceptance_count >= _BRAINSTORM_ACCEPTANCE_HARD_LIMIT: logger.info( f"Topic {self._current_topic_id} already at {self._acceptance_count} " - f"acceptances (>= 30 cap). Skipping aggregation, forcing paper writing." + f"acceptances (>= {_BRAINSTORM_ACCEPTANCE_HARD_LIMIT} cap). " + f"Skipping aggregation, forcing paper writing." ) - await brainstorm_memory.mark_complete(self._current_topic_id) - await research_metadata.mark_brainstorm_complete(self._current_topic_id) + await self._trigger_brainstorm_hard_limit(self._acceptance_count) await self._brainstorm_aggregator.stop() await self._run_brainstorm_completion_proofs() return True while self._running and not self._stop_event.is_set(): + if not await self._current_brainstorm_available_for_aggregation(brainstorm_db_path): + await self._brainstorm_aggregator.stop() + return False + # Get current aggregator stats status = await self._brainstorm_aggregator.get_status() current_acceptances = status.total_acceptances @@ -3294,25 +3848,10 @@ async def paper_model_tracking_callback(model_id: str) -> None: await self._save_workflow_state(tier="tier1_aggregation") # Check for hard limit of 30 acceptances (FORCE paper writing, skip completion review) - if self._acceptance_count >= 30: - logger.info(f"Hard limit of 30 acceptances reached for {self._current_topic_id}. Forcing paper writing transition.") - - # Broadcast hard limit reached event - await self._broadcast("brainstorm_hard_limit_reached", { - "topic_id": self._current_topic_id, - "acceptance_count": self._acceptance_count, - "message": "Brainstorm hard limit of 30 acceptances reached. Forcing paper writing." - }) - - # Mark brainstorm complete - await brainstorm_memory.mark_complete(self._current_topic_id) - await research_metadata.mark_brainstorm_complete(self._current_topic_id) - - # Stop aggregator + if self._acceptance_count >= _BRAINSTORM_ACCEPTANCE_HARD_LIMIT: + await self._trigger_brainstorm_hard_limit(self._acceptance_count) await self._brainstorm_aggregator.stop() await self._run_brainstorm_completion_proofs() - - # Force transition to paper writing (skip completion review) return True # Check for early completion triggers @@ -3330,6 +3869,11 @@ async def paper_model_tracking_callback(model_id: str) -> None: await self._brainstorm_aggregator.stop() await self._run_brainstorm_completion_proofs() return True + + if self._brainstorm_hard_limit_triggered: + await self._brainstorm_aggregator.stop() + await self._run_brainstorm_completion_proofs() + return True # Check for manual override trigger (before checking stop event) if self._manual_paper_writing_triggered: @@ -3422,7 +3966,10 @@ async def _check_early_completion_triggers(self) -> bool: ] exhaustion_count = 0 - for submitter_id in [1, 2, 3]: + configured_submitter_ids = sorted( + {config.submitter_id for config in self._submitter_configs if config.submitter_id} + ) or [1, 2, 3] + for submitter_id in configured_submitter_ids: rejections = await autonomous_rejection_logs.get_brainstorm_submitter_rejections( self._current_topic_id, submitter_id @@ -3481,8 +4028,19 @@ async def force_paper_writing(self) -> bool: await brainstorm_memory.mark_complete(self._current_topic_id) await research_metadata.mark_brainstorm_complete(self._current_topic_id) - # Set flag to trigger paper writing on next loop iteration + # Parent/user action wins immediately: stop child aggregation now, + # then let the owning workflow loop transition into Tier 2. self._manual_paper_writing_triggered = True + self._state.current_tier = "tier2_paper_writing" + await self._save_workflow_state(tier="tier2_paper_writing") + try: + if await self._await_parent_phase_shutdown( + "brainstorm aggregator shutdown for manual paper-writing override", + self._brainstorm_aggregator.stop(), + ): + logger.info("Brainstorm aggregator stopped by manual paper-writing override") + except Exception as stop_exc: + logger.warning(f"Error stopping aggregator during manual paper-writing override: {stop_exc}") return True @@ -3563,10 +4121,15 @@ async def force_tier3_final_answer(self, mode: str = "complete_current") -> dict logger.info("Force Tier 3: Main loop stopped") # Stop current aggregator if it exists (don't check tier - state is unreliable) + await self._stop_active_child_aggregators("forced Tier 3") + if self._brainstorm_aggregator: try: - await self._brainstorm_aggregator.stop() - logger.info("Aggregator stopped for forced Tier 3") + if await self._await_parent_phase_shutdown( + "brainstorm aggregator shutdown for forced Tier 3", + self._brainstorm_aggregator.stop(), + ): + logger.info("Aggregator stopped for forced Tier 3") except Exception as e: logger.warning(f"Error stopping aggregator: {e}") @@ -3581,17 +4144,35 @@ async def force_tier3_final_answer(self, mode: str = "complete_current") -> dict # Stop compiler if it exists (don't check tier - state is unreliable) if self._paper_compiler: try: - await self._paper_compiler.stop() - logger.info("Compiler stopped for forced Tier 3") + if await self._await_parent_phase_shutdown( + "paper compiler shutdown for forced Tier 3", + self._paper_compiler.stop(), + ): + logger.info("Compiler stopped for forced Tier 3") except Exception as e: logger.warning(f"Error stopping compiler: {e}") - # CRITICAL: Wait for main loop to actually exit before resetting flags - # The main loop checks these flags, and if we reset them too quickly, - # the loop will see _running=True and continue creating brainstorms! - # This delay ensures the main loop's next iteration sees _running=False and exits. - await asyncio.sleep(0.5) - logger.info("Force Tier 3: Waited for main loop to exit") + # CRITICAL: Wait for the old main loop to actually exit before + # resetting flags for Tier 3. Otherwise a child branch can see the + # cleared stop flag and continue under the parent final phase. + current_task = asyncio.current_task() + main_task = self._main_task + if main_task and main_task is not current_task and not main_task.done(): + try: + await asyncio.wait_for( + asyncio.shield(main_task), + timeout=_PARENT_PHASE_SHUTDOWN_TIMEOUT_SECONDS, + ) + logger.info("Force Tier 3: Main loop exited cleanly") + except asyncio.TimeoutError: + logger.warning("Force Tier 3: Main loop did not exit in time; cancelling it") + main_task.cancel() + try: + await main_task + except asyncio.CancelledError: + pass + else: + await asyncio.sleep(0) # CRITICAL: Reset flags for Tier 3 execution # Now that the main loop has exited, we can reset flags for Tier 3's internal loops @@ -3633,8 +4214,9 @@ async def force_tier3_final_answer(self, mode: str = "complete_current") -> dict logger.info("Force Tier 3: Restarting main research loop to generate more papers") # Flags are already in running state (set at lines 1737-1738) - # Create a background task to resume the main research loop - asyncio.create_task(self._resume_research_loop_after_tier3()) + # Create a tracked background task to resume the main research loop. + self._main_task = asyncio.create_task(self._resume_research_loop_after_tier3()) + self._main_task.add_done_callback(self._on_main_task_done) return { "success": True, @@ -4116,6 +4698,7 @@ async def _paper_title_exploration_phase( last_rejections = 0 else: exploration_aggregator = AggregatorCoordinator() + self._track_child_aggregator(exploration_aggregator) await exploration_aggregator.initialize( user_prompt=exploration_prompt, @@ -4127,7 +4710,9 @@ async def _paper_title_exploration_phase( validator_max_tokens=self._validator_max_tokens, validator_provider=self._validator_provider, validator_openrouter_provider=self._validator_openrouter_provider, + validator_openrouter_reasoning_effort=self._validator_openrouter_reasoning_effort, validator_lm_studio_fallback=self._validator_lm_studio_fallback, + validator_supercharge_enabled=self._validator_supercharge_enabled, enable_cleanup_review=False ) @@ -4229,6 +4814,8 @@ async def _paper_title_exploration_phase( pass return "" finally: + self._untrack_child_aggregator(exploration_aggregator) + system_config.shared_training_file = original_shared_path shared_training_memory.file_path = original_memory_path @@ -4306,16 +4893,24 @@ async def _compile_paper( # Pass OpenRouter provider configs for all compiler roles validator_provider=self._validator_provider, validator_openrouter_provider=self._validator_openrouter_provider, + validator_openrouter_reasoning_effort=self._validator_openrouter_reasoning_effort, validator_lm_studio_fallback=self._validator_lm_studio_fallback, high_context_provider=self._high_context_provider, high_context_openrouter_provider=self._high_context_openrouter_provider, + high_context_openrouter_reasoning_effort=self._high_context_openrouter_reasoning_effort, high_context_lm_studio_fallback=self._high_context_lm_studio_fallback, high_param_provider=self._high_param_provider, high_param_openrouter_provider=self._high_param_openrouter_provider, + high_param_openrouter_reasoning_effort=self._high_param_openrouter_reasoning_effort, high_param_lm_studio_fallback=self._high_param_lm_studio_fallback, critique_submitter_provider=self._critique_submitter_provider, critique_submitter_openrouter_provider=self._critique_submitter_openrouter_provider, - critique_submitter_lm_studio_fallback=self._critique_submitter_lm_studio_fallback + critique_submitter_openrouter_reasoning_effort=self._critique_submitter_openrouter_reasoning_effort, + critique_submitter_lm_studio_fallback=self._critique_submitter_lm_studio_fallback, + validator_supercharge_enabled=self._validator_supercharge_enabled, + high_context_supercharge_enabled=self._high_context_supercharge_enabled, + high_param_supercharge_enabled=self._high_param_supercharge_enabled, + critique_submitter_supercharge_enabled=self._critique_submitter_supercharge_enabled ) # Set WebSocket broadcaster for compiler events @@ -4401,6 +4996,10 @@ async def _compile_paper( if reference_paper_ids: logger.info(f"Loading {len(reference_paper_ids)} reference papers into compiler RAG") for ref_paper_id in reference_paper_ids: + ref_metadata = await paper_library.get_metadata(ref_paper_id) + if not ref_metadata or ref_metadata.status != "complete": + logger.info(f"Skipping non-complete compiler reference paper {ref_paper_id}") + continue # IMPORTANT: Use paper_library.get_paper_path() for session-aware path resolution paper_path = paper_library.get_paper_path(ref_paper_id) if os.path.exists(paper_path): @@ -4423,6 +5022,10 @@ async def _compile_paper( if self._current_brainstorm_paper_ids: logger.info(f"Loading {len(self._current_brainstorm_paper_ids)} prior brainstorm papers as auto-references") for bp_id in self._current_brainstorm_paper_ids: + bp_metadata = await paper_library.get_metadata(bp_id) + if not bp_metadata or bp_metadata.status != "complete": + logger.info(f"Skipping non-complete prior brainstorm paper {bp_id}") + continue bp_path = paper_library.get_paper_path(bp_id) if os.path.exists(bp_path): bp_content = await paper_library.get_paper_content(bp_id, strip_proofs=True) @@ -4671,63 +5274,22 @@ async def _handle_paper_completion( "word_count": paper_metadata.word_count }) - await self._run_proof_verification( - content, - "paper", - paper_id, - source_title=title, + await self._save_workflow_state( + tier="tier2_paper_writing", + phase="paper_proof_verification", ) - - pending_retry_candidates: List[ProofCandidate] = [] - retry_source_ids = paper_metadata.source_brainstorm_ids or ([self._current_topic_id] if self._current_topic_id else []) - for brainstorm_id in retry_source_ids: - pending_retries = await proof_database.get_pending_retries( - brainstorm_id, - retry_source_id=paper_id, - ) - for pending_retry in pending_retries: - combined_excerpt_parts = [] - if pending_retry.source_excerpt: - combined_excerpt_parts.append( - "ORIGINAL BRAINSTORM EXCERPT:\n" + pending_retry.source_excerpt - ) - if content: - combined_excerpt_parts.append( - "REFINED PAPER CONTEXT:\n" + content[:6000] - ) - - retry_formal_sketch = pending_retry.formal_sketch - if pending_retry.error_summary: - retry_formal_sketch = ( - f"{retry_formal_sketch}\n\nPrior Lean 4 failure summary: {pending_retry.error_summary}" - ).strip() - - pending_retry_candidates.append( - ProofCandidate( - theorem_id=pending_retry.theorem_id, - statement=pending_retry.theorem_statement, - formal_sketch=retry_formal_sketch, - source_excerpt="\n\n".join(part for part in combined_excerpt_parts if part).strip(), - origin_source_id=brainstorm_id, - ) - ) - - if pending_retry_candidates: - await self._broadcast("proof_retry_scheduled", { - "source_type": "paper", - "source_id": paper_id, - "source_title": title, - "count": len(pending_retry_candidates), - "brainstorm_ids": retry_source_ids, - }) - await self._run_proof_verification( - content, - "paper", + await self._run_completed_paper_proof_checks( + paper_id=paper_id, + title=title, + content=content, + source_brainstorm_ids=paper_metadata.source_brainstorm_ids, + ) + if self._stop_event.is_set(): + logger.info( + "Stop requested during paper proof verification for %s; preserving proof checkpoint", paper_id, - source_title=title, - theorem_candidates=pending_retry_candidates, - trigger="retry", ) + return # Trigger auto-critique generation in background (only if marking as complete) asyncio.create_task(self._auto_generate_paper_critique( @@ -4748,6 +5310,81 @@ async def _handle_paper_completion( else: # Paper saved but still in progress - keep state logger.info(f"Paper saved (in progress): {paper_id} ({paper_metadata.word_count} words)") + + async def _run_completed_paper_proof_checks( + self, + paper_id: str, + title: str, + content: str, + source_brainstorm_ids: List[str], + ) -> None: + """Run proof checks for a completed paper and any deferred brainstorm retries.""" + self._state.current_tier = "tier2_paper_writing" + await self._save_workflow_state( + tier="tier2_paper_writing", + phase="paper_proof_verification", + ) + + await self._run_proof_verification( + content, + "paper", + paper_id, + source_title=title, + ) + + if self._stop_event.is_set(): + return + + pending_retry_candidates: List[ProofCandidate] = [] + retry_source_ids = source_brainstorm_ids or ([self._current_topic_id] if self._current_topic_id else []) + for brainstorm_id in retry_source_ids: + pending_retries = await proof_database.get_pending_retries( + brainstorm_id, + retry_source_id=paper_id, + ) + for pending_retry in pending_retries: + combined_excerpt_parts = [] + if pending_retry.source_excerpt: + combined_excerpt_parts.append( + "ORIGINAL BRAINSTORM EXCERPT:\n" + pending_retry.source_excerpt + ) + if content: + combined_excerpt_parts.append( + "REFINED PAPER CONTEXT:\n" + content[:6000] + ) + + retry_formal_sketch = pending_retry.formal_sketch + if pending_retry.error_summary: + retry_formal_sketch = ( + f"{retry_formal_sketch}\n\nPrior Lean 4 failure summary: {pending_retry.error_summary}" + ).strip() + + pending_retry_candidates.append( + ProofCandidate( + theorem_id=pending_retry.theorem_id, + statement=pending_retry.theorem_statement, + formal_sketch=retry_formal_sketch, + source_excerpt="\n\n".join(part for part in combined_excerpt_parts if part).strip(), + origin_source_id=brainstorm_id, + ) + ) + + if pending_retry_candidates and not self._stop_event.is_set(): + await self._broadcast("proof_retry_scheduled", { + "source_type": "paper", + "source_id": paper_id, + "source_title": title, + "count": len(pending_retry_candidates), + "brainstorm_ids": retry_source_ids, + }) + await self._run_proof_verification( + content, + "paper", + paper_id, + source_title=title, + theorem_candidates=pending_retry_candidates, + trigger="retry", + ) async def _auto_generate_paper_critique( self, @@ -4814,9 +5451,11 @@ async def _auto_generate_paper_critique( model_id=self._validator_model, openrouter_model_id=self._validator_model if self._validator_provider == "openrouter" else None, openrouter_provider=self._validator_openrouter_provider, + openrouter_reasoning_effort=self._validator_openrouter_reasoning_effort, lm_studio_fallback_id=self._validator_lm_studio_fallback, context_window=self._validator_context, - max_output_tokens=self._validator_max_tokens + max_output_tokens=self._validator_max_tokens, + supercharge_enabled=self._validator_supercharge_enabled ) ) @@ -4944,7 +5583,12 @@ async def _check_paper_redundancy(self) -> None: if result and result.should_remove and result.paper_id: # Execute removal - success = await self._redundancy_checker.execute_removal(result.paper_id) + success = await self._redundancy_checker.execute_removal( + result.paper_id, + reason=result.reasoning, + ) + if success: + await autonomous_rag_manager.remove_paper_from_rag(result.paper_id) await self._broadcast("paper_redundancy_review", { "should_remove": True, @@ -5829,16 +6473,24 @@ async def _compile_tier3_paper( # Pass OpenRouter provider configs for all compiler roles validator_provider=self._validator_provider, validator_openrouter_provider=self._validator_openrouter_provider, + validator_openrouter_reasoning_effort=self._validator_openrouter_reasoning_effort, validator_lm_studio_fallback=self._validator_lm_studio_fallback, high_context_provider=self._high_context_provider, high_context_openrouter_provider=self._high_context_openrouter_provider, + high_context_openrouter_reasoning_effort=self._high_context_openrouter_reasoning_effort, high_context_lm_studio_fallback=self._high_context_lm_studio_fallback, high_param_provider=self._high_param_provider, high_param_openrouter_provider=self._high_param_openrouter_provider, + high_param_openrouter_reasoning_effort=self._high_param_openrouter_reasoning_effort, high_param_lm_studio_fallback=self._high_param_lm_studio_fallback, critique_submitter_provider=self._critique_submitter_provider, critique_submitter_openrouter_provider=self._critique_submitter_openrouter_provider, - critique_submitter_lm_studio_fallback=self._critique_submitter_lm_studio_fallback + critique_submitter_openrouter_reasoning_effort=self._critique_submitter_openrouter_reasoning_effort, + critique_submitter_lm_studio_fallback=self._critique_submitter_lm_studio_fallback, + validator_supercharge_enabled=self._validator_supercharge_enabled, + high_context_supercharge_enabled=self._high_context_supercharge_enabled, + high_param_supercharge_enabled=self._high_param_supercharge_enabled, + critique_submitter_supercharge_enabled=self._critique_submitter_supercharge_enabled ) # Set WebSocket broadcaster @@ -6137,6 +6789,7 @@ def safe_rmtree(path: Path, max_retries: int = 5) -> bool: self._last_redundancy_check_at = 0 self._last_completion_review_at = 0 self._manual_paper_writing_triggered = False + self._brainstorm_hard_limit_triggered = False self._force_tier3_after_paper = False self._force_tier3_immediate = False self._tier3_active = False diff --git a/backend/autonomous/core/autonomous_rag_manager.py b/backend/autonomous/core/autonomous_rag_manager.py index dfabcb6..548f9ab 100644 --- a/backend/autonomous/core/autonomous_rag_manager.py +++ b/backend/autonomous/core/autonomous_rag_manager.py @@ -222,6 +222,10 @@ async def get_reference_papers_context( for paper_id in paper_ids: content = await paper_library.get_paper_content(paper_id, strip_proofs=True) metadata = await paper_library.get_metadata(paper_id) + + if metadata and metadata.status != "complete": + logger.info(f"Skipping non-complete reference paper {paper_id} ({metadata.status})") + continue if content and metadata: paper_tokens = count_tokens(content) @@ -523,6 +527,20 @@ async def remove_brainstorm_from_rag(self, topic_id: str) -> None: except Exception as e: logger.error(f"Failed to remove brainstorm {topic_id} from RAG: {e}") + async def remove_paper_from_rag(self, paper_id: str) -> None: + """Remove a pruned paper from any active paper RAG sources.""" + self._papers_indexed.discard(paper_id) + for source_name in ( + f"reference_paper_{paper_id}", + f"reference_paper_{paper_id}.txt", + f"prior_paper_{paper_id}.txt", + ): + try: + await rag_manager.remove_document(source_name) + logger.info(f"Removed pruned paper RAG source {source_name}") + except Exception as e: + logger.debug(f"Reference paper RAG source {source_name} not removed: {e}") + # Global instance autonomous_rag_manager = AutonomousRAGManager() diff --git a/backend/autonomous/core/proof_novelty.py b/backend/autonomous/core/proof_novelty.py index 573d538..4b4cf6f 100644 --- a/backend/autonomous/core/proof_novelty.py +++ b/backend/autonomous/core/proof_novelty.py @@ -21,7 +21,13 @@ VALID_NOVELTY_TIERS = frozenset( - {"not_novel", "novel_formulation", "novel_variant", "mathematical_discovery"} + { + "not_novel", + "novel_formulation", + "novel_variant", + "mathematical_discovery", + "major_mathematical_discovery", + } ) @@ -37,7 +43,7 @@ async def assess_proof_novelty( task_id: str, role_id: str = "autonomous_proof_novelty", ) -> Tuple[str, str]: - """Classify a Lean-4-verified theorem into one of four novelty tiers. + """Classify a Lean-4-verified theorem into one of five novelty tiers. Args: user_prompt: Top-level research prompt for context. @@ -55,7 +61,8 @@ async def assess_proof_novelty( Returns: Tuple of (novelty_tier, reasoning) where novelty_tier is one of: - "not_novel", "novel_formulation", "novel_variant", "mathematical_discovery". + "not_novel", "novel_formulation", "novel_variant", + "mathematical_discovery", "major_mathematical_discovery". Falls back to ("not_novel", ) when the validator returns no usable response or an unrecognised tier string. """ diff --git a/backend/autonomous/core/proof_registration.py b/backend/autonomous/core/proof_registration.py new file mode 100644 index 0000000..ab02d8f --- /dev/null +++ b/backend/autonomous/core/proof_registration.py @@ -0,0 +1,216 @@ +""" +Shared registration for Lean-verified proofs. + +Callers that already have Lean-accepted code use this module to classify the +proof with the validator novelty prompt and store the resulting ProofRecord in +the central proof database. +""" +from __future__ import annotations + +import logging +from dataclasses import dataclass +from typing import Any, Awaitable, Callable, Optional + +from backend.autonomous.core.proof_novelty import assess_proof_novelty +from backend.shared.models import ProofAttemptFeedback, ProofDependency, ProofRecord + +logger = logging.getLogger(__name__) + +BroadcastFn = Optional[Callable[[str, dict[str, Any]], Awaitable[None]]] + + +@dataclass +class RegisteredProof: + """Result of registering or reusing a verified proof record.""" + + record: ProofRecord + duplicate: bool = False + + +def _normalize_for_duplicate_check(value: str) -> str: + return "\n".join((value or "").strip().splitlines()) + + +async def _find_existing_proof( + proof_database, + *, + source_type: str, + source_id: str, + theorem_statement: str, + lean_code: str, +) -> Optional[ProofRecord]: + """Return an existing proof for the same source/theorem/code if present.""" + normalized_statement = " ".join((theorem_statement or "").split()) + normalized_code = _normalize_for_duplicate_check(lean_code) + try: + for proof in await proof_database.get_all_proofs(): + if proof.source_type != source_type or proof.source_id != source_id: + continue + if " ".join((proof.theorem_statement or "").split()) != normalized_statement: + continue + if _normalize_for_duplicate_check(proof.lean_code) != normalized_code: + continue + return proof + except Exception as exc: + logger.debug("Existing proof lookup failed for %s %s: %s", source_type, source_id, exc) + return None + + +async def _broadcast_registered_proof( + *, + broadcast_fn: BroadcastFn, + record: ProofRecord, + base_event: Optional[dict[str, Any]], + proof_label: str = "", + retry_origin_source_id: str = "", +) -> None: + if not broadcast_fn: + return + + event_payload = { + **(base_event or {}), + "proof_id": record.proof_id, + "theorem_statement": record.theorem_statement, + "solver": record.solver, + "novelty_tier": record.novelty_tier, + "retry_origin_source_id": retry_origin_source_id, + } + if proof_label: + event_payload["proof_label"] = proof_label + + if record.novel: + await broadcast_fn("novel_proof_discovered", event_payload) + else: + await broadcast_fn("known_proof_verified", event_payload) + + +async def _broadcast_duplicate_proof( + *, + broadcast_fn: BroadcastFn, + record: ProofRecord, + base_event: Optional[dict[str, Any]], + proof_label: str = "", +) -> None: + if not broadcast_fn: + return + event_payload = { + **(base_event or {}), + "proof_id": record.proof_id, + "theorem_statement": record.theorem_statement, + "solver": record.solver, + "novelty_tier": record.novelty_tier, + "duplicate": True, + } + if proof_label: + event_payload["proof_label"] = proof_label + await broadcast_fn("proof_registration_duplicate", event_payload) + + +async def register_verified_lean_proof( + *, + proof_database, + user_prompt: str, + theorem_statement: str, + lean_code: str, + validator_model: str, + validator_context: int, + validator_max_tokens: int, + task_id: str, + role_id: str, + source_type: str, + source_id: str, + source_title: str = "", + theorem_id: str = "", + theorem_name: str = "", + formal_sketch: str = "", + solver: str = "Lean 4", + verification_notes: str = "Lean 4 accepted the submitted proof.", + attempt_count: int = 0, + attempts: Optional[list[ProofAttemptFeedback]] = None, + dependencies: Optional[list[ProofDependency]] = None, + solver_hints: Optional[list[str]] = None, + broadcast_fn: BroadcastFn = None, + base_event: Optional[dict[str, Any]] = None, + proof_label: str = "", + retry_origin_source_id: str = "", +) -> RegisteredProof: + """ + Classify and store Lean-verified proof code using the shared novelty tiers. + + Duplicate detection is scoped to source type/id, theorem statement, and + Lean code. When a duplicate is found, the existing record is returned and + no novelty API call is made. + """ + existing = await _find_existing_proof( + proof_database, + source_type=source_type, + source_id=source_id, + theorem_statement=theorem_statement, + lean_code=lean_code, + ) + if existing is not None: + await _broadcast_duplicate_proof( + broadcast_fn=broadcast_fn, + record=existing, + base_event=base_event, + proof_label=proof_label, + ) + return RegisteredProof(record=existing, duplicate=True) + + existing_novel_proofs = proof_database.get_novel_proofs_for_injection() + novelty_tier, novelty_reasoning = await assess_proof_novelty( + user_prompt=user_prompt, + theorem_statement=theorem_statement, + lean_code=lean_code, + validator_model=validator_model, + validator_context=validator_context, + validator_max_tokens=validator_max_tokens, + existing_novel_proofs=existing_novel_proofs, + task_id=task_id, + role_id=role_id, + ) + is_novel = novelty_tier != "not_novel" + + record = ProofRecord( + proof_id="", + theorem_id=theorem_id, + theorem_statement=theorem_statement, + theorem_name=theorem_name, + formal_sketch=formal_sketch, + source_type=source_type, + source_id=source_id, + source_title=source_title, + solver=solver, + lean_code=lean_code, + novel=is_novel, + novelty_tier=novelty_tier, + novelty_reasoning=novelty_reasoning, + verification_notes=verification_notes, + attempt_count=attempt_count, + attempts=list(attempts or []), + dependencies=list(dependencies or []), + solver_hints=list(solver_hints or []), + ) + if hasattr(proof_database, "add_proof_if_absent"): + stored, duplicate = await proof_database.add_proof_if_absent(record) + else: + stored = await proof_database.add_proof(record) + duplicate = stored.proof_id != record.proof_id and record.proof_id != "" + + if duplicate: + await _broadcast_duplicate_proof( + broadcast_fn=broadcast_fn, + record=stored, + base_event=base_event, + proof_label=proof_label, + ) + return RegisteredProof(record=stored, duplicate=True) + + await _broadcast_registered_proof( + broadcast_fn=broadcast_fn, + record=stored, + base_event=base_event, + proof_label=proof_label, + retry_origin_source_id=retry_origin_source_id, + ) + return RegisteredProof(record=stored, duplicate=False) diff --git a/backend/autonomous/core/proof_verification_stage.py b/backend/autonomous/core/proof_verification_stage.py index c19178c..bd0e19a 100644 --- a/backend/autonomous/core/proof_verification_stage.py +++ b/backend/autonomous/core/proof_verification_stage.py @@ -15,8 +15,15 @@ from backend.autonomous.agents.proof_identification_agent import ProofIdentificationAgent from backend.autonomous.memory.brainstorm_memory import brainstorm_memory from backend.autonomous.memory.paper_library import paper_library +from backend.autonomous.core.proof_registration import register_verified_lean_proof +from backend.aggregator.prompts.validator_prompts import build_validator_prompt +from backend.shared.api_client_manager import api_client_manager +from backend.shared.brainstorm_proof_gate import BRAINSTORM_LEAN_PROOF_MARKER from backend.shared.config import system_config -from backend.shared.models import ProofAttemptFeedback, ProofAttemptResult, ProofCandidate, ProofRecord, ProofStageResult, SmtHint +from backend.shared.json_parser import parse_json +from backend.shared.lean_proof_integrity import validate_full_lean_proof_integrity +from backend.shared.model_error_utils import is_non_retryable_model_error +from backend.shared.models import ProofAttemptFeedback, ProofAttemptResult, ProofCandidate, ProofStageResult, SmtHint from backend.shared.openrouter_client import FreeModelExhaustedError from backend.shared.smt_client import get_smt_client from .proof_dependency_extractor import ProofDependencyExtractor @@ -32,6 +39,7 @@ class _LeanVerificationOutcome: """Outcome of a single candidate's Lean 4 formalization pipeline (Phase A).""" candidate: ProofCandidate + proof_label: str success: bool theorem_name: str lean_code: str @@ -46,6 +54,7 @@ class ProofVerificationStage: def __init__(self) -> None: self._novelty_task_sequence = 0 + self._integrity_task_sequence = 0 self._dependency_extractor = ProofDependencyExtractor() @classmethod @@ -136,6 +145,22 @@ def _summarize_error(error_text: str, limit: int = 500) -> str: cleaned = " ".join(raw.split()) return cleaned[:limit] + ("..." if len(cleaned) > limit else "") + @staticmethod + def _proof_label_for_index(index: int) -> str: + """Return Proof A..Z, then AA..ZZ, then AAA.. for a 1-based index.""" + safe_index = max(1, int(index or 1)) + letter = chr(ord("A") + ((safe_index - 1) % 26)) + repeat_count = ((safe_index - 1) // 26) + 1 + return letter * repeat_count + + def _lean_response_summary(self, feedback: ProofAttemptFeedback) -> str: + if feedback.success: + return "Lean 4 response: proof verified." + error_summary = self._summarize_error(feedback.error_output, limit=960) + if error_summary: + return f"Lean 4 response: {error_summary} - proof not verified." + return "Lean 4 response: proof not verified." + @staticmethod def _extract_suggested_lemma_targets(error_text: str) -> list[str]: targets: list[str] = [] @@ -245,6 +270,7 @@ async def _run_smt_check( source_id: str, base_event: dict[str, Any], candidate: ProofCandidate, + proof_label: str, source_content: str, identification_agent: ProofIdentificationAgent, broadcast_fn: BroadcastFn, @@ -252,16 +278,6 @@ async def _run_smt_check( if not system_config.smt_enabled or not self._is_smt_amenable(candidate): return None - await self._broadcast( - broadcast_fn, - "smt_check_started", - { - **base_event, - "theorem_id": candidate.theorem_id, - "theorem_statement": candidate.statement, - }, - ) - started_at = time.monotonic() result_name = "unknown" try: @@ -288,49 +304,23 @@ async def _run_smt_check( z3_output=z3_raw[:2000], ) except Exception as exc: + if is_non_retryable_model_error(exc): + raise logger.debug("SMT check failed for theorem %s in %s %s: %s", candidate.theorem_id, source_type, source_id, exc) - return SmtHint(result="unknown", suggested_tactics=[], smtlib="") - finally: elapsed_ms = int((time.monotonic() - started_at) * 1000) await self._broadcast( broadcast_fn, - "smt_check_complete", + "smt_check_error", { **base_event, "theorem_id": candidate.theorem_id, "theorem_statement": candidate.statement, - "result": result_name, + "proof_label": proof_label, + "error_summary": self._summarize_error(str(exc), limit=960), "elapsed_ms": elapsed_ms, }, ) - - async def _assess_novelty( - self, - *, - user_prompt: str, - theorem_statement: str, - lean_code: str, - validator_model: str, - validator_context: int, - validator_max_tokens: int, - existing_novel_proofs: str, - ) -> tuple[str, str]: - from .proof_novelty import assess_proof_novelty - - task_id = f"proof_novelty_{self._novelty_task_sequence:03d}" - self._novelty_task_sequence += 1 - - return await assess_proof_novelty( - user_prompt=user_prompt, - theorem_statement=theorem_statement, - lean_code=lean_code, - validator_model=validator_model, - validator_context=validator_context, - validator_max_tokens=validator_max_tokens, - existing_novel_proofs=existing_novel_proofs, - task_id=task_id, - role_id="autonomous_proof_novelty", - ) + return SmtHint(result="unknown", suggested_tactics=[], smtlib="") async def _resolve_candidates( self, @@ -377,6 +367,103 @@ async def _prepare_candidate( candidate = candidate.model_copy(update={"relevant_lemmas": relevant_lemmas}) return candidate + @staticmethod + def _format_verified_proof_for_brainstorm_validation( + *, + theorem_statement: str, + formal_sketch: str, + lean_code: str, + attempt_count: int, + ) -> str: + sections = [ + BRAINSTORM_LEAN_PROOF_MARKER, + "", + "Lean 4 has accepted the following proof. Decide whether it is useful, non-redundant brainstorm progress before it is appended to the brainstorm database.", + "", + f"Theorem statement: {theorem_statement}", + ] + if formal_sketch: + sections.extend(["", f"Formalization notes: {formal_sketch}"]) + sections.extend( + [ + "", + f"Lean verification: accepted after {attempt_count} attempt{'s' if attempt_count != 1 else ''}.", + "", + "Lean 4 code:", + "```lean", + lean_code, + "```", + ] + ) + return "\n".join(sections).strip() + + async def _validate_brainstorm_verified_proof_addition( + self, + *, + user_prompt: str, + source_content: str, + proof_submission: str, + validator_model: str, + validator_context: int, + validator_max_tokens: int, + task_id: str, + role_id: str, + broadcast_fn: BroadcastFn, + base_event: dict[str, Any], + ) -> bool: + """Run the normal brainstorm usefulness gate before appending verified proofs.""" + context = source_content or "" + while len(context) > 24000: + context = context[: max(len(context) // 2, 24000)] + prompt = build_validator_prompt( + user_prompt=user_prompt, + submission_content=proof_submission, + context=f"CURRENT BRAINSTORM DATABASE:\n{context}", + ) + try: + response = await api_client_manager.generate_completion( + task_id=task_id, + role_id=role_id, + model=validator_model, + messages=[{"role": "user", "content": prompt}], + max_tokens=validator_max_tokens, + temperature=0.0, + ) + if not response or not response.get("choices"): + raise ValueError("Proof brainstorm validator returned no choices.") + message = response["choices"][0].get("message", {}) + content = message.get("content") or message.get("reasoning") or "" + raw = parse_json(content) + if isinstance(raw, list): + raw = raw[0] if raw else {} + if not isinstance(raw, dict): + raw = {} + accepted = str(raw.get("decision") or "").strip().lower() == "accept" + await self._broadcast( + broadcast_fn, + "proof_brainstorm_validation_complete", + { + **base_event, + "accepted": accepted, + "reasoning": str(raw.get("reasoning") or raw.get("summary") or ""), + }, + ) + return accepted + except Exception as exc: + if is_non_retryable_model_error(exc): + raise + logger.warning("Verified brainstorm proof usefulness validation failed: %s", exc) + await self._broadcast( + broadcast_fn, + "proof_brainstorm_validation_complete", + { + **base_event, + "accepted": False, + "reasoning": f"Validator failed before producing a usable decision: {exc}", + }, + ) + return False + async def run( self, content: str, @@ -397,6 +484,7 @@ async def run( trigger: str = "automatic", source_reserved: bool = False, should_stop: ShouldStopFn = None, + append_to_source: bool = True, ) -> ProofStageResult: """Run proof identification, formalization, Lean 4 checking, and novelty review.""" result = ProofStageResult(source_type=source_type, source_id=source_id) @@ -488,18 +576,22 @@ def _stop_requested() -> bool: { **base_event, "count": len(resolved_candidates), - "theorems_preview": [candidate.statement[:180] for candidate in resolved_candidates], + "theorems_preview": [ + f"Proof {self._proof_label_for_index(index)}: {candidate.statement[:180]}" + for index, candidate in enumerate(resolved_candidates, start=1) + ], }, ) max_parallel = max(1, int(getattr(system_config, "proof_max_parallel_candidates", 6) or 1)) semaphore = asyncio.Semaphore(max_parallel) - async def run_phase_a(theorem_candidate: ProofCandidate) -> _LeanVerificationOutcome: + async def run_phase_a(theorem_candidate: ProofCandidate, proof_label: str) -> _LeanVerificationOutcome: async with semaphore: if _stop_requested(): return _LeanVerificationOutcome( candidate=theorem_candidate, + proof_label=proof_label, success=False, theorem_name="", lean_code="", @@ -508,6 +600,7 @@ async def run_phase_a(theorem_candidate: ProofCandidate) -> _LeanVerificationOut return await self._run_lean_pipeline_for_candidate( theorem_candidate=theorem_candidate, base_event=base_event, + proof_label=proof_label, user_prompt=user_prompt, source_type=source_type, source_id=source_id, @@ -523,8 +616,8 @@ async def run_phase_a(theorem_candidate: ProofCandidate) -> _LeanVerificationOut ) verification_tasks = [ - asyncio.create_task(run_phase_a(candidate)) - for candidate in resolved_candidates + asyncio.create_task(run_phase_a(candidate, self._proof_label_for_index(index))) + for index, candidate in enumerate(resolved_candidates, start=1) ] pending_tasks = set(verification_tasks) @@ -586,6 +679,7 @@ async def run_phase_a(theorem_candidate: ProofCandidate) -> _LeanVerificationOut break candidate = outcome.candidate + proof_label = outcome.proof_label attempts = outcome.attempts lean_code = outcome.lean_code @@ -614,44 +708,118 @@ async def run_phase_a(theorem_candidate: ProofCandidate) -> _LeanVerificationOut ) continue - result.verified_count += 1 - existing_novel_proofs = novel_proofs_db.get_novel_proofs_for_injection() - novelty_tier, novelty_reasoning = await self._assess_novelty( + integrity_task_id = f"proof_integrity_{self._integrity_task_sequence:03d}" + self._integrity_task_sequence += 1 + integrity = await validate_full_lean_proof_integrity( user_prompt=user_prompt, theorem_statement=candidate.statement, + formal_sketch=candidate.formal_sketch, lean_code=lean_code, + source_excerpt=candidate.source_excerpt or content, + allowed_baseline="", validator_model=validator_model, validator_context=validator_context, validator_max_tokens=validator_max_tokens, - existing_novel_proofs=existing_novel_proofs, + task_id=integrity_task_id, + role_id="autonomous_proof_novelty", + require_statement_alignment=True, ) - is_novel = novelty_tier != "not_novel" + if not integrity.valid: + integrity_feedback = ProofAttemptFeedback( + attempt=(attempts[-1].attempt + 1 if attempts else 1), + theorem_id=candidate.theorem_id, + reasoning="Post-Lean proof integrity check failed.", + lean_code=lean_code, + error_output=integrity.reason, + strategy="full_script", + success=False, + ) + attempts = list(attempts) + [integrity_feedback] + error_summary = self._summarize_error(integrity.reason) + suggested_targets = self._extract_suggested_lemma_targets(integrity.reason) + if source_type == "brainstorm" and trigger != "retry": + await novel_proofs_db.record_failed_candidate( + source_id, + candidate, + error_summary, + suggested_lemma_targets=suggested_targets, + ) + await self._broadcast( + broadcast_fn, + "proof_integrity_rejected", + { + **base_event, + "theorem_id": candidate.theorem_id, + "theorem_statement": candidate.statement, + "proof_label": proof_label, + "category": integrity.category, + "reason": integrity.reason, + }, + ) + result.results.append( + ProofAttemptResult( + theorem_id=candidate.theorem_id, + theorem_statement=candidate.statement, + lean_code=lean_code, + success=False, + novel=False, + attempts_used=len(attempts), + error_summary=error_summary, + ) + ) + continue + + novelty_task_id = f"proof_novelty_{self._novelty_task_sequence:03d}" + self._novelty_task_sequence += 1 solver_hints = [] if self._first_attempt_used_smt_hint(attempts, candidate.smt_hint): solver_hints.append("smt-z3") - proof_record = ProofRecord( - proof_id="", - theorem_id=candidate.theorem_id, + registration = await register_verified_lean_proof( + proof_database=novel_proofs_db, + user_prompt=user_prompt, theorem_statement=candidate.statement, - theorem_name=outcome.theorem_name, - formal_sketch=candidate.formal_sketch, + lean_code=lean_code, + validator_model=validator_model, + validator_context=validator_context, + validator_max_tokens=validator_max_tokens, + task_id=novelty_task_id, + role_id="autonomous_proof_novelty", source_type=source_type, source_id=source_id, source_title=source_title, + theorem_id=candidate.theorem_id, + theorem_name=outcome.theorem_name, + formal_sketch=candidate.formal_sketch, solver="Lean 4", - lean_code=lean_code, - novel=is_novel, - novelty_tier=novelty_tier, - novelty_reasoning=novelty_reasoning, verification_notes="Lean 4 accepted the submitted proof.", attempt_count=len(attempts), attempts=attempts, - dependencies=[], solver_hints=solver_hints, + broadcast_fn=broadcast_fn, + base_event=base_event, + proof_label=proof_label, + retry_origin_source_id=candidate.origin_source_id, + ) + stored_record = registration.record + is_novel = stored_record.novel + novelty_tier = stored_record.novelty_tier + result.verified_count += 1 + + await self._broadcast( + broadcast_fn, + "proof_verified", + { + **base_event, + "proof_id": stored_record.proof_id, + "theorem_id": candidate.theorem_id, + "theorem_statement": candidate.statement, + "proof_label": proof_label, + "strategy": attempts[-1].strategy if attempts else "full_script", + "retry_origin_source_id": candidate.origin_source_id, + }, ) - stored_record = await novel_proofs_db.add_proof(proof_record) # Dependency extraction runs in Phase B so later candidates # in the same paper can see earlier proofs. We instantiate @@ -687,6 +855,7 @@ async def run_phase_a(theorem_candidate: ProofCandidate) -> _LeanVerificationOut **base_event, "proof_id": stored_record.proof_id, "theorem_name": stored_record.theorem_name, + "proof_label": proof_label, "dependencies": [ dependency.model_dump(mode="json") for dependency in dependencies @@ -707,43 +876,44 @@ async def run_phase_a(theorem_candidate: ProofCandidate) -> _LeanVerificationOut stored_record.proof_id, ) - if is_novel: + if is_novel and not registration.duplicate: result.novel_count += 1 # Novel proofs are appended to their source document so the # paper/brainstorm they came from retains a record of them. # They are also stored in ProofDatabase and direct-injected # into all prompts via inject_into_prompt(). - if source_type == "brainstorm": - await brainstorm_memory.append_proofs_section(source_id, stored_record) - elif source_type == "paper": + if append_to_source and source_type == "brainstorm": + validator_accepted = await self._validate_brainstorm_verified_proof_addition( + user_prompt=user_prompt, + source_content=content, + proof_submission=self._format_verified_proof_for_brainstorm_validation( + theorem_statement=candidate.statement, + formal_sketch=candidate.formal_sketch, + lean_code=lean_code, + attempt_count=len(attempts), + ), + validator_model=validator_model, + validator_context=validator_context, + validator_max_tokens=validator_max_tokens, + task_id=f"proof_brainstorm_val_{self._novelty_task_sequence:03d}", + role_id="autonomous_proof_novelty", + broadcast_fn=broadcast_fn, + base_event={ + **base_event, + "theorem_id": candidate.theorem_id, + "theorem_statement": candidate.statement, + "proof_id": stored_record.proof_id, + "proof_label": proof_label, + }, + ) + if validator_accepted: + await brainstorm_memory.append_proofs_section(source_id, stored_record) + elif append_to_source and source_type == "paper": await paper_library.append_proofs_section(source_id, stored_record) - await self._broadcast( - broadcast_fn, - "novel_proof_discovered", - { - **base_event, - "proof_id": stored_record.proof_id, - "theorem_statement": stored_record.theorem_statement, - "solver": "Lean 4", - "novelty_tier": novelty_tier, - "retry_origin_source_id": candidate.origin_source_id, - }, - ) - else: - # Non-novel (known) proofs are stored in ProofDatabase only. - # They are NOT appended to brainstorm/paper files to avoid - # polluting compiler and RAG context with standard Lean 4 code. - # They remain browsable via proof_database.get_known_proofs_summary_for_browsing(). - await self._broadcast( - broadcast_fn, - "known_proof_verified", - { - **base_event, - "proof_id": stored_record.proof_id, - "theorem_statement": stored_record.theorem_statement, - "retry_origin_source_id": candidate.origin_source_id, - }, - ) + # Non-novel (known) proofs are stored in ProofDatabase only. + # They are NOT appended to brainstorm/paper files to avoid + # polluting compiler and RAG context with standard Lean 4 code. + # They remain browsable via proof_database.get_known_proofs_summary_for_browsing(). result.results.append( ProofAttemptResult( @@ -780,6 +950,8 @@ async def run_phase_a(theorem_candidate: ProofCandidate) -> _LeanVerificationOut except FreeModelExhaustedError: raise except Exception as exc: + if is_non_retryable_model_error(exc): + raise logger.error( "Proof verification stage failed for %s %s: %s", source_type, @@ -809,6 +981,7 @@ async def _run_lean_pipeline_for_candidate( *, theorem_candidate: ProofCandidate, base_event: dict[str, Any], + proof_label: str, user_prompt: str, source_type: str, source_id: str, @@ -861,6 +1034,7 @@ async def _run_lean_pipeline_for_candidate( source_id=source_id, base_event=base_event, candidate=candidate, + proof_label=proof_label, source_content=source_content, identification_agent=identification_agent, broadcast_fn=broadcast_fn, @@ -886,6 +1060,7 @@ async def on_attempt_started( **base_event, "theorem_id": current_candidate.theorem_id, "theorem_statement": current_candidate.statement, + "proof_label": proof_label, "attempt": attempt_number, "strategy": strategy, "retry_origin_source_id": current_candidate.origin_source_id, @@ -896,16 +1071,21 @@ async def on_attempt_feedback(feedback, current_candidate=candidate) -> None: if feedback.success: await self._broadcast( broadcast_fn, - "proof_verified", + "proof_lean_accepted", { **base_event, "theorem_id": current_candidate.theorem_id, "theorem_statement": current_candidate.statement, + "proof_label": proof_label, + "attempt": feedback.attempt, "strategy": feedback.strategy, + "lean_response": self._lean_response_summary(feedback), + "proof_verified": True, "retry_origin_source_id": current_candidate.origin_source_id, }, ) else: + lean_response = self._lean_response_summary(feedback) await self._broadcast( broadcast_fn, "proof_attempt_failed", @@ -913,9 +1093,12 @@ async def on_attempt_feedback(feedback, current_candidate=candidate) -> None: **base_event, "theorem_id": current_candidate.theorem_id, "theorem_statement": current_candidate.statement, + "proof_label": proof_label, "attempt": feedback.attempt, "strategy": feedback.strategy, "error_summary": self._summarize_error(feedback.error_output), + "lean_response": lean_response, + "proof_verified": False, "retry_origin_source_id": current_candidate.origin_source_id, }, ) @@ -961,12 +1144,14 @@ async def on_attempt_feedback(feedback, current_candidate=candidate) -> None: **base_event, "theorem_id": candidate.theorem_id, "theorem_statement": candidate.statement, + "proof_label": proof_label, "retry_origin_source_id": candidate.origin_source_id, }, ) return _LeanVerificationOutcome( candidate=candidate, + proof_label=proof_label, success=success, theorem_name=theorem_name, lean_code=lean_code, diff --git a/backend/autonomous/memory/autonomous_api_logger.py b/backend/autonomous/memory/autonomous_api_logger.py index 723857a..fb2e649 100644 --- a/backend/autonomous/memory/autonomous_api_logger.py +++ b/backend/autonomous/memory/autonomous_api_logger.py @@ -3,18 +3,32 @@ Stores logs in a persistent file for viewing in the Autonomous Logs tab. """ import asyncio +import hashlib import json import logging import os +from collections import deque from datetime import datetime from typing import Dict, Any, List, Optional from pathlib import Path from backend.shared.config import system_config +from backend.shared.log_redaction import redact_log_text logger = logging.getLogger(__name__) +def _payload_metadata(value: str, preview_chars: int) -> Dict[str, Any]: + """Return safe log metadata for a prompt/response payload.""" + text = value or "" + preview = redact_log_text(text, preview_chars) + return { + "preview": preview, + "size": len(text), + "sha256": hashlib.sha256(text.encode("utf-8", errors="replace")).hexdigest() if text else "", + } + + class AutonomousAPILogger: """ Logger for autonomous research API call outputs. @@ -38,6 +52,7 @@ def __init__(self): self._initialized = True self._ensure_log_file() + self._scrub_persisted_full_payloads() logger.info("AutonomousAPILogger initialized") def _ensure_log_file(self) -> None: @@ -51,6 +66,64 @@ def _ensure_log_file(self) -> None: def _get_log_path(self) -> Path: """Return the instance-scoped autonomous API log path.""" return Path(system_config.data_dir) / "auto_api_log.txt" + + def _scrub_persisted_full_payloads(self) -> None: + """Remove legacy full prompt/response bodies from the on-disk JSONL log.""" + log_path = self._get_log_path() + if not log_path.exists(): + return + + changed = False + scrubbed_lines: List[str] = [] + + try: + with open(log_path, "r", encoding="utf-8") as f: + lines = f.readlines() + + for line in lines: + stripped = line.strip() + if not stripped: + continue + try: + entry = json.loads(stripped) + except json.JSONDecodeError: + scrubbed_lines.append(line) + continue + + original_entry = dict(entry) + prompt_full = str(entry.pop("prompt_full", "") or "") + response_full = str(entry.pop("response_full", "") or "") + prompt_source = prompt_full or str(entry.get("prompt_preview") or "") + response_source = response_full or str(entry.get("response_preview") or "") + + if prompt_source: + prompt_meta = _payload_metadata(prompt_source, 1000) + entry["prompt_preview"] = prompt_meta["preview"] + entry["prompt_size"] = int(entry.get("prompt_size") or prompt_meta["size"]) + entry.setdefault("prompt_sha256", prompt_meta["sha256"]) + if response_source: + response_meta = _payload_metadata(response_source, 2000) + entry["response_preview"] = response_meta["preview"] + entry["response_size"] = int(entry.get("response_size") or response_meta["size"]) + entry.setdefault("response_sha256", response_meta["sha256"]) + + entry["prompt_redacted"] = True + entry["response_redacted"] = True + entry["has_full_prompt"] = False + entry["has_full_response"] = False + if entry.get("error"): + entry["error"] = redact_log_text(entry["error"], 1000) + + if prompt_full or response_full or entry != original_entry: + changed = True + scrubbed_lines.append(json.dumps(entry) + "\n") + + if changed: + with open(log_path, "w", encoding="utf-8") as f: + f.writelines(scrubbed_lines) + logger.info("Scrubbed legacy full prompt/response payloads from autonomous API log") + except Exception as e: + logger.warning(f"Failed to scrub legacy autonomous API log payloads: {e}") async def log_api_call( self, @@ -64,7 +137,8 @@ async def log_api_call( duration_ms: Optional[float] = None, success: bool = True, error: Optional[str] = None, - phase: str = "unknown" + phase: str = "unknown", + workflow: str = "autonomous", ) -> None: """ Log an autonomous research API call. @@ -81,9 +155,14 @@ async def log_api_call( success: Whether the call succeeded error: Error message if call failed phase: Research phase ("topic_selection", "brainstorm", "paper_compilation", "tier3") + workflow: Workflow namespace for this call ("autonomous" or "leanoj") """ async with self._lock: try: + prompt_meta = _payload_metadata(prompt, 1000) + response_meta = _payload_metadata(response_content, 2000) + store_full_payloads = bool(system_config.api_log_store_full_payloads) + log_entry = { "timestamp": datetime.now().isoformat(), "task_id": task_id, @@ -91,15 +170,25 @@ async def log_api_call( "model": model, "provider": provider, "phase": phase, - "prompt_preview": prompt[:1000] if prompt else "", - "prompt_full": prompt, - "response_preview": response_content[:2000] if response_content else "", - "response_full": response_content, + "workflow": workflow, + "prompt_preview": prompt_meta["preview"], + "prompt_size": prompt_meta["size"], + "prompt_sha256": prompt_meta["sha256"], + "prompt_redacted": not store_full_payloads, + "has_full_prompt": store_full_payloads and bool(prompt), + "response_preview": response_meta["preview"], + "response_size": response_meta["size"], + "response_sha256": response_meta["sha256"], + "response_redacted": not store_full_payloads, + "has_full_response": store_full_payloads and bool(response_content), "tokens_used": tokens_used, "duration_ms": duration_ms, "success": success, - "error": error + "error": redact_log_text(error, 1000) } + if store_full_payloads: + log_entry["prompt_full"] = prompt + log_entry["response_full"] = response_content # Append to log file with open(self._get_log_path(), "a", encoding="utf-8") as f: @@ -129,7 +218,7 @@ async def _trim_log_if_needed(self) -> None: except Exception as e: logger.error(f"Failed to trim autonomous API log: {e}") - async def get_logs(self, limit: int = 100) -> List[Dict[str, Any]]: + async def get_logs(self, limit: int = 100, include_full: bool = True) -> List[Dict[str, Any]]: """ Get recent autonomous API call logs. @@ -146,7 +235,7 @@ async def get_logs(self, limit: int = 100) -> List[Dict[str, Any]]: return [] with open(log_path, "r", encoding="utf-8") as f: - lines = f.readlines() + lines = deque(f, maxlen=max(1, limit)) logs = [] for line in lines: @@ -154,6 +243,17 @@ async def get_logs(self, limit: int = 100) -> List[Dict[str, Any]]: if line: try: log_entry = json.loads(line) + if not include_full or not system_config.api_log_store_full_payloads: + prompt_full = str(log_entry.pop("prompt_full", "") or "") + response_full = str(log_entry.pop("response_full", "") or "") + log_entry["prompt_size"] = int(log_entry.get("prompt_size") or len(prompt_full)) + log_entry["response_size"] = int(log_entry.get("response_size") or len(response_full)) + log_entry["has_full_prompt"] = False + log_entry["has_full_response"] = False + if prompt_full and not log_entry.get("prompt_sha256"): + log_entry["prompt_sha256"] = hashlib.sha256(prompt_full.encode("utf-8", errors="replace")).hexdigest() + if response_full and not log_entry.get("response_sha256"): + log_entry["response_sha256"] = hashlib.sha256(response_full.encode("utf-8", errors="replace")).hexdigest() logs.append(log_entry) except json.JSONDecodeError: continue @@ -165,11 +265,49 @@ async def get_logs(self, limit: int = 100) -> List[Dict[str, Any]]: except Exception as e: logger.error(f"Failed to get autonomous API logs: {e}") return [] - - async def clear_logs(self) -> None: - """Clear all autonomous API logs.""" + + @staticmethod + def _entry_workflow(entry: Dict[str, Any]) -> str: + workflow = str(entry.get("workflow") or "").strip().lower() + if workflow: + return workflow + + role_id = str(entry.get("role_id") or "") + task_id = str(entry.get("task_id") or "") + if role_id.startswith("leanoj_") or task_id.startswith("leanoj_"): + return "leanoj" + return "autonomous" + + async def clear_logs(self, workflow: Optional[str] = None) -> None: + """Clear autonomous API logs, optionally scoped to one workflow.""" async with self._lock: try: + if workflow: + log_path = self._get_log_path() + if not os.path.exists(log_path): + return + + with open(log_path, "r", encoding="utf-8") as f: + lines = f.readlines() + + retained_lines: List[str] = [] + for line in lines: + stripped = line.strip() + if not stripped: + continue + try: + entry = json.loads(stripped) + except json.JSONDecodeError: + retained_lines.append(line) + continue + if self._entry_workflow(entry) != workflow: + retained_lines.append(line) + + with open(log_path, "w", encoding="utf-8") as f: + f.writelines(retained_lines) + logger.info("Autonomous API logs cleared for workflow %s", workflow) + return + with open(self._get_log_path(), "w", encoding="utf-8") as f: f.write("") logger.info("Autonomous API logs cleared") diff --git a/backend/autonomous/memory/final_answer_memory.py b/backend/autonomous/memory/final_answer_memory.py index 44b827b..b0b4d5a 100644 --- a/backend/autonomous/memory/final_answer_memory.py +++ b/backend/autonomous/memory/final_answer_memory.py @@ -750,6 +750,26 @@ def _get_source_papers_dir(self) -> Path: def _get_source_brainstorms_dir(self) -> Path: """Get path to source brainstorms archive directory.""" return self._base_dir / "source_brainstorms" + + def _get_archived_paper_paths(self, paper_id: str) -> Dict[str, Path]: + """Return root-confined paths for one archived paper ID.""" + safe_id = validate_single_path_component(paper_id, "paper ID") + source_papers_dir = self._get_source_papers_dir() + return { + "content": resolve_path_within_root(source_papers_dir, f"paper_{safe_id}.txt"), + "abstract": resolve_path_within_root(source_papers_dir, f"paper_{safe_id}_abstract.txt"), + "outline": resolve_path_within_root(source_papers_dir, f"paper_{safe_id}_outline.txt"), + "metadata": resolve_path_within_root(source_papers_dir, f"paper_{safe_id}_metadata.json"), + } + + def _get_archived_brainstorm_paths(self, topic_id: str) -> Dict[str, Path]: + """Return root-confined paths for one archived brainstorm ID.""" + safe_id = validate_single_path_component(topic_id, "topic ID") + source_brainstorms_dir = self._get_source_brainstorms_dir() + return { + "content": resolve_path_within_root(source_brainstorms_dir, f"brainstorm_{safe_id}.txt"), + "metadata": resolve_path_within_root(source_brainstorms_dir, f"brainstorm_{safe_id}_metadata.json"), + } async def save_chapter_paper( self, @@ -804,26 +824,24 @@ async def _archive_paper(self, paper_id: str) -> bool: try: source_papers_dir = self._get_source_papers_dir() source_papers_dir.mkdir(parents=True, exist_ok=True) + archive_paths = self._get_archived_paper_paths(paper_id) # Copy paper content content = await paper_library.get_paper_content(paper_id) if content: - paper_path = source_papers_dir / f"paper_{paper_id}.txt" - async with aiofiles.open(paper_path, 'w', encoding='utf-8') as f: + async with aiofiles.open(archive_paths["content"], 'w', encoding='utf-8') as f: await f.write(content) # Copy abstract abstract = await paper_library.get_abstract(paper_id) if abstract: - abstract_path = source_papers_dir / f"paper_{paper_id}_abstract.txt" - async with aiofiles.open(abstract_path, 'w', encoding='utf-8') as f: + async with aiofiles.open(archive_paths["abstract"], 'w', encoding='utf-8') as f: await f.write(abstract) # Copy outline outline = await paper_library.get_outline(paper_id) if outline: - outline_path = source_papers_dir / f"paper_{paper_id}_outline.txt" - async with aiofiles.open(outline_path, 'w', encoding='utf-8') as f: + async with aiofiles.open(archive_paths["outline"], 'w', encoding='utf-8') as f: await f.write(outline) # Copy metadata @@ -834,8 +852,7 @@ async def _archive_paper(self, paper_id: str) -> bool: if isinstance(value, datetime): metadata_data[key] = value.isoformat() - metadata_path = source_papers_dir / f"paper_{paper_id}_metadata.json" - async with aiofiles.open(metadata_path, 'w', encoding='utf-8') as f: + async with aiofiles.open(archive_paths["metadata"], 'w', encoding='utf-8') as f: await f.write(json.dumps(metadata_data, indent=2)) logger.info(f"Archived paper {paper_id} to final answer source_papers") @@ -860,12 +877,12 @@ async def _archive_brainstorm(self, topic_id: str) -> bool: try: source_brainstorms_dir = self._get_source_brainstorms_dir() source_brainstorms_dir.mkdir(parents=True, exist_ok=True) + archive_paths = self._get_archived_brainstorm_paths(topic_id) # Copy brainstorm database content = await brainstorm_memory.get_database_content(topic_id) if content: - db_path = source_brainstorms_dir / f"brainstorm_{topic_id}.txt" - async with aiofiles.open(db_path, 'w', encoding='utf-8') as f: + async with aiofiles.open(archive_paths["content"], 'w', encoding='utf-8') as f: await f.write(content) # Copy metadata @@ -876,8 +893,7 @@ async def _archive_brainstorm(self, topic_id: str) -> bool: if isinstance(value, datetime): metadata_data[key] = value.isoformat() - metadata_path = source_brainstorms_dir / f"brainstorm_{topic_id}_metadata.json" - async with aiofiles.open(metadata_path, 'w', encoding='utf-8') as f: + async with aiofiles.open(archive_paths["metadata"], 'w', encoding='utf-8') as f: await f.write(json.dumps(metadata_data, indent=2)) logger.info(f"Archived brainstorm {topic_id} to final answer source_brainstorms") @@ -975,11 +991,11 @@ async def get_archived_paper(self, paper_id: str) -> Optional[Dict[str, Any]]: Returns: Dictionary with paper content, abstract, outline, metadata """ - source_papers_dir = self._get_source_papers_dir() + archive_paths = self._get_archived_paper_paths(paper_id) try: # Read content - paper_path = source_papers_dir / f"paper_{paper_id}.txt" + paper_path = archive_paths["content"] if not paper_path.exists(): return None @@ -987,21 +1003,21 @@ async def get_archived_paper(self, paper_id: str) -> Optional[Dict[str, Any]]: content = await f.read() # Read abstract - abstract_path = source_papers_dir / f"paper_{paper_id}_abstract.txt" + abstract_path = archive_paths["abstract"] abstract = "" if abstract_path.exists(): async with aiofiles.open(abstract_path, 'r', encoding='utf-8') as f: abstract = await f.read() # Read outline - outline_path = source_papers_dir / f"paper_{paper_id}_outline.txt" + outline_path = archive_paths["outline"] outline = "" if outline_path.exists(): async with aiofiles.open(outline_path, 'r', encoding='utf-8') as f: outline = await f.read() # Read metadata - metadata_path = source_papers_dir / f"paper_{paper_id}_metadata.json" + metadata_path = archive_paths["metadata"] metadata = {} if metadata_path.exists(): async with aiofiles.open(metadata_path, 'r', encoding='utf-8') as f: @@ -1054,11 +1070,11 @@ async def get_archived_brainstorm(self, topic_id: str) -> Optional[Dict[str, Any Returns: Dictionary with brainstorm content and metadata """ - source_brainstorms_dir = self._get_source_brainstorms_dir() + archive_paths = self._get_archived_brainstorm_paths(topic_id) try: # Read database content - db_path = source_brainstorms_dir / f"brainstorm_{topic_id}.txt" + db_path = archive_paths["content"] if not db_path.exists(): return None @@ -1066,7 +1082,7 @@ async def get_archived_brainstorm(self, topic_id: str) -> Optional[Dict[str, Any content = await f.read() # Read metadata - metadata_path = source_brainstorms_dir / f"brainstorm_{topic_id}_metadata.json" + metadata_path = archive_paths["metadata"] metadata = {} if metadata_path.exists(): async with aiofiles.open(metadata_path, 'r', encoding='utf-8') as f: @@ -1390,7 +1406,7 @@ async def list_all_final_answers(self) -> List[Dict[str, Any]]: - word_count: Total words - chapter_count: Number of chapters (long form only) - completion_date: When it was completed - - location: Path to the answer + - location: Logical answer scope (never an absolute filesystem path) - session_id: Session identifier (or "legacy" for old format) """ final_answers = [] @@ -1455,7 +1471,7 @@ async def list_all_final_answers(self) -> List[Dict[str, Any]]: "word_count": word_count, "chapter_count": chapter_count, "completion_date": completion_date, - "location": str(legacy_dir), + "location": "legacy", "session_id": "legacy" }) except Exception as e: @@ -1534,7 +1550,7 @@ async def list_all_final_answers(self) -> List[Dict[str, Any]]: "word_count": word_count, "chapter_count": chapter_count, "completion_date": completion_date, - "location": str(final_answer_dir), + "location": session_folder.name, "session_id": session_folder.name }) except Exception as e: @@ -1622,7 +1638,7 @@ async def get_final_answer_by_id(self, answer_id: str) -> Optional[Dict[str, Any "word_count": len(full_content.split()), "chapter_count": len(chapters), "completion_date": completion_date, - "location": str(base_dir), + "location": answer_id, "session_id": answer_id }, "content": full_content, diff --git a/backend/autonomous/memory/paper_library.py b/backend/autonomous/memory/paper_library.py index 415d752..c8b6bb1 100644 --- a/backend/autonomous/memory/paper_library.py +++ b/backend/autonomous/memory/paper_library.py @@ -18,6 +18,7 @@ resolve_path_within_root, validate_single_path_component, ) +from backend.shared.log_redaction import redact_log_text logger = logging.getLogger(__name__) @@ -36,6 +37,7 @@ def __init__(self): self._lock = asyncio.Lock() self._base_dir = Path(system_config.auto_papers_dir) self._archive_dir = Path(system_config.auto_papers_archive_dir) + self._pruned_dir = self._base_dir / "pruned" self._session_manager = None def set_session_manager(self, session_manager) -> None: @@ -44,7 +46,8 @@ def set_session_manager(self, session_manager) -> None: if session_manager and session_manager.is_session_active: self._base_dir = session_manager.get_papers_dir() self._archive_dir = session_manager.get_papers_dir() / "archive" - logger.info(f"Paper library using session path: {self._base_dir}") + self._pruned_dir = session_manager.get_papers_dir() / "pruned" + logger.info("Paper library using session path: %s", redact_log_text(self._base_dir, 240)) async def initialize(self) -> None: """Initialize the paper library directories.""" @@ -52,18 +55,39 @@ async def initialize(self) -> None: if self._session_manager and self._session_manager.is_session_active: self._base_dir = self._session_manager.get_papers_dir() self._archive_dir = self._base_dir / "archive" + self._pruned_dir = self._base_dir / "pruned" self._base_dir.mkdir(parents=True, exist_ok=True) self._archive_dir.mkdir(parents=True, exist_ok=True) - logger.info(f"Paper library initialized at {self._base_dir}") + self._pruned_dir.mkdir(parents=True, exist_ok=True) + logger.info("Paper library initialized at %s", redact_log_text(self._base_dir, 240)) def _safe_paper_id(self, paper_id: str) -> str: """Validate paper_id as a single path component.""" return validate_single_path_component(paper_id, "paper ID") + def _paper_path(self, root: Path, paper_id: str, suffix: str, *, prefix: str = "paper_") -> Path: + """Build a paper-related path inside a trusted library root.""" + safe_id = self._safe_paper_id(paper_id) + return resolve_path_within_root(root, f"{prefix}{safe_id}{suffix}") + + def _ensure_library_path(self, path: Path, label: str = "library path") -> Path: + """Verify an existing helper-built path is still under this library's roots.""" + candidate = Path(path) + for root in (self._base_dir, self._archive_dir, self._pruned_dir): + try: + return resolve_path_within_root(root, str(candidate)) + except ValueError: + continue + raise ValueError(f"{label} escapes paper library roots") + def _get_paper_path(self, paper_id: str) -> Path: """Get path to paper file.""" - return self._base_dir / f"paper_{self._safe_paper_id(paper_id)}.txt" + return self._paper_path(self._base_dir, paper_id, ".txt") + + def _get_pruned_paper_path(self, paper_id: str) -> Path: + """Get path to a pruned paper file.""" + return self._paper_path(self._pruned_dir, paper_id, ".txt", prefix="pruned_paper_") def get_paper_path(self, paper_id: str) -> str: """ @@ -87,23 +111,55 @@ def get_outline_path(self, paper_id: str) -> str: def _get_abstract_path(self, paper_id: str) -> Path: """Get path to abstract file.""" - return self._base_dir / f"paper_{self._safe_paper_id(paper_id)}_abstract.txt" + return self._paper_path(self._base_dir, paper_id, "_abstract.txt") + + def _get_pruned_abstract_path(self, paper_id: str) -> Path: + """Get path to pruned paper abstract file.""" + return self._paper_path(self._pruned_dir, paper_id, "_abstract.txt", prefix="pruned_paper_") def _get_source_brainstorm_path(self, paper_id: str) -> Path: """Get path to cached source brainstorm file.""" - return self._base_dir / f"paper_{self._safe_paper_id(paper_id)}_source_brainstorm.txt" + return self._paper_path(self._base_dir, paper_id, "_source_brainstorm.txt") + + def _get_pruned_source_brainstorm_path(self, paper_id: str) -> Path: + """Get path to pruned cached source brainstorm file.""" + return self._paper_path(self._pruned_dir, paper_id, "_source_brainstorm.txt", prefix="pruned_paper_") def _get_outline_path(self, paper_id: str) -> Path: """Get path to paper outline file.""" - return self._base_dir / f"paper_{self._safe_paper_id(paper_id)}_outline.txt" + return self._paper_path(self._base_dir, paper_id, "_outline.txt") + + def _get_pruned_outline_path(self, paper_id: str) -> Path: + """Get path to pruned paper outline file.""" + return self._paper_path(self._pruned_dir, paper_id, "_outline.txt", prefix="pruned_paper_") def _get_metadata_path(self, paper_id: str) -> Path: """Get path to paper metadata JSON file.""" - return self._base_dir / f"paper_{self._safe_paper_id(paper_id)}_metadata.json" + return self._paper_path(self._base_dir, paper_id, "_metadata.json") + + def _get_pruned_metadata_path(self, paper_id: str) -> Path: + """Get path to pruned paper metadata JSON file.""" + return self._paper_path(self._pruned_dir, paper_id, "_metadata.json", prefix="pruned_paper_") def _get_rejections_path(self, paper_id: str) -> Path: """Get path to paper compiler rejections file.""" - return self._base_dir / f"paper_{self._safe_paper_id(paper_id)}_last_10_rejections.txt" + return self._paper_path(self._base_dir, paper_id, "_last_10_rejections.txt") + + def _get_pruned_rejections_path(self, paper_id: str) -> Path: + """Get path to pruned paper compiler rejections file.""" + return self._paper_path(self._pruned_dir, paper_id, "_last_10_rejections.txt", prefix="pruned_paper_") + + def _get_archive_paper_path(self, paper_id: str) -> Path: + """Get path to a legacy archived paper file.""" + return self._paper_path(self._archive_dir, paper_id, ".txt") + + def _get_archive_outline_path(self, paper_id: str) -> Path: + """Get path to a legacy archived outline file.""" + return self._paper_path(self._archive_dir, paper_id, "_outline.txt") + + def _get_archive_metadata_path(self, paper_id: str) -> Path: + """Get path to a legacy archived metadata file.""" + return self._paper_path(self._archive_dir, paper_id, "_metadata.json") # ======================================================================== # HISTORY HELPERS @@ -115,6 +171,7 @@ def _build_scoped_library(base_dir: Path) -> "PaperLibrary": scoped_library = PaperLibrary() scoped_library._base_dir = base_dir scoped_library._archive_dir = base_dir / "archive" + scoped_library._pruned_dir = base_dir / "pruned" return scoped_library @staticmethod @@ -188,7 +245,11 @@ async def _get_history_user_prompt(self, session_id: str) -> str: metadata.get("user_research_prompt"), ) except Exception as e: - logger.warning(f"Failed to read history prompt for session {session_id}: {e}") + logger.warning( + "Failed to read history prompt for session %s: %s", + redact_log_text(session_id, 120), + redact_log_text(e, 240), + ) return self._derive_history_prompt_from_session_id(session_id) @staticmethod @@ -222,6 +283,7 @@ def _format_verified_proof_entry(cls, proof: Any, source_context: str = "") -> s novelty_tier = str(cls._proof_value(proof, "novelty_tier", "") or "").strip() tier_labels = { + "major_mathematical_discovery": "Major Mathematical Discovery", "mathematical_discovery": "Mathematical Discovery", "novel_variant": "Novel Reformulation", "novel_formulation": "Novel Formalization", @@ -294,9 +356,27 @@ def attach_verified_proofs_to_content( return before + appendix_block + after fallback_header = "=== PROOFS ATTACHED TO THIS PAPER (Lean 4 Verified) ===" + self_review_match = re.search( + r"(?:^|\n)\s*(?:#+\s*)?AI Self-Review and Limitations\s*\n", + existing_content, + re.IGNORECASE, + ) + + def insert_before_self_review(proof_block: str) -> str: + if not self_review_match: + return existing_content.rstrip() + "\n\n" + proof_block.rstrip() + "\n" + insert_at = self_review_match.start() + if insert_at > 0 and existing_content[insert_at] == "\n": + insert_at += 1 + before_review = existing_content[:insert_at].rstrip() + review_and_after = existing_content[insert_at:].lstrip() + return f"{before_review}\n\n{proof_block.rstrip()}\n\n{review_and_after}" + if fallback_header in existing_content: - return existing_content.rstrip() + "\n\n" + new_entries + "\n" - return existing_content.rstrip() + "\n\n" + fallback_header + "\n\n" + new_entries + "\n" + return insert_before_self_review(new_entries) + + proof_section = f"{fallback_header}\n\n{new_entries}" + return insert_before_self_review(proof_section) @staticmethod def strip_verified_proofs_from_content(content: str) -> str: @@ -328,40 +408,168 @@ def strip_verified_proofs_from_content(content: str) -> str: return stripped.rstrip() - async def _list_history_papers_from_directory(self, papers_dir: Path, session_id: str) -> List[Dict[str, Any]]: - """List complete, non-archived papers from one legacy/session papers directory.""" + @staticmethod + def _pruned_banner( + *, + paper_id: str, + pruned_at: datetime, + pruned_by: str, + reason: str, + ) -> str: + """Build the raw-file banner that identifies a preserved pruned paper.""" + actor_note = ( + "The system decided autonomously that this paper hurt context cumulation." + if pruned_by in {"system", "legacy"} + else "The user removed this paper from model context accumulation." + ) + return ( + "PRUNED PAPER - REMOVED FROM MODEL CONTEXT\n\n" + f"{actor_note}\n" + "This file is preserved for user review and download only. " + "It must not be used as future model context.\n\n" + f"Original Paper ID: {paper_id}\n" + f"Pruned At: {pruned_at.isoformat()}\n" + f"Pruned By: {pruned_by}\n" + f"Prune Reason: {reason or 'No reason recorded.'}\n" + f"{'=' * 80}\n\n" + ) + + @staticmethod + def _strip_existing_pruned_banner(content: str) -> str: + """Avoid duplicating the pruned banner if a paper is pruned twice.""" + if not content.startswith("PRUNED PAPER - REMOVED FROM MODEL CONTEXT"): + return content + marker = f"{'=' * 80}\n\n" + marker_index = content.find(marker) + if marker_index < 0: + return content + return content[marker_index + len(marker):] + + @staticmethod + def _metadata_to_dict(metadata: PaperMetadata) -> Dict[str, Any]: + """Serialize metadata for JSON files across pydantic versions.""" + if hasattr(metadata, "model_dump"): + return metadata.model_dump() + return metadata.dict() + + async def _read_metadata_file(self, metadata_path: Path) -> Optional[PaperMetadata]: + """Read a metadata file into PaperMetadata.""" + try: + metadata_path = self._ensure_library_path(metadata_path, "metadata path") + except ValueError as exc: + logger.warning("Rejected unsafe metadata path: %s", exc) + return None + + # codeql[py/path-injection]: metadata_path is constrained to this paper library's roots. + if not metadata_path.exists(): + return None + try: + # codeql[py/path-injection]: metadata_path is constrained to this paper library's roots. + async with aiofiles.open(metadata_path, 'r', encoding='utf-8') as f: + content = await f.read() + return PaperMetadata(**json.loads(content)) + except Exception as e: + logger.error( + "Failed to load paper metadata from %s: %s", + redact_log_text(metadata_path, 240), + redact_log_text(e, 240), + ) + return None + + async def _save_metadata_to_path(self, metadata: PaperMetadata, metadata_path: Path) -> None: + """Save paper metadata to a specific path.""" + metadata_path = self._ensure_library_path(metadata_path, "metadata path") + # codeql[py/path-injection]: metadata_path is constrained to this paper library's roots. + metadata_path.parent.mkdir(parents=True, exist_ok=True) + # codeql[py/path-injection]: metadata_path is constrained to this paper library's roots. + async with aiofiles.open(metadata_path, 'w', encoding='utf-8') as f: + await f.write(json.dumps(self._metadata_to_dict(metadata), indent=2, default=str)) + + async def _read_text_file(self, path: Path) -> str: + """Read a text file if it exists.""" + try: + path = self._ensure_library_path(path, "text path") + except ValueError as exc: + logger.warning("Rejected unsafe text path: %s", exc) + return "" + + # codeql[py/path-injection]: path is constrained to this paper library's roots. + if not path.exists(): + return "" + try: + # codeql[py/path-injection]: path is constrained to this paper library's roots. + async with aiofiles.open(path, 'r', encoding='utf-8') as f: + return await f.read() + except Exception as e: + logger.error( + "Failed to read %s: %s", + redact_log_text(path, 240), + redact_log_text(e, 240), + ) + return "" + + def _pruned_note_for(self, metadata: PaperMetadata) -> str: + """Return the user-facing pruned-paper note.""" + if metadata.pruned_by == "system" or metadata.status == "archived": + return "The system decided autonomously that this paper hurt context cumulation." + return "The user removed this paper from model context accumulation." + + async def _build_history_entry( + self, + *, + metadata: PaperMetadata, + session_id: str, + papers_dir: Path, + pruned: bool = False, + ) -> Dict[str, Any]: + """Build the shared Stage 2 history response shape.""" from backend.shared.critique_memory import get_latest_critique + latest_critique = await get_latest_critique( + paper_type="autonomous_paper", + paper_id=metadata.paper_id, + base_dir=papers_dir + ) + + entry = { + "history_id": f"{session_id}:{metadata.paper_id}", + "session_id": session_id, + "paper_id": metadata.paper_id, + "title": metadata.title, + "abstract": metadata.abstract, + "word_count": metadata.word_count, + "source_brainstorm_ids": metadata.source_brainstorm_ids, + "referenced_papers": metadata.referenced_papers, + "status": metadata.status, + "created_at": metadata.created_at.isoformat() if metadata.created_at else None, + "model_usage": metadata.model_usage, + "user_prompt": await self._get_history_user_prompt(session_id), + "critique_avg": self._calculate_critique_average(latest_critique), + } + if pruned: + entry.update({ + "is_pruned": True, + "pruned_at": metadata.pruned_at.isoformat() if metadata.pruned_at else None, + "pruned_reason": metadata.pruned_reason, + "pruned_by": metadata.pruned_by or ("legacy" if metadata.status == "archived" else None), + "pruned_note": self._pruned_note_for(metadata), + }) + return entry + + async def _list_history_papers_from_directory(self, papers_dir: Path, session_id: str) -> List[Dict[str, Any]]: + """List complete, non-archived papers from one legacy/session papers directory.""" scoped_library = self._build_scoped_library(papers_dir) - user_prompt = await self._get_history_user_prompt(session_id) papers = await scoped_library.get_all_papers(validate_completeness=True) history_papers = [] for metadata in papers: if metadata.status != "complete": continue - - latest_critique = await get_latest_critique( - paper_type="autonomous_paper", - paper_id=metadata.paper_id, - base_dir=papers_dir - ) - - history_papers.append({ - "history_id": f"{session_id}:{metadata.paper_id}", - "session_id": session_id, - "paper_id": metadata.paper_id, - "title": metadata.title, - "abstract": metadata.abstract, - "word_count": metadata.word_count, - "source_brainstorm_ids": metadata.source_brainstorm_ids, - "referenced_papers": metadata.referenced_papers, - "status": metadata.status, - "created_at": metadata.created_at.isoformat() if metadata.created_at else None, - "model_usage": metadata.model_usage, - "user_prompt": user_prompt, - "critique_avg": self._calculate_critique_average(latest_critique), - }) + history_papers.append(await scoped_library._build_history_entry( + metadata=metadata, + session_id=session_id, + papers_dir=papers_dir, + )) return history_papers @@ -389,10 +597,71 @@ async def list_history_papers(self) -> List[Dict[str, Any]]: history_papers.sort(key=lambda paper: paper.get("created_at") or "", reverse=True) return history_papers + async def _list_pruned_history_papers_from_directory(self, papers_dir: Path, session_id: str) -> List[Dict[str, Any]]: + """List pruned papers from one legacy/session papers directory.""" + scoped_library = self._build_scoped_library(papers_dir) + pruned_papers: List[Dict[str, Any]] = [] + + if scoped_library._pruned_dir.exists(): + for metadata_path in scoped_library._pruned_dir.glob("pruned_paper_*_metadata.json"): + metadata = await scoped_library._read_metadata_file(metadata_path) + if not metadata: + continue + metadata.status = "pruned" + pruned_papers.append(await scoped_library._build_history_entry( + metadata=metadata, + session_id=session_id, + papers_dir=papers_dir, + pruned=True, + )) + + # Legacy archived papers are exposed as pruned history for user access. + if scoped_library._archive_dir.exists(): + for metadata_path in scoped_library._archive_dir.glob("paper_*_metadata.json"): + metadata = await scoped_library._read_metadata_file(metadata_path) + if not metadata: + continue + metadata.status = "archived" + metadata.pruned_by = metadata.pruned_by or "legacy" + metadata.pruned_reason = metadata.pruned_reason or "Legacy archived paper preserved as pruned history." + pruned_papers.append(await scoped_library._build_history_entry( + metadata=metadata, + session_id=session_id, + papers_dir=papers_dir, + pruned=True, + )) + + return pruned_papers + + async def list_pruned_history_papers(self) -> List[Dict[str, Any]]: + """List all pruned Stage 2 papers from legacy and session storage.""" + pruned_papers: List[Dict[str, Any]] = [] + + legacy_papers_dir = Path(system_config.auto_papers_dir) + if legacy_papers_dir.exists(): + pruned_papers.extend( + await self._list_pruned_history_papers_from_directory(legacy_papers_dir, "legacy") + ) + + sessions_dir = Path(system_config.auto_sessions_base_dir) + if sessions_dir.exists(): + for session_dir in sorted((p for p in sessions_dir.iterdir() if p.is_dir()), reverse=True): + papers_dir = session_dir / "papers" + if not papers_dir.exists(): + continue + + pruned_papers.extend( + await self._list_pruned_history_papers_from_directory(papers_dir, session_dir.name) + ) + + pruned_papers.sort( + key=lambda paper: paper.get("pruned_at") or paper.get("created_at") or "", + reverse=True, + ) + return pruned_papers + async def get_history_paper(self, session_id: str, paper_id: str) -> Optional[Dict[str, Any]]: """Get one complete, non-archived Stage 2 paper from legacy/session history.""" - from backend.shared.critique_memory import get_latest_critique - papers_dir = self.get_history_papers_dir(session_id) if papers_dir is None: return None @@ -407,26 +676,64 @@ async def get_history_paper(self, session_id: str, paper_id: str) -> Optional[Di content = await scoped_library.get_paper_content(paper_id) outline = await scoped_library.get_outline(paper_id) - latest_critique = await get_latest_critique( - paper_type="autonomous_paper", - paper_id=paper_id, - base_dir=papers_dir + entry = await scoped_library._build_history_entry( + metadata=metadata, + session_id=session_id, + papers_dir=papers_dir, ) return { - "history_id": f"{session_id}:{paper_id}", - "session_id": session_id, - "paper_id": metadata.paper_id, - "title": metadata.title, - "abstract": metadata.abstract, - "word_count": metadata.word_count, - "source_brainstorm_ids": metadata.source_brainstorm_ids, - "referenced_papers": metadata.referenced_papers, - "status": metadata.status, - "created_at": metadata.created_at.isoformat() if metadata.created_at else None, - "model_usage": metadata.model_usage, - "user_prompt": await self._get_history_user_prompt(session_id), - "critique_avg": self._calculate_critique_average(latest_critique), + **entry, + "content": content, + "outline": outline, + } + + async def get_pruned_history_paper(self, session_id: str, paper_id: str) -> Optional[Dict[str, Any]]: + """Get one pruned Stage 2 paper from legacy/session history.""" + papers_dir = self.get_history_papers_dir(session_id) + if papers_dir is None: + return None + + scoped_library = self._build_scoped_library(papers_dir) + metadata = await scoped_library._read_metadata_file( + scoped_library._get_pruned_metadata_path(paper_id) + ) + content_path = scoped_library._get_pruned_paper_path(paper_id) + outline_path = scoped_library._get_pruned_outline_path(paper_id) + is_legacy_archive = False + + # Legacy archives used the old paper_ prefix inside archive/. + if metadata is None: + archive_metadata_path = scoped_library._get_archive_metadata_path(paper_id) + metadata = await scoped_library._read_metadata_file(archive_metadata_path) + content_path = scoped_library._get_archive_paper_path(paper_id) + outline_path = scoped_library._get_archive_outline_path(paper_id) + if metadata: + is_legacy_archive = True + metadata.status = "archived" + metadata.pruned_by = metadata.pruned_by or "legacy" + metadata.pruned_reason = metadata.pruned_reason or "Legacy archived paper preserved as pruned history." + + if metadata is None: + return None + + content = await scoped_library._read_text_file(content_path) + if is_legacy_archive and content and not content.startswith("PRUNED PAPER - REMOVED FROM MODEL CONTEXT"): + content = scoped_library._pruned_banner( + paper_id=paper_id, + pruned_at=metadata.pruned_at or metadata.created_at or datetime.now(), + pruned_by=metadata.pruned_by or "legacy", + reason=metadata.pruned_reason or "Legacy archived paper preserved as pruned history.", + ) + content + outline = await scoped_library._read_text_file(outline_path) + entry = await scoped_library._build_history_entry( + metadata=metadata, + session_id=session_id, + papers_dir=papers_dir, + pruned=True, + ) + return { + **entry, "content": content, "outline": outline, } @@ -597,7 +904,11 @@ async def _is_paper_complete(self, paper_id: str) -> bool: return True except Exception as e: - logger.error(f"Failed to validate paper {paper_id}: {e}") + logger.error( + "Failed to validate paper %s: %s", + redact_log_text(paper_id, 120), + redact_log_text(e, 240), + ) return False # ======================================================================== @@ -661,19 +972,19 @@ async def save_paper( paper_path = self._get_paper_path(paper_id) async with aiofiles.open(paper_path, 'w', encoding='utf-8') as f: await f.write(content) - logger.info(f"Paper saved: {paper_path}") + logger.info("Paper saved: %s", redact_log_text(paper_path, 240)) # Save outline outline_path = self._get_outline_path(paper_id) async with aiofiles.open(outline_path, 'w', encoding='utf-8') as f: await f.write(outline) - logger.info(f"Outline saved: {outline_path}") + logger.info("Outline saved: %s", redact_log_text(outline_path, 240)) # Save abstract abstract_path = self._get_abstract_path(paper_id) async with aiofiles.open(abstract_path, 'w', encoding='utf-8') as f: await f.write(abstract) - logger.info(f"Abstract saved: {abstract_path}") + logger.info("Abstract saved: %s", redact_log_text(abstract_path, 240)) # Save source brainstorm cache source_path = self._get_source_brainstorm_path(paper_id) @@ -689,7 +1000,13 @@ async def save_paper( await self._save_metadata(metadata) model_count = len(model_usage) if model_usage else 0 - logger.info(f"Saved paper {paper_id}: '{title}' ({word_count} words, {model_count} models tracked)") + logger.info( + "Saved paper %s: '%s' (%s words, %s models tracked)", + redact_log_text(paper_id, 120), + redact_log_text(title, 240), + word_count, + model_count, + ) return metadata async def get_paper_content(self, paper_id: str, *, strip_proofs: bool = False) -> str: @@ -709,13 +1026,16 @@ async def get_paper_content(self, paper_id: str, *, strip_proofs: bool = False) return "" try: - async with aiofiles.open(paper_path, 'r', encoding='utf-8') as f: - content = await f.read() + content = await self._read_text_file(paper_path) if strip_proofs and content: content = self.strip_verified_proofs_from_content(content) return content except Exception as e: - logger.error(f"Failed to read paper {paper_id}: {e}") + logger.error( + "Failed to read paper %s: %s", + redact_log_text(paper_id, 120), + redact_log_text(e, 240), + ) return "" async def append_proofs_section(self, paper_id: str, proofs_data: Any) -> bool: @@ -724,7 +1044,10 @@ async def append_proofs_section(self, paper_id: str, proofs_data: Any) -> bool: session_id, scoped_paper_id = paper_id.split(":", 1) papers_dir = self.get_history_papers_dir(session_id) if papers_dir is None: - logger.error(f"History paper directory not found for proof append: {paper_id}") + logger.error( + "History paper directory not found for proof append: %s", + redact_log_text(paper_id, 120), + ) return False scoped_library = self._build_scoped_library(papers_dir) return await scoped_library.append_proofs_section(scoped_paper_id, proofs_data) @@ -732,7 +1055,7 @@ async def append_proofs_section(self, paper_id: str, proofs_data: Any) -> bool: async with self._lock: paper_path = self._get_paper_path(paper_id) if not paper_path.exists(): - logger.error(f"Paper not found for proof append: {paper_id}") + logger.error("Paper not found for proof append: %s", redact_log_text(paper_id, 120)) return False proofs = proofs_data if isinstance(proofs_data, list) else [proofs_data] @@ -747,93 +1070,69 @@ async def append_proofs_section(self, paper_id: str, proofs_data: Any) -> bool: "this paper", ) if updated_content == existing_content: - logger.info("No new proof entries to append to paper %s", paper_id) + logger.info("No new proof entries to append to paper %s", redact_log_text(paper_id, 120)) return True async with aiofiles.open(paper_path, "w", encoding="utf-8") as handle: await handle.write(updated_content) - logger.info("Appended %s proof(s) to paper %s", len(proofs), paper_id) + logger.info("Appended %s proof(s) to paper %s", len(proofs), redact_log_text(paper_id, 120)) return True except Exception as exc: - logger.error(f"Failed to append proofs to paper {paper_id}: {exc}") + logger.error( + "Failed to append proofs to paper %s: %s", + redact_log_text(paper_id, 120), + redact_log_text(exc, 240), + ) return False async def get_abstract(self, paper_id: str) -> str: """Get paper abstract.""" abstract_path = self._get_abstract_path(paper_id) - - if not abstract_path.exists(): - return "" - - try: - async with aiofiles.open(abstract_path, 'r', encoding='utf-8') as f: - return await f.read() - except Exception as e: - logger.error(f"Failed to read abstract for {paper_id}: {e}") - return "" + return await self._read_text_file(abstract_path) async def get_outline(self, paper_id: str) -> str: """Get paper outline.""" outline_path = self._get_outline_path(paper_id) - - if not outline_path.exists(): - return "" - - try: - async with aiofiles.open(outline_path, 'r', encoding='utf-8') as f: - return await f.read() - except Exception as e: - logger.error(f"Failed to read outline for {paper_id}: {e}") - return "" + return await self._read_text_file(outline_path) async def get_source_brainstorm(self, paper_id: str) -> str: """Get cached source brainstorm content.""" source_path = self._get_source_brainstorm_path(paper_id) - - if not source_path.exists(): - return "" - - try: - async with aiofiles.open(source_path, 'r', encoding='utf-8') as f: - return await f.read() - except Exception as e: - logger.error(f"Failed to read source brainstorm for {paper_id}: {e}") - return "" + return await self._read_text_file(source_path) async def _save_metadata(self, metadata: PaperMetadata) -> None: """Save paper metadata to JSON file.""" metadata_path = self._get_metadata_path(metadata.paper_id) try: - async with aiofiles.open(metadata_path, 'w', encoding='utf-8') as f: - await f.write(json.dumps(metadata.dict(), indent=2, default=str)) + await self._save_metadata_to_path(metadata, metadata_path) except Exception as e: - logger.error(f"Failed to save metadata for {metadata.paper_id}: {e}") + logger.error( + "Failed to save metadata for %s: %s", + redact_log_text(metadata.paper_id, 120), + redact_log_text(e, 240), + ) async def get_metadata(self, paper_id: str) -> Optional[PaperMetadata]: """Get paper metadata.""" metadata_path = self._get_metadata_path(paper_id) - - if not metadata_path.exists(): - return None - - try: - async with aiofiles.open(metadata_path, 'r', encoding='utf-8') as f: - content = await f.read() - data = json.loads(content) - return PaperMetadata(**data) - except Exception as e: - logger.error(f"Failed to load metadata for {paper_id}: {e}") - return None + return await self._read_metadata_file(metadata_path) - async def get_all_papers(self, include_archived: bool = False, include_in_progress: bool = False, validate_completeness: bool = True) -> List[PaperMetadata]: + async def get_all_papers( + self, + include_archived: bool = False, + include_in_progress: bool = False, + include_pruned: bool = False, + validate_completeness: bool = True, + ) -> List[PaperMetadata]: """ Get metadata for all papers. Args: - include_archived: If True, include archived papers + include_archived: If True, include legacy archived papers include_in_progress: If True, include papers with status="in_progress" (default False) + include_pruned: If True, include pruned papers from active metadata (legacy compatibility) validate_completeness: If True, only return papers with all required sections (default True) Returns: @@ -852,11 +1151,10 @@ async def get_all_papers(self, include_archived: bool = False, include_in_progre data = json.loads(content) metadata = PaperMetadata(**data) - # Filter by archive status if metadata.status == "archived" and not include_archived: continue - - # Filter by in_progress status + if metadata.status == "pruned" and not include_pruned: + continue if metadata.status == "in_progress" and not include_in_progress: logger.debug(f"Skipping in_progress paper {metadata.paper_id}") continue @@ -870,7 +1168,11 @@ async def get_all_papers(self, include_archived: bool = False, include_in_progre papers.append(metadata) except Exception as e: - logger.error(f"Failed to load paper metadata from {path}: {e}") + logger.error( + "Failed to load paper metadata from %s: %s", + redact_log_text(path, 240), + redact_log_text(e, 240), + ) # Sort by creation time (most recent first) papers.sort(key=lambda x: x.created_at, reverse=True) @@ -903,8 +1205,8 @@ async def get_most_recent_incomplete_paper(self) -> Optional[PaperMetadata]: data = json.loads(content) metadata = PaperMetadata(**data) - # Skip archived papers - if metadata.status == "archived": + # Skip archived/pruned papers + if metadata.status in {"archived", "pruned"}: continue # Check if paper is incomplete @@ -913,7 +1215,11 @@ async def get_most_recent_incomplete_paper(self) -> Optional[PaperMetadata]: incomplete_papers.append(metadata) logger.debug(f"Found incomplete paper: {metadata.paper_id}") except Exception as e: - logger.error(f"Failed to check paper completeness from {path}: {e}") + logger.error( + "Failed to check paper completeness from %s: %s", + redact_log_text(path, 240), + redact_log_text(e, 240), + ) if not incomplete_papers: return None @@ -935,46 +1241,102 @@ async def is_paper_complete(self, paper_id: str) -> bool: return await self._is_paper_complete(paper_id) # ======================================================================== - # ARCHIVE OPERATIONS + # PRUNE OPERATIONS # ======================================================================== - - async def archive_paper(self, paper_id: str) -> bool: - """ - Archive a paper (move to archive directory). - Used when paper is marked as redundant. - """ + + async def prune_paper( + self, + paper_id: str, + *, + reason: str = "", + pruned_by: str = "system", + ) -> bool: + """Soft-prune a paper from model context while preserving it for users.""" async with self._lock: try: - # Get metadata metadata = await self.get_metadata(paper_id) + pruned_metadata_path = self._get_pruned_metadata_path(paper_id) if metadata is None: - logger.error(f"Cannot archive paper {paper_id}: metadata not found") + # codeql[py/path-injection]: paper_id is validated by _get_pruned_metadata_path. + if pruned_metadata_path.exists(): + logger.info("Paper %s is already pruned", redact_log_text(paper_id, 120)) + return True + logger.error("Cannot prune paper %s: metadata not found", redact_log_text(paper_id, 120)) return False - - # Update status - metadata.status = "archived" - await self._save_metadata(metadata) - - # Move files to archive directory + + self._pruned_dir.mkdir(parents=True, exist_ok=True) + + pruned_at = datetime.now() + metadata.status = "pruned" + metadata.pruned_at = pruned_at + metadata.pruned_reason = reason or "No pruning reason recorded." + metadata.pruned_by = pruned_by if pruned_by in {"system", "user", "legacy"} else "system" + + paper_path = self._ensure_library_path( + self._get_paper_path(paper_id), + "paper path", + ) + if paper_path.exists(): + content = await self._read_text_file(paper_path) + clean_content = self._strip_existing_pruned_banner(content) + pruned_content = self._pruned_banner( + paper_id=paper_id, + pruned_at=pruned_at, + pruned_by=metadata.pruned_by, + reason=metadata.pruned_reason, + ) + clean_content + pruned_paper_path = self._ensure_library_path( + self._get_pruned_paper_path(paper_id), + "pruned paper path", + ) + async with aiofiles.open(pruned_paper_path, 'w', encoding='utf-8') as f: + await f.write(pruned_content) + paper_path.unlink(missing_ok=True) + files_to_move = [ - (self._get_paper_path(paper_id), self._archive_dir / f"paper_{paper_id}.txt"), - (self._get_abstract_path(paper_id), self._archive_dir / f"paper_{paper_id}_abstract.txt"), - (self._get_outline_path(paper_id), self._archive_dir / f"paper_{paper_id}_outline.txt"), - (self._get_source_brainstorm_path(paper_id), self._archive_dir / f"paper_{paper_id}_source_brainstorm.txt"), - (self._get_metadata_path(paper_id), self._archive_dir / f"paper_{paper_id}_metadata.json"), - (self._get_rejections_path(paper_id), self._archive_dir / f"paper_{paper_id}_last_10_rejections.txt") + (self._get_abstract_path(paper_id), self._get_pruned_abstract_path(paper_id)), + (self._get_outline_path(paper_id), self._get_pruned_outline_path(paper_id)), + (self._get_source_brainstorm_path(paper_id), self._get_pruned_source_brainstorm_path(paper_id)), + (self._get_rejections_path(paper_id), self._get_pruned_rejections_path(paper_id)), ] - + for source, dest in files_to_move: if source.exists(): + source = self._ensure_library_path(source, "paper source path") + dest = self._ensure_library_path(dest, "pruned destination path") + dest.parent.mkdir(parents=True, exist_ok=True) shutil.move(str(source), str(dest)) - - logger.info(f"Paper {paper_id} archived successfully") + + await self._save_metadata_to_path(metadata, pruned_metadata_path) + metadata_path = self._ensure_library_path( + self._get_metadata_path(paper_id), + "metadata path", + ) + metadata_path.unlink(missing_ok=True) + + logger.info("Paper %s pruned successfully", redact_log_text(paper_id, 120)) return True - + except Exception as e: - logger.error(f"Failed to archive paper {paper_id}: {e}") + logger.error( + "Failed to prune paper %s: %s", + redact_log_text(paper_id, 120), + redact_log_text(e, 240), + ) return False + + async def archive_paper(self, paper_id: str) -> bool: + """ + Legacy compatibility wrapper. + + Redundancy removal is now a prune operation: the paper leaves model + context but remains downloadable and visibly labeled for users. + """ + return await self.prune_paper( + paper_id, + reason="Legacy archive request treated as a pruned paper.", + pruned_by="system", + ) async def get_papers_summary(self) -> List[Dict[str, Any]]: """ @@ -1021,24 +1383,66 @@ async def get_all_papers_with_outlines(self) -> List[Dict[str, Any]]: return summaries async def count_papers(self) -> Dict[str, int]: - """Count total, archived, in_progress, and active (complete) papers.""" - all_papers = await self.get_all_papers(include_archived=True, include_in_progress=True, validate_completeness=False) + """Count total, pruned, archived, in_progress, and active (complete) papers.""" + all_papers = await self.get_all_papers( + include_archived=True, + include_in_progress=True, + include_pruned=True, + validate_completeness=False, + ) total = len(all_papers) archived = sum(1 for p in all_papers if p.status == "archived") + pruned = sum(1 for p in all_papers if p.status == "pruned") in_progress = sum(1 for p in all_papers if p.status == "in_progress") - active = total - archived - in_progress # Only "complete" papers are active + + if self._pruned_dir.exists(): + pruned += len(list(self._pruned_dir.glob("pruned_paper_*_metadata.json"))) + if self._archive_dir.exists(): + archived += len(list(self._archive_dir.glob("paper_*_metadata.json"))) + + total += pruned + archived + active = sum(1 for p in all_papers if p.status == "complete") return { "total": total, "active": active, "in_progress": in_progress, - "archived": archived + "archived": archived, + "pruned": pruned } # ======================================================================== # DELETE OPERATIONS # ======================================================================== + + async def delete_all_pruned_papers(self) -> int: + """Permanently delete all pruned and legacy archived paper files in this scope.""" + async with self._lock: + deleted_count = 0 + try: + for directory in (self._pruned_dir, self._archive_dir): + if not directory.exists(): + continue + for metadata_path in directory.glob("*paper_*_metadata.json"): + deleted_count += 1 + for path in directory.glob("*"): + if path.is_file(): + path.unlink() + # Leave the directory itself in place for future prunes. + logger.info( + "Deleted %s pruned/archived paper records from %s", + deleted_count, + redact_log_text(self._base_dir, 240), + ) + return deleted_count + except Exception as e: + logger.error( + "Failed to delete pruned papers from %s: %s", + redact_log_text(self._base_dir, 240), + redact_log_text(e, 240), + ) + return deleted_count async def delete_paper(self, paper_id: str) -> bool: """ @@ -1084,16 +1488,38 @@ async def delete_paper(self, paper_id: str) -> bool: path.unlink() deleted_any = True logger.debug(f"Deleted from archive: {path}") + + pruned_files = [ + self._get_pruned_paper_path(paper_id), + self._get_pruned_abstract_path(paper_id), + self._get_pruned_outline_path(paper_id), + self._get_pruned_source_brainstorm_path(paper_id), + self._get_pruned_metadata_path(paper_id), + self._get_pruned_rejections_path(paper_id), + ] + + for path in pruned_files: + if path.exists(): + path.unlink() + deleted_any = True + logger.debug(f"Deleted from pruned papers: {path}") if deleted_any: - logger.info(f"Paper {paper_id} deleted successfully") + logger.info("Paper %s deleted successfully", redact_log_text(paper_id, 120)) return True else: - logger.warning(f"Paper {paper_id} not found in active or archive directories") + logger.warning( + "Paper %s not found in active or archive directories", + redact_log_text(paper_id, 120), + ) return False except Exception as e: - logger.error(f"Failed to delete paper {paper_id}: {e}") + logger.error( + "Failed to delete paper %s: %s", + redact_log_text(paper_id, 120), + redact_log_text(e, 240), + ) return False diff --git a/backend/autonomous/memory/proof_database.py b/backend/autonomous/memory/proof_database.py index b2bff96..fd88121 100644 --- a/backend/autonomous/memory/proof_database.py +++ b/backend/autonomous/memory/proof_database.py @@ -7,6 +7,7 @@ import asyncio import json import logging +import re import shutil from datetime import datetime from pathlib import Path @@ -100,6 +101,31 @@ def _rebuild_reverse_indexes(self) -> None: if proof_id not in self._mathlib_reverse_short_index[short_name]: self._mathlib_reverse_short_index[short_name].append(proof_id) + def _rebuild_index_from_record_files_sync(self) -> Dict[str, Any]: + proofs: List[Dict[str, Any]] = [] + for record_path in self._base_dir.glob("proof_*.json"): + if record_path.name.endswith("_metadata.json"): + continue + try: + data = json.loads(record_path.read_text(encoding="utf-8")) + if not isinstance(data, dict) or not data.get("proof_id"): + continue + proofs.append(data) + except Exception as exc: + logger.warning("Skipping unreadable proof record during index rebuild: %s (%s)", record_path, exc) + + proofs.sort(key=lambda proof: proof.get("created_at", ""), reverse=True) + max_numeric_id = 0 + for proof in proofs: + proof_id = str(proof.get("proof_id", "")) + match = re.search(r"(\d+)$", proof_id) + if match: + max_numeric_id = max(max_numeric_id, int(match.group(1))) + return { + "next_proof_id": max(max_numeric_id + 1, len(proofs) + 1, 1), + "proofs": proofs, + } + async def initialize(self) -> None: """Ensure storage exists and load the index.""" if self._session_manager and self._session_manager.is_session_active: @@ -117,8 +143,11 @@ async def _load_index(self) -> None: self._index_data = json.loads(await handle.read()) except Exception as exc: logger.error("Failed to load proofs index: %s", exc) - self._index_data = self._default_index() - await self._save_index() + self._index_data = await asyncio.to_thread(self._rebuild_index_from_record_files_sync) + logger.warning( + "Rebuilt proofs index from %s record file(s) after index load failure", + len(self._index_data.get("proofs", [])), + ) else: self._index_data = self._default_index() await self._save_index() @@ -140,7 +169,7 @@ def _ensure_index_loaded_sync(self) -> None: self._index_data = json.loads(index_path.read_text(encoding="utf-8")) except Exception as exc: logger.error("Failed to synchronously load proofs index: %s", exc) - self._index_data = self._default_index() + self._index_data = self._rebuild_index_from_record_files_sync() else: self._index_data = self._default_index() @@ -206,10 +235,26 @@ async def _save_failed_candidates( async def add_proof(self, record: ProofRecord) -> ProofRecord: """Persist a proof record and return the stored copy.""" + stored_record, _duplicate = await self.add_proof_if_absent(record) + return stored_record + + async def add_proof_if_absent(self, record: ProofRecord) -> tuple[ProofRecord, bool]: + """Persist a proof record unless an identical source/theorem/code exists.""" async with self._lock: if self._index_data is None: await self._load_index() + normalized_statement = " ".join((record.theorem_statement or "").split()) + normalized_code = "\n".join((record.lean_code or "").strip().splitlines()) + for existing in self._index_data.get("proofs", []): + if existing.get("source_type") != record.source_type or existing.get("source_id") != record.source_id: + continue + if " ".join(str(existing.get("theorem_statement") or "").split()) != normalized_statement: + continue + if "\n".join(str(existing.get("lean_code") or "").strip().splitlines()) != normalized_code: + continue + return self._deserialize_record(existing), True + proof_id = record.proof_id or f"proof_{self._index_data['next_proof_id']:03d}" stored_record = record.model_copy(update={"proof_id": proof_id}) serialized = self._serialize_record(stored_record) @@ -241,7 +286,7 @@ async def add_proof(self, record: ProofRecord) -> ProofRecord: stored_record.source_type, stored_record.source_id, ) - return stored_record + return stored_record, False async def record_failed_candidate( self, @@ -627,12 +672,13 @@ def get_novel_proofs_for_injection(self) -> str: lines = [ "=== VERIFIED NOVEL MATHEMATICAL PROOFS (Lean 4 Verified) ===", "[These proofs have been formally verified. They represent proven mathematical truths.", - "Novelty tiers: Mathematical Discovery (highest — new result), Novel Reformulation (novel reformulation of known proof), Novel Formalization (first Lean 4 formalization of known result).]", + "Novelty tiers: Major Mathematical Discovery (highest — possible prize-level discovery), Mathematical Discovery (new result), Novel Reformulation (novel reformulation of known proof), Novel Formalization (first Lean 4 formalization of known result).]", "", ] for index, proof in enumerate(novel_proofs, start=1): tier = proof.get("novelty_tier", "") tier_label = { + "major_mathematical_discovery": "Major Mathematical Discovery", "mathematical_discovery": "Mathematical Discovery", "novel_variant": "Novel Reformulation", "novel_formulation": "Novel Formalization", diff --git a/backend/autonomous/memory/research_metadata.py b/backend/autonomous/memory/research_metadata.py index c3c6732..ee076f5 100644 --- a/backend/autonomous/memory/research_metadata.py +++ b/backend/autonomous/memory/research_metadata.py @@ -58,6 +58,7 @@ def _get_default_stats(self) -> Dict[str, Any]: "total_brainstorms_completed": 0, "total_papers_completed": 0, "total_papers_archived": 0, + "total_papers_pruned": 0, "total_submissions_accepted": 0, "total_submissions_rejected": 0, "topic_selection_rejections": 0, @@ -108,6 +109,9 @@ async def _ensure_initialized(self) -> None: if self._stats is None: self._stats = self._get_default_stats() await self._save_stats() + else: + for key, value in self._get_default_stats().items(): + self._stats.setdefault(key, value) async def initialize(self, user_research_prompt: str = "") -> None: """Initialize or load research metadata.""" @@ -506,11 +510,33 @@ async def register_paper(self, metadata: PaperMetadata) -> None: async def archive_paper(self, paper_id: str) -> None: """Mark a paper as archived in central metadata.""" + await self.prune_paper( + paper_id, + reason="Legacy archive request treated as a pruned paper.", + pruned_by="system", + ) + + async def prune_paper( + self, + paper_id: str, + *, + reason: str = "", + pruned_by: str = "system", + ) -> None: + """Mark a paper as pruned in central metadata.""" async with self._lock: for i, p in enumerate(self._data.get("papers", [])): if p.get("paper_id") == paper_id: - self._data["papers"][i]["status"] = "archived" + self._data["papers"][i]["status"] = "pruned" + self._data["papers"][i]["pruned_at"] = datetime.now().isoformat() + self._data["papers"][i]["pruned_reason"] = reason or "No pruning reason recorded." + self._data["papers"][i]["pruned_by"] = pruned_by break + for i, b in enumerate(self._data.get("brainstorms", [])): + papers_generated = b.get("papers_generated", []) + if paper_id in papers_generated: + papers_generated.remove(paper_id) + self._data["brainstorms"][i]["papers_generated"] = papers_generated await self._save_metadata() # Update stats @@ -518,6 +544,14 @@ async def archive_paper(self, paper_id: str) -> None: 1 for p in self._data.get("papers", []) if p.get("status") == "archived" ) + self._stats["total_papers_pruned"] = sum( + 1 for p in self._data.get("papers", []) + if p.get("status") == "pruned" + ) + self._stats["total_papers_completed"] = sum( + 1 for p in self._data.get("papers", []) + if p.get("status") == "complete" + ) await self._save_stats() def _paper_to_dict(self, metadata: PaperMetadata) -> Dict[str, Any]: @@ -530,7 +564,10 @@ def _paper_to_dict(self, metadata: PaperMetadata) -> Dict[str, Any]: "source_brainstorm_ids": metadata.source_brainstorm_ids, "referenced_papers": metadata.referenced_papers, "status": metadata.status, - "created_at": metadata.created_at.isoformat() if metadata.created_at else None + "created_at": metadata.created_at.isoformat() if metadata.created_at else None, + "pruned_at": metadata.pruned_at.isoformat() if metadata.pruned_at else None, + "pruned_reason": metadata.pruned_reason, + "pruned_by": metadata.pruned_by, } # ======================================================================== @@ -700,6 +737,10 @@ async def delete_paper(self, paper_id: str) -> bool: 1 for p in self._data.get("papers", []) if p.get("status") == "archived" ) + self._stats["total_papers_pruned"] = sum( + 1 for p in self._data.get("papers", []) + if p.get("status") == "pruned" + ) await self._save_stats() logger.info(f"Removed paper {paper_id} from central metadata") diff --git a/backend/autonomous/memory/session_manager.py b/backend/autonomous/memory/session_manager.py index 5b334b1..f166c98 100644 --- a/backend/autonomous/memory/session_manager.py +++ b/backend/autonomous/memory/session_manager.py @@ -21,6 +21,42 @@ logger = logging.getLogger(__name__) +def _session_paper_has_section(content: str, section_name: str) -> bool: + base_patterns = [ + rf"##\s*{section_name}", + rf"#\s*{section_name}", + rf"\*\*{section_name}\*\*", + rf"^{section_name}\s*$", + rf"^\\(?:section|chapter)\*?\{{{section_name}\}}\s*$", + ] + if section_name == "Introduction": + base_patterns.append(rf"^I\.\s*{section_name}") + base_patterns.append(rf"^\\(?:section|chapter)\*?\{{I\.?\s*{section_name}\}}\s*$") + elif section_name == "Conclusion": + base_patterns.append(rf"^[IVXLC]+\.\s*{section_name}") + + return any(re.search(pattern, content, re.IGNORECASE | re.MULTILINE) for pattern in base_patterns) + + +def _detect_session_paper_phase(paper_content: str) -> str: + has_abstract = _session_paper_has_section(paper_content, "Abstract") + has_intro = _session_paper_has_section(paper_content, "Introduction") + has_conclusion = _session_paper_has_section(paper_content, "Conclusion") + + has_abstract_placeholder = "[HARD CODED PLACEHOLDER FOR THE ABSTRACT SECTION" in paper_content + has_intro_placeholder = "[HARD CODED PLACEHOLDER FOR INTRODUCTION SECTION" in paper_content + has_conclusion_placeholder = "[HARD CODED PLACEHOLDER FOR THE CONCLUSION SECTION" in paper_content + has_body_content = bool(re.search(r"^[IVX]+\.\s+\w", paper_content or "", re.MULTILINE)) + + if not has_conclusion or has_conclusion_placeholder: + return "conclusion" if has_body_content else "body" + if not has_intro or has_intro_placeholder: + return "introduction" + if not has_abstract or has_abstract_placeholder: + return "abstract" + return "abstract" + + class SessionManager: """ Manages prompt-based session folder organization. @@ -308,21 +344,31 @@ async def find_interrupted_session(self, base_dir: Optional[str] = None) -> Opti continue workflow_state_path = session_dir / "workflow_state.json" - if not workflow_state_path.exists(): - continue - + workflow_state = None try: - async with aiofiles.open(workflow_state_path, 'r', encoding='utf-8') as f: - raw = await f.read() - if not raw.strip().strip('\x00'): - continue # Empty or null-padded file — skip silently - workflow_state = json.loads(raw) - - # Check if this session is resumable - # Resumable means: has a tier AND (has a topic OR has completed papers) - has_tier = workflow_state.get("current_tier") is not None - has_topic = workflow_state.get("current_topic_id") is not None - has_papers = workflow_state.get("papers_completed_count", 0) > 0 + if workflow_state_path.exists(): + async with aiofiles.open(workflow_state_path, 'r', encoding='utf-8') as f: + raw = await f.read() + if raw.strip().strip('\x00'): + workflow_state = json.loads(raw) + # Check if this session is resumable. + # Resumable means: has a tier AND (has a topic OR has completed papers). + has_tier = bool(workflow_state and workflow_state.get("current_tier") is not None) + has_topic = bool(workflow_state and workflow_state.get("current_topic_id") is not None) + has_papers = bool(workflow_state and workflow_state.get("papers_completed_count", 0) > 0) + + # A stale idle workflow_state.json can coexist with valid session + # stats/brainstorm files. Try the durable-file recovery before + # deciding the session is not resumable. + if not (has_tier and (has_topic or has_papers)): + recovered_state = await self._recover_workflow_state_from_session_files(session_dir) + if recovered_state is not None: + workflow_state = recovered_state + has_tier = workflow_state.get("current_tier") is not None + has_topic = workflow_state.get("current_topic_id") is not None + has_papers = workflow_state.get("papers_completed_count", 0) > 0 + if workflow_state is None: + continue if has_tier and (has_topic or has_papers): # Load session metadata for user prompt @@ -357,6 +403,147 @@ async def find_interrupted_session(self, base_dir: Optional[str] = None) -> Opti return most_recent + async def _recover_workflow_state_from_session_files(self, session_dir: Path) -> Optional[Dict[str, Any]]: + """Build a conservative resume state from session stats/brainstorm files. + + This protects sessions where the workflow checkpoint was stale or absent + but durable brainstorm metadata still shows work in progress. It only + resumes a current stats pointer, an in-progress brainstorm, or a completed + brainstorm that has not produced a paper yet. + """ + try: + stats = {} + stats_path = session_dir / "session_stats.json" + if stats_path.exists(): + async with aiofiles.open(stats_path, 'r', encoding='utf-8') as f: + stats = json.loads(await f.read()) + + topic_id = stats.get("current_brainstorm_id") + paper_id = stats.get("current_paper_id") + topic_metadata = None + paper_metadata = None + paper_title = None + reference_paper_ids = [] + + brainstorms_dir = session_dir / "brainstorms" + papers_dir = session_dir / "papers" + if paper_id and papers_dir.exists(): + paper_metadata_path = papers_dir / f"paper_{paper_id}_metadata.json" + if paper_metadata_path.exists(): + async with aiofiles.open(paper_metadata_path, 'r', encoding='utf-8') as f: + paper_metadata = json.loads(await f.read()) + if paper_metadata.get("status") == "in_progress": + paper_title = paper_metadata.get("title") + reference_paper_ids = paper_metadata.get("referenced_papers") or [] + if not topic_id: + source_ids = paper_metadata.get("source_brainstorm_ids") or [] + topic_id = source_ids[0] if source_ids else None + else: + # `current_paper_id` is sticky in stats; a completed paper + # must not make a stale/idle session look like active paper writing. + paper_id = None + else: + paper_id = None + + if not paper_id and papers_dir.exists(): + paper_candidates = [] + for paper_metadata_path in papers_dir.glob("paper_*_metadata.json"): + try: + async with aiofiles.open(paper_metadata_path, 'r', encoding='utf-8') as f: + data = json.loads(await f.read()) + if data.get("status") == "in_progress": + paper_candidates.append(data) + except Exception: + continue + if paper_candidates: + paper_candidates.sort(key=lambda item: item.get("created_at", ""), reverse=True) + paper_metadata = paper_candidates[0] + paper_id = paper_metadata.get("paper_id") + paper_title = paper_metadata.get("title") + reference_paper_ids = paper_metadata.get("referenced_papers") or [] + if not topic_id: + source_ids = paper_metadata.get("source_brainstorm_ids") or [] + topic_id = source_ids[0] if source_ids else None + + if topic_id and brainstorms_dir.exists(): + metadata_path = brainstorms_dir / f"brainstorm_{topic_id}_metadata.json" + if metadata_path.exists(): + async with aiofiles.open(metadata_path, 'r', encoding='utf-8') as f: + topic_metadata = json.loads(await f.read()) + + if topic_metadata is None and brainstorms_dir.exists(): + candidates = [] + for metadata_path in brainstorms_dir.glob("brainstorm_*_metadata.json"): + try: + async with aiofiles.open(metadata_path, 'r', encoding='utf-8') as f: + data = json.loads(await f.read()) + status = data.get("status") + papers_generated = data.get("papers_generated") or [] + if status == "in_progress" or (status == "complete" and not papers_generated): + candidates.append(data) + except Exception: + continue + if candidates: + candidates.sort(key=lambda item: item.get("last_activity", ""), reverse=True) + topic_metadata = candidates[0] + topic_id = topic_metadata.get("topic_id") + + if not topic_id and not paper_id: + return None + + current_tier = "tier2_paper_writing" if paper_id else "tier1_aggregation" + paper_phase = None + if paper_id: + paper_path = papers_dir / f"paper_{paper_id}.txt" + if paper_path.exists(): + async with aiofiles.open(paper_path, 'r', encoding='utf-8') as f: + paper_phase = _detect_session_paper_phase(await f.read()) + else: + paper_phase = "body" + acceptance_count = int((topic_metadata or {}).get("submission_count") or 0) + if ( + topic_metadata + and topic_metadata.get("status") == "complete" + and not paper_id + and not (topic_metadata.get("papers_generated") or []) + ): + current_tier = "tier2_paper_writing" + paper_phase = "brainstorm_proof_verification" + elif topic_metadata and topic_metadata.get("status") == "complete" and not paper_id: + return None + + return { + "is_running": False, + "current_tier": current_tier, + "current_topic_id": topic_id, + "current_paper_id": paper_id, + "current_paper_title": paper_title, + "paper_phase": paper_phase, + "reference_paper_ids": reference_paper_ids, + "acceptance_count": acceptance_count, + "rejection_count": 0, + "consecutive_rejections": 0, + "exhaustion_signals": 0, + "papers_completed_count": stats.get("total_papers_completed", 0), + "last_redundancy_check_at": 0, + "last_completion_review_at": 0, + "last_tier3_check_at": 0, + "brainstorm_paper_count": 0, + "current_brainstorm_paper_ids": [], + "proof_framing_active": False, + "proof_framing_context": "", + "proof_framing_reasoning": "", + "tier3_active": False, + "tier3_enabled": False, + "tier3_format": None, + "tier3_phase": None, + "model_config": {}, + "last_updated": stats.get("last_updated") or (topic_metadata or {}).get("last_activity", ""), + } + except Exception as exc: + logger.debug(f"Failed to recover workflow state from session files {session_dir.name}: {exc}") + return None + async def list_all_sessions(self, base_dir: Optional[str] = None) -> List[Dict[str, Any]]: """ List all research sessions. @@ -377,7 +564,7 @@ async def list_all_sessions(self, base_dir: Optional[str] = None) -> List[Dict[s try: async with aiofiles.open(metadata_path, 'r', encoding='utf-8') as f: metadata = json.loads(await f.read()) - metadata["path"] = str(session_dir) + metadata["path"] = session_dir.name # Count items in subdirectories brainstorms_dir = session_dir / "brainstorms" diff --git a/backend/autonomous/prompts/completion_prompts.py b/backend/autonomous/prompts/completion_prompts.py index 8ac3c59..c0f7bca 100644 --- a/backend/autonomous/prompts/completion_prompts.py +++ b/backend/autonomous/prompts/completion_prompts.py @@ -37,18 +37,23 @@ def get_completion_review_system_prompt() -> str: CRITICAL UNDERSTANDING: This is an assessment of topic exploration completeness using all resources at your disposal. Consider whether you can contribute more valuable mathematical insights using your knowledge, web search capabilities (if available), and analysis of what's been covered. +DIRECT-SOLUTION PREFERENCE: +- Prefer moving to paper writing once the brainstorm can support the strongest rigorous direct answer currently justified +- Continue brainstorming only when you can identify concrete additional work that is likely to produce a more direct solution, stronger partial solution, impossibility result, or sharper constraint +- Do not extend brainstorming merely for breadth if the best direct answer is already ready to synthesize + DECISION CRITERIA: Choose CONTINUE_BRAINSTORM if: -- You can identify specific mathematical areas not yet covered in the submissions -- You have additional theorems, proofs, or techniques relevant to the topic (from your knowledge or discoverable via web search) -- The brainstorm would benefit from deeper exploration in specific directions -- You can still contribute valuable insights using available resources (base knowledge, web search if available) +- You can identify specific mathematical areas not yet covered in the submissions that are likely to improve the direct answer +- You have additional theorems, proofs, techniques, constructions, or impossibility arguments relevant to the topic (from your knowledge or discoverable via web search) +- The brainstorm would benefit from deeper exploration in specific directions that materially strengthen direct resolution +- You can still contribute valuable direct-progress insights using available resources (base knowledge, web search if available) Choose WRITE_PAPER if: - All major mathematical avenues for this topic have been explored - Additional submissions would likely be redundant with existing content -- The brainstorm database is comprehensive enough for a quality paper +- The brainstorm database is comprehensive enough for a quality paper that gives the strongest currently justified direct answer - Available resources (base knowledge, web search if available) have been sufficiently utilized for this topic - You genuinely cannot think of significant new contributions using available resources @@ -57,6 +62,7 @@ def get_completion_review_system_prompt() -> str: - Don't artificially extend brainstorming if exhausted - Don't prematurely end if valuable knowledge remains - Consider the mathematical depth achieved, not just submission count +- Prefer best-answer readiness over breadth for breadth's sake CRITICAL JSON ESCAPE RULES: 1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text @@ -131,14 +137,14 @@ def get_completion_self_validation_system_prompt() -> str: Validate as TRUE (confirm your assessment) if: - Your assessment accurately reflects the current state of the brainstorm using all available resources (base knowledge, web search if available) -- If you said "continue_brainstorm": You genuinely have more valuable insights to contribute using available resources -- If you said "write_paper": You genuinely cannot think of significant new contributions +- If you said "continue_brainstorm": You genuinely have more valuable direct-progress insights to contribute using available resources +- If you said "write_paper": You genuinely cannot think of significant new contributions that would materially strengthen the direct answer - The reasoning in your assessment is sound and honest Validate as FALSE if: - Upon reflection, the assessment was CLEARLY incorrect -- If "continue_brainstorm": The suggested additions are trivial, irrelevant, or already extensively covered -- If "write_paper": You have CONCRETE, SPECIFIC valuable additions you overlooked (not vague possibilities) +- If "continue_brainstorm": The suggested additions are trivial, irrelevant, already extensively covered, or too indirect to justify delay +- If "write_paper": You have CONCRETE, SPECIFIC valuable additions you overlooked that would materially improve direct resolution (not vague possibilities) - The reasoning contains obvious flawed logic BALANCED VALIDATION APPROACH: diff --git a/backend/autonomous/prompts/final_answer_prompts.py b/backend/autonomous/prompts/final_answer_prompts.py index 72bf2f4..82ea213 100644 --- a/backend/autonomous/prompts/final_answer_prompts.py +++ b/backend/autonomous/prompts/final_answer_prompts.py @@ -44,6 +44,10 @@ def get_certainty_assessment_system_prompt() -> str: YOUR TASK: Review all existing research papers and determine what can be answered WITH CERTAINTY - without speculation or theoretical hand-waving. +DIRECT-ANSWER-FIRST REQUIREMENT: +- Identify the strongest direct answer the papers justify, not just nearby facts +- Prefer a precise answer, partial answer, impossibility result, or sharp limitation statement over broad summary + ASSESSMENT CRITERIA: 1. TOTAL_ANSWER - The user's question can be FULLY answered with high confidence @@ -74,6 +78,7 @@ def get_certainty_assessment_system_prompt() -> str: - Identify what is KNOWN WITH CERTAINTY vs what is SPECULATIVE - Do not claim certainty where uncertainty exists - Summarize the key certainties that have been established +- State the best direct answer those certainties support CRITICAL JSON ESCAPE RULES: 1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text @@ -144,6 +149,7 @@ def get_certainty_validator_system_prompt() -> str: - The reasoning properly references the papers - No overclaiming certainty where uncertainty exists - No underclaiming (missing obvious certainties) +- The assessment captures the strongest direct answer the papers justify REJECT the assessment if: - Certainty level doesn't match the evidence @@ -222,6 +228,8 @@ def get_format_selection_system_prompt() -> str: - Whether a single coherent narrative is possible - Whether the papers naturally form a cohesive volume - The certainty level from Phase 1 +- Prefer short form whenever one paper can honestly provide the strongest direct answer +- Choose long form only when multiple chapters are genuinely necessary to deliver that answer well CRITICAL JSON ESCAPE RULES: 1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text @@ -287,12 +295,14 @@ def get_format_validator_system_prompt() -> str: - The reasoning is sound - Short form is chosen only when a single paper suffices - Long form is chosen when multiple perspectives are needed +- The choice preserves the clearest path to a direct answer REJECT the selection if: - Short form is chosen for a question requiring extensive treatment - Long form is chosen unnecessarily for a focused question - The reasoning doesn't support the choice - The selection ignores important factors +- The selection adds unnecessary structural breadth instead of optimizing for a direct answer CRITICAL JSON ESCAPE RULES: 1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text @@ -415,6 +425,10 @@ def get_volume_organization_system_prompt() -> str: 3. Plans an INTRODUCTION paper that frames the collection 4. Plans a CONCLUSION paper that synthesizes findings and answers the question +DIRECT-ANSWER-FIRST REQUIREMENT: +- Include only the chapters needed to deliver the strongest rigorous direct answer +- Do not add gap papers for breadth alone; add them only when they are necessary to close a real answer gap + VOLUME STRUCTURE REQUIREMENTS: BODY CHAPTERS (from existing papers or gaps): @@ -422,6 +436,7 @@ def get_volume_organization_system_prompt() -> str: - Order them logically (foundations → main results → applications) - Identify gaps: topics that need coverage but no paper exists - Gap papers will be written before introduction/conclusion +- Exclude chapters that are merely adjacent if they do not materially strengthen the answer INTRODUCTION PAPER: - Frames the user's question @@ -565,6 +580,7 @@ def get_volume_validator_system_prompt() -> str: - Introduction and conclusion are properly planned - The reasoning is sound - If outline_complete=true, the structure is ready for writing +- The structure stays focused on the strongest rigorous direct answer without unnecessary breadth REJECT the organization if: - Important existing papers are missing @@ -573,6 +589,7 @@ def get_volume_validator_system_prompt() -> str: - Introduction/conclusion are missing or poorly planned - The structure doesn't effectively answer the question - outline_complete=true but structure has issues +- The structure includes chapters that broaden scope without materially improving the answer Provide specific feedback for rejected organizations. @@ -615,6 +632,7 @@ def get_gap_paper_context_prompt() -> str: - Use ONLY existing Tier 2 papers as references (no brainstorm databases) - The paper must integrate with the volume's other chapters - Focus on the specific gap identified in the chapter description +- Write only the material needed to close that answer gap directly and rigorously REFERENCE PAPERS: The papers listed are from the existing Tier 2 library. Use them as context and references. @@ -641,6 +659,7 @@ def get_volume_intro_paper_context_prompt() -> str: - You have access to ALL chapter content to accurately describe them - The introduction should make the volume's value clear - Frame the answer that will be provided +- Keep the framing centered on the direct answer, not on exploratory wanderings REFERENCE: Use the chapter papers as context for accurate descriptions.""" @@ -665,6 +684,7 @@ def get_volume_conclusion_paper_context_prompt() -> str: - All body chapters exist, so you can reference their content - Be definitive about certainties, honest about uncertainties - This is the climactic answer to the user's question +- Make the direct answer explicit as early and clearly as the evidence allows REFERENCE: Use the body chapter papers to inform the synthesis.""" diff --git a/backend/autonomous/prompts/paper_continuation_prompts.py b/backend/autonomous/prompts/paper_continuation_prompts.py index 93420e5..c8062a9 100644 --- a/backend/autonomous/prompts/paper_continuation_prompts.py +++ b/backend/autonomous/prompts/paper_continuation_prompts.py @@ -35,16 +35,21 @@ def get_continuation_decision_system_prompt() -> str: YOUR TASK: Decide whether the brainstorm database contains enough distinct, unexplored material to warrant writing ANOTHER paper, or whether the user's research goal is better served by moving on to a new brainstorm topic. +DIRECT-SOLUTION PREFERENCE: +- Write another paper only if it would materially strengthen the best rigorous direct answer to the user's goal +- Move on when remaining material is mostly supportive, repetitive, or too indirect to justify another paper + DECISION OPTIONS: 1. WRITE_ANOTHER_PAPER - The brainstorm has significant material that the existing paper(s) did NOT cover, and another paper would meaningfully advance the user's research goal 2. MOVE_ON - The existing paper(s) adequately cover this brainstorm, or a new topic would better serve the user's goal WRITE ANOTHER PAPER if: - The brainstorm database contains substantial material not covered by existing paper(s) -- Another paper would address a meaningfully DIFFERENT angle, perspective, or subset of the brainstorm +- Another paper would address a meaningfully DIFFERENT angle, perspective, or subset of the brainstorm that improves direct resolution of the user's goal - The uncovered material is rich enough for a complete, distinct paper (not just leftover fragments) - Writing another paper from this brainstorm advances the user's goal MORE than starting a new topic - The existing paper(s) focused on specific aspects, leaving other important aspects unexplored +- Another paper would provide a stronger direct partial answer, tighter impossibility result, or sharper constraint MOVE ON if: - The existing paper(s) adequately cover the brainstorm's valuable content @@ -52,6 +57,7 @@ def get_continuation_decision_system_prompt() -> str: - A new brainstorm topic would better advance the user's research goal - Another paper would largely duplicate content already in the existing paper(s) - The brainstorm's unique contributions have been captured +- The remaining material is mostly indirect support rather than meaningful direct progress CRITICAL JSON ESCAPE RULES: 1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text @@ -114,10 +120,10 @@ def get_continuation_validator_system_prompt() -> str: --- YOUR TASK: -Validate whether the proposed continuation decision is the best use of research resources. +Validate whether the proposed continuation decision is the best use of research resources for improving the strongest rigorous direct answer. ACCEPT the decision if: -1. WRITE_ANOTHER_PAPER: The brainstorm genuinely has enough distinct unexplored material for another paper AND the reasoning correctly identifies what material remains +1. WRITE_ANOTHER_PAPER: The brainstorm genuinely has enough distinct unexplored material for another paper AND the reasoning correctly identifies what material remains AND why it materially strengthens direct resolution 2. MOVE_ON: The existing papers adequately cover the brainstorm OR a new topic would genuinely better serve the goal AND the reasoning is sound REJECT the decision if: @@ -126,6 +132,7 @@ def get_continuation_validator_system_prompt() -> str: 3. MOVE_ON: There is clearly substantial uncovered material that warrants another paper 4. MOVE_ON: The reasoning ignores valuable unexplored content in the brainstorm 5. The reasoning is flawed, vague, or contradicts the evidence +6. The decision prefers indirect leftover material over a clearly stronger direct-answer path REJECTION FEEDBACK FORMAT: If rejecting, provide CONCRETE, ACTIONABLE guidance: diff --git a/backend/autonomous/prompts/paper_redundancy_prompts.py b/backend/autonomous/prompts/paper_redundancy_prompts.py index eb57753..c0c710d 100644 --- a/backend/autonomous/prompts/paper_redundancy_prompts.py +++ b/backend/autonomous/prompts/paper_redundancy_prompts.py @@ -46,13 +46,15 @@ def get_paper_redundancy_system_prompt() -> str: 3. Contains information SUPERSEDED by better, more complete papers 4. Was MARGINALLY useful initially but provides no unique value given current library 5. Covers the same mathematical territory as a newer, superior paper +6. Is more indirect or auxiliary while another paper provides a stronger rigorous direct answer on the same territory REASONS TO KEEP - A paper should be kept if it: -1. Provides ANY unique mathematical content not covered elsewhere -2. Offers a different perspective or approach even if related to other papers -3. Contains specific proofs, theorems, or techniques not present elsewhere -4. Contributes to research diversity in any meaningful way -5. Covers distinct mathematical subtopics within a broader area +1. Provides a stronger direct answer, sharper impossibility result, or tighter constraint than overlapping papers +2. Provides ANY unique mathematical content not covered elsewhere +3. Offers a different perspective or approach even if related to other papers +4. Contains specific proofs, theorems, or techniques not present elsewhere +5. Contributes to research diversity in any meaningful way +6. Covers distinct mathematical subtopics within a broader area CONSERVATIVE APPROACH: - When in doubt, DO NOT recommend removal @@ -63,6 +65,9 @@ def get_paper_redundancy_system_prompt() -> str: CRITICAL SELECTION RULE: When multiple papers overlap, select the WEAKEST one for removal - the one that provides the LEAST unique value. NEVER remove a more comprehensive paper in favor of keeping a less comprehensive one. +DIRECT-SOLUTION PRIORITY: +If overlapping papers differ in how directly they answer the user's research goal, preserve the paper with the strongest rigorous direct answer and remove the more auxiliary one first when all else is equal. + CRITICAL JSON ESCAPE RULES: 1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text 2. Quotes: Escape double quotes inside strings as \\" diff --git a/backend/autonomous/prompts/paper_reference_prompts.py b/backend/autonomous/prompts/paper_reference_prompts.py index 19bbb11..9d0450b 100644 --- a/backend/autonomous/prompts/paper_reference_prompts.py +++ b/backend/autonomous/prompts/paper_reference_prompts.py @@ -52,6 +52,10 @@ def get_pre_brainstorm_expansion_system_prompt(max_papers: int) -> str: YOUR TASK: Determine which papers (if any) would be VERY USEFUL to inform and enhance your brainstorm exploration. +DIRECT-SOLUTION PREFERENCE: +- Prefer papers that most directly help produce a rigorous direct answer, direct partial answer, impossibility result, explicit construction, exact reduction, or sharp constraint +- Do not select papers merely because they are broadly related if they do not materially strengthen the most direct route to the goal + WHY THIS MATTERS - COMPOUNDING KNOWLEDGE: This is the crucial mechanism that allows the system to compound knowledge across research cycles. By selecting reference papers BEFORE brainstorming, you can: @@ -61,9 +65,9 @@ def get_pre_brainstorm_expansion_system_prompt(max_papers: int) -> str: - Accelerate convergence on valuable insights by standing on prior work THRESHOLD: "VERY USEFUL FOR BRAINSTORMING" -- Papers that provide mathematical foundations you'll build upon -- Papers that cover related concepts you can extend or connect to -- Papers that offer techniques or methods relevant to your topic +- Papers that provide mathematical foundations you'll directly build upon +- Papers that cover related concepts you can extend or connect to in service of a more direct answer +- Papers that offer techniques or methods that materially strengthen the most direct route to your topic - Don't request papers that are merely tangentially related OPTIONS: @@ -114,15 +118,19 @@ def get_additional_reference_expansion_system_prompt(max_total_papers: int) -> s YOUR TASK: Determine if any ADDITIONAL papers would be valuable for paper compilation, based on what you learned during brainstorming. +DIRECT-SOLUTION PREFERENCE: +- Add papers only when they materially strengthen the best rigorous direct answer you can now write +- Do not add broadly relevant papers that do not improve direct resolution of the user's goal + CONTEXT: - You already selected reference papers before brainstorming (shown as "ALREADY SELECTED") - During brainstorming, you may have discovered new connections or topics - This is your chance to add more relevant papers (if any) THRESHOLD: "VALUABLE BASED ON BRAINSTORM INSIGHTS" -- Papers that address topics that emerged during brainstorming -- Papers that provide additional techniques you now realize are relevant -- Papers that cover connections you discovered during exploration +- Papers that address topics that emerged during brainstorming and materially strengthen direct resolution +- Papers that provide additional techniques you now realize are relevant to the strongest direct answer +- Papers that cover connections you discovered during exploration only when those connections improve direct progress - Don't add papers just to fill slots OPTIONS: @@ -170,8 +178,12 @@ def get_reference_expansion_system_prompt(max_papers: int = 6) -> str: YOUR TASK: Determine which papers (if any) would be VERY USEFUL for writing your upcoming paper, and request to see their full content before making final selection. +DIRECT-SOLUTION PREFERENCE: +- Prefer papers that will help you write the strongest rigorous direct answer to the user's goal +- Do not expand papers that are merely adjacent background unless they are needed for direct resolution + THRESHOLD: "VERY USEFUL" -- A paper is "very useful" if it provides substantial mathematical context, techniques, or insights directly relevant to your brainstorm topic +- A paper is "very useful" if it provides substantial mathematical context, techniques, or insights that materially strengthen the most direct answer to your brainstorm topic - Don't request papers that are merely tangentially related - Quality over quantity - only request papers you genuinely need to evaluate @@ -251,11 +263,15 @@ def get_reference_selection_system_prompt(max_papers: int) -> str: YOUR TASK: Make your final selection of reference papers (maximum {max_papers}) that will be included in your context during paper compilation. +DIRECT-SOLUTION PREFERENCE: +- Select papers that most directly strengthen the answer you intend to write +- Prefer papers that support the core proof, construction, impossibility argument, or key reduction over broader background + SELECTION CRITERIA: -- Papers that provide essential mathematical background -- Papers that offer techniques or methods relevant to your topic -- Papers that establish theoretical foundations you'll build upon -- Papers that present related results you'll reference or extend +- Papers that provide essential mathematical background for the direct answer +- Papers that offer techniques or methods central to your topic's strongest resolution path +- Papers that establish theoretical foundations you'll directly build upon +- Papers that present related results you'll reference or extend in order to answer the question more directly CONSTRAINT: - Maximum {max_papers} papers can be selected (hard limit for context budget) diff --git a/backend/autonomous/prompts/paper_title_exploration_prompts.py b/backend/autonomous/prompts/paper_title_exploration_prompts.py index 9193ec0..cd40cd3 100644 --- a/backend/autonomous/prompts/paper_title_exploration_prompts.py +++ b/backend/autonomous/prompts/paper_title_exploration_prompts.py @@ -38,6 +38,10 @@ def build_title_exploration_user_prompt( parts.append("Instead, your task is to propose ONE CANDIDATE PAPER TITLE per submission.") parts.append("The system will collect 5 validated candidate titles before a later final") parts.append("selection chooses the actual title.\n") + parts.append("Prefer titles that make the paper's direct answer-bearing contribution clear") + parts.append("when the source material supports one. Do not use generic exploratory titles") + parts.append("when a theorem, construction, impossibility result, or sharp constraint can be") + parts.append("accurately foregrounded.\n") parts.append("Each submission should contain:") parts.append("- One candidate paper title") parts.append("- Brief reasoning for why the title is strong, accurate, and distinct\n") @@ -49,6 +53,7 @@ def build_title_exploration_user_prompt( parts.append("WHAT MAKES A GOOD CANDIDATE TITLE:") parts.append("- Accurately captures the paper's likely mathematical content") parts.append("- Specific enough to communicate the core focus") + parts.append("- Foregrounds the direct answer, core result, or limitation when justified") parts.append("- Professional and suitable for a mathematical research paper") parts.append("- Distinct from already-accepted candidate titles") parts.append("- Distinct from related completed papers listed below") diff --git a/backend/autonomous/prompts/paper_title_prompts.py b/backend/autonomous/prompts/paper_title_prompts.py index fa25fc3..f77b3b5 100644 --- a/backend/autonomous/prompts/paper_title_prompts.py +++ b/backend/autonomous/prompts/paper_title_prompts.py @@ -35,6 +35,10 @@ def get_paper_title_system_prompt() -> str: YOUR TASK: Choose a title that accurately captures the mathematical content and scope of the planned paper. +DIRECT-SOLUTION PREFERENCE: +- When the paper reaches a direct conclusion, theorem, impossibility result, or explicit construction, let the title foreground that result rather than sounding like generic exploration +- Prefer titles that make the paper's answer-bearing content clear, while staying accurate to the actual scope + IMPORTANT CLARIFICATION: - The brainstorm submissions are the SOURCE MATERIAL for your paper - Your title SHOULD reflect what's in the brainstorm - that's expected and correct! @@ -47,6 +51,7 @@ def get_paper_title_system_prompt() -> str: - Is professional and suitable for a mathematical research paper - Differentiates from EXISTING COMPLETED PAPERS from the same brainstorm (if any exist - check the list below) - Avoids being overly broad or generic +- Makes the paper's strongest direct contribution clear when the content justifies it TITLE STYLE: - Use standard mathematical paper title conventions @@ -139,6 +144,7 @@ def get_paper_title_validator_system_prompt() -> str: - It follows mathematical paper title conventions - The reasoning is sound - If "EXISTING PAPERS FROM THIS BRAINSTORM: None" - there's nothing to differentiate from, so accept if other criteria are met +- It makes any justified direct conclusion or core result clear rather than sounding needlessly exploratory REJECT the title if: - It is too similar to an EXISTING COMPLETED PAPER from the same brainstorm (NOT brainstorm submissions - those are the source material!) @@ -146,6 +152,7 @@ def get_paper_title_validator_system_prompt() -> str: - It is too vague or generic - It doesn't follow professional conventions - The reasoning is flawed +- It obscures a clear direct result behind generic exploratory wording DO NOT REJECT simply because the title reflects brainstorm submission content - that is the INTENDED behavior. diff --git a/backend/autonomous/prompts/proof_prompts.py b/backend/autonomous/prompts/proof_prompts.py index de9e4a7..f39d782 100644 --- a/backend/autonomous/prompts/proof_prompts.py +++ b/backend/autonomous/prompts/proof_prompts.py @@ -9,11 +9,14 @@ PROOF_FRAMING_CONTEXT = """[PROOF FRAMING CONTEXT -- This research prompt targets formal mathematical proof. -Submissions should aggressively pursue NOVEL, NON-TRIVIAL theorems that push the -boundaries of what is known. The Lean 4 proof assistant is available for formal -verification. Prioritize ambitious conjectures, original results, and theorems that -would represent genuine mathematical contributions over safe restatements of textbook -facts. Standard identities and well-known Mathlib lemmas are NOT valuable targets.]""" +All proof work must serve the user's research prompt. Submissions should pursue +theorems, lemmas, and formalizations that directly help answer, support, or advance +that prompt. Novel/non-trivial results are valuable only when they are relevant to +the user's goal. The Lean 4 proof assistant is available for formal verification. +Prioritize ambitious conjectures, original results, and theorems that would represent +genuine mathematical contributions toward the prompt over safe restatements of +textbook facts. Standard identities, irrelevant curiosities, and well-known Mathlib +lemmas are NOT valuable targets.]""" def _json_only_footer(example: str) -> str: @@ -159,7 +162,7 @@ def format_failure_hints_for_injection(failure_hints: Iterable[Any]) -> str: lines = [ "=== OPEN LEMMA TARGETS LEAN 4 COULD NOT YET CLOSE ===", - "[These are recent proof attempts that failed. Prefer brainstorms that generate missing lemmas, stronger assumptions, or cleaner formal theorem statements.]", + "[These are recent proof attempts that failed. Prefer brainstorms that generate missing lemmas, stronger assumptions, or cleaner formal theorem statements only when they directly support the user's research prompt.]", "", ] for index, hint in enumerate(hints, start=1): @@ -188,7 +191,8 @@ def format_failure_hints_for_injection(failure_hints: Iterable[Any]) -> str: "Note: the previous formalization attempt was rejected because " "it used `sorry`/`admit` or axiomatized the theorem's concepts " "to make the goal trivial. Prefer brainstorms that state a " - "narrower, concretely provable lemma instead of the full claim." + "narrower, concretely provable lemma that still supports the " + "user's research prompt instead of the full claim." ) lines.extend( [ @@ -206,20 +210,20 @@ def format_failure_hints_for_injection(failure_hints: Iterable[Any]) -> str: def build_proof_framing_gate_prompt(user_prompt: str) -> str: """Ask whether the research goal should be framed toward formal proof.""" - return f"""You are deciding whether a research program should be explicitly framed toward formal mathematical proof and novel theorem discovery. + return f"""You are deciding whether a research program should be explicitly framed toward formal mathematical proof and novel theorem discovery that helps answer the user's prompt. USER RESEARCH PROMPT: {user_prompt} -Return TRUE if the prompt would benefit from working toward formally provable theorems in Lean 4, especially novel or non-trivial ones. +Return TRUE if the prompt would benefit from working toward Lean 4-formalized theorems that directly help answer, support, or advance the user's research goal. Return FALSE only if the prompt is purely empirical, engineering-focused, descriptive, or has no meaningful mathematical content. Consider: - Does the research involve mathematical structures, proofs, bounds, or formal reasoning? -- Could novel theorems or formalizations emerge from this research direction? -- Would formal verification add rigor or uncover new results? +- Could prompt-relevant theorems, lemmas, or formalizations emerge from this research direction? +- Would formal verification add rigor or uncover new results that matter for the user's goal? -Err on the side of TRUE -- if there is any mathematical substance worth formalizing, enable the proof pipeline. +Err on the side of TRUE when there is mathematical substance worth formalizing for the prompt. Do not enable proof framing solely for off-topic mathematical curiosities. {_json_only_footer('{"is_proof_amenable": true, "reasoning": "brief explanation"}')} """ @@ -231,7 +235,7 @@ def build_proof_identification_prompt( source_id: str, source_content: str, ) -> str: - """Identify novel, non-trivial theorem candidates from a brainstorm or paper.""" + """Identify prompt-relevant theorem candidates from a brainstorm or paper.""" example_json = """{ "has_provable_theorems": true, "theorems": [ @@ -239,22 +243,24 @@ def build_proof_identification_prompt( "theorem_id": "thm_1", "statement": "natural-language theorem statement", "formal_sketch": "optional note about assumptions, notation, or likely Lean formalization strategy", - "novelty_rationale": "why this theorem is non-trivial and worth formalizing" + "novelty_rationale": "why this theorem helps the user prompt and is worth formalizing" } ] }""" - return f"""You are a theorem-discovery agent for MOTO. Your mission is to find NOVEL, NON-TRIVIAL mathematical claims in the source below that deserve formal verification in Lean 4. + return f"""You are a theorem-discovery agent for MOTO. Your mission is to find mathematical claims in the source below that directly help answer, support, or advance the USER RESEARCH PROMPT and deserve formal verification in Lean 4. -MOTO's goal is to push the frontier of mathematical knowledge. You are the gatekeeper that decides which theorems are worth the cost of formal verification. Be ambitious -- seek out the most original, surprising, or substantive results the source offers. +MOTO's goal is to push the frontier of mathematical knowledge in service of the user's stated problem. You are the gatekeeper that decides which theorems are worth the cost of formal verification. Be ambitious, but do not chase unrelated mathematical curiosities: a proof candidate must be useful for the user's prompt, not merely non-trivial in isolation. WHAT TO EXTRACT (prioritize these): -- Novel theorems, lemmas, or propositions that represent genuine mathematical insight -- Bold conjectures that can be sharpened into provable statements -- Non-obvious connections, bounds, inequalities, or structural results -- Original formalizations of results not yet in Mathlib -- Ambitious claims even if they need narrowing -- the formalization agent can refine them +- Theorems, lemmas, or propositions that directly help answer or advance the USER RESEARCH PROMPT +- Supporting lemmas needed to prove prompt-central claims +- Novel mathematical insights only when they are relevant to the user's stated goal +- Non-obvious connections, bounds, inequalities, or structural results that strengthen the prompt's argument +- Original formalizations of prompt-relevant results not yet in Mathlib +- Ambitious prompt-relevant claims even if they need narrowing -- the formalization agent can refine them WHAT TO REJECT (never extract these): +- Mathematically interesting claims that do not materially help the USER RESEARCH PROMPT - Trivial identities (e.g. n + 0 = n, a * 1 = a, commutativity of addition) - Direct restatements of well-known Mathlib lemmas or standard textbook results - Results closable by a single tactic like `simp`, `omega`, `norm_num`, `decide`, or `rfl` @@ -262,11 +268,12 @@ def build_proof_identification_prompt( - Routine algebraic manipulations with no conceptual content Rules: -- Return TRUE when at least one non-trivial, novel-potential theorem is found. -- Return FALSE only if the source genuinely contains nothing beyond trivial or well-known results. -- Rank candidates by novelty potential. Return at most 5 of the most promising theorems. -- For each candidate, include a brief novelty_rationale explaining why it is worth formalizing. -- Welcome bold or speculative claims -- if the source proposes something ambitious that might be provable with the right formalization, extract it. The downstream formalization agent will handle narrowing if needed. +- Return TRUE when at least one prompt-relevant, non-trivial theorem is found. +- Return FALSE if the source contains no theorem that would materially help answer, support, or advance the USER RESEARCH PROMPT. +- Order candidates by direct usefulness to the USER RESEARCH PROMPT first, then by novelty/formalization value. This ordering is not a cap. +- Return every prompt-relevant theorem that is non-trivial and worth attempting. +- For each candidate, include a brief novelty_rationale explaining both why it helps the USER RESEARCH PROMPT and why it is worth formalizing. +- Welcome bold or speculative claims only when they are prompt-relevant -- if the source proposes something ambitious that might be provable with the right formalization, extract it. The downstream formalization agent will handle narrowing if needed. - Use theorem IDs that are stable strings such as "thm_1", "thm_2", etc. USER RESEARCH PROMPT: @@ -305,6 +312,7 @@ def build_lemma_search_prompt( - Return 5-10 candidate lemma/theorem names when possible. - Prefer concrete declaration names over descriptions. - Use familiar Mathlib naming when possible (for example `Nat.add_comm`, `mul_assoc`, `Finset.card_union_add_card_inter`). +- Keep suggestions tied to the target theorem and the USER RESEARCH PROMPT; do not drift toward merely adjacent or interesting Mathlib facts. - If the theorem is too vague or no good candidates are evident, return an empty list. USER RESEARCH PROMPT: @@ -347,6 +355,7 @@ def build_smt_translation_prompt( - Prefer quantifier-free arithmetic fragments when possible. - If the theorem is underspecified, only encode the part that is clearly justified by the theorem statement and notes. - Do not invent new assumptions that are not strongly implied by the theorem. +- Do not translate a different or weaker theorem merely because it is easier; the SMT check must still support the USER RESEARCH PROMPT through the selected target theorem. - Return an empty `smtlib` string if you cannot produce a faithful SMT translation. - Use only SMT-LIB text in the `smtlib` field. @@ -397,6 +406,9 @@ def build_proof_formalization_prompt( - Include needed imports. - State assumptions explicitly. - Prefer correct, minimal, compilable code over stylistic elegance. +- Keep the USER RESEARCH PROMPT as the relevance boundary. If you narrow an + underspecified theorem, the narrowed lemma must still help answer, support, + or advance the user's prompt. - PRESERVE the theorem's non-trivial content. Do not simplify or weaken the statement into a trivial identity just to make it compile. The goal is to formalize the ACTUAL claim, not a watered-down version of it. @@ -474,6 +486,9 @@ def build_proof_tactic_script_prompt( - Return a short, ordered list of tactics that can be appended under a `by` block. - Each tactic entry must include the Lean tactic string and one short reasoning note. - Prefer small, composable tactics over a single opaque script. +- Keep the USER RESEARCH PROMPT as the relevance boundary. If you narrow an + underspecified theorem, the narrowed lemma must still help answer, support, + or advance the user's prompt. - PRESERVE the theorem's non-trivial content. Do not simplify or weaken the statement into a trivial identity just to make it compile. - NEVER include `sorry` or `admit` in the tactic list. A script that uses @@ -522,7 +537,7 @@ def build_proof_novelty_prompt( lean_code: str, existing_novel_proofs: str, ) -> str: - """Ask the validator to classify a Lean-verified theorem into one of four novelty tiers.""" + """Ask the validator to classify a Lean-verified theorem into one of five novelty tiers.""" existing_proofs_block = existing_novel_proofs or "[No previously stored novel proofs.]" return f"""This proof has been FORMALLY VERIFIED by Lean 4. It is mathematically valid. @@ -555,10 +570,17 @@ def build_proof_novelty_prompt( - It constitutes a novel alternative proof of an existing result whose existence changes mathematical understanding (e.g., a constructive proof where only non-constructive proofs were known). - Assign this tier when the proof would be a publishable or citable contribution in its own right. +"major_mathematical_discovery" +- The result appears to be an exceptional mathematical breakthrough, not merely a publishable or citable new result. +- It may be competitive for a major prize or medal in a related field if confirmed, contextualized, and accepted by domain experts. +- It resolves an important open problem, creates a powerful new theory or framework, or proves a result with unusually broad consequences. +- Assign this tier only when the proof's significance appears field-level or prize-level, above an ordinary mathematical discovery. + Rules: - Do NOT re-check validity. Lean 4 already verified it. - Choose the single best-fitting tier. When a proof could fit multiple tiers, choose the highest applicable one. -- Consider the research prompt context. A result textbook-standard in one field may qualify as "novel_formulation" if it is the first mechanized Lean 4 proof of that result for this research program. +- Consider the research prompt context. A result textbook-standard in one field may qualify as "novel_formulation" if it is the first mechanized Lean 4 proof of that result for this research program and it helps the USER RESEARCH PROMPT. +- Do not assign a high novelty tier to a theorem that is mathematically interesting but irrelevant to the USER RESEARCH PROMPT. - Err toward recognizing higher tiers for results that required multi-step reasoning, non-trivial formalization work, or original proof strategy. USER RESEARCH PROMPT: @@ -575,3 +597,48 @@ def build_proof_novelty_prompt( {_json_only_footer('{"novelty_tier": "mathematical_discovery", "reasoning": "brief explanation"}')} """ + + +def build_proof_statement_alignment_prompt( + user_prompt: str, + theorem_statement: str, + formal_sketch: str, + lean_code: str, + source_excerpt: str, +) -> str: + """Validate that Lean-accepted code proves the intended theorem candidate.""" + return f"""You are validating a Lean 4 proof candidate after Lean 4 has accepted the code. + +Lean 4 already verified that the code is logically valid. Your task is narrower: +decide whether the accepted Lean code actually corresponds to the intended theorem +candidate below. Reject code that proves an unrelated trivial theorem, proves only a +weakened/irrelevant result, or avoids the intended statement by changing the target. + +Accept if the Lean code formalizes the same mathematical claim, a clearly equivalent +claim, or a faithful narrowed form explicitly justified by the formal sketch and still +useful for the USER RESEARCH PROMPT. + +USER RESEARCH PROMPT: +{user_prompt} + +INTENDED THEOREM CANDIDATE: +{theorem_statement} + +FORMAL SKETCH / EXPECTED SHAPE: +{formal_sketch or '[none provided]'} + +SOURCE EXCERPT: +{source_excerpt or '[none provided]'} + +LEAN 4-ACCEPTED CODE: +{lean_code} + +Reject examples: +- The code proves only `True`, `1 = 1`, or a routine identity unrelated to the candidate. +- The theorem name/statement in Lean bears no relationship to the intended theorem. +- The proof introduces a different result and ignores the claimed theorem. +- The result is materially weaker than the intended theorem without being a useful, explicitly scoped lemma. +- The result may be mathematically valid but does not help answer, support, or advance the USER RESEARCH PROMPT. + +{_json_only_footer('{"decision": "accept", "reasoning": "why the Lean code matches or does not match the intended theorem", "summary": "short rejection feedback if rejected"}')} +""" diff --git a/backend/autonomous/prompts/topic_exploration_prompts.py b/backend/autonomous/prompts/topic_exploration_prompts.py index f367fe6..98f025b 100644 --- a/backend/autonomous/prompts/topic_exploration_prompts.py +++ b/backend/autonomous/prompts/topic_exploration_prompts.py @@ -30,13 +30,16 @@ def build_exploration_user_prompt( parts = [] parts.append("=== TOPIC EXPLORATION PHASE ===\n") - parts.append("You are in a TOPIC EXPLORATION phase. You are NOT solving a mathematical problem directly.") - parts.append("Instead, your task is to propose CANDIDATE BRAINSTORM QUESTIONS — specific mathematical") - parts.append("avenues worth exploring for the research goal below.\n") + parts.append("You are in a TOPIC EXPLORATION phase. Your task is to propose CANDIDATE BRAINSTORM QUESTIONS") + parts.append("that maximize the chance of a rigorous DIRECT answer to the research goal below.\n") + parts.append("Prefer candidate questions aimed at direct solutions, direct partial solutions, impossibility") + parts.append("results, exact reductions, explicit constructions, or sharp constraints. Use indirect/support") + parts.append("avenues only when no stronger direct path is currently available.\n") parts.append("Each submission should contain ONE candidate brainstorm question and reasoning for why") parts.append("it is a valuable, distinct direction. The validator will check quality and DIVERSITY —") parts.append("candidates that overlap with already-accepted ones will be REJECTED.\n") parts.append("WHAT MAKES A GOOD CANDIDATE QUESTION:") + parts.append("- Most directly targets answering the user's problem or a clearly necessary subproblem") parts.append("- Specific enough to guide focused mathematical exploration (not vague)") parts.append("- Novel relative to already-accepted candidates and existing brainstorms") parts.append("- Relevant to the research goal below") @@ -45,8 +48,9 @@ def build_exploration_user_prompt( parts.append("- Actionable — a brainstorm session could produce meaningful insights from it\n") parts.append("DIVERSITY IS PARAMOUNT:") parts.append("Your candidate MUST be SUBSTANTIVELY DIFFERENT from already-accepted candidates.") - parts.append("The goal is to map the exploration landscape BROADLY before committing to a direction.") - parts.append("Do not propose variations of existing candidates — propose genuinely different avenues.\n") + parts.append("The goal is to compare the BEST direct-answer paths before committing to one.") + parts.append("Do not propose shallow variations of existing candidates — propose genuinely different,") + parts.append("high-value avenues with a preference for the most direct rigorous routes.\n") parts.append("FORMAT YOUR SUBMISSION AS:") parts.append("State the candidate brainstorm question clearly, then explain why it is valuable and") parts.append("distinct from any existing candidates.\n") diff --git a/backend/autonomous/prompts/topic_prompts.py b/backend/autonomous/prompts/topic_prompts.py index 449fef2..48fb61b 100644 --- a/backend/autonomous/prompts/topic_prompts.py +++ b/backend/autonomous/prompts/topic_prompts.py @@ -31,7 +31,12 @@ def get_topic_selection_system_prompt() -> str: --- YOUR TASK: -Select the optimal research avenue that best advances the user's research goal. +Select the optimal research avenue that most directly advances the user's research goal toward a rigorous answer. + +DIRECT-SOLUTION PREFERENCE: +- Prefer avenues likely to produce a direct solution, direct partial solution, impossibility result, explicit construction, exact reduction, or sharp constraint +- Use broader exploratory or background-heavy avenues only when no stronger direct path is currently available +- Do not choose an avenue merely because it is broad or interesting if a more direct rigorous path exists DECISION OPTIONS: 1. NEW_TOPIC - Create a brand new brainstorm topic to explore @@ -42,25 +47,29 @@ def get_topic_selection_system_prompt() -> str: When to choose NEW_TOPIC: - All existing topics are complete OR -- A genuinely new mathematical avenue would provide more research value than continuing existing work +- A genuinely new mathematical avenue would provide more direct-answer value than continuing existing work - The new topic addresses an unexplored area relevant to the research goal - Existing papers don't adequately cover this mathematical territory +- The new topic offers a stronger direct route to resolving the user's question than current options When to choose CONTINUE_EXISTING: - An incomplete brainstorm has significant untapped mathematical depth - The brainstorm has few submissions relative to its mathematical richness -- Continuing would yield more valuable insights than starting fresh +- Continuing would yield more valuable direct progress than starting fresh +- The unfinished topic still contains a realistic path to a stronger direct answer When to choose COMBINE_TOPICS: - Multiple existing brainstorms are deeply interconnected - A unified exploration would reveal insights neither topic could provide alone - The mathematical concepts naturally bridge multiple brainstorms +- The combination produces a more direct route to answering the user's question than keeping them separate CRITICAL REQUIREMENTS: - Focus on mathematical rigor and logical soundness - Avoid redundancy with existing work - Ensure topic selection serves the user's research goal - Consider the existing paper library to avoid redundant explorations +- Prefer the avenue with the strongest justified direct-answer potential CRITICAL JSON ESCAPE RULES: 1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text @@ -143,7 +152,7 @@ def get_topic_validator_system_prompt() -> str: --- YOUR TASK: -Validate whether the proposed topic selection represents the best use of research resources. +Validate whether the proposed topic selection represents the best use of research resources for obtaining the strongest rigorous direct answer. VALIDATION CRITERIA: @@ -154,6 +163,7 @@ def get_topic_validator_system_prompt() -> str: 4. The choice is relevant to the user's research goal 5. The reasoning is sound and mathematically grounded 6. The topic doesn't duplicate existing completed work +7. The choice is at least as direct a route to answering the user's question as the available alternatives REJECT the topic selection if: 1. NEW_TOPIC: The topic duplicates an existing brainstorm or completed paper @@ -162,6 +172,7 @@ def get_topic_validator_system_prompt() -> str: 4. The choice ignores more valuable research avenues 5. The reasoning is flawed or lacks mathematical rigor 6. The selection would lead to redundant work +7. A clearly more direct rigorous avenue was available and unjustifiably ignored REJECTION FEEDBACK FORMAT: If rejecting, provide CONCRETE, ACTIONABLE guidance: diff --git a/backend/autonomous/validation/paper_redundancy_checker.py b/backend/autonomous/validation/paper_redundancy_checker.py index 276e827..c5591ac 100644 --- a/backend/autonomous/validation/paper_redundancy_checker.py +++ b/backend/autonomous/validation/paper_redundancy_checker.py @@ -164,9 +164,9 @@ async def check_redundancy( self.task_tracking_callback("completed", task_id) return self._create_no_removal(f"Error: {str(e)}") - async def execute_removal(self, paper_id: str) -> bool: + async def execute_removal(self, paper_id: str, reason: str = "") -> bool: """ - Execute paper removal by archiving it. + Execute paper removal by pruning it from model context. Args: paper_id: ID of paper to remove @@ -175,15 +175,21 @@ async def execute_removal(self, paper_id: str) -> bool: True if removal successful """ try: - # Archive the paper - success = await paper_library.archive_paper(paper_id) + success = await paper_library.prune_paper( + paper_id, + reason=reason, + pruned_by="system", + ) if success: - # Update central metadata - await research_metadata.archive_paper(paper_id) - logger.info(f"PaperRedundancyChecker: Successfully archived paper {paper_id}") + await research_metadata.prune_paper( + paper_id, + reason=reason, + pruned_by="system", + ) + logger.info(f"PaperRedundancyChecker: Successfully pruned paper {paper_id}") else: - logger.error(f"PaperRedundancyChecker: Failed to archive paper {paper_id}") + logger.error(f"PaperRedundancyChecker: Failed to prune paper {paper_id}") return success diff --git a/backend/compiler/agents/critique_submitter.py b/backend/compiler/agents/critique_submitter.py index 45b74f0..7040399 100644 --- a/backend/compiler/agents/critique_submitter.py +++ b/backend/compiler/agents/critique_submitter.py @@ -1,14 +1,12 @@ """ Critique Submitter - generates peer review feedback on body section. -Also makes rewrite vs continue decision after 5 critiques received. """ -import asyncio -from typing import Optional, Dict, Callable, List +from typing import Optional, Callable import logging import uuid from datetime import datetime -from backend.shared.config import rag_config, system_config +from backend.shared.config import rag_config from backend.shared.models import Submission from backend.shared.api_client_manager import api_client_manager from backend.shared.openrouter_client import FreeModelExhaustedError @@ -16,8 +14,6 @@ from backend.shared.utils import count_tokens from backend.compiler.prompts.critique_prompts import ( build_critique_prompt, - build_rewrite_decision_prompt, - build_iterative_edit_prompt ) from backend.compiler.memory.critique_rejection_memory import CritiqueRejectionMemory @@ -27,7 +23,7 @@ class CritiqueSubmitterAgent: """ Critique submitter agent for peer review aggregation phase. - Generates critiques of body section and makes rewrite vs continue decisions. + Generates critiques of the body section for the final self-review. """ def __init__( @@ -226,290 +222,6 @@ async def submit_critique( logger.error(f"Error generating critique: {e}", exc_info=True) return None - async def submit_rewrite_decision( - self, - user_prompt: str, - current_body: str, - current_outline: str, - current_title: str, - aggregator_db: str, - critique_feedback: str, - pre_critique_paper: str, - reference_papers: Optional[str] = None, - accumulated_history: Optional[str] = None - ) -> Optional[Dict]: - """ - Decide whether to rewrite body or continue to conclusion. - - Args: - user_prompt: User's compiler-directing prompt - current_body: Body section being evaluated - current_outline: Paper outline - current_title: Current paper title - aggregator_db: Aggregator database content - critique_feedback: All accepted critiques (typically 1-3 out of 5 total attempts) - pre_critique_paper: Paper snapshot from START of critique phase (for context) - reference_papers: Optional reference paper content - accumulated_history: Optional accumulated critique history from previous failed versions - - Returns: - Dict with decision details or None if generation failed - Format: { - "decision": "total_rewrite" | "partial_revision" | "continue", - "new_title": str or None, - "new_outline": str or None, - "reasoning": str - } - Note: For partial_revision, edit operations are proposed iteratively (not upfront) - """ - try: - # Build prompt - prompt = build_rewrite_decision_prompt( - user_prompt=user_prompt, - current_body=current_body, - current_outline=current_outline, - current_title=current_title, - aggregator_db=aggregator_db, - critique_feedback=critique_feedback, - pre_critique_paper=pre_critique_paper, - reference_papers=reference_papers, - accumulated_history=accumulated_history - ) - - # Validate prompt size - prompt_tokens = count_tokens(prompt) - max_allowed = rag_config.get_available_input_tokens( - self.context_window, - self.max_tokens - ) - - if prompt_tokens > max_allowed: - logger.error( - f"Rewrite decision prompt ({prompt_tokens} tokens) exceeds context window " - f"({max_allowed} tokens available)" - ) - return None - - logger.debug(f"Rewrite decision prompt: {prompt_tokens} tokens (max: {max_allowed})") - - # Generate task ID and notify start - task_id = f"critique_decision_{self.task_sequence:03d}" - self.task_sequence += 1 - - if self.task_tracking_callback: - self.task_tracking_callback("started", task_id) - - # Call LLM (uses same role as critique generation) - response = await api_client_manager.generate_completion( - task_id=task_id, - role_id=self.role_id, # Use same role config as critique generation - model=self.model, - messages=[{"role": "user", "content": prompt}], - temperature=0.0, - max_tokens=self.max_tokens - ) - - # Notify completion - if self.task_tracking_callback: - self.task_tracking_callback("completed", task_id) - - # Extract content from API response - # Some reasoning models output JSON in 'reasoning' field instead of 'content' - if not response.get("choices") or not response["choices"][0].get("message"): - logger.error("Rewrite decision: LLM returned empty response structure") - return None - - message = response["choices"][0]["message"] - llm_output = message.get("content") or message.get("reasoning") or "" - - # Parse JSON response - data = parse_json(llm_output) - - if data is None: - logger.error("Failed to parse rewrite decision JSON response") - return None - - # Handle array responses (extract first element) - if isinstance(data, list): - logger.warning("Rewrite decision returned array instead of object - using first element") - if not data: - logger.error("Empty array response from rewrite decision") - return None - data = data[0] - - # Validate required fields - required_fields = ["decision", "reasoning"] - for field in required_fields: - if field not in data: - logger.error(f"Rewrite decision response missing '{field}' field") - return None - - # Validate decision value - if data["decision"] not in ["total_rewrite", "partial_revision", "continue"]: - logger.error(f"Invalid decision value: {data['decision']} (must be 'total_rewrite', 'partial_revision', or 'continue')") - return None - - # Note: For partial_revision, edit_operations are now proposed iteratively (not upfront) - # So we no longer validate edit_operations field here - - logger.info(f"Rewrite decision generated: {data['decision']}") - - return data - - except FreeModelExhaustedError: - raise - except RuntimeError as e: - if "credits exhausted" in str(e).lower(): - raise - logger.error(f"Error generating rewrite decision: {e}", exc_info=True) - return None - except Exception as e: - logger.error(f"Error generating rewrite decision: {e}", exc_info=True) - return None - - async def submit_iterative_edit( - self, - user_prompt: str, - pre_critique_paper: str, - current_paper: str, - current_outline: str, - critique_feedback: str, - edits_applied: List[Dict], - reference_papers: Optional[str] = None, - accumulated_history: Optional[str] = None - ) -> Optional[Dict]: - """ - Propose ONE edit for iterative partial revision. - - Called repeatedly until more_edits_needed=false or max iterations reached. - Each call sees the updated paper after previous edits were applied. - - Args: - user_prompt: User's compiler-directing prompt - pre_critique_paper: Paper snapshot from START of critique phase - current_paper: Current paper body (after any edits applied so far) - current_outline: Paper outline - critique_feedback: All accepted critiques from this revision cycle - edits_applied: List of edits already applied in this iteration - reference_papers: Optional reference paper content - accumulated_history: Optional accumulated critique history from previous failed versions - - Returns: - Dict with edit details or None if generation failed - Format: { - "operation": "replace" | "insert_after" | "delete", - "old_string": str, - "new_string": str, - "reasoning": str, - "more_edits_needed": bool - } - """ - try: - # Build prompt - prompt = build_iterative_edit_prompt( - user_prompt=user_prompt, - pre_critique_paper=pre_critique_paper, - current_paper=current_paper, - current_outline=current_outline, - critique_feedback=critique_feedback, - edits_applied=edits_applied, - reference_papers=reference_papers, - accumulated_critique_history=accumulated_history or "" - ) - - # Validate prompt size - prompt_tokens = count_tokens(prompt) - max_allowed = rag_config.get_available_input_tokens( - self.context_window, - self.max_tokens - ) - - if prompt_tokens > max_allowed: - logger.error( - f"Iterative edit prompt ({prompt_tokens} tokens) exceeds context window " - f"({max_allowed} tokens available)" - ) - return None - - logger.debug(f"Iterative edit prompt: {prompt_tokens} tokens (max: {max_allowed})") - - # Generate task ID and notify start - task_id = f"partial_edit_{self.task_sequence:03d}" - self.task_sequence += 1 - - if self.task_tracking_callback: - self.task_tracking_callback("started", task_id) - - # Call LLM - response = await api_client_manager.generate_completion( - task_id=task_id, - role_id=self.role_id, - model=self.model, - messages=[{"role": "user", "content": prompt}], - temperature=0.0, - max_tokens=self.max_tokens - ) - - # Notify completion - if self.task_tracking_callback: - self.task_tracking_callback("completed", task_id) - - # Extract content from API response - if not response.get("choices") or not response["choices"][0].get("message"): - logger.error("Iterative edit: LLM returned empty response structure") - return None - - message = response["choices"][0]["message"] - llm_output = message.get("content") or message.get("reasoning") or "" - - # Parse JSON response - data = parse_json(llm_output) - - if data is None: - logger.error("Failed to parse iterative edit JSON response") - return None - - # Handle array responses - if isinstance(data, list): - logger.warning("Iterative edit returned array instead of object - using first element") - if not data: - logger.error("Empty array response from iterative edit") - return None - data = data[0] - - # Validate required fields - required_fields = ["operation", "old_string", "new_string", "reasoning", "more_edits_needed"] - for field in required_fields: - if field not in data: - logger.error(f"Iterative edit response missing '{field}' field") - return None - - # Validate operation type - if data["operation"] not in ["replace", "insert_after", "delete"]: - logger.error(f"Invalid operation: {data['operation']} (must be 'replace', 'insert_after', or 'delete')") - return None - - # Validate more_edits_needed is boolean - if not isinstance(data["more_edits_needed"], bool): - logger.warning(f"more_edits_needed is not boolean: {data['more_edits_needed']}, converting to bool") - data["more_edits_needed"] = bool(data["more_edits_needed"]) - - edit_num = len(edits_applied) + 1 - logger.info(f"Iterative edit #{edit_num} proposed: {data['operation']} (more_edits_needed={data['more_edits_needed']})") - - return data - - except FreeModelExhaustedError: - raise - except RuntimeError as e: - if "credits exhausted" in str(e).lower(): - raise - logger.error(f"Error generating iterative edit: {e}", exc_info=True) - return None - except Exception as e: - logger.error(f"Error generating iterative edit: {e}", exc_info=True) - return None - async def handle_acceptance(self) -> None: """Handle critique acceptance (for compatibility with aggregator interface).""" # No special action needed for critique acceptances diff --git a/backend/compiler/agents/high_context_submitter.py b/backend/compiler/agents/high_context_submitter.py index fa1cb0d..f224824 100644 --- a/backend/compiler/agents/high_context_submitter.py +++ b/backend/compiler/agents/high_context_submitter.py @@ -3,6 +3,7 @@ Handles 3 modes: construction, outline update, and review. """ import asyncio +import hashlib import json import logging import uuid @@ -14,7 +15,7 @@ from backend.shared.models import CompilerSubmission from backend.shared.config import system_config, rag_config from backend.shared.utils import count_tokens -from backend.shared.json_parser import parse_json +from backend.shared.json_parser import parse_json, sanitize_model_output_for_retry_context from backend.autonomous.memory.proof_database import proof_database from backend.aggregator.validation.json_validator import json_validator from backend.compiler.prompts.outline_prompts import ( @@ -46,11 +47,34 @@ # ============================================================================= # The main writer may invoke Wolfram Alpha as a real OpenAI-style tool during # construction mode. Each submission gets a budget of 20 calls; the loop -# forces finalization once the budget is exhausted. Callers attach the full -# audit trail to `CompilerSubmission.metadata["wolfram_calls"]`. +# forces finalization once the budget is exhausted. Tool results are returned +# to the model, while logs/WebSocket events only expose redacted metadata. WOLFRAM_MAX_CALLS_PER_SUBMISSION = 20 + +def _hash_text_for_audit(value: str) -> str: + text = value or "" + return hashlib.sha256(text.encode("utf-8", errors="replace")).hexdigest() if text else "" + + +def _redacted_wolfram_audit_entry(query: str, purpose: str, result: str) -> Dict[str, Any]: + """Store non-sensitive Wolfram audit metadata while preserving call counts.""" + return { + "query": "[redacted]", + "purpose": "[redacted]" if purpose else "", + "result": "[redacted]", + "query_redacted": True, + "purpose_redacted": True, + "result_redacted": True, + "query_length": len(query or ""), + "purpose_length": len(purpose or ""), + "result_length": len(result or ""), + "query_sha256": _hash_text_for_audit(query), + "purpose_sha256": _hash_text_for_audit(purpose), + "result_sha256": _hash_text_for_audit(result), + } + WOLFRAM_TOOL_SCHEMA: Dict[str, Any] = { "type": "function", "function": { @@ -489,8 +513,6 @@ async def submit_construction( is_first_portion: bool = False, section_phase: Optional[str] = None, rejection_feedback: Optional[str] = None, - critique_feedback: Optional[str] = None, - pre_critique_paper: Optional[str] = None, brainstorm_content: Optional[str] = None, brainstorm_source_name: Optional[str] = None ) -> Optional[CompilerSubmission]: @@ -502,8 +524,6 @@ async def submit_construction( section_phase: Phase constraint for construction ("body", "conclusion", "introduction", "abstract") When provided, uses phase-specific prompts with explicit section_complete feedback. rejection_feedback: Feedback from a previous rejection to guide the model (e.g., "Introduction not found in document") - critique_feedback: Accepted critique feedback from peer review (for body rewrites only) - pre_critique_paper: Paper state before critique phase (for body rewrites - shows what failed) brainstorm_content: Full brainstorm database with submission numbers (for retroactive corrections) brainstorm_source_name: RAG source name for brainstorm (e.g., "brainstorm_abc123.txt") to exclude from retrieval @@ -512,8 +532,7 @@ async def submit_construction( """ phase_info = f", phase={section_phase}" if section_phase else "" feedback_info = f", retry with feedback" if rejection_feedback else "" - critique_info = f", rewrite with critique" if critique_feedback else "" - logger.info(f"Starting construction submission generation (first={is_first_portion}{phase_info}{feedback_info}{critique_info})") + logger.info(f"Starting construction submission generation (first={is_first_portion}{phase_info}{feedback_info})") try: # Get current outline and paper @@ -577,8 +596,6 @@ async def submit_construction( rag_evidence=context_pack.text, is_first_portion=is_first_portion, rejection_feedback=rejection_feedback, - critique_feedback=critique_feedback, - pre_critique_paper=pre_critique_paper, brainstorm_content=brainstorm_content ) elif section_phase == "conclusion": @@ -617,9 +634,7 @@ async def submit_construction( rag_evidence=context_pack.text, is_first_portion=is_first_portion, section_phase=section_phase, - rejection_feedback=rejection_feedback, - critique_feedback=critique_feedback, - pre_critique_paper=pre_critique_paper + rejection_feedback=rejection_feedback ) logger.info(f"Prompt built: {len(prompt)} chars") @@ -993,7 +1008,7 @@ async def _generate_completion_with_wolfram_tool( to the single-shot path. Websocket events: - - `compiler_wolfram_call` broadcast per call with query + preview. + - `compiler_wolfram_call` broadcast per call with redacted metadata. """ wolfram_enabled = _wolfram_tool_available() @@ -1115,16 +1130,14 @@ async def _generate_completion_with_wolfram_tool( logger.warning(f"Wolfram query raised: {exc}") result_text = None result_text = result_text or "Wolfram Alpha returned no result." - wolfram_calls.append({ - "query": query, - "purpose": purpose, - "result": result_text, - }) + wolfram_calls.append(_redacted_wolfram_audit_entry(query, purpose, result_text)) logger.info( - "Wolfram Alpha call %d/%d: %s", + "Wolfram Alpha call %d/%d completed (query_len=%d, purpose_len=%d, result_len=%d)", len(wolfram_calls), WOLFRAM_MAX_CALLS_PER_SUBMISSION, - query[:120], + len(query), + len(purpose), + len(result_text), ) try: await self._broadcast_wolfram_event( @@ -1181,9 +1194,14 @@ async def _broadcast_wolfram_event( "compiler_wolfram_call", { "task_id": task_id, - "query": query, - "purpose": purpose, - "result_preview": (result or "")[:200], + "query": "[redacted]", + "purpose": "[redacted]" if purpose else "", + "result_preview": "", + "query_redacted": True, + "result_redacted": True, + "query_length": len(query or ""), + "purpose_length": len(purpose or ""), + "result_length": len(result or ""), "calls_used": calls_used, "calls_remaining": max(0, WOLFRAM_MAX_CALLS_PER_SUBMISSION - calls_used), "max_calls": WOLFRAM_MAX_CALLS_PER_SUBMISSION, @@ -1229,6 +1247,7 @@ async def _parse_json_response_with_retry( try: # Generate a retry task ID (append _retry to distinguish from original) retry_task_id = f"{self.get_current_task_id()}_retry" + retry_context = sanitize_model_output_for_retry_context(response) retry_response = await api_client_manager.generate_completion( task_id=retry_task_id, @@ -1236,7 +1255,7 @@ async def _parse_json_response_with_retry( model=self.model_name, messages=[ {"role": "user", "content": original_prompt}, - {"role": "assistant", "content": response}, + {"role": "assistant", "content": retry_context}, {"role": "user", "content": retry_prompt} ], temperature=0.0, # Deterministic JSON formatting diff --git a/backend/compiler/agents/high_param_submitter.py b/backend/compiler/agents/high_param_submitter.py index ec03eda..b46bf3e 100644 --- a/backend/compiler/agents/high_param_submitter.py +++ b/backend/compiler/agents/high_param_submitter.py @@ -10,10 +10,11 @@ for up to 5 Lean 4 attempts with error-feedback chaining. Stage 3 (novelty): classify the verified proof and persist it via proof_database.add_proof. - Stage 4 (placement): propose an inline edit that introduces the + Stage 4 (placement): either propose an inline edit that introduces the theorem with a "verified in Lean 4" marker and an appendix - reference. The coordinator owns the 2-attempt validator retry loop - and the appendix fallback. + reference, or explicitly request appendix-only storage for extension + theorems. The coordinator owns the 2-attempt validator retry loop + and appendix insertion. The Wolfram sub-mode that used to live here has been removed in Phase 2. Wolfram Alpha is now a tool available to HighContextSubmitter.submit_construction @@ -39,13 +40,13 @@ ) from backend.shared.api_client_manager import api_client_manager from backend.shared.config import rag_config, system_config -from backend.shared.json_parser import parse_json +from backend.shared.json_parser import parse_json, sanitize_model_output_for_retry_context +from backend.shared.lean_proof_integrity import validate_full_lean_proof_integrity from backend.shared.lm_studio_client import lm_studio_client from backend.shared.models import ( CompilerSubmission, ProofAttemptFeedback, ProofCandidate, - ProofRecord, ) from backend.shared.utils import count_tokens @@ -96,6 +97,7 @@ def format_theorem_appendix_entry( """ header_name = theorem_name.strip() or proof_id tier_labels = { + "major_mathematical_discovery": "Major Mathematical Discovery", "mathematical_discovery": "Mathematical Discovery", "novel_variant": "Novel Reformulation", "novel_formulation": "Novel Formalization", @@ -103,6 +105,7 @@ def format_theorem_appendix_entry( novelty_label = tier_labels.get(novelty_tier, "Novel" if is_novel else "Known") status_suffix = { "appendix_fallback": "inline placement rejected; preserved here because Lean 4 verified the math", + "appendix_requested": "stored here by rigor discovery request", "inline": "also placed inline in the body", }.get(placement_outcome, placement_outcome) @@ -138,6 +141,8 @@ class RigorTheoremResult: # Retained for retry-prompt assembly formal_sketch: str = "" source_excerpt: str = "" + theorem_origin: str = "existing_paper_claim" + placement_preference: str = "inline" # Metadata pass-through metadata: Dict[str, Any] = field(default_factory=dict) @@ -156,6 +161,10 @@ def __init__( model_name: str, user_prompt: str, websocket_broadcaster: Optional[Callable[[str, Dict[str, Any]], Awaitable[None]]] = None, + *, + validator_model: str = "", + validator_context_window: Optional[int] = None, + validator_max_tokens: Optional[int] = None, ): self.model_name = model_name # NOTE: proof_database.inject_into_prompt prepends all novel proofs @@ -163,6 +172,9 @@ def __init__( self.user_prompt = proof_database.inject_into_prompt(user_prompt) self.raw_user_prompt = user_prompt self.websocket_broadcaster = websocket_broadcaster + self.validator_model = validator_model or model_name + self.validator_context_window = validator_context_window or system_config.compiler_validator_context_window + self.validator_max_tokens = validator_max_tokens or system_config.compiler_validator_max_output_tokens self._initialized = False self._standalone_session_id = f"standalone_{uuid.uuid4().hex[:12]}" @@ -192,6 +204,8 @@ async def initialize(self) -> None: self.context_window = system_config.compiler_high_param_context_window self.max_output_tokens = system_config.compiler_high_param_max_output_tokens + self.validator_context_window = self.validator_context_window or system_config.compiler_validator_context_window + self.validator_max_tokens = self.validator_max_tokens or system_config.compiler_validator_max_output_tokens self.available_input_tokens = rag_config.get_available_input_tokens( self.context_window, self.max_output_tokens ) @@ -298,11 +312,31 @@ async def submit_rigor_lean_theorem(self) -> Optional[RigorTheoremResult]: formal_sketch = str(discovery.get("formal_sketch") or "").strip() source_excerpt = str(discovery.get("source_excerpt") or "").strip() retry_failure_id = str(discovery.get("retry_existing_failure_id") or "").strip() + theorem_origin = str(discovery.get("theorem_origin") or "").strip() + placement_preference = str(discovery.get("placement_preference") or "").strip() if not theorem_statement: logger.info("Rigor cycle: discovery returned empty theorem_statement; declining") return None + if theorem_origin not in { + "existing_paper_claim", + "extension_from_partial_work", + "extension_from_user_prompt", + }: + theorem_origin = "existing_paper_claim" + + if placement_preference not in {"inline", "appendix_only"}: + placement_preference = "inline" + + if theorem_origin in { + "extension_from_partial_work", + "extension_from_user_prompt", + }: + # Extension proofs are useful evidence for the paper, but they + # should not silently mutate the main body narrative. + placement_preference = "appendix_only" + logger.info( "Rigor cycle: Stage 2 - Lean 4 formalization (up to 5 attempts), " f"retry_failure_id={retry_failure_id or 'none'}" @@ -323,13 +357,16 @@ async def submit_rigor_lean_theorem(self) -> Optional[RigorTheoremResult]: theorem_name, lean_code, attempts = formalizer_result logger.info("Rigor cycle: Stage 3 - novelty classification + persistence") - is_novel, novelty_reasoning, stored_record = await self._step_assess_novelty_and_store( + novelty_result = await self._step_assess_novelty_and_store( theorem_statement=theorem_statement, theorem_name=theorem_name, lean_code=lean_code, formal_sketch=formal_sketch, attempts=attempts, ) + if novelty_result is None: + return None + is_novel, novelty_reasoning, stored_record = novelty_result await self._broadcast( "proof_verified", @@ -355,14 +392,22 @@ async def submit_rigor_lean_theorem(self) -> Optional[RigorTheoremResult]: except Exception as exc: logger.debug("mark_resolved_retry failed (non-fatal): %s", exc) - logger.info("Rigor cycle: Stage 4 - initial placement proposal") - initial_submission = await self._step_initial_placement( - proof_id=stored_record.proof_id, - theorem_statement=theorem_statement, - theorem_name=theorem_name, - lean_code=lean_code, - is_novel=is_novel, - ) + initial_submission = None + if placement_preference == "appendix_only": + logger.info( + "Rigor cycle: discovery requested appendix-only placement " + "(origin=%s)", + theorem_origin, + ) + else: + logger.info("Rigor cycle: Stage 4 - initial placement proposal") + initial_submission = await self._step_initial_placement( + proof_id=stored_record.proof_id, + theorem_statement=theorem_statement, + theorem_name=theorem_name, + lean_code=lean_code, + is_novel=is_novel, + ) return RigorTheoremResult( proof_id=stored_record.proof_id, @@ -370,16 +415,20 @@ async def submit_rigor_lean_theorem(self) -> Optional[RigorTheoremResult]: theorem_name=theorem_name, lean_code=lean_code, is_novel=is_novel, - novelty_tier=novelty_tier, + novelty_tier=stored_record.novelty_tier, novelty_reasoning=novelty_reasoning, attempts=attempts, source_id=self._compiler_source_id(), initial_placement_submission=initial_submission, formal_sketch=formal_sketch, source_excerpt=source_excerpt, + theorem_origin=theorem_origin, + placement_preference=placement_preference, metadata={ "retry_failure_id": retry_failure_id, "attempt_count": len(attempts), + "theorem_origin": theorem_origin, + "placement_preference": placement_preference, }, ) @@ -489,6 +538,17 @@ async def _step_formalize( max_output_tokens=self.max_output_tokens, role_id="compiler_rigor_formalization", ) + proof_label = "A" + + def _lean_response_summary(feedback: ProofAttemptFeedback) -> str: + if feedback.success: + return "Lean 4 response: proof verified." + error = " ".join((feedback.error_output or "").split()) + if len(error) > 960: + error = f"{error[:960]}..." + if error: + return f"Lean 4 response: {error} - proof not verified." + return "Lean 4 response: proof not verified." async def _on_attempt_started(attempt_number: int, strategy: str) -> None: await self._broadcast( @@ -498,13 +558,14 @@ async def _on_attempt_started(attempt_number: int, strategy: str) -> None: "source_id": self._compiler_source_id(), "theorem_id": candidate.theorem_id, "theorem_statement": theorem_statement, + "proof_label": proof_label, "attempt": attempt_number, "strategy": strategy, }, ) async def _on_attempt_feedback(feedback: ProofAttemptFeedback) -> None: - event = "proof_verified" if feedback.success else "proof_attempt_failed" + event = "proof_lean_accepted" if feedback.success else "proof_attempt_failed" await self._broadcast( event, { @@ -512,9 +573,12 @@ async def _on_attempt_feedback(feedback: ProofAttemptFeedback) -> None: "source_id": self._compiler_source_id(), "theorem_id": candidate.theorem_id, "theorem_statement": theorem_statement, + "proof_label": proof_label, "attempt": feedback.attempt, "strategy": feedback.strategy, "error_output": feedback.error_output[:500] if feedback.error_output else "", + "lean_response": _lean_response_summary(feedback), + "proof_verified": feedback.success, }, ) @@ -574,6 +638,61 @@ async def _on_attempt_feedback(feedback: ProofAttemptFeedback) -> None: ) return None + integrity = await validate_full_lean_proof_integrity( + user_prompt=self.raw_user_prompt, + theorem_statement=theorem_statement, + formal_sketch=candidate.formal_sketch, + lean_code=lean_code, + source_excerpt=candidate.source_excerpt or current_paper, + allowed_baseline="", + validator_model=self.validator_model, + validator_context=self.validator_context_window, + validator_max_tokens=self.validator_max_tokens, + task_id=f"{self.get_current_task_id()}_integrity", + role_id="compiler_rigor_novelty", + require_statement_alignment=True, + ) + if not integrity.valid: + integrity_feedback = ProofAttemptFeedback( + attempt=(attempts[-1].attempt + 1 if attempts else 1), + theorem_id=candidate.theorem_id, + reasoning="Post-Lean proof integrity check failed.", + lean_code=lean_code, + error_output=integrity.reason, + strategy="full_script", + success=False, + ) + attempts = list(attempts) + [integrity_feedback] + try: + await proof_database.record_failed_candidate( + source_brainstorm_id=self._compiler_source_id(), + theorem_candidate=candidate, + error_summary=integrity.reason[:2000], + ) + except Exception as exc: + logger.debug("record_failed_candidate failed after integrity rejection: %s", exc) + await self._broadcast( + "proof_integrity_rejected", + { + "source_type": "compiler_rigor", + "source_id": self._compiler_source_id(), + "theorem_id": candidate.theorem_id, + "theorem_statement": theorem_statement, + "category": integrity.category, + "reason": integrity.reason, + }, + ) + await self._broadcast( + "proof_check_complete", + { + "source_type": "compiler_rigor", + "source_id": self._compiler_source_id(), + "verified_count": 0, + "message": "Lean proof failed post-verification integrity checks", + }, + ) + return None + return theorem_name, lean_code, attempts # --------------------------------------------------------- stage 3 @@ -586,60 +705,58 @@ async def _step_assess_novelty_and_store( lean_code: str, formal_sketch: str, attempts: List[ProofAttemptFeedback], - ) -> tuple: + ) -> Optional[tuple]: """Classify the verified proof and persist it via proof_database. Returns (is_novel, novelty_reasoning, stored_record). """ - # Lazy import to break an early-load circular chain through the - # autonomous.core package __init__. - from backend.autonomous.core.proof_novelty import assess_proof_novelty - - existing_block = proof_database.get_novel_proofs_for_injection() - task_id = f"{self.get_current_task_id()}_novelty" self.task_sequence += 1 try: - novelty_tier, novelty_reasoning = await assess_proof_novelty( + # Lazy import avoids an early-load cycle through autonomous.core. + from backend.autonomous.core.proof_registration import register_verified_lean_proof + + registration = await register_verified_lean_proof( + proof_database=proof_database, user_prompt=self.raw_user_prompt, theorem_statement=theorem_statement, lean_code=lean_code, - validator_model=self.model_name, - validator_context=self.context_window, - validator_max_tokens=self.max_output_tokens, - existing_novel_proofs=existing_block, + validator_model=self.validator_model, + validator_context=self.validator_context_window, + validator_max_tokens=self.validator_max_tokens, task_id=task_id, role_id="compiler_rigor_novelty", + source_type="paper", + source_id=self._compiler_source_id(), + source_title="Compiler Rigor Theorem", + theorem_name=theorem_name, + formal_sketch=formal_sketch, + solver="Lean 4", + verification_notes="Produced by compiler rigor loop (HighParamSubmitter).", + attempt_count=len(attempts), + attempts=list(attempts), + broadcast_fn=self.websocket_broadcaster, + base_event={ + "source_type": "compiler_rigor", + "source_id": self._compiler_source_id(), + "trigger": "rigor_loop", + }, ) - is_novel = novelty_tier != "not_novel" + stored = registration.record + return stored.novel, stored.novelty_reasoning, stored except Exception as exc: - logger.warning("Novelty assessment failed (%s); defaulting to non-novel", exc) - novelty_tier, novelty_reasoning, is_novel = "not_novel", f"Novelty assessment error: {exc}", False - - record = ProofRecord( - proof_id="", # proof_database assigns proof_XXX on add_proof - theorem_id="", - theorem_statement=theorem_statement, - theorem_name=theorem_name, - formal_sketch=formal_sketch, - source_type="paper", # compiler rigor proofs live under the "paper" channel - source_id=self._compiler_source_id(), - source_title="Compiler Rigor Theorem", - solver="Lean 4", - lean_code=lean_code, - novel=is_novel, - novelty_tier=novelty_tier, - novelty_reasoning=novelty_reasoning, - verification_notes="Produced by compiler rigor loop (HighParamSubmitter).", - attempt_count=len(attempts), - attempts=list(attempts), - dependencies=[], - solver_hints=[], - ) - - stored = await proof_database.add_proof(record) - return is_novel, novelty_reasoning, stored + logger.warning("Novelty assessment failed; rigor proof will not be stored: %s", exc) + await self._broadcast( + "proof_check_complete", + { + "source_type": "compiler_rigor", + "source_id": self._compiler_source_id(), + "verified_count": 0, + "message": f"novelty validation failed: {exc}", + }, + ) + return None # --------------------------------------------------------- stage 4 @@ -870,9 +987,7 @@ async def _call_llm_and_parse( ) try: - truncated_preview = llm_output[:2000] + ( - "\n[...truncated...]" if len(llm_output) > 2000 else "" - ) + truncated_preview = sanitize_model_output_for_retry_context(llm_output, max_chars=2000) retry_response = await api_client_manager.generate_completion( task_id=f"{task_id}_retry", role_id=self.role_id, diff --git a/backend/compiler/core/compiler_coordinator.py b/backend/compiler/core/compiler_coordinator.py index 9d0964d..6acd695 100644 --- a/backend/compiler/core/compiler_coordinator.py +++ b/backend/compiler/core/compiler_coordinator.py @@ -17,6 +17,7 @@ from backend.shared.workflow_predictor import workflow_predictor from backend.shared.api_client_manager import api_client_manager from backend.shared.openrouter_client import FreeModelExhaustedError, OpenRouterInvalidResponseError +from backend.shared.brainstorm_proof_gate import BRAINSTORM_LEAN_PROOF_MARKER from backend.shared.free_model_manager import free_model_manager from backend.shared.json_parser import parse_json from backend.shared.utils import count_tokens @@ -39,6 +40,21 @@ logger = logging.getLogger(__name__) +CRITIQUE_ATTEMPT_TARGET = 3 + +LEAN_PROOF_EDIT_DENIAL_REASON = ( + "REJECTION REASON: Protected Lean 4 Proof\n\n" + "ISSUE: The paper-writing retroactive brainstorm operation attempted to edit, delete, " + "or add context to a Lean 4 verified proof in the brainstorm database.\n\n" + "WHY THIS IS AN ISSUE: Lean 4 proof blocks are immutable from paper-writing modes. " + "Paper writing may cite or discuss verified proofs in the paper, but it cannot mutate " + "the proof text or attach context to the proof record. Only the normal brainstorm prune " + "system may remove Lean 4 proof entries.\n\n" + "FIX REQUIRED: Do not target Lean 4 proof submissions with brainstorm_operation. If a proof " + "is unhelpful, let the scheduled brainstorm prune system handle removal. If the paper needs " + "commentary, write that commentary in the paper prose instead of editing the proof." +) + def _classify_submitter_error(err: BaseException) -> tuple[str, str]: """ @@ -113,7 +129,7 @@ def __init__(self): self.autonomous_mode = False self.autonomous_section_phase = None # "body", "conclusion", "introduction", "abstract" self._current_topic_id = None # Set by autonomous coordinator for retroactive brainstorm corrections - self._current_reference_paper_ids: List[str] = [] # Autonomous/Tier 3 references preserved for critique and rewrite context + self._current_reference_paper_ids: List[str] = [] # Autonomous/Tier 3 references preserved for critique context # Critique phase state (post-body peer review) self.critique_submitter = None # CritiqueSubmitterAgent instance @@ -121,15 +137,8 @@ def __init__(self): self.in_critique_phase = False self.critique_acceptances = 0 self.paper_version = 1 # Track version number - self.rewrite_count = 0 # Track COMPLETED rewrites (max 1) - self.rewrite_pending = False # Track if rewrite initiated but not yet succeeded - self.accumulated_critique_history: List[Dict] = [] # Store all critiques from all versions - self.previous_body_versions: List[Dict] = [] # Store prior versions - self.needs_critique_after_rewrite = False # Flag to trigger another critique round self.paper_title: Optional[str] = None # Track current paper title self._skip_critique_requested = False # Pre-emptive skip flag (user can set before critique phase) - self.pre_critique_paper: Optional[str] = None # Snapshot of paper at critique phase start - self.current_critique_feedback: Optional[str] = None # Accepted critiques for current version (for rewrite context) # Aggregator monitoring for incremental re-RAG self.aggregator_acceptances_last_rag = 0 @@ -162,19 +171,27 @@ async def initialize( # OpenRouter provider config for validator validator_provider: str = "lm_studio", validator_openrouter_provider: Optional[str] = None, + validator_openrouter_reasoning_effort: str = "auto", validator_lm_studio_fallback: Optional[str] = None, # OpenRouter provider config for high-context submitter high_context_provider: str = "lm_studio", high_context_openrouter_provider: Optional[str] = None, + high_context_openrouter_reasoning_effort: str = "auto", high_context_lm_studio_fallback: Optional[str] = None, # OpenRouter provider config for high-param submitter high_param_provider: str = "lm_studio", high_param_openrouter_provider: Optional[str] = None, + high_param_openrouter_reasoning_effort: str = "auto", high_param_lm_studio_fallback: Optional[str] = None, # OpenRouter provider config for critique submitter critique_submitter_provider: str = "lm_studio", critique_submitter_openrouter_provider: Optional[str] = None, - critique_submitter_lm_studio_fallback: Optional[str] = None + critique_submitter_openrouter_reasoning_effort: str = "auto", + critique_submitter_lm_studio_fallback: Optional[str] = None, + validator_supercharge_enabled: bool = False, + high_context_supercharge_enabled: bool = False, + high_param_supercharge_enabled: bool = False, + critique_submitter_supercharge_enabled: bool = False ) -> None: """ Initialize the compiler coordinator. @@ -184,19 +201,23 @@ async def initialize( validator_model: Model for validator high_context_model: Model for high-context submitter high_param_model: Model for high-param submitter - critique_submitter_model: Model for critique generation and rewrite decisions + critique_submitter_model: Model for critique generation skip_aggregator_db: If True, don't load Part 1 aggregator database (for autonomous mode) validator_provider: Provider for validator ("lm_studio" or "openrouter") validator_openrouter_provider: OpenRouter host provider for validator + validator_openrouter_reasoning_effort: OpenRouter reasoning effort for validator validator_lm_studio_fallback: LM Studio fallback model for validator high_context_provider: Provider for high-context submitter high_context_openrouter_provider: OpenRouter host provider for high-context submitter + high_context_openrouter_reasoning_effort: OpenRouter reasoning effort for high-context submitter high_context_lm_studio_fallback: LM Studio fallback model for high-context submitter high_param_provider: Provider for high-param submitter high_param_openrouter_provider: OpenRouter host provider for high-param submitter + high_param_openrouter_reasoning_effort: OpenRouter reasoning effort for high-param submitter high_param_lm_studio_fallback: LM Studio fallback model for high-param submitter critique_submitter_provider: Provider for critique submitter critique_submitter_openrouter_provider: OpenRouter host provider for critique submitter + critique_submitter_openrouter_reasoning_effort: OpenRouter reasoning effort for critique submitter critique_submitter_lm_studio_fallback: LM Studio fallback model for critique submitter """ logger.info("Initializing compiler coordinator...") @@ -212,16 +233,24 @@ async def initialize( # Store OpenRouter provider configs for all roles self.validator_provider = validator_provider self.validator_openrouter_provider = validator_openrouter_provider + self.validator_openrouter_reasoning_effort = validator_openrouter_reasoning_effort self.validator_lm_studio_fallback = validator_lm_studio_fallback self.high_context_provider = high_context_provider self.high_context_openrouter_provider = high_context_openrouter_provider + self.high_context_openrouter_reasoning_effort = high_context_openrouter_reasoning_effort self.high_context_lm_studio_fallback = high_context_lm_studio_fallback self.high_param_provider = high_param_provider self.high_param_openrouter_provider = high_param_openrouter_provider + self.high_param_openrouter_reasoning_effort = high_param_openrouter_reasoning_effort self.high_param_lm_studio_fallback = high_param_lm_studio_fallback self.critique_submitter_provider = critique_submitter_provider self.critique_submitter_openrouter_provider = critique_submitter_openrouter_provider + self.critique_submitter_openrouter_reasoning_effort = critique_submitter_openrouter_reasoning_effort self.critique_submitter_lm_studio_fallback = critique_submitter_lm_studio_fallback + self.validator_supercharge_enabled = validator_supercharge_enabled + self.high_context_supercharge_enabled = high_context_supercharge_enabled + self.high_param_supercharge_enabled = high_param_supercharge_enabled + self.critique_submitter_supercharge_enabled = critique_submitter_supercharge_enabled # Reset workflow state for fresh start self.outline_accepted = False @@ -321,16 +350,21 @@ async def initialize( provider=self.high_context_provider, model_id=high_context_model, openrouter_provider=self.high_context_openrouter_provider, + openrouter_reasoning_effort=self.high_context_openrouter_reasoning_effort, lm_studio_fallback_id=self.high_context_lm_studio_fallback, context_window=system_config.compiler_high_context_context_window, - max_output_tokens=system_config.compiler_high_context_max_output_tokens + max_output_tokens=system_config.compiler_high_context_max_output_tokens, + supercharge_enabled=self.high_context_supercharge_enabled ) ) self.high_param_submitter = HighParamSubmitter( high_param_model, compiler_prompt, - websocket_broadcaster=self.websocket_broadcaster + websocket_broadcaster=self.websocket_broadcaster, + validator_model=validator_model, + validator_context_window=self.validator_context_window, + validator_max_tokens=self.validator_max_tokens, ) await self.high_param_submitter.initialize() # Set up task tracking callback for workflow panel integration @@ -342,11 +376,41 @@ async def initialize( provider=self.high_param_provider, model_id=high_param_model, openrouter_provider=self.high_param_openrouter_provider, + openrouter_reasoning_effort=self.high_param_openrouter_reasoning_effort, lm_studio_fallback_id=self.high_param_lm_studio_fallback, context_window=system_config.compiler_high_param_context_window, - max_output_tokens=system_config.compiler_high_param_max_output_tokens + max_output_tokens=system_config.compiler_high_param_max_output_tokens, + supercharge_enabled=self.high_param_supercharge_enabled ) ) + high_param_role_config = ModelConfig( + provider=self.high_param_provider, + model_id=high_param_model, + openrouter_provider=self.high_param_openrouter_provider, + openrouter_reasoning_effort=self.high_param_openrouter_reasoning_effort, + lm_studio_fallback_id=self.high_param_lm_studio_fallback, + context_window=system_config.compiler_high_param_context_window, + max_output_tokens=system_config.compiler_high_param_max_output_tokens, + supercharge_enabled=self.high_param_supercharge_enabled + ) + api_client_manager.configure_role( + role_id="compiler_rigor_formalization", + config=high_param_role_config, + ) + validator_role_config = ModelConfig( + provider=self.validator_provider, + model_id=validator_model, + openrouter_provider=self.validator_openrouter_provider, + openrouter_reasoning_effort=self.validator_openrouter_reasoning_effort, + lm_studio_fallback_id=self.validator_lm_studio_fallback, + context_window=self.validator_context_window, + max_output_tokens=self.validator_max_tokens, + supercharge_enabled=self.validator_supercharge_enabled + ) + api_client_manager.configure_role( + role_id="compiler_rigor_novelty", + config=validator_role_config, + ) self.validator = CompilerValidator( validator_model, @@ -363,9 +427,11 @@ async def initialize( provider=self.validator_provider, model_id=validator_model, openrouter_provider=self.validator_openrouter_provider, + openrouter_reasoning_effort=self.validator_openrouter_reasoning_effort, lm_studio_fallback_id=self.validator_lm_studio_fallback, context_window=self.validator_context_window, - max_output_tokens=self.validator_max_tokens + max_output_tokens=self.validator_max_tokens, + supercharge_enabled=self.validator_supercharge_enabled ) ) @@ -577,15 +643,6 @@ def _is_body_complete(self, paper: str) -> bool: Returns: True if body is complete (should skip rigor/outline updates), False otherwise """ - # If rewrite is pending (initiated but not yet succeeded), body is NOT complete - if self.rewrite_pending: - return False - - # Check if max rewrites completed - skip critique entirely - if self.rewrite_count >= 1: - logger.info("Max rewrites completed (1) - treating body as complete") - return True - # Autonomous mode: use explicit phase tracking if self.autonomous_mode: return self.autonomous_section_phase != "body" @@ -615,8 +672,12 @@ async def start(self) -> None: # Start main workflow loop self._main_task = asyncio.create_task(self._main_workflow()) - # Start aggregator monitoring for incremental re-RAG - self._aggregator_monitor_task = asyncio.create_task(self._monitor_aggregator_for_rerag()) + # Manual Part 2 can watch Part 1 for incremental context. Autonomous/Tier 3 + # paper writing owns its brainstorm context and must not spawn child monitors. + if not self.autonomous_mode: + self._aggregator_monitor_task = asyncio.create_task(self._monitor_aggregator_for_rerag()) + else: + self._aggregator_monitor_task = None await self._broadcast("compiler_started", {"message": "Compiler started"}) logger.info("Compiler started successfully") @@ -711,7 +772,7 @@ async def _main_workflow(self) -> None: }) await asyncio.sleep(120) # Wait before retrying (all models exhausted) if self.is_running: - asyncio.create_task(self._main_workflow()) + self._main_task = asyncio.create_task(self._main_workflow()) except Exception as e: logger.error(f"Compiler workflow error: {e}", exc_info=True) self.is_running = False @@ -1244,14 +1305,6 @@ async def _submit_and_validate_construction(self, rejection_feedback: Optional[s # Single attempt - None means no work needed, not error section_phase = self.autonomous_section_phase if self.autonomous_mode else None - # Pass critique context during body rewrite (when critique_feedback is set) - critique_feedback_for_construction = None - pre_critique_paper_for_construction = None - if section_phase == "body" and self.current_critique_feedback: - critique_feedback_for_construction = self.current_critique_feedback - pre_critique_paper_for_construction = self.pre_critique_paper - logger.info("Body construction with critique context (rewrite mode)") - # Load brainstorm content for retroactive corrections (autonomous mode only) brainstorm_content_for_submitter = None brainstorm_source_for_submitter = None @@ -1271,8 +1324,6 @@ async def _submit_and_validate_construction(self, rejection_feedback: Optional[s is_first_portion=False, section_phase=section_phase, rejection_feedback=rejection_feedback, - critique_feedback=critique_feedback_for_construction, - pre_critique_paper=pre_critique_paper_for_construction, brainstorm_content=brainstorm_content_for_submitter, brainstorm_source_name=brainstorm_source_for_submitter ) @@ -1608,12 +1659,6 @@ def has_real_section_content(section_pattern: str, paper_text: str) -> bool: self.construction_acceptances += 1 self._track_submission_wolfram_calls(submission) - # If rewrite was pending, mark it as completed now (first successful acceptance) - if self.rewrite_pending: - self.rewrite_count += 1 - self.rewrite_pending = False - logger.info(f"Rewrite #{self.rewrite_count} completed successfully (first acceptance after rewrite)") - await compiler_rejection_log.add_acceptance( submission.submission_id, "construction", @@ -1684,6 +1729,18 @@ async def _handle_brainstorm_retroactive_operation(self, brainstorm_op) -> None: logger.warning(f"Brainstorm {topic_id} is empty, skipping retroactive operation") return + denial_reason = self._get_lean_proof_retroactive_denial_reason( + brainstorm_op, + brainstorm_content, + ) + if denial_reason: + await self._reject_brainstorm_retroactive_operation( + brainstorm_op, + topic_id, + denial_reason, + ) + return + result = await self.validator.validate_brainstorm_operation( brainstorm_op, brainstorm_content ) @@ -1738,6 +1795,109 @@ async def _handle_brainstorm_retroactive_operation(self, brainstorm_op) -> None: }) except Exception as e: logger.error(f"Error handling retroactive brainstorm operation: {e}") + + @staticmethod + def _parse_brainstorm_submission_content(brainstorm_content: str, submission_number: int) -> str: + separator = "=" * 80 + parts = (brainstorm_content or "").split(separator) + for index, part in enumerate(parts): + if "SUBMISSION #" not in part: + continue + match = re.search(r"SUBMISSION #(\d+)\s*\|", part) + if not match or int(match.group(1)) != submission_number: + continue + if index + 1 < len(parts): + return parts[index + 1].strip() + return "" + + @staticmethod + def _is_lean_verified_brainstorm_content(content: str) -> bool: + text = content or "" + if BRAINSTORM_LEAN_PROOF_MARKER in text: + return True + lowered = text.lower() + return ( + "lean 4 code:" in lowered + and "lean verification: accepted" in lowered + and "theorem statement:" in lowered + ) + + @staticmethod + def _looks_like_lean_proof_annotation_attempt(brainstorm_op) -> bool: + text = f"{getattr(brainstorm_op, 'new_content', '')}\n{getattr(brainstorm_op, 'reasoning', '')}".lower() + if BRAINSTORM_LEAN_PROOF_MARKER.lower() in text: + return True + proof_reference = ( + "lean 4 proof" in text + or "lean-verified proof" in text + or "lean verified proof" in text + or "verified proof" in text + or "proof id:" in text + ) + annotation_intent = any( + phrase in text + for phrase in ( + "add context", + "additional context", + "annotate", + "annotation", + "clarify proof", + "context for proof", + "context to proof", + "update proof", + "edit proof", + ) + ) + return proof_reference and annotation_intent + + def _get_lean_proof_retroactive_denial_reason(self, brainstorm_op, brainstorm_content: str) -> str: + action = getattr(brainstorm_op, "action", "") + if action in {"edit", "delete"}: + submission_number = getattr(brainstorm_op, "submission_number", None) + if submission_number is None: + return "" + target_content = self._parse_brainstorm_submission_content( + brainstorm_content, + int(submission_number), + ) + if self._is_lean_verified_brainstorm_content(target_content): + return LEAN_PROOF_EDIT_DENIAL_REASON + if action == "add" and self._looks_like_lean_proof_annotation_attempt(brainstorm_op): + return LEAN_PROOF_EDIT_DENIAL_REASON + return "" + + async def _reject_brainstorm_retroactive_operation( + self, + brainstorm_op, + topic_id: str, + reasoning: str, + ) -> None: + logger.info( + "Retroactive brainstorm %s automatically rejected for protected Lean 4 proof mutation: %s", + getattr(brainstorm_op, "action", ""), + getattr(brainstorm_op, "submission_number", None), + ) + result = CompilerValidationResult( + submission_id=str(uuid.uuid4()), + decision="reject", + reasoning=reasoning, + summary=reasoning[:750], + json_valid=True, + validation_stage="pre-validation", + ) + await compiler_rejection_log.add_rejection( + result, + "brainstorm_retroactive", + getattr(brainstorm_op, "new_content", "") or getattr(brainstorm_op, "reasoning", ""), + ) + await self._broadcast("brainstorm_retroactive_rejected", { + "action": getattr(brainstorm_op, "action", ""), + "topic_id": topic_id, + "submission_number": getattr(brainstorm_op, "submission_number", None), + "reasoning": reasoning[:500], + "automatic": True, + "protected_lean_proof": True, + }) async def _submit_and_validate_outline_update(self) -> bool: """Submit and validate outline update. Returns True if accepted.""" @@ -2089,9 +2249,9 @@ async def _submit_and_validate_rigor(self) -> bool: async def _place_or_appendix_fallback(self, lean_result) -> bool: """Drive the 2-attempt placement validator loop. - On double rejection (or when the submitter never produced a legal - attempt), the theorem is appended to the Theorems Appendix and the - cycle is counted as a rigor_acceptance. + On explicit appendix-only discovery, double rejection, or when the + submitter never produced a legal attempt, the theorem is appended to + the Theorems Appendix and the cycle is counted as a rigor_acceptance. """ from backend.compiler.agents.high_param_submitter import ( format_theorem_appendix_entry, @@ -2099,13 +2259,26 @@ async def _place_or_appendix_fallback(self, lean_result) -> bool: submission = lean_result.initial_placement_submission validator_feedback = "" + requested_appendix_only = ( + getattr(lean_result, "placement_preference", "") == "appendix_only" + or (getattr(lean_result, "metadata", {}) or {}).get("placement_preference") + == "appendix_only" + ) + appendix_outcome = ( + "appendix_requested" if requested_appendix_only else "appendix_fallback" + ) for placement_attempt in (1, 2): if submission is None: + route_reason = ( + "discovery requested appendix-only placement" + if requested_appendix_only + else "submitter returned no placement submission" + ) logger.info( - "Rigor placement attempt %s: submitter returned no placement submission; " - "routing directly to appendix fallback", + "Rigor placement attempt %s: %s; routing directly to appendix", placement_attempt, + route_reason, ) break @@ -2251,9 +2424,10 @@ async def _place_or_appendix_fallback(self, lean_result) -> bool: lean_result, validator_feedback ) - # Appendix fallback: both placement attempts failed (or attempt 1 was - # impossible). The math is already Lean-verified, so the theorem is - # preserved in the Theorems Appendix and counted as a rigor_acceptance. + # Appendix storage: either explicitly requested by discovery or used + # as fallback when inline placement failed / was impossible. The math + # is already Lean-verified, so the theorem is preserved and counted as + # a rigor_acceptance. appendix_entry = format_theorem_appendix_entry( proof_id=lean_result.proof_id, theorem_statement=lean_result.theorem_statement, @@ -2261,7 +2435,7 @@ async def _place_or_appendix_fallback(self, lean_result) -> bool: is_novel=lean_result.is_novel, theorem_name=lean_result.theorem_name, novelty_tier=lean_result.novelty_tier, - placement_outcome="appendix_fallback", + placement_outcome=appendix_outcome, ) appended = await paper_memory.append_to_theorems_appendix(appendix_entry) if not appended: @@ -2281,18 +2455,18 @@ async def _place_or_appendix_fallback(self, lean_result) -> bool: "submission_id": ( lean_result.initial_placement_submission.submission_id if lean_result.initial_placement_submission - else f"rigor_appendix_{lean_result.proof_id}" + else f"rigor_{appendix_outcome}_{lean_result.proof_id}" ), - "placement_outcome": "appendix_fallback", + "placement_outcome": appendix_outcome, "lean_proof_id": lean_result.proof_id, "is_novel": lean_result.is_novel, }, ) await self._broadcast("paper_updated", {"word_count": word_count}) logger.info( - "Rigor theorem %s stored in Theorems Appendix (both placement attempts " - "failed or unavailable)", + "Rigor theorem %s stored in Theorems Appendix (%s)", lean_result.proof_id, + appendix_outcome, ) return True @@ -2614,31 +2788,11 @@ async def _start_critique_phase(self) -> None: self.in_critique_phase = True self.critique_acceptances = 0 - # Snapshot paper at critique phase start (for rewrite context) - self.pre_critique_paper = await paper_memory.get_paper() - logger.info(f"Snapshot pre-critique paper: {len(self.pre_critique_paper)} chars") - - # Clear current critique feedback for this round - self.current_critique_feedback = None - # Initialize critique memory paper_id = f"paper_v{self.paper_version}" critique_memory.initialize(paper_id) - - # Before clearing, accumulate any existing critiques from previous phases - existing = await critique_memory.get_all_critiques() - if existing.strip(): - self.accumulated_critique_history.append({ - "version": self.paper_version, - "critiques": existing - }) - logger.info(f"Accumulated {len(self.accumulated_critique_history)} critique history version(s)") - await critique_memory.clear() - - # Load from file for crash recovery (if file exists) - await critique_memory.load_from_file() - + logger.info(f"Critique memory initialized for {paper_id}") # Create critique submitter agent @@ -2668,9 +2822,11 @@ async def _start_critique_phase(self) -> None: provider=self.critique_submitter_provider, model_id=self.critique_submitter_model, openrouter_provider=self.critique_submitter_openrouter_provider, + openrouter_reasoning_effort=self.critique_submitter_openrouter_reasoning_effort, lm_studio_fallback_id=self.critique_submitter_lm_studio_fallback, context_window=system_config.compiler_critique_submitter_context_window, - max_output_tokens=system_config.compiler_critique_submitter_max_tokens + max_output_tokens=system_config.compiler_critique_submitter_max_tokens, + supercharge_enabled=self.critique_submitter_supercharge_enabled ) ) @@ -2681,9 +2837,11 @@ async def _start_critique_phase(self) -> None: provider=self.validator_provider, model_id=self.validator_model, openrouter_provider=self.validator_openrouter_provider, + openrouter_reasoning_effort=self.validator_openrouter_reasoning_effort, lm_studio_fallback_id=self.validator_lm_studio_fallback, context_window=self.validator_context_window, - max_output_tokens=self.validator_max_tokens + max_output_tokens=self.validator_max_tokens, + supercharge_enabled=self.validator_supercharge_enabled ) ) @@ -2694,16 +2852,18 @@ async def _start_critique_phase(self) -> None: provider=self.validator_provider, model_id=self.validator_model, openrouter_provider=self.validator_openrouter_provider, + openrouter_reasoning_effort=self.validator_openrouter_reasoning_effort, lm_studio_fallback_id=self.validator_lm_studio_fallback, context_window=self.validator_context_window, - max_output_tokens=self.validator_max_tokens + max_output_tokens=self.validator_max_tokens, + supercharge_enabled=self.validator_supercharge_enabled ) ) # Broadcast critique phase started await self._broadcast("critique_phase_started", { "paper_version": self.paper_version, - "target_critiques": 5 + "target_critiques": CRITIQUE_ATTEMPT_TARGET }) # Start critique aggregation loop @@ -2714,12 +2874,10 @@ async def _get_reference_papers_context_for_critique( current_outline: str = "", current_body: str = "", aggregator_db: str = "", - critique_feedback: str = "", - pre_critique_paper: str = "", - accumulated_history: str = "" + critique_feedback: str = "" ) -> Optional[str]: """ - Prepare reference-paper context for critique/rewrite prompts in autonomous mode. + Prepare reference-paper context for critique prompts in autonomous mode. This preserves the reference papers selected for the paper instead of silently dropping them once the critique phase begins. @@ -2744,8 +2902,6 @@ async def _get_reference_papers_context_for_critique( current_body or "", aggregator_db or "", critique_feedback or "", - pre_critique_paper or "", - accumulated_history or "", ] if part ) @@ -2773,7 +2929,6 @@ async def _get_reference_papers_context_for_critique( current_outline or "", current_body or "", critique_feedback or "", - pre_critique_paper or "", ] if part ) @@ -2792,10 +2947,13 @@ async def _get_reference_papers_context_for_critique( async def _run_critique_aggregation(self) -> None: """ - Run critique aggregation until 5 total attempts. + Run critique aggregation until the configured total attempt target. Uses simple generate-validate loop similar to aggregator workflow. """ - logger.info("Starting critique aggregation loop (target: 5 total attempts, accepted OR rejected)") + logger.info( + "Starting critique aggregation loop " + f"(target: {CRITIQUE_ATTEMPT_TARGET} total attempts, accepted OR rejected)" + ) rejection_count = 0 consecutive_rejections = 0 @@ -2812,25 +2970,26 @@ async def _run_critique_aggregation(self) -> None: "acceptances": critique_count, "rejections": rejection_count, "total_attempts": total_attempts, - "target": 5, # Now means total attempts, not just acceptances + "target": CRITIQUE_ATTEMPT_TARGET, "version": self.paper_version }) # Check if target reached - if total_attempts >= 5: + if total_attempts >= CRITIQUE_ATTEMPT_TARGET: logger.info(f"Critique phase complete: {total_attempts} total attempts ({critique_count} accepted, {rejection_count} rejected)") - - # If 0 acceptances, skip rewrite and continue + if critique_count == 0: - logger.info("No critiques accepted - skipping rewrite phase, moving to next section") - await self._skip_rewrite_and_continue() + logger.info("No critiques accepted - moving to next section") + await self._continue_without_self_review() else: - # Trigger rewrite decision with accepted critiques - await self._trigger_rewrite_decision() + await self._append_accepted_critiques_as_self_review() break # Generate critique - logger.info(f"Generating critique (attempts: {total_attempts}/5, accepted: {critique_count}, rejected: {rejection_count})") + logger.info( + f"Generating critique (attempts: {total_attempts}/{CRITIQUE_ATTEMPT_TARGET}, " + f"accepted: {critique_count}, rejected: {rejection_count})" + ) current_body = await paper_memory.get_paper() current_outline = await outline_memory.get_outline() @@ -2842,16 +3001,12 @@ async def _run_critique_aggregation(self) -> None: # Get existing critiques existing_critiques = await critique_memory.get_all_critiques() - # Format accumulated critique history from previous failed versions - accumulated_history = self._format_accumulated_critique_history() - - # Keep autonomous reference papers available during critique/rewrite. + # Keep autonomous reference papers available during critique. reference_papers = await self._get_reference_papers_context_for_critique( current_outline=current_outline, current_body=current_body, aggregator_db=aggregator_db, - critique_feedback=existing_critiques, - accumulated_history=accumulated_history + critique_feedback=existing_critiques ) # Generate critique submission @@ -2861,8 +3016,7 @@ async def _run_critique_aggregation(self) -> None: current_outline=current_outline, aggregator_db=aggregator_db, reference_papers=reference_papers, - existing_critiques=existing_critiques, - accumulated_history=accumulated_history + existing_critiques=existing_critiques ) if submission is None: @@ -2872,13 +3026,7 @@ async def _run_critique_aggregation(self) -> None: logger.info(f"Critique generated: {submission.submission_id}") - # Validate critique using aggregator validator prompts - from backend.aggregator.agents.validator import ValidatorAgent - from backend.aggregator.memory.shared_training import shared_training_memory - from backend.aggregator.prompts.validator_prompts import build_validator_prompt - - # Build critique validation prompt (reuses aggregator validator structure) - # We'll use the validator's validate method but with critique-specific context + # Validate critique using critique-specific validator prompts. validation_result = await self._validate_critique(submission) # Handle decline submissions differently @@ -2894,7 +3042,7 @@ async def _run_critique_aggregation(self) -> None: "reasoning": submission.reasoning, "version": self.paper_version, "total_attempts": total_attempts, - "target": 5 + "target": CRITIQUE_ATTEMPT_TARGET }) else: # Validator disagrees - there ARE issues that need critique @@ -2907,7 +3055,7 @@ async def _run_critique_aggregation(self) -> None: "reasoning": validation_result.reasoning if validation_result else "Unknown", "consecutive": consecutive_rejections, "total_attempts": total_attempts, - "target": 5 + "target": CRITIQUE_ATTEMPT_TARGET }) else: # Regular critique submission @@ -2919,12 +3067,12 @@ async def _run_critique_aggregation(self) -> None: await critique_memory.add_accepted_critique(submission.content) new_count = await critique_memory.get_critique_count() - logger.info(f"Critique ACCEPTED ({new_count}/5): {submission.submission_id}") + logger.info(f"Critique ACCEPTED ({new_count}/{CRITIQUE_ATTEMPT_TARGET}): {submission.submission_id}") await self._broadcast("critique_accepted", { "critique_id": submission.submission_id, "count": new_count, - "target": 5, + "target": CRITIQUE_ATTEMPT_TARGET, "version": self.paper_version, "total_attempts": total_attempts, "rejections": rejection_count @@ -2953,7 +3101,7 @@ async def _run_critique_aggregation(self) -> None: "reasoning": validation_result.reasoning if validation_result else "Unknown", "consecutive": consecutive_rejections, "total_attempts": total_attempts, - "target": 5 + "target": CRITIQUE_ATTEMPT_TARGET }) # Brief delay between critiques @@ -2975,9 +3123,6 @@ async def _validate_critique(self, submission) -> Optional[ValidationResult]: ValidationResult or None """ try: - # Import prompt builders - from backend.aggregator.prompts.validator_prompts import build_validator_prompt - # Build validation prompt for critique # We pass the critique as "submission" and existing critiques as "context" current_body = await paper_memory.get_paper() @@ -3132,480 +3277,29 @@ async def _perform_critique_cleanup(self) -> None: except Exception as e: logger.error(f"Error in critique cleanup: {e}", exc_info=True) - async def _trigger_rewrite_decision(self) -> None: - """ - Trigger rewrite vs continue decision after 5 critiques. - Includes retry logic if decision is rejected by validator. - """ - max_retries = 5 - retry_count = 0 - - while retry_count < max_retries: - try: - logger.info("=" * 80) - logger.info(f"Critique phase complete (5 total attempts) - triggering rewrite decision (attempt {retry_count + 1})") - logger.info("=" * 80) - - # Get all critiques - critique_feedback = await critique_memory.get_all_critiques() - current_body = await paper_memory.get_paper() - current_outline = await outline_memory.get_outline() - current_title = self.paper_title if self.paper_title else self.user_prompt - - # Get context (aggregator DB, reference papers, etc.) - from backend.aggregator.memory.shared_training import shared_training_memory - aggregator_db = await shared_training_memory.get_all_content() - # Format accumulated critique history from previous failed versions - accumulated_history = self._format_accumulated_critique_history() - - reference_papers = await self._get_reference_papers_context_for_critique( - current_outline=current_outline, - current_body=current_body, - aggregator_db=aggregator_db, - critique_feedback=critique_feedback, - pre_critique_paper=self.pre_critique_paper or "", - accumulated_history=accumulated_history - ) - - # Critique submitter makes decision - logger.info("Critique submitter generating rewrite decision...") - decision_result = await self.critique_submitter.submit_rewrite_decision( - user_prompt=self.user_prompt, - current_body=current_body, - current_outline=current_outline, - current_title=current_title, - aggregator_db=aggregator_db, - critique_feedback=critique_feedback, - pre_critique_paper=self.pre_critique_paper, # Paper snapshot from start of critique phase - reference_papers=reference_papers, - accumulated_history=accumulated_history - ) - - if decision_result is None: - logger.error("Rewrite decision generation returned None") - retry_count += 1 - await asyncio.sleep(5) - continue - - logger.info(f"Rewrite decision: {decision_result['decision']}") - - # Validator reviews decision - logger.info("Validator reviewing rewrite decision...") - validated = await self.validator.validate_rewrite_decision( - decision_result=decision_result, - user_prompt=self.user_prompt, - current_body=current_body, - current_outline=current_outline, - current_title=current_title, - critique_feedback=critique_feedback, - aggregator_db=aggregator_db - ) - - if not validated: - # Decision rejected - retry - logger.warning("Rewrite decision rejected by validator - retrying") - await self._broadcast("rewrite_decision_rejected", { - "attempt": retry_count + 1, - "max_retries": max_retries - }) - retry_count += 1 - await asyncio.sleep(5) - continue - - # Decision validated - execute it - logger.info("Rewrite decision validated - executing") - - # Execute decision - if decision_result["decision"] == "continue": - logger.info("Decision: CONTINUE to conclusion (critiques minor/incorrect)") - await self._end_critique_phase(rewrite=False) - - elif decision_result["decision"] == "partial_revision": - logger.info("Decision: PARTIAL REVISION (iterative targeted edits)") - await self._execute_partial_revision( - new_title=decision_result.get("new_title"), - new_outline=decision_result.get("new_outline"), - critique_feedback=critique_feedback, - accumulated_history=accumulated_history - ) - - elif decision_result["decision"] == "total_rewrite": - logger.info("Decision: TOTAL REWRITE body section") - await self._execute_body_rewrite( - new_title=decision_result.get("new_title"), - new_outline=decision_result.get("new_outline"), - critique_feedback=critique_feedback - ) - - # Success - break out of retry loop - break - - except Exception as e: - logger.error(f"Error in rewrite decision (attempt {retry_count + 1}): {e}", exc_info=True) - retry_count += 1 - if retry_count < max_retries: - await asyncio.sleep(5) - # Note: Don't call _end_critique_phase here - let it fall through to unified fallback below - - # Unified fallback if while loop exited due to retry exhaustion - # Handles both: validation failures (returned False 5 times) OR exceptions (5 exceptions occurred) - if retry_count >= max_retries: - logger.error("Rewrite decision validation failed after max retries - defaulting to CONTINUE") - await self._broadcast("rewrite_decision_max_retries_exceeded", { - "action": "continue_to_conclusion" - }) - await self._end_critique_phase(rewrite=False) - - async def _execute_body_rewrite( - self, - new_title: Optional[str], - new_outline: Optional[str], - critique_feedback: str - ) -> None: - """ - Execute full body section rewrite. - - Args: - new_title: New paper title (or None to keep current) - new_outline: Updated outline (or None to keep current) - critique_feedback: All accepted critiques - """ - logger.info("=" * 80) - logger.info("EXECUTING BODY REWRITE") - logger.info("=" * 80) - - # Mark rewrite as pending (will count as completed only after first successful body acceptance) - self.rewrite_pending = True - logger.info(f"Rewrite initiated (pending successful completion, max: 1)") - - # Store previous version - current_body = await paper_memory.get_paper() - old_title = self.paper_title if self.paper_title else self.user_prompt - - await paper_memory.store_previous_version( - version=self.paper_version, - title=old_title, - body=current_body, - critique_feedback=critique_feedback - ) - - logger.info(f"Stored Version {self.paper_version}: {old_title}") - - # Update title if changed - title_changed = False - if new_title and new_title != old_title: - self.paper_title = new_title - self.paper_version += 1 - title_changed = True - logger.info(f"Paper title changed: {new_title} (Version {self.paper_version})") - else: - logger.info("Paper title unchanged") - - # Update outline if provided - if new_outline: - await outline_memory.update_outline(new_outline) - logger.info("Outline updated with new structure") - - # Clear paper body (keep only placeholders) - await paper_memory.clear_body_section() - logger.info("Body section cleared - preserving placeholders") - - # Broadcast rewrite started - await self._broadcast("body_rewrite_started", { - "version": self.paper_version, - "title": self.paper_title if self.paper_title else self.user_prompt, - "title_changed": title_changed, - "critique_feedback_preview": critique_feedback[:500] - }) - - # End critique phase - await self._end_critique_phase(rewrite=True) - - # Reset to body phase with new context - self.autonomous_section_phase = "body" - - # Store critique feedback for passing to construction prompts - # This provides rewrite context so the model knows what to fix - self.current_critique_feedback = critique_feedback - logger.info(f"Stored critique feedback for construction: {len(critique_feedback)} chars") - - # Set flag for re-critique if title changed - if title_changed: - logger.info("Title changed - will run critique phase again after rewrite completes") - self.needs_critique_after_rewrite = True - else: - logger.info("Title unchanged - will continue to conclusion after rewrite completes") - self.needs_critique_after_rewrite = False - - logger.info("=" * 80) - logger.info(f"BODY REWRITE PREPARED - Starting body construction for Version {self.paper_version}") - logger.info("Body reconstruction will have: pre_critique_paper + accepted critique feedback") - logger.info("=" * 80) - - async def _execute_partial_revision( - self, - new_title: Optional[str], - new_outline: Optional[str], - critique_feedback: str, - accumulated_history: Optional[str] = None - ) -> None: - """ - Execute partial revision using ITERATIVE targeted edit operations. - - Proposes edits one at a time, validates each, applies, and shows updated paper - before proposing the next edit. - - Args: - new_title: New paper title (or None to keep current) - new_outline: Updated outline (or None to keep current) - critique_feedback: All accepted critiques - accumulated_history: Optional accumulated critique history from previous versions - """ - logger.info("=" * 80) - logger.info("EXECUTING PARTIAL REVISION (ITERATIVE EDITS)") - logger.info("=" * 80) - - # Mark rewrite as pending (will count as completed only after first successful edit acceptance) - self.rewrite_pending = True - logger.info(f"Partial revision initiated (pending successful completion, max: 1)") - - # Store current state (for history tracking) - old_title = self.paper_title if self.paper_title else self.user_prompt - - # Update title if changed - title_changed = False - if new_title and new_title != old_title: - self.paper_title = new_title - self.paper_version += 1 - title_changed = True - logger.info(f"Paper title changed: {new_title} (Version {self.paper_version})") - else: - logger.info("Paper title unchanged") - - # Update outline if provided - if new_outline: - await outline_memory.update_outline(new_outline) - logger.info("Outline updated with new structure") - - # Get current outline - current_outline = await outline_memory.get_outline() + async def _append_accepted_critiques_as_self_review(self) -> None: + """Append validator-accepted critiques as an honest AI self-review section.""" + critique_feedback = await critique_memory.get_all_critiques() + if not critique_feedback.strip(): + await self._continue_without_self_review() + return - reference_papers = await self._get_reference_papers_context_for_critique( - current_outline=current_outline, - current_body=self.pre_critique_paper or "", - critique_feedback=critique_feedback, - pre_critique_paper=self.pre_critique_paper or "", - accumulated_history=accumulated_history or "" - ) - - # ITERATIVE EDIT LOOP - MAX_EDITS = 20 # Safety limit to prevent infinite loops - edits_applied: List[Dict] = [] - successful_edits = 0 - failed_edits = 0 - consecutive_failures = 0 - MAX_CONSECUTIVE_FAILURES = 3 - - logger.info("Starting iterative edit loop...") - - more_edits_needed = True - while more_edits_needed and len(edits_applied) < MAX_EDITS: - try: - # Get current paper state - current_paper = await paper_memory.get_paper() - - # Ask critique submitter for next edit - logger.info(f"Requesting edit #{len(edits_applied) + 1}...") - edit_proposal = await self.critique_submitter.submit_iterative_edit( - user_prompt=self.user_prompt, - pre_critique_paper=self.pre_critique_paper, - current_paper=current_paper, - current_outline=current_outline, - critique_feedback=critique_feedback, - edits_applied=edits_applied, - reference_papers=reference_papers, - accumulated_history=accumulated_history - ) - - if edit_proposal is None: - logger.error("Failed to get edit proposal - stopping iterative loop") - break - - operation = edit_proposal.get("operation") - old_string = edit_proposal.get("old_string", "") - new_string = edit_proposal.get("new_string", "") - reasoning = edit_proposal.get("reasoning", "") - more_edits_needed = edit_proposal.get("more_edits_needed", False) - - logger.info(f"Edit proposal: {operation} - {reasoning[:100]}...") - - # Validate the edit via validator - is_valid, validation_reason = await self._validate_partial_revision_edit( - edit_proposal=edit_proposal, - current_paper=current_paper, - current_outline=current_outline, - critique_feedback=critique_feedback - ) - - if not is_valid: - logger.warning(f"Edit #{len(edits_applied) + 1} rejected by validator: {validation_reason}") - consecutive_failures += 1 - - if consecutive_failures >= MAX_CONSECUTIVE_FAILURES: - logger.error(f"Max consecutive failures ({MAX_CONSECUTIVE_FAILURES}) reached - stopping iterative loop") - break - - # Don't add to edits_applied, loop will retry with same state - failed_edits += 1 - continue - - # Apply the validated edit - edit_submission = CompilerSubmission( - submission_id=f"partial_revision_edit_{len(edits_applied) + 1}", - mode="review", - content=new_string, - operation=operation, - old_string=old_string, - new_string=new_string, - reasoning=reasoning - ) - updated_paper = self._apply_edit(current_paper, edit_submission) - - if updated_paper is not None: - await paper_memory.update_paper(updated_paper) - logger.info(f"Edit #{len(edits_applied) + 1} applied successfully") - edits_applied.append(edit_proposal) - successful_edits += 1 - consecutive_failures = 0 # Reset on success - - # Broadcast progress - await self._broadcast("partial_revision_edit_applied", { - "edit_number": len(edits_applied), - "operation": operation, - "reasoning": reasoning[:200], - "more_edits_needed": more_edits_needed - }) - else: - logger.warning(f"Edit #{len(edits_applied) + 1} failed to apply (old_string not found)") - consecutive_failures += 1 - failed_edits += 1 - - if consecutive_failures >= MAX_CONSECUTIVE_FAILURES: - logger.error(f"Max consecutive failures ({MAX_CONSECUTIVE_FAILURES}) reached - stopping iterative loop") - break - - except Exception as e: - logger.error(f"Error in iterative edit loop: {e}", exc_info=True) - consecutive_failures += 1 - failed_edits += 1 - - if consecutive_failures >= MAX_CONSECUTIVE_FAILURES: - logger.error(f"Max consecutive failures ({MAX_CONSECUTIVE_FAILURES}) reached - stopping iterative loop") - break - - logger.info(f"Iterative edit loop complete: {successful_edits} successful, {failed_edits} failed") - - if len(edits_applied) >= MAX_EDITS: - logger.warning(f"Reached max edit limit ({MAX_EDITS}) - stopping iterative loop") - - # Mark rewrite completion on first successful edit - if self.rewrite_pending: - if successful_edits > 0: - self.rewrite_count += 1 - logger.info(f"Rewrite #{self.rewrite_count} completed successfully (first accepted partial edit)") - self.rewrite_pending = False - - # Broadcast partial revision complete - await self._broadcast("partial_revision_complete", { + appended = await paper_memory.append_self_review_section(critique_feedback) + await self._broadcast("self_review_appended", { "version": self.paper_version, - "title": self.paper_title if self.paper_title else self.user_prompt, - "title_changed": title_changed, - "edits_applied": successful_edits, - "edits_failed": failed_edits, - "critique_feedback_preview": critique_feedback[:500] + "critique_count": await critique_memory.get_critique_count(), + "appended": appended, }) - - # End critique phase - await self._end_critique_phase(rewrite=False) - - # Set flag for re-critique if title changed - if title_changed: - logger.info("Title changed - would run critique phase again, but max rewrites reached") - # Note: With max 1 rewrite, title changes won't trigger re-critique - self.needs_critique_after_rewrite = False - else: - logger.info("Title unchanged - continuing to conclusion") - self.needs_critique_after_rewrite = False - - # Continue to conclusion (partial revision doesn't loop back to body phase) - # Clear critique context (no longer needed after body phase) - self.current_critique_feedback = None - self.autonomous_section_phase = "conclusion" - - logger.info("=" * 80) - logger.info("PARTIAL REVISION COMPLETE - Continuing to CONCLUSION") - logger.info("=" * 80) - - async def _validate_partial_revision_edit( - self, - edit_proposal: Dict, - current_paper: str, - current_outline: str, - critique_feedback: str - ) -> Tuple[bool, str]: - """ - Validate a single partial revision edit using the compiler validator. - - Args: - edit_proposal: The proposed edit with operation, old_string, new_string, reasoning - current_paper: Current paper content - current_outline: Paper outline - critique_feedback: All accepted critiques - - Returns: - Tuple of (is_valid: bool, rejection_reason: str) - """ - try: - # Delegate to the compiler validator which has comprehensive validation logic - return await self.validator.validate_partial_revision_edit( - edit_proposal=edit_proposal, - current_paper=current_paper, - current_outline=current_outline, - critique_feedback=critique_feedback - ) - - except Exception as e: - logger.error(f"Error validating partial revision edit: {e}", exc_info=True) - return False, f"Validation error: {str(e)}" + await self._end_critique_phase(self_review_appended=appended) - def _format_accumulated_critique_history(self) -> str: - """ - Format all historical critiques from previous failed versions. - Returns formatted string with clear version labeling. - """ - if not self.accumulated_critique_history: - return "" - - parts = ["=" * 80] - parts.append("CRITIQUE HISTORY FROM PREVIOUS FAILED VERSIONS") - parts.append("(These critiques are from earlier attempts that were rewritten)") - parts.append("=" * 80 + "\n") - - for i, entry in enumerate(self.accumulated_critique_history, 1): - parts.append(f"--- FAILED VERSION #{i} (REWRITTEN) ---") - parts.append(entry['critiques']) - parts.append("") - - return "\n".join(parts) - - async def _end_critique_phase(self, rewrite: bool) -> None: + async def _end_critique_phase(self, self_review_appended: bool = False) -> None: """ End critique phase and clean up. Args: - rewrite: Whether a rewrite was approved + self_review_appended: Whether accepted critiques were appended to the paper """ - logger.info(f"Ending critique phase (rewrite={rewrite})") + logger.info(f"Ending critique phase (self_review_appended={self_review_appended})") self.in_critique_phase = False @@ -3620,32 +3314,25 @@ async def _end_critique_phase(self, rewrite: bool) -> None: # Broadcast end await self._broadcast("critique_phase_ended", { - "rewrite": rewrite, + "self_review_appended": self_review_appended, "version": self.paper_version }) - - if not rewrite: - # Continue to conclusion - # Clear critique context (no longer needed after body phase) - self.current_critique_feedback = None - self.autonomous_section_phase = "conclusion" - logger.info("Critique phase complete - transitioning to CONCLUSION phase") - await self._broadcast("phase_transition", { - "from_phase": "critique", - "to_phase": "conclusion", - "trigger": "critiques_reviewed", - "paper_word_count": await paper_memory.get_word_count() - }) - else: - logger.info("Critique phase complete - body will be rewritten") + + self.autonomous_section_phase = "conclusion" + logger.info("Critique phase complete - transitioning to CONCLUSION phase") + await self._broadcast("phase_transition", { + "from_phase": "critique", + "to_phase": "conclusion", + "trigger": "critiques_reviewed", + "paper_word_count": await paper_memory.get_word_count() + }) - async def _skip_rewrite_and_continue(self) -> None: + async def _continue_without_self_review(self) -> None: """ - Skip rewrite phase when body is academically acceptable. - Called when 5 total attempts complete with 0 accepted critiques. + Continue when no critiques were accepted for the self-review section. """ logger.info("=" * 80) - logger.info("SKIPPING REWRITE - No critiques accepted, body is acceptable") + logger.info("NO SELF-REVIEW APPENDED - No critiques accepted") logger.info("=" * 80) await self._broadcast("critique_phase_skipped", { @@ -3653,16 +3340,15 @@ async def _skip_rewrite_and_continue(self) -> None: "version": self.paper_version }) - # End critique phase without rewrite - await self._end_critique_phase(rewrite=False) + await self._end_critique_phase(self_review_appended=False) - # The _end_critique_phase already transitions to conclusion when rewrite=False + # The _end_critique_phase already transitions to conclusion. logger.info("Transitioning to CONCLUSION phase (body accepted as-is)") async def skip_critique_phase(self) -> bool: """ Skip the critique phase and continue to conclusion. - User override to bypass peer review and rewrite cycle. + User override to bypass peer review and self-review appending. Can be called: - During critique phase: immediately skips @@ -3682,7 +3368,7 @@ async def skip_critique_phase(self) -> bool: "version": self.paper_version }) - await self._end_critique_phase(rewrite=False) + await self._end_critique_phase(self_review_appended=False) return True else: # Not in critique phase yet - set flag to skip when reached @@ -3764,57 +3450,7 @@ async def _check_phase_transition(self, section_complete: bool = False) -> bool: # Phase transition logic based on explicit completion signal if current_phase == "body": - # Check if max rewrites reached - skip critique phase entirely - if self.rewrite_count >= 1: - logger.info(f"Max rewrites ({self.rewrite_count}) reached - skipping critique phase, proceeding to conclusion") - # Clear critique context (no longer needed after body phase) - self.current_critique_feedback = None - self.autonomous_section_phase = "conclusion" - await self._broadcast("phase_transition", { - "from_phase": "body", - "to_phase": "conclusion", - "trigger": "section_complete", - "reason": "max_rewrites_reached", - "rewrite_count": self.rewrite_count, - "paper_word_count": word_count - }) - return False - - # Check if this is a rewrite completion that needs another critique round - if self.needs_critique_after_rewrite: - # Body rewrite complete, title changed - run critique phase again - logger.info(f"Body rewrite complete (Version {self.paper_version}) - triggering ANOTHER critique phase (title changed)") - self.needs_critique_after_rewrite = False # Reset flag - - await self._broadcast("phase_transition", { - "from_phase": "body", - "to_phase": "critique", - "trigger": "rewrite_complete_title_changed", - "paper_word_count": word_count, - "version": self.paper_version - }) - - # Start critique aggregation sub-workflow again - await self._start_critique_phase() - return False - - # Check if this is a rewrite completion with unchanged title - skip to conclusion - if self.rewrite_count > 0: - # Rewrite completed but title unchanged - critique loop ends, proceed to conclusion - logger.info(f"Rewrite #{self.rewrite_count} complete (title unchanged) - skipping additional critique, proceeding to conclusion") - # Clear critique context (no longer needed after body phase) - self.current_critique_feedback = None - self.autonomous_section_phase = "conclusion" - await self._broadcast("phase_transition", { - "from_phase": "body", - "to_phase": "conclusion", - "trigger": "rewrite_complete_title_unchanged", - "rewrite_count": self.rewrite_count, - "paper_word_count": word_count - }) - return False - - # BODY COMPLETE - TRIGGER CRITIQUE PHASE BEFORE CONCLUSION (first time only) + # BODY COMPLETE - TRIGGER CRITIQUE PHASE BEFORE CONCLUSION logger.info("Body section complete - transitioning to CRITIQUE PHASE") await self._broadcast("phase_transition", { "from_phase": "body", @@ -4005,15 +3641,8 @@ async def clear_paper(self) -> None: self.in_critique_phase = False self.critique_acceptances = 0 self.paper_version = 1 - self.rewrite_count = 0 - self.rewrite_pending = False - self.accumulated_critique_history.clear() - self.previous_body_versions.clear() - self.needs_critique_after_rewrite = False self.paper_title = None self._skip_critique_requested = False - self.pre_critique_paper = None - self.current_critique_feedback = None logger.info("Reset critique phase state") logger.info("Paper and outline cleared - system reset to fresh start") diff --git a/backend/compiler/core/compiler_rag_manager.py b/backend/compiler/core/compiler_rag_manager.py index c7b2532..da189c9 100644 --- a/backend/compiler/core/compiler_rag_manager.py +++ b/backend/compiler/core/compiler_rag_manager.py @@ -151,7 +151,11 @@ async def load_aggregator_database(self) -> None: chunks_by_size = await ingestion_pipeline.ingest_file( aggregator_file_path, rag_config.submitter_chunk_intervals, # All 4 configs - is_user_file=True + is_user_file=True, + trusted_roots=[ + system_config.data_dir, + system_config.user_uploads_dir, + ], ) # Add all chunks while holding the lock diff --git a/backend/compiler/memory/paper_memory.py b/backend/compiler/memory/paper_memory.py index e107b7e..e30ecf0 100644 --- a/backend/compiler/memory/paper_memory.py +++ b/backend/compiler/memory/paper_memory.py @@ -7,6 +7,7 @@ from typing import Optional, Callable, List, Dict from pathlib import Path import logging +import re from backend.shared.config import system_config @@ -27,6 +28,8 @@ THEOREMS_APPENDIX_START = "[HARD CODED THEOREMS APPENDIX START -- LEAN 4 VERIFIED THEOREMS BELOW]" THEOREMS_APPENDIX_END = "[HARD CODED THEOREMS APPENDIX END -- ALL APPENDIX CONTENT SHOULD BE ABOVE THIS LINE]" APPENDIX_EMPTY_PLACEHOLDER = "[Theorems appendix - verified Lean 4 theorems not placed inline will appear here]" +AI_SELF_REVIEW_SECTION_TITLE = "AI Self-Review and Limitations" +AI_SELF_REVIEW_SECTION_HEADER = f"## {AI_SELF_REVIEW_SECTION_TITLE}" class PaperMemory: @@ -45,7 +48,6 @@ def __init__(self): self.rechunk_callback: Optional[Callable] = None self._lock = asyncio.Lock() self._initialized = False - self.previous_versions = [] # Store previous body versions for UI display async def initialize(self) -> None: """Initialize paper memory.""" @@ -339,54 +341,14 @@ async def clear_body_section(self) -> None: except Exception as e: logger.error(f"Re-chunking callback failed after clearing body: {e}") - async def store_previous_version( - self, - version: int, - title: str, - body: str, - critique_feedback: str - ) -> None: - """ - Store previous body version for UI display. - - Args: - version: Version number - title: Paper title for this version - body: Body section content - critique_feedback: Critique feedback that triggered rewrite - """ - async with self._lock: - # Add to in-memory list - version_data = { - "version": version, - "title": title, - "body": body, - "critique_feedback": critique_feedback - } - self.previous_versions.append(version_data) - - # Save to file - version_file = Path(system_config.data_dir) / f"paper_version_{version}.txt" - version_file.parent.mkdir(parents=True, exist_ok=True) - - async with aiofiles.open(version_file, 'w', encoding='utf-8') as f: - await f.write(f"VERSION {version}: {title}\n") - await f.write(f"{'=' * 80}\n\n") - await f.write(f"BODY SECTION:\n{body}\n\n") - await f.write(f"{'=' * 80}\n\n") - await f.write(f"CRITIQUE FEEDBACK THAT TRIGGERED REWRITE:\n{critique_feedback}\n") - - logger.info(f"Stored previous version {version} to {version_file}") - async def get_previous_versions(self) -> list: """ - Get all previous versions for UI display. - - Returns: - List of version dicts with version, title, body, critique_feedback + Compatibility endpoint for older clients. + + Rewrites are no longer performed, so there are no previous body + versions to expose. """ - async with self._lock: - return self.previous_versions.copy() + return [] def _extract_body_and_appendix(self, paper: str) -> tuple[str, str]: """ @@ -533,6 +495,82 @@ async def append_to_theorems_appendix(self, theorem_entry: str) -> bool: logger.error(f"Re-chunking callback failed after appendix append: {e}") return True + + def _remove_self_review_section(self, content: str) -> str: + """Remove an existing AI self-review section before replacing it.""" + if not content: + return "" + + title_pattern = re.escape(AI_SELF_REVIEW_SECTION_TITLE) + header_pattern = ( + rf"(?:^|\n)\s*(?:#+\s*)?{title_pattern}\s*\n" + ) + match = re.search(header_pattern, content, re.IGNORECASE) + if not match: + return content + + start = match.start() + if start > 0 and content[start] == "\n": + start += 1 + + anchor_match = re.search(re.escape(PAPER_ANCHOR), content[match.end():]) + end = len(content) + if anchor_match: + end = match.end() + anchor_match.start() + + return (content[:start].rstrip() + "\n\n" + content[end:].lstrip()).strip() + + def _build_self_review_section(self, critique_feedback: str) -> str: + """Build the final transparent self-review section from accepted critiques.""" + return ( + f"{AI_SELF_REVIEW_SECTION_HEADER}\n\n" + "The following self-review notes were generated during the AI critique phase " + "and accepted by the validator as substantive concerns, limitations, or " + "improvement points. They are preserved transparently rather than being used " + "to rewrite the paper.\n\n" + f"{critique_feedback.strip()}" + ).strip() + + async def append_self_review_section(self, critique_feedback: str) -> bool: + """ + Append accepted critique feedback as the final AI self-review section. + + The section is placed after the Theorems Appendix when those markers + exist, otherwise before the paper anchor. Existing self-review content + is replaced so retries do not duplicate the section. + """ + if not critique_feedback or not critique_feedback.strip(): + logger.info("No critique feedback supplied; skipping self-review append") + return False + + final_content = None + async with self._lock: + paper = await self._get_paper_unlocked() + if not paper.strip(): + logger.warning("Cannot append self-review section: paper is empty") + return False + + cleaned = self._remove_self_review_section(paper) + section = self._build_self_review_section(critique_feedback) + + anchor_idx = cleaned.find(PAPER_ANCHOR) + if anchor_idx >= 0: + before_anchor = cleaned[:anchor_idx].rstrip() + anchor_and_after = cleaned[anchor_idx:].lstrip() + new_paper = f"{before_anchor}\n\n{section}\n\n{anchor_and_after}" + else: + new_paper = f"{cleaned.rstrip()}\n\n{section}" + + final_content = await self._update_paper_unlocked(new_paper) + logger.info("AI self-review section appended to paper") + + if final_content and self.rechunk_callback: + try: + await self.rechunk_callback(final_content) + except Exception as e: + logger.error(f"Re-chunking callback failed after self-review append: {e}") + + return True async def ensure_placeholders_exist(self) -> bool: """ diff --git a/backend/compiler/prompts/construction_prompts.py b/backend/compiler/prompts/construction_prompts.py index 322dd03..b5907db 100644 --- a/backend/compiler/prompts/construction_prompts.py +++ b/backend/compiler/prompts/construction_prompts.py @@ -109,6 +109,11 @@ def get_body_construction_system_prompt() -> str: - Do NOT force coverage of every source entry - Do NOT ignore clearly crucial source material for the scope you are writing +DIRECT-ANSWER-FIRST PRINCIPLE: +- Prefer writing sections that directly solve, partially solve, refute, or sharply constrain the user's question +- Use background and supporting exposition only to the extent needed to support the strongest rigorous direct answer +- Do not broaden the paper with side material that weakens answer focus + CRITICAL - SYSTEM-MANAGED MARKERS (NOT YOUR OUTPUT): The paper uses placeholder markers that the SYSTEM adds automatically (you did NOT create these): @@ -178,6 +183,7 @@ def get_body_construction_system_prompt() -> str: - Follow the outline structure for body sections - Build upon what's already written - Use brainstorm/aggregator content when it helps, but you are not required to cover every source entry +- Prioritize the strongest direct rigorous route to answering the user's prompt - Do not repeat content already in the document - Check for existing section headers before creating new ones - Write clear, rigorous mathematical exposition @@ -361,8 +367,13 @@ def get_conclusion_construction_system_prompt() -> str: 4. SET section_complete=true (because writing the conclusion completes this phase) 5. PROVIDE the actual Conclusion text in the "content" field +DIRECT-ANSWER-FIRST REQUIREMENT: +- Make the paper's strongest justified answer, partial answer, impossibility result, or sharp constraint explicit +- Do not hide the core answer behind generic summary language + WHAT TO INCLUDE IN CONCLUSION: - Summary of main results and theorems proven +- Clear statement of the strongest direct answer the paper has established - Significance of the mathematical contributions - Connections between results - Brief mention of limitations or open questions (optional) @@ -526,10 +537,15 @@ def get_introduction_construction_system_prompt() -> str: 5. SET section_complete=true (because writing the introduction completes this phase) 6. PROVIDE the actual Introduction text in the "content" field +DIRECT-ANSWER-FIRST REQUIREMENT: +- Frame the paper around the main question and the direct answer the body/conclusion establish +- Keep preliminaries and motivation in service of that answer rather than drifting into generic survey exposition + WHAT TO INCLUDE IN INTRODUCTION: - Context and motivation for the mathematical problem - Brief overview of what the paper covers - Statement of main results (high-level, not full proofs) +- Clear framing of the paper's answer-bearing contribution - Roadmap of the paper structure - Historical context or prior work (if relevant) @@ -688,9 +704,14 @@ def get_abstract_construction_system_prompt() -> str: 5. ALWAYS SET section_complete=true (THIS IS THE FINAL PHASE - no exceptions) 6. PROVIDE the actual Abstract text in the "content" field +DIRECT-ANSWER-FIRST REQUIREMENT: +- State the strongest direct answer, partial answer, impossibility result, or sharp constraint up front when the paper justifies it +- Avoid generic "this paper explores" wording when the paper actually proves or establishes something sharper + WHAT TO INCLUDE IN ABSTRACT: - Brief statement of the problem addressed - Main results and contributions (1-2 sentences) +- Explicit statement of the paper's answer-bearing result when justified - Key methods or approaches used - Significance of the results - Typically 150-300 words @@ -801,10 +822,15 @@ def get_construction_system_prompt() -> str: 3. Maintain coherence with the outline and existing draft 4. Set section_complete=true when the current phase is done +DIRECT-ANSWER-FIRST PRINCIPLE: +- In every phase, prefer content that most directly answers the user's prompt +- Use supporting background only when it materially strengthens that direct answer + CRITICAL REQUIREMENTS: - Follow the outline structure - Build upon what's already written - Use brainstorm/aggregator content when it helps, but you are not required to cover every source entry +- Prioritize the strongest direct rigorous route to answering the user's prompt - Maintain coherent narrative flow - Write clear, rigorous mathematical exposition - Do not repeat content already in the document @@ -932,6 +958,12 @@ def get_construction_json_schema() -> str: - NEVER write paper content that depends on a simultaneous brainstorm correction for correctness - NEVER propose a brainstorm correction that is only justified by what you're writing in the paper +PROTECTED LEAN 4 PROOFS: +- You must NEVER edit, delete, annotate, or add context to a Lean 4 verified proof in the brainstorm database using `brainstorm_operation`. +- If a brainstorm submission is marked as a Lean 4 verified proof, treat it as immutable proof evidence. You may cite or discuss it in the paper prose, but you cannot mutate the proof text or attach explanatory context to the proof record. +- Only the normal brainstorm prune system is allowed to remove Lean 4 proof entries. Paper-writing retroactive brainstorm operations are not a proof-pruning mechanism. +- If you try to edit/delete/add context to a Lean 4 proof, the system will automatically reject the brainstorm_operation and feed that rejection back to you. + Add this OPTIONAL field to your JSON response: { ... (all standard fields above) ..., @@ -964,8 +996,6 @@ async def build_construction_prompt( is_first_portion: bool = False, section_phase: Optional[str] = None, rejection_feedback: Optional[str] = None, - critique_feedback: Optional[str] = None, - pre_critique_paper: Optional[str] = None, brainstorm_content: Optional[str] = None ) -> str: """ @@ -979,8 +1009,6 @@ async def build_construction_prompt( is_first_portion: Whether this is the first portion of the document section_phase: Phase for construction ("body", "conclusion", "introduction", "abstract", or None for legacy) rejection_feedback: Feedback from a previous rejection to guide the model - critique_feedback: Accepted critique feedback from peer review (for rewrites) - pre_critique_paper: Paper state before critique phase (for rewrites - shows what failed) brainstorm_content: Full brainstorm database with submission numbers (for retroactive corrections, autonomous mode) Returns: @@ -1033,34 +1061,6 @@ async def build_construction_prompt( --- """) - # Add critique context for rewrites (body reconstruction after critique phase) - if critique_feedback or pre_critique_paper: - parts.append("=" * 80 + "\n") - parts.append("⚠️ REWRITE CONTEXT - THIS IS A POST-CRITIQUE RECONSTRUCTION ⚠️\n") - parts.append("=" * 80 + "\n\n") - - if pre_critique_paper: - parts.append("""PREVIOUS VERSION (This version received critiques and needs rebuilding): -The body section below was reviewed by peer critique. You must now rebuild it from scratch, -addressing the critique issues while maintaining the mathematical rigor and content that was correct. - ----BEGIN PREVIOUS VERSION--- -""") - parts.append(pre_critique_paper) - parts.append("\n---END PREVIOUS VERSION---\n\n") - - if critique_feedback: - parts.append("""ACCEPTED CRITIQUE FEEDBACK (Address these issues in your rewrite): -These critiques were validated as legitimate issues that need to be fixed. Your rewrite MUST address -each of these critique points while preserving the mathematical content that was correct. - -""") - parts.append(critique_feedback) - parts.append("\n---\n\n") - - parts.append("YOUR TASK: Rebuild the body section from scratch, addressing ALL critique feedback above.\n") - parts.append("=" * 80 + "\n---\n") - parts.append(f"USER COMPILER-DIRECTING PROMPT:\n{user_prompt}") parts.append("\n---\n") parts.append(f"CURRENT OUTLINE:\n{current_outline}") @@ -1090,6 +1090,7 @@ async def build_construction_prompt( - Use them if they help you achieve the strongest rigorous paper toward the user's prompt. - You may synthesize beyond them using sound mathematical reasoning. - Do NOT force coverage of every source entry. +- Prefer material that strengthens the paper's direct answer over broader auxiliary coverage. """) parts.append("\n---\n") @@ -1112,8 +1113,6 @@ async def build_phase_construction_prompt( phase: str, is_first_in_phase: bool = False, rejection_feedback: Optional[str] = None, - critique_feedback: Optional[str] = None, - pre_critique_paper: Optional[str] = None, brainstorm_content: Optional[str] = None ) -> str: """ @@ -1129,8 +1128,6 @@ async def build_phase_construction_prompt( phase: One of "body", "conclusion", "introduction", "abstract" is_first_in_phase: Whether this is the first submission in this phase rejection_feedback: Feedback from a previous rejection to guide the model - critique_feedback: Accepted critique feedback from peer review (for rewrites) - pre_critique_paper: Paper state before critique phase (for rewrites) brainstorm_content: Full brainstorm database with submission numbers (autonomous mode) Returns: @@ -1144,8 +1141,6 @@ async def build_phase_construction_prompt( is_first_portion=is_first_in_phase, section_phase=phase, rejection_feedback=rejection_feedback, - critique_feedback=critique_feedback, - pre_critique_paper=pre_critique_paper, brainstorm_content=brainstorm_content ) @@ -1161,8 +1156,6 @@ async def build_body_construction_prompt( rag_evidence: str, is_first_portion: bool = False, rejection_feedback: Optional[str] = None, - critique_feedback: Optional[str] = None, - pre_critique_paper: Optional[str] = None, brainstorm_content: Optional[str] = None ) -> str: """ @@ -1175,8 +1168,6 @@ async def build_body_construction_prompt( rag_evidence: RAG-retrieved evidence from aggregator database is_first_portion: Whether this is the first portion of the document rejection_feedback: Feedback from a previous rejection to guide the model - critique_feedback: Accepted critique feedback from peer review (for rewrites only) - pre_critique_paper: Paper state before critique phase (for rewrites - shows what failed) brainstorm_content: Full brainstorm database with submission numbers (autonomous mode) """ return await build_phase_construction_prompt( @@ -1187,8 +1178,6 @@ async def build_body_construction_prompt( phase="body", is_first_in_phase=is_first_portion, rejection_feedback=rejection_feedback, - critique_feedback=critique_feedback, - pre_critique_paper=pre_critique_paper, brainstorm_content=brainstorm_content ) diff --git a/backend/compiler/prompts/critique_prompts.py b/backend/compiler/prompts/critique_prompts.py index d5066fa..6679d3e 100644 --- a/backend/compiler/prompts/critique_prompts.py +++ b/backend/compiler/prompts/critique_prompts.py @@ -1,105 +1,52 @@ """ -Critique prompts for peer review aggregation phase. -Used after body section is complete to collect feedback before proceeding to conclusion. +Prompts for the compiler critique phase. + +The critique phase now collects validator-approved self-review notes and appends +them to the paper. It does not rewrite paper content. """ -from typing import Optional, Dict, List +from typing import Optional -CRITIQUE_EMPIRICAL_PROVENANCE_RULES = """EMPIRICAL PROVENANCE RULES: -- Classify substantive claims as one of: theoretical claim, literature claim, empirical claim, or artifact claim. -- Theoretical claims must be supported by sound derivation, proof, or explicit assumptions inside the document. -- Literature claims must identify the external source in-text. -- Empirical claims include benchmark numbers, latency, throughput, speedups, accuracy, perplexity, hardware metrics, ablations, and measured outcomes. +CRITIQUE_EMPIRICAL_PROVENANCE_RULES = """EMPIRICAL / ARTIFACT CLAIM POLICY: - Artifact claims include statements about code, kernels, experiments, logs, reproductions, or accompanying implementations. - Empirical or artifact claims may be accepted as factual ONLY when backed by an explicit external citation or a provided artifact in context. -- If such support is absent, they should be criticized, removed, or rewritten as hypotheses, validation plans, expected benefits, limitations, or future work. -- Never invent citations, experiments, benchmark numbers, hardware measurements, or code artifacts during critique or rewrite work.""" +- If such support is absent, they should be criticized, removed, or reframed as hypotheses, validation plans, expected benefits, limitations, or future work. +- Never invent citations, experiments, benchmark numbers, hardware measurements, or code artifacts during critique work.""" def get_critique_submitter_system_prompt() -> str: - """System prompt for generating critiques of body section.""" - return """You are a peer reviewer generating constructive criticism of a mathematical document's body section. + """System prompt for generating self-review critiques of the body section.""" + return """You are a peer reviewer generating constructive self-review notes for a mathematical document's body section. -⚠️ CRITICAL - INTERNAL CONTENT WARNING ⚠️ +IMPORTANT - INTERNAL CONTENT WARNING: ALL context provided to you (brainstorm databases, accepted submissions, papers, reference materials, outlines, previous document content) is AI-GENERATED within this research system. This content has NOT been peer-reviewed, published, or verified by external sources. YOU MUST TREAT ALL PROVIDED CONTEXT WITH EXTREME SKEPTICISM: -- NEVER assume claims are true because they "sound good" or "fit well" -- NEVER trust information simply because it appears in "accepted submissions" or "papers" -- ALWAYS verify information independently before using or building upon it -- NEVER cite internal documents as authoritative or established sources -- Question and validate every assertion, even if it appears in validated content +- NEVER assume claims are true because they sound good or fit well. +- NEVER trust information simply because it appears in accepted submissions or papers. +- ALWAYS verify information independently before using or building upon it. +- NEVER cite internal documents as authoritative or established sources. +- Question and validate every assertion, even if it appears in validated content. """ + CRITIQUE_EMPIRICAL_PROVENANCE_RULES + """ - The internal context shows what has been explored by AI agents, NOT what has been proven correct. Your role is to generate rigorous peer review feedback. Use internal context as exploration history and your base knowledge for reasoning and verification. - - WHEN IN DOUBT: Verify independently. Do not assume. Do not trust unverified internal context as truth. - ---- +The internal context shows what has been explored by AI agents, NOT what has been proven correct. Your role is to identify honest limitations, concerns, or improvement points for the final paper's self-review section. CRITICAL - YOU CAN DECLINE TO CRITIQUE: -If the body section is academically acceptable with only minor stylistic issues or cosmetic concerns, you may decline to provide a critique by setting critique_needed=false. +If the body section is academically acceptable with only minor stylistic issues or cosmetic concerns, you may decline by setting critique_needed=false. SOURCE MATERIAL POLICY: -- The aggregator/brainstorm database and reference papers are optional support for critique, not mandatory checklists -- Do NOT critique solely because the body does not explicitly cover some source material -- Do critique omitted material when the omission creates a genuine gap relative to the current outline, stated paper scope, or mathematical goals -- Focus on whether the paper itself is strong, rigorous, and aligned, not on exhaustively mirroring source inputs - -ACADEMICALLY ACCEPTABLE means: -- No mathematical errors or unsound reasoning -- No missing proofs or incomplete arguments -- No logical gaps affecting correctness -- Structural organization is coherent -- All outline requirements are met -- Content aligns with paper title and goals -- Mathematical rigor meets academic standards - -You should ONLY critique if you identify substantive issues that would improve mathematical correctness, logical soundness, or completeness. If the body is fundamentally sound with only minor issues (stylistic, cosmetic, or trivial), you should decline to critique. - ---- - -YOUR TASK: -Assess whether the body section needs substantive critique. If it does, identify specific issues, errors, gaps, or improvements needed. If it doesn't (academically acceptable), decline to critique. - -PROGRESSIVE SYSTEM: You will be called multiple times (up to 5 total attempts). Focus on identifying ONE specific, well-substantiated critique per turn. Do not try to list every issue at once — address the most important issue thoroughly this turn, and you will have further opportunities to raise additional issues. - -WHAT TO CRITIQUE - Focus on: -- Mathematical errors or unsound reasoning -- Missing proofs or incomplete arguments -- Logical gaps or unclear transitions between ideas -- Redundancy or unnecessary verbosity -- Structural issues (sections out of logical order, poor organization) -- Missing content that should be covered per the outline -- Content that doesn't align with the paper title/goal -- Unfounded claims or logical fallacies -- Insufficient mathematical rigor for an academic paper -- Fabricated experiments, unsupported benchmark numbers, uncited literature claims, or nonexistent code/artifact claims - -WHAT NOT TO CRITIQUE - Avoid: -- The conclusion, introduction, or abstract (not written yet) -- Stylistic preferences (focus on substance) -- Minor formatting or cosmetic issues -- Personal preferences about notation (unless causing confusion) +- The aggregator/brainstorm database and reference papers are optional support for critique, not mandatory checklists. +- Do NOT critique solely because the body does not explicitly cover some source material. +- Do critique omitted material when the omission creates a genuine gap relative to the current outline, stated paper scope, or mathematical goals. +- Focus on whether the paper itself is strong, rigorous, and aligned, not on exhaustively mirroring source inputs. -CRITICAL REQUIREMENTS: -- Be SPECIFIC: Point to exact sections, paragraphs, or claims -- Be CONSTRUCTIVE: Explain what should change and why -- Be ACTIONABLE: Provide clear direction for improvement -- Focus on SUBSTANCE: Mathematical correctness, logical soundness, completeness -- Explicitly call out unsupported empirical or artifact claims rather than treating them as minor issues - -Your critique will be validated against these criteria: -- Does it identify a legitimate issue that would improve the paper? -- Is it specific enough to be actionable? -- Is it constructive and substantive (not stylistic)? -- Is it non-redundant with existing accepted critiques? - -Or if declining to critique, your assessment will be validated against: -- Is the body indeed academically acceptable? -- Is your reasoning for declining sound? +CRITIQUE QUALITY REQUIREMENTS: +- Identify only substantive mathematical, logical, structural, or provenance issues. +- Be specific enough that a reader understands the limitation or concern. +- Do not propose direct edits or rewrites. The critique will be appended transparently as self-review. +- Do not list every possible issue. You will be called up to 3 total attempts, so focus on one important point per turn. Output your response ONLY as JSON in this exact format: { @@ -121,118 +68,69 @@ def get_critique_json_schema() -> str: } CRITICAL JSON ESCAPE RULES: -1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text - - Example: Write "\\\\tau" not "\\tau", write "\\\\(" not "\\(" -2. Quotes: Escape double quotes inside strings as \\" - - Example: "He said \\"hello\\"" -3. Newlines/Tabs: Use \\n for newlines (NOT \\\\n), \\t for tabs (NOT \\\\t) - - Example: "Line 1\\nLine 2" creates two lines -4. DO NOT use single backslashes except for: \\", \\\\, \\/, \\b, \\f, \\n, \\r, \\t, \\uXXXX -5. LaTeX notation: If your content contains mathematical expressions like \\Delta, \\tau, etc., - you MUST escape the backslash: write "\\\\Delta", "\\\\tau", "\\\\[", "\\\\]" - -Example (critique of mathematical error): -{ - "critique_needed": true, - "submission": "Section III contains a flawed proof of the convergence claim. The proof assumes uniform convergence without establishing the necessary conditions. Specifically, the argument on page 3 states 'the sequence converges' but does not verify the Cauchy criterion or provide bounds. This should be corrected by adding a lemma establishing uniform convergence via the Weierstrass M-test, with explicit bounds on the sequence terms.", - "reasoning": "This is a critical mathematical error that undermines the validity of the main theorem. Without establishing uniform convergence properly, the subsequent results are not rigorously justified." -} +1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text. +2. Quotes: Escape double quotes inside strings as \\\". +3. Newlines/Tabs: Use \\n for newlines, \\t for tabs. +4. DO NOT use single backslashes except for: \\\", \\\\, \\/, \\b, \\f, \\n, \\r, \\t, \\uXXXX. +5. LaTeX notation: If your content contains mathematical expressions like \\Delta, \\tau, etc., + you MUST escape the backslash: write "\\\\Delta", "\\\\tau", "\\\\[", "\\\\]". -Example (missing content per outline): +Example critique: { "critique_needed": true, - "submission": "The outline specifies a subsection on 'Baker's Theorem Applications' under Section IV, but this content is completely missing from the current body. The outline indicates this should cover explicit applications to transcendence problems, but the body jumps from Baker's Theorem statement directly to unrelated topics. This gap should be filled with concrete applications showing how Baker's theorem applies to specific transcendence questions.", - "reasoning": "Following the outline structure is essential for paper coherence. This missing content is explicitly planned in the outline and its absence creates a logical gap in the exposition." + "submission": "Section III asserts a convergence claim without establishing the needed uniform bound. This is a substantive limitation because later arguments depend on that convergence statement. The paper should be read with this proof gap in mind unless an independent bound is supplied.", + "reasoning": "This is a mathematical gap that affects the reliability of a downstream claim and is suitable for the self-review section." } -Example (decline - body is academically acceptable): +Example decline: { "critique_needed": false, "submission": "", - "reasoning": "After thorough review, the body section is academically acceptable. All mathematical proofs are rigorous and correct. The outline requirements are fully met. Content aligns with the paper title. While there are minor stylistic variations in notation (e.g., using both f(x) and f(·) interchangeably), these are cosmetic issues that don't affect mathematical correctness or comprehension. No substantive critique is warranted." + "reasoning": "The body section is academically acceptable for the current scope. The remaining issues are stylistic and do not warrant a substantive self-review critique." } """ def get_critique_validator_system_prompt() -> str: - """System prompt for validating critiques (reuses aggregator validator logic).""" - return """You are a validation agent reviewing peer review critiques of a mathematical document's body section. + """System prompt for validating critique submissions.""" + return """You are a validation agent reviewing peer-review critiques for a mathematical document's self-review section. -⚠️ CRITICAL - INTERNAL CONTENT WARNING ⚠️ +IMPORTANT - INTERNAL CONTENT WARNING: ALL context provided to you (brainstorm databases, accepted submissions, papers, reference materials, outlines, previous document content, critiques) is AI-GENERATED within this research system. This content has NOT been peer-reviewed, published, or verified by external sources. YOU MUST TREAT ALL PROVIDED CONTEXT WITH EXTREME SKEPTICISM: -- NEVER assume claims are true because they "sound good" or "fit well" -- NEVER trust information simply because it appears in "accepted submissions" or "papers" -- ALWAYS verify information independently before using or building upon it -- NEVER cite internal documents as authoritative or established sources -- Question and validate every assertion, even if it appears in validated content +- NEVER assume claims are true because they sound good or fit well. +- NEVER trust information simply because it appears in accepted submissions or papers. +- ALWAYS verify information independently before using or building upon it. +- NEVER cite internal documents as authoritative or established sources. +- Question and validate every assertion, even if it appears in validated content. """ + CRITIQUE_EMPIRICAL_PROVENANCE_RULES + """ - The internal context shows what has been explored by AI agents, NOT what has been proven correct. Your role is to validate peer review critiques. Use internal context as exploration history and your base knowledge for reasoning and verification. - - WHEN IN DOUBT: Verify independently. Do not assume. Do not trust unverified internal context as truth. - ---- - YOUR TASK: -Decide if this submission is valid - either a legitimate critique OR a justified decline assessment. - -For CRITIQUES (critique_needed=true): You are evaluating whether the critique database becomes more useful for improving the paper with this critique added than it was without it. - -For DECLINE ASSESSMENTS (critique_needed=false): You are evaluating whether the submitter's assessment that the body is academically acceptable is correct. +Decide if this submission is valid - either a legitimate self-review critique OR a justified decline assessment. -EVALUATION CRITERIA - Consider: -- Does the critique identify a genuine mathematical error or logical flaw? -- Does the critique point out missing content per the outline? -- Does the critique identify structural or organizational issues? -- Is the critique specific and actionable (not vague)? -- Is the critique substantive (not just stylistic preference)? -- Is the critique redundant with existing accepted critiques? -- Is the critique correct (or is the body section actually fine)? +For CRITIQUES (critique_needed=true): evaluate whether appending this critique would make the paper more transparent and honest for readers. -VALIDATION DECISION RULES: -A critique should be ACCEPTED if it: -1. Identifies a real mathematical error or unsound reasoning -2. Points out missing content explicitly planned in the outline -3. Identifies structural issues affecting coherence -4. Provides specific, actionable guidance for improvement -5. Is non-redundant with existing critiques -6. Correctly flags fabricated experiments, unsupported metrics, uncited external results, or nonexistent artifacts +For DECLINE ASSESSMENTS (critique_needed=false): evaluate whether the submitter's assessment that no substantive critique is needed is correct. -A critique should be REJECTED if it: -1. Is vague or unhelpful ("could be better" without specifics) -2. Is redundant with existing accepted critiques -3. Focuses on stylistic preferences, not substance -4. Is incorrect (the body section is actually correct) -5. Suggests changes that would reduce clarity or rigor -6. Is trivial or pedantic without meaningful impact +ACCEPT a critique if it: +1. Identifies a real mathematical error, proof gap, unsupported claim, structural problem, or material limitation. +2. Is specific and useful to readers. +3. Is substantive rather than stylistic. +4. Is non-redundant with existing accepted critiques. +5. Correctly flags fabricated experiments, unsupported metrics, uncited external results, or nonexistent artifacts. -VALIDATING DECLINE ASSESSMENTS (critique_needed=false): +REJECT a critique if it: +1. Is vague or unhelpful. +2. Is redundant with existing accepted critiques. +3. Focuses on stylistic preferences, not substance. +4. Is incorrect. +5. Criticizes selective non-use of optional source material without a real gap in the paper's stated scope. +6. Is trivial or pedantic without meaningful impact. -ACCEPT the decline if: -- Body is indeed academically acceptable (only minor stylistic or cosmetic issues) -- No substantive mathematical errors exist -- No logical gaps affecting correctness -- All outline requirements are met -- Submitter's reasoning for declining is sound and accurate -- Body meets required criteria for academic mathematical paper -- There are no unsupported empirical or artifact claims being presented as established fact -- The body is strong for its chosen scope even if some source material remains unused - -REJECT the decline if: -- Submitter missed substantive issues you can identify -- Body has mathematical errors or unsound reasoning -- Body has logical gaps or incomplete arguments -- Missing content required by outline -- Body misaligned with paper title or goals -- Decline reasoning is weak, incorrect, or fails to recognize real issues - -For critiques, ask yourself: "Does adding this critique to our feedback database make us more capable of improving the paper than we were without it?" - -For declines, ask yourself: "Is the body indeed academically acceptable with only minor issues, or did the submitter miss substantive problems?" +For declines, ACCEPT only if the body is academically acceptable and any remaining issues are minor. REJECT if a substantive issue was missed. Output your decision ONLY as JSON in this exact format: { @@ -249,334 +147,20 @@ def get_critique_validation_json_schema() -> str: REQUIRED JSON FORMAT: { "decision": "accept" OR "reject", - "reasoning": "string - detailed explanation of your decision", - "summary": "string - brief summary (max 750 chars, used for rejection feedback)" -} - -CRITICAL JSON ESCAPE RULES: -1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text - - Example: Write "\\\\tau" not "\\tau", write "\\\\(" not "\\(" -2. Quotes: Escape double quotes inside strings as \\" - - Example: "He said \\"hello\\"" -3. Newlines/Tabs: Use \\n for newlines (NOT \\\\n), \\t for tabs (NOT \\\\t) - - Example: "Line 1\\nLine 2" creates two lines -4. DO NOT use single backslashes except for: \\", \\\\, \\/, \\b, \\f, \\n, \\r, \\t, \\uXXXX -5. LaTeX notation: If your content contains mathematical expressions like \\Delta, \\tau, etc., - you MUST escape the backslash: write "\\\\Delta", "\\\\tau", "\\\\[", "\\\\]" - -Example (Accept): -{ - "decision": "accept", - "reasoning": "This critique correctly identifies a missing convergence proof in Section III. The body claims uniform convergence without establishing it, which is a genuine mathematical gap that needs addressing. The critique is specific, actionable, and substantive.", - "summary": "" -} - -Example (Reject - Vague): -{ - "decision": "reject", - "reasoning": "This critique says 'Section II could be clearer' without identifying specific issues or suggesting concrete improvements. It's too vague to be actionable.", - "summary": "Critique is too vague - must identify specific issues and suggest concrete improvements." -} - -Example (Reject - Redundant): -{ - "decision": "reject", - "reasoning": "This critique about the missing Baker's theorem application is redundant with already-accepted critique #3, which made the same observation with more detail.", - "summary": "Redundant with critique #3 which already identified this gap." -} - -Example (Accept Decline - Body is acceptable): -{ - "decision": "accept", - "reasoning": "The submitter correctly assessed that the body is academically acceptable. After reviewing the body section, I confirm there are no mathematical errors, all proofs are rigorous and complete, outline requirements are fully met, and content aligns with the paper goals. The only issues present are minor stylistic variations in notation, which do not affect mathematical correctness. The decline is justified.", - "summary": "" -} - -Example (Reject Decline - Submitter missed issues): -{ - "decision": "reject", - "reasoning": "The submitter declined to critique, claiming the body is academically acceptable. However, Section III contains a significant error: the proof assumes uniform convergence without establishing it. This is a substantive mathematical gap that requires critique. The decline assessment is incorrect.", - "summary": "Decline rejected - Section III contains missing convergence proof that needs to be critiqued." -} -""" - - -def get_rewrite_decision_system_prompt() -> str: - """System prompt for rewrite vs continue decision.""" - return """You are reviewing aggregated peer review critiques to decide if the body section needs revision. - -⚠️ CRITICAL - INTERNAL CONTENT WARNING ⚠️ - -ALL context provided to you (brainstorm databases, accepted submissions, papers, reference materials, outlines, previous document content, critiques) is AI-GENERATED within this research system. This content has NOT been peer-reviewed, published, or verified by external sources. - -YOU MUST TREAT ALL PROVIDED CONTEXT WITH EXTREME SKEPTICISM: -- NEVER assume claims are true because they "sound good" or "fit well" -- NEVER trust information simply because it appears in "accepted submissions" or "papers" -- ALWAYS verify information independently before using or building upon it -- NEVER cite internal documents as authoritative or established sources -- Question and validate every assertion, even if it appears in validated content - -""" + CRITIQUE_EMPIRICAL_PROVENANCE_RULES + """ - - The internal context shows what has been explored by AI agents, NOT what has been proven correct. Your role is to make an informed rewrite decision. Use internal context as exploration history and your base knowledge for reasoning and verification. - - WHEN IN DOUBT: Verify independently. Do not assume. Do not trust unverified internal context as truth. - ---- - -YOUR TASK: -Review all accepted critiques and decide what action to take for the body section. - -**CRITIQUE COLLECTION CONTEXT**: The peer review phase collected critiques through multiple attempts. ALL accepted critiques are provided below (typically 1-3 accepted out of 5 total attempts). Review each accepted critique on its individual merits. - -DECISION OPTIONS: -1. **CONTINUE** - Critiques are minor/incorrect. Proceed to conclusion phase. -2. **PARTIAL_REVISION** - Critiques identify fixable issues. You will then apply edits ONE AT A TIME in an iterative loop. -3. **TOTAL_REWRITE** - Critiques reveal catastrophic flaws. Delete entire body and rebuild from scratch. - -CRITICAL GUIDANCE ON WHEN TO USE EACH: - -**Use CONTINUE when:** -- Critiques are stylistic preferences without substance -- Critiques are incorrect (the body is actually fine) -- Small gaps that can be addressed in future editing phases -- Issues don't affect overall mathematical correctness - -**Use PARTIAL_REVISION when:** -- Specific sections have errors that can be fixed with targeted edits -- Missing content can be inserted at specific locations -- Redundant paragraphs need removal -- Most of the body is sound, only specific parts need correction -- Critiques point to fixable issues in isolated sections - -IMPORTANT - PARTIAL_REVISION IS ITERATIVE: -If you choose PARTIAL_REVISION, you will then be prompted to propose edits ONE AT A TIME. -Each edit will be validated and applied, then you will see the updated paper and propose the next edit. -This continues until you indicate all edits are complete. -You do NOT specify edit_operations in this decision - that happens in the iterative loop. - -**Use TOTAL_REWRITE when (ONLY AS LAST RESORT):** -- Fundamental mathematical errors pervasive throughout the body -- Body is fundamentally misaligned with paper title/stated goal -- Structural problems require complete reorganization -- Multiple critical gaps that can't be addressed with isolated edits -- The body fundamentally doesn't achieve what the paper claims - -**IMPORTANT - NEXT STEPS CONTEXT:** - -If you choose PARTIAL_REVISION or TOTAL_REWRITE, you can also: -1. Change the paper title (if body reveals scope drift) -2. Update the outline (if structure needs changes) - -This means you have FULL control to revise the paper comprehensively. However: -- TOTAL_REWRITE should ONLY be used when absolutely necessary -- Total rewrites are difficult and can introduce errors in areas that were previously correct -- Even with feedback, rewriting from scratch can lose coherence -- Prefer PARTIAL_REVISION whenever the issues are localized and fixable - -CRITICAL - REWRITE SCOPE: -If you choose TOTAL_REWRITE, the ENTIRE body section will be deleted and rewritten from scratch. The rewrite will have access to: -- All original context (aggregator database, reference papers, etc.) -- The PRE-CRITIQUE PAPER (what the body looked like before this revision cycle) -- ALL critiques from ALL previous failed versions (accumulated feedback history) -- Current version's accepted critiques - -ACCUMULATED CRITIQUE HISTORY: -If this is not the first critique phase, you will see critiques from ALL previous failed versions. -These are labeled clearly as "FAILED - REWRITTEN" versions. Use this accumulated feedback -to understand what went wrong in past attempts and avoid repeating the same mistakes. - -SOURCE MATERIAL POLICY: -- The aggregator/brainstorm database and reference papers are optional supports during rewrite decisions, not mandatory checklists -- Do NOT choose PARTIAL_REVISION or TOTAL_REWRITE solely to force coverage of unused source material -- Do choose revision when the current body is genuinely weaker, incomplete for its chosen scope, misaligned with the outline/title, or mathematically unsound - -Output your decision ONLY as JSON in this exact format: -{ - "decision": "continue | partial_revision | total_rewrite", - "new_title": "New paper title (or null if keeping current)", - "new_outline": "Updated outline content (or null if keeping current)", - "reasoning": "Detailed explanation of your decision and rationale for any title/outline changes" -} -""" - - -def get_rewrite_decision_json_schema() -> str: - """Get JSON schema specification for rewrite decision.""" - return """ -REQUIRED JSON FORMAT: -{ - "decision": "continue" OR "partial_revision" OR "total_rewrite", - "new_title": "string (new paper title) OR null (keep current)", - "new_outline": "string (updated outline) OR null (keep current)", - "reasoning": "string - detailed explanation of decision" -} - -NOTE ON PARTIAL_REVISION: -If you choose "partial_revision", you will NOT specify edit operations here. -Instead, you will be prompted to propose edits ONE AT A TIME in an iterative loop. -Each edit will be validated and applied, then you'll see the updated paper before proposing the next edit. - -CRITICAL JSON ESCAPE RULES: -1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text - - Example: Write "\\\\tau" not "\\tau", write "\\\\(" not "\\(" -2. Quotes: Escape double quotes inside strings as \\" - - Example: "He said \\"hello\\"" -3. Newlines/Tabs: Use \\n for newlines (NOT \\\\n), \\t for tabs (NOT \\\\t) - - Example: "Line 1\\nLine 2" creates two lines -4. DO NOT use single backslashes except for: \\", \\\\, \\/, \\b, \\f, \\n, \\r, \\t, \\uXXXX -5. LaTeX notation: If your content contains mathematical expressions like \\Delta, \\tau, etc., - you MUST escape the backslash: write "\\\\Delta", "\\\\tau", "\\\\[", "\\\\]" - -Example (CONTINUE - Minor Issues): -{ - "decision": "continue", - "new_title": null, - "new_outline": null, - "reasoning": "After reviewing the accepted critiques, the issues identified are minor and do not warrant any revision. Critiques #1 and #3 point out small notation inconsistencies that can be addressed in review phase. Critique #2 suggests stylistic changes without substantive mathematical impact. The body section is fundamentally sound and aligned with the paper title. Proceeding to conclusion phase." -} - -Example (PARTIAL_REVISION - Triggers Iterative Edit Loop): -{ - "decision": "partial_revision", - "new_title": null, - "new_outline": null, - "reasoning": "Critiques identify two fixable issues: (1) missing convergence proof in Section III, (2) missing Corollary 3.1. These can be addressed with targeted edits without rewriting the entire body, which is otherwise mathematically sound. Will propose edits one at a time in the iterative loop." -} - -Example (TOTAL_REWRITE - Catastrophic Issues): -{ - "decision": "total_rewrite", - "new_title": "Transcendence Methods in Modern Number Theory: From Lindemann-Weierstrass to Baker", - "new_outline": "Abstract\\n\\nI. Introduction\\n A. Historical development\\n B. Scope and goals\\n\\nII. Classical Transcendence Theory\\n A. Lindemann-Weierstrass theorem\\n B. Applications to geometric constructibility\\n\\nIII. Baker's Theorem and Linear Forms\\n A. Statement and proof outline\\n B. Applications to Diophantine equations\\n\\nIV. Modern Developments\\n A. Recent refinements\\n B. Computational aspects\\n\\nV. Conclusion", - "reasoning": "Critiques #1-#8 reveal fundamental problems: the entire approach to the convergence argument is flawed from first principles, structural organization makes sections incomprehensible, and body has drifted to cover different scope than title. These issues are too pervasive for targeted edits. A complete rebuild is necessary." -} -""" - - -def get_rewrite_decision_validator_system_prompt() -> str: - """System prompt for validating rewrite decisions.""" - return """You are validating a rewrite decision made after reviewing peer review critiques. - -⚠️ CRITICAL - INTERNAL CONTENT WARNING ⚠️ - -ALL context provided to you (brainstorm databases, accepted submissions, papers, reference materials, outlines, previous document content, critiques, decisions) is AI-GENERATED within this research system. This content has NOT been peer-reviewed, published, or verified by external sources. - -YOU MUST TREAT ALL PROVIDED CONTEXT WITH EXTREME SKEPTICISM: -- NEVER assume claims are true because they "sound good" or "fit well" -- NEVER trust information simply because it appears in "accepted submissions" or "papers" -- ALWAYS verify information independently before using or building upon it -- NEVER cite internal documents as authoritative or established sources -- Question and validate every assertion, even if it appears in validated content - -""" + CRITIQUE_EMPIRICAL_PROVENANCE_RULES + """ - - The internal context shows what has been explored by AI agents, NOT what has been proven correct. Use internal context and your base knowledge for validation. - - WHEN IN DOUBT: Verify independently. Do not assume. Do not trust unverified internal context as truth. - ---- - -YOUR TASK: -Validate whether the rewrite decision (CONTINUE, PARTIAL_REVISION, or TOTAL_REWRITE) is justified based on all accepted critiques and current body content. - -VALIDATION CRITERIA - Consider: - -**ACCEPT "continue" decision if:** -- Critiques are indeed minor or incorrect -- Body is fundamentally sound despite critique issues -- Issues can be addressed without any revision -- Title and body remain aligned - -**ACCEPT "partial_revision" decision if:** -- Critiques identify specific, localized issues -- Proposed edit operations would fix the identified problems -- Edit operations are appropriate (correct operation types, reasonable old_string/new_string) -- Most of the body is sound, only targeted fixes needed -- Title change (if proposed) is justified -- Outline update (if proposed) improves structure - -**ACCEPT "total_rewrite" decision if:** -- Critiques reveal catastrophic issues (pervasive math errors, fundamental misalignment, structural chaos) -- Total rewrite is justified - issues too widespread for targeted edits -- Partial revision would not be sufficient -- Title change (if proposed) is justified by scope drift -- Outline update (if proposed) improves structure - -**REJECT decision if:** -- Reasoning doesn't match the critiques (illogical conclusion) -- "Continue" chosen despite substantive issues in critiques -- "Total_rewrite" chosen for minor or fixable issues (should use partial_revision) -- "Partial_revision" chosen but edit operations are vague or incorrect -- Title change proposed without justification from critiques -- Decision appears arbitrary or not evidence-based - -SOURCE MATERIAL POLICY: -- The source database is optional support, not a mandatory checklist -- Do NOT reject a decision solely because it leaves some source material unused -- Do reject if the decision ignores source material only when that omission clearly makes the chosen scope weaker, incoherent, or misaligned with the outline/title - -Ask yourself: "Is this decision the right response to the accepted critiques? Is the chosen level of revision appropriate?" - -Output your decision ONLY as JSON in this exact format: -{ - "decision": "accept or reject", - "reasoning": "Detailed explanation of your validation decision" -} -""" - - -def get_rewrite_decision_validation_json_schema() -> str: - """Get JSON schema specification for rewrite decision validation.""" - return """ -REQUIRED JSON FORMAT: -{ - "decision": "accept" OR "reject", - "reasoning": "string - detailed explanation of your validation decision" + "reasoning": "string - detailed explanation", + "summary": "string - rejection summary if rejected, empty string if accepted" } CRITICAL JSON ESCAPE RULES: -1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text - - Example: Write "\\\\tau" not "\\tau", write "\\\\(" not "\\(" -2. Quotes: Escape double quotes inside strings as \\" - - Example: "He said \\"hello\\"" -3. Newlines/Tabs: Use \\n for newlines (NOT \\\\n), \\t for tabs (NOT \\\\t) - - Example: "Line 1\\nLine 2" creates two lines -4. DO NOT use single backslashes except for: \\", \\\\, \\/, \\b, \\f, \\n, \\r, \\t, \\uXXXX -5. LaTeX notation: If your content contains mathematical expressions like \\Delta, \\tau, etc., - you MUST escape the backslash: write "\\\\Delta", "\\\\tau", "\\\\[", "\\\\]" - -Example (Accept continue decision): -{ - "decision": "accept", - "reasoning": "The decision to CONTINUE is justified. The critiques are indeed minor issues: 3 stylistic suggestions, 4 notation clarifications, 2 incorrect critiques (the proofs are actually valid), and 1 small gap that can be filled in review phase. No fundamental mathematical errors were identified. Proceeding to conclusion is appropriate." -} - -Example (Accept partial_revision decision): -{ - "decision": "accept", - "reasoning": "The decision to use PARTIAL_REVISION is justified. The critiques identify 2 specific, fixable issues: missing convergence proof in Section III and missing Corollary 3.1. The proposed edit operations correctly target these issues with appropriate old_string/new_string replacements. Most of the body is mathematically sound - targeted edits are more appropriate than a complete rewrite." -} - -Example (Accept total_rewrite decision): -{ - "decision": "accept", - "reasoning": "The decision to use TOTAL_REWRITE is justified. The critiques reveal catastrophic problems: 4 critiques identify fundamental errors in the convergence arguments that permeate multiple sections, 3 point out missing content explicitly in the outline, 2 show the body has drifted to cover different scope than the title. These issues are too pervasive for targeted edits. A complete rebuild is necessary." -} - -Example (Reject - should use partial_revision instead): -{ - "decision": "reject", - "reasoning": "The decision to use TOTAL_REWRITE is NOT justified. The critiques identify only 2 specific issues: missing convergence proof in Section III and missing corollary. These are localized problems that can be fixed with targeted edits. The rest of the body is mathematically sound. The decision should be PARTIAL_REVISION, not TOTAL_REWRITE." -} +1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text. +2. Quotes: Escape double quotes inside strings as \\\". +3. Newlines/Tabs: Use \\n for newlines, \\t for tabs. +4. DO NOT use single backslashes except for: \\\", \\\\, \\/, \\b, \\f, \\n, \\r, \\t, \\uXXXX. +5. LaTeX notation: If your content contains mathematical expressions like \\Delta, \\tau, etc., + you MUST escape the backslash: write "\\\\Delta", "\\\\tau", "\\\\[", "\\\\]". """ -# ============================================================================= -# PROMPT BUILDERS -# ============================================================================= - - def build_critique_prompt( user_prompt: str, current_body: str, @@ -587,22 +171,7 @@ def build_critique_prompt( rejection_feedback: Optional[str] = None, accumulated_history: Optional[str] = None ) -> str: - """ - Build complete prompt for critique generation. - - Args: - user_prompt: The user's compiler-directing prompt - current_body: The body section to critique - current_outline: The paper outline - aggregator_db: The aggregator database content - reference_papers: Optional reference paper content - critique_feedback: Optional existing critiques (for context) - rejection_feedback: Optional rejection feedback (last 5 rejections) - accumulated_history: Optional accumulated critique history from previous failed versions - - Returns: - Complete assembled prompt - """ + """Build complete prompt for critique generation.""" parts = [ get_critique_submitter_system_prompt(), "\n---\n", @@ -610,7 +179,7 @@ def build_critique_prompt( "\n---\n", f"USER COMPILER-DIRECTING PROMPT:\n{user_prompt}", "\n---\n", - f"PAPER TITLE:\n{user_prompt}", # Using compiler prompt as title context + f"PAPER TITLE:\n{user_prompt}", "\n---\n", f"CURRENT OUTLINE:\n{current_outline}", "\n---\n", @@ -620,480 +189,39 @@ def build_critique_prompt( - The source database below is optional support, not a mandatory checklist. - Use it to identify genuine gaps or contradictions if helpful. - Do NOT critique solely because some source entries were not used. +- Do use it if it reveals that the body missed a stronger direct-answer path. """, "\n---\n", f"SOURCE DATABASE (optional support - use if helpful):\n{aggregator_db}", ] - + if reference_papers: parts.extend([ "\n---\n", f"REFERENCE PAPERS:\n{reference_papers}" ]) - + if accumulated_history: parts.extend([ "\n---\n", accumulated_history ]) - + if critique_feedback: parts.extend([ "\n---\n", f"EXISTING ACCEPTED CRITIQUES (CURRENT VERSION):\n{critique_feedback}" ]) - + if rejection_feedback: parts.extend([ "\n---\n", f"YOUR LAST 5 REJECTIONS (Learn from these):\n{rejection_feedback}" ]) - - parts.extend([ - "\n---\n", - "Now generate your critique as JSON:" - ]) - - return ''.join(parts) - -def build_rewrite_decision_prompt( - user_prompt: str, - current_body: str, - current_outline: str, - current_title: str, - aggregator_db: str, - critique_feedback: str, - pre_critique_paper: str, - reference_papers: Optional[str] = None, - accumulated_history: Optional[str] = None -) -> str: - """ - Build complete prompt for rewrite vs continue decision. - - Args: - user_prompt: The user's compiler-directing prompt - current_body: The body section being evaluated - current_outline: The paper outline - current_title: The current paper title - aggregator_db: The aggregator database content - critique_feedback: All accepted critiques (typically 1-3 out of 5 total attempts) - pre_critique_paper: Paper snapshot from START of critique phase (for rewrite context) - reference_papers: Optional reference paper content - accumulated_history: Optional accumulated critique history from previous failed versions - - Returns: - Complete assembled prompt - """ - parts = [ - get_rewrite_decision_system_prompt(), - "\n---\n", - get_rewrite_decision_json_schema(), - "\n---\n", - f"USER COMPILER-DIRECTING PROMPT:\n{user_prompt}", - "\n---\n", - f"CURRENT PAPER TITLE:\n{current_title}", - "\n---\n", - f"CURRENT OUTLINE:\n{current_outline}", - "\n---\n", - f"PRE-CRITIQUE PAPER (body at START of this revision cycle):\n{pre_critique_paper}", - "\n---\n", - f"CURRENT BODY SECTION (after critique phase):\n{current_body}", - ] - - if accumulated_history: - parts.extend([ - "\n---\n", - accumulated_history - ]) - - parts.extend([ - "\n---\n", - f"ALL ACCEPTED CRITIQUES (CURRENT VERSION):\n{critique_feedback}", - "\n---\n", - """OPTIONAL SOURCE MATERIAL POLICY: -- The source database below is optional support, not a mandatory checklist. -- Use it if it helps judge whether the body's chosen scope is genuinely weak, incomplete, or misaligned. -- Do NOT force rewrite solely to cover unused source material. -""", - "\n---\n", - f"SOURCE DATABASE (optional support - use if helpful):\n{aggregator_db}", - ]) - - if reference_papers: - parts.extend([ - "\n---\n", - f"REFERENCE PAPERS:\n{reference_papers}" - ]) - parts.extend([ "\n---\n", - "Review all critiques and decide whether to REWRITE the body or CONTINUE to conclusion. Respond as JSON:" - ]) - - return ''.join(parts) - - -def build_rewrite_decision_validation_prompt( - user_prompt: str, - current_body: str, - current_outline: str, - current_title: str, - critique_feedback: str, - decision_result: Dict, - aggregator_db: str -) -> str: - """ - Build complete prompt for validating the rewrite decision. - - Args: - user_prompt: The user's compiler-directing prompt - current_body: The body section - current_outline: The paper outline - current_title: Current paper title - critique_feedback: All accepted critiques (typically 1-3 out of 5 total attempts) - decision_result: The decision being validated - aggregator_db: The aggregator database content - - Returns: - Complete assembled prompt - """ - decision = decision_result.get('decision', 'unknown') - new_title = decision_result.get('new_title', None) - new_outline = decision_result.get('new_outline', None) - reasoning = decision_result.get('reasoning', '') - - parts = [ - get_rewrite_decision_validator_system_prompt(), - "\n---\n", - get_rewrite_decision_validation_json_schema(), - "\n---\n", - f"USER COMPILER-DIRECTING PROMPT:\n{user_prompt}", - "\n---\n", - f"CURRENT PAPER TITLE:\n{current_title}", - "\n---\n", - f"CURRENT OUTLINE:\n{current_outline}", - "\n---\n", - f"CURRENT BODY SECTION:\n{current_body}", - "\n---\n", - f"ALL ACCEPTED CRITIQUES:\n{critique_feedback}", - "\n---\n", - """OPTIONAL SOURCE MATERIAL POLICY: -- The source database below is optional support, not a mandatory checklist. -- Use it if needed to judge whether the proposed decision is genuinely stronger or weaker. -- Do NOT reject solely because not all source material is being used. -""", - "\n---\n", - f"SOURCE DATABASE (optional support - use if helpful):\n{aggregator_db}", - "\n---\n", - f"PROPOSED DECISION:\n", - f"Decision: {decision}\n", - f"New Title: {new_title if new_title else '(keep current)'}\n", - f"New Outline: {new_outline if new_outline else '(keep current)'}\n", - f"Reasoning: {reasoning}", - "\n---\n", - "Validate whether this decision is justified based on the critiques. Respond as JSON:" - ] - - return ''.join(parts) - - -# ============================================================================ -# ITERATIVE PARTIAL REVISION PROMPTS -# ============================================================================ - -def get_iterative_edit_system_prompt() -> str: - """System prompt for iterative partial revision - proposing one edit at a time.""" - return """You are making targeted edits to a mathematical document body to address peer review critiques. - -⚠️ CRITICAL - INTERNAL CONTENT WARNING ⚠️ - -ALL context provided to you (papers, outlines, critiques) is AI-GENERATED within this research system. -This content has NOT been peer-reviewed, published, or verified by external sources. -Treat all provided context with extreme skepticism. - -YOUR TASK: -You are in an ITERATIVE EDIT LOOP. You have been shown: -1. The PRE-CRITIQUE PAPER (how the body looked before this revision cycle started) -2. The CURRENT PAPER (the body after any edits applied so far in this loop) -3. The ACCEPTED CRITIQUES (problems identified that need fixing) -4. The EDITS ALREADY APPLIED (what has been changed so far) - -Your job is to propose ONE EDIT at a time to address the remaining critique issues. -After each edit is validated and applied, you will see the updated paper and can propose the next edit. - -EDIT OPERATIONS USE EXACT STRING MATCHING: -- old_string must exist VERBATIM and UNIQUELY in the CURRENT paper body -- Include enough context (3-5 lines) to ensure uniqueness -- If the exact string is not found or is ambiguous, the edit will be rejected - -OPERATION TYPES: -- **replace**: Find old_string, replace with new_string -- **insert_after**: Find old_string, insert new_string immediately after it -- **delete**: Find old_string, remove it (new_string should be empty) - -WHEN TO SET more_edits_needed: -- TRUE: More critique issues remain to be addressed -- FALSE: All critique issues have been addressed (or best effort has been made) - -IMPORTANT: -- Focus on ONE edit at a time -- Address the most critical issues first -- Each edit should be substantial and address specific critique feedback -- Do NOT make cosmetic changes - focus on mathematical/structural issues identified in critiques -- If you believe all issues are addressed, set more_edits_needed to false -- If critique issues involve unsupported empirical or artifact claims, remove them or rewrite them as hypotheses, validation plans, expected benefits, limitations, or future work -- Never preserve fabricated experiments, unsupported benchmark numbers, or nonexistent code claims as if they were verified - -Output your response ONLY as JSON in the exact format specified. -""" - - -def get_iterative_edit_json_schema() -> str: - """Get JSON schema for iterative edit response.""" - return """ -REQUIRED JSON FORMAT: -{ - "operation": "replace | insert_after | delete", - "old_string": "Exact text to find in the CURRENT paper body (must be unique)", - "new_string": "Replacement/insertion text (empty string for delete)", - "reasoning": "Which critique issue this edit addresses and why this change fixes it", - "more_edits_needed": true OR false -} - -CRITICAL JSON ESCAPE RULES: -1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text - - Example: Write "\\\\tau" not "\\tau", write "\\\\(" not "\\(" -2. Quotes: Escape double quotes inside strings as \\" - - Example: "He said \\"hello\\"" -3. Newlines/Tabs: Use \\n for newlines (NOT \\\\n), \\t for tabs (NOT \\\\t) - - Example: "Line 1\\nLine 2" creates two lines -4. DO NOT use single backslashes except for: \\", \\\\, \\/, \\b, \\f, \\n, \\r, \\t, \\uXXXX -5. LaTeX notation: MUST escape backslash: write "\\\\Delta", "\\\\tau", "\\\\[", "\\\\]" - -Example (Replace - Fix missing proof): -{ - "operation": "replace", - "old_string": "The proof assumes uniform convergence without establishing the necessary conditions.", - "new_string": "We establish uniform convergence via the Weierstrass M-test. The series satisfies |f_n(x)| ≤ M_n with ∑M_n < ∞, therefore uniform convergence follows immediately.", - "reasoning": "Critique #1 identified that the convergence proof was assumed rather than proven. Adding the rigorous justification using Weierstrass M-test.", - "more_edits_needed": true -} - -Example (Insert After - Add missing corollary): -{ - "operation": "insert_after", - "old_string": "This completes the proof of Theorem 3. ∎", - "new_string": "\\n\\nCorollary 3.1. As an immediate consequence of Theorem 3, we obtain the following bound on the error term:\\n\\n|R_n(x)| ≤ C · n^{-α}\\n\\nfor some constant C > 0 independent of n.", - "reasoning": "Critique #3 noted that Corollary 3.1 from the outline was missing. Adding it directly after the proof of Theorem 3 where it logically belongs.", - "more_edits_needed": false -} - -Example (Delete - Remove redundant section): -{ - "operation": "delete", - "old_string": "We pause to note that this result is analogous to several classical results in the literature, including the work of Smith (1995), Jones (2001), and Brown (2010). While a full comparison is beyond the scope of this paper, the interested reader may consult these references for additional context.", - "new_string": "", - "reasoning": "Critique #2 identified this paragraph as redundant filler that doesn't add mathematical substance. Removing to improve focus.", - "more_edits_needed": true -} -""" - - -def build_iterative_edit_prompt( - user_prompt: str, - pre_critique_paper: str, - current_paper: str, - current_outline: str, - critique_feedback: str, - edits_applied: List[Dict], - reference_papers: Optional[str] = None, - accumulated_critique_history: str = "" -) -> str: - """ - Build prompt for iterative partial revision edit. - - Args: - user_prompt: The user's compiler-directing prompt - pre_critique_paper: Paper snapshot from START of critique phase - current_paper: Current paper body (after any edits applied so far) - current_outline: The paper outline - critique_feedback: All accepted critiques from this revision cycle - edits_applied: List of edits already applied in this iteration - reference_papers: Optional reference paper content - accumulated_critique_history: Critiques from previous failed versions (if any) - - Returns: - Complete assembled prompt - """ - parts = [ - get_iterative_edit_system_prompt(), - "\n---\n", - get_iterative_edit_json_schema(), - "\n---\n", - f"USER COMPILER-DIRECTING PROMPT:\n{user_prompt}", - "\n---\n", - f"CURRENT OUTLINE:\n{current_outline}", - "\n---\n", - ] - - # Add accumulated history if present - if accumulated_critique_history: - parts.extend([ - f"ACCUMULATED CRITIQUE HISTORY (from previous failed versions):\n{accumulated_critique_history}", - "\n---\n", - ]) - - parts.extend([ - f"ACCEPTED CRITIQUES (issues to address):\n{critique_feedback}", - "\n---\n", - f"PRE-CRITIQUE PAPER (how the body looked before this revision cycle):\n{pre_critique_paper}", - "\n---\n", - f"CURRENT PAPER (after {len(edits_applied)} edit(s) applied):\n{current_paper}", - "\n---\n", + "Now generate your critique as JSON:" ]) - if reference_papers: - parts.extend([ - f"REFERENCE PAPERS:\n{reference_papers}", - "\n---\n", - ]) - - # Show edits already applied - if edits_applied: - edits_str = "\n".join([ - f"Edit {i+1}: {e['operation']} - {e.get('reasoning', 'N/A')[:100]}..." - for i, e in enumerate(edits_applied) - ]) - parts.extend([ - f"EDITS ALREADY APPLIED:\n{edits_str}", - "\n---\n", - ]) - else: - parts.extend([ - "EDITS ALREADY APPLIED: None yet - this is the first edit.", - "\n---\n", - ]) - - parts.append( - "Propose your NEXT edit to address remaining critique issues, or set more_edits_needed=false if all issues are resolved. Respond as JSON:" - ) - - return ''.join(parts) - - -# ============================================================================ -# PARTIAL REVISION EDIT VALIDATION PROMPTS -# ============================================================================ - -def get_partial_revision_validation_system_prompt() -> str: - """System prompt for validating individual partial revision edits.""" - return """You are validating a proposed edit to a mathematical document. - -The edit is part of an iterative partial revision to address peer review critiques. - -EMPIRICAL PROVENANCE RULES: -- Empirical claims (benchmarks, speedups, latency, accuracy, perplexity, hardware measurements) must not remain stated as fact unless backed by explicit citation or provided artifact support. -- Artifact claims (code, kernels, experiments, logs, accompanying implementations) must not remain stated as fact unless backed by explicit citation or provided artifact support. -- If the edit rewrites unsupported empirical/artifact claims into hypotheses, validation plans, expected benefits, limitations, or future work, that is a valid improvement. - -YOUR TASK: -Validate whether this specific edit should be ACCEPTED or REJECTED. - -ACCEPT the edit if: -1. It addresses one or more issues identified in the accepted critiques -2. The old_string exists in the current paper and is unambiguous -3. The new_string improves the mathematical content or addresses critique feedback -4. The edit maintains coherence with the surrounding text -5. The edit is mathematically sound - -REJECT the edit if: -1. The edit does NOT address any critique issues -2. The old_string does not exist or is ambiguous in the current paper -3. The new_string introduces errors or reduces quality -4. The edit breaks coherence with surrounding content -5. The edit is mathematically unsound or introduces logical errors -6. The edit is purely cosmetic and doesn't address critiques -7. The edit preserves fabricated experiments, unsupported metrics, or nonexistent artifact claims as established fact - -Output your decision as JSON. -""" - - -def get_partial_revision_validation_json_schema() -> str: - """Get JSON schema for partial revision edit validation.""" - return """ -REQUIRED JSON FORMAT: -{ - "decision": "accept" OR "reject", - "reasoning": "string - explanation of why the edit should or should not be accepted" -} - -CRITICAL JSON ESCAPE RULES: -1. Backslashes: ALWAYS use double backslash (\\\\) for any backslash in your text -2. Quotes: Escape double quotes inside strings as \\" -3. Newlines/Tabs: Use \\n for newlines (NOT \\\\n), \\t for tabs - -Example (Accept): -{ - "decision": "accept", - "reasoning": "The edit correctly addresses critique #1 which identified a missing convergence proof. The new text adds a rigorous Weierstrass M-test argument that establishes uniform convergence. The old_string exists exactly as specified in the current paper." -} - -Example (Reject): -{ - "decision": "reject", - "reasoning": "The proposed edit does not address any of the accepted critiques. It appears to be a stylistic change (rewording a sentence) rather than fixing the mathematical issues identified. Additionally, the old_string appears twice in the document, making it ambiguous." -} -""" - - -def build_partial_revision_validation_prompt( - current_paper: str, - current_outline: str, - critique_feedback: str, - edit_proposal: Dict -) -> str: - """ - Build prompt for validating a single partial revision edit. - - Args: - current_paper: Current paper body - current_outline: Paper outline - critique_feedback: All accepted critiques - edit_proposal: Dict with operation, old_string, new_string, reasoning - - Returns: - Complete assembled validation prompt - """ - operation = edit_proposal.get("operation", "") - old_string = edit_proposal.get("old_string", "") - new_string = edit_proposal.get("new_string", "") - reasoning = edit_proposal.get("reasoning", "") - - # Truncate long strings for prompt - old_str_display = old_string[:500] + "..." if len(old_string) > 500 else old_string - new_str_display = new_string[:500] + "..." if len(new_string) > 500 else new_string - - parts = [ - get_partial_revision_validation_system_prompt(), - "\n---\n", - get_partial_revision_validation_json_schema(), - "\n---\n", - f"CURRENT OUTLINE:\n{current_outline}", - "\n---\n", - f"ACCEPTED CRITIQUES (issues being addressed):\n{critique_feedback}", - "\n---\n", - f"CURRENT PAPER:\n{current_paper}", - "\n---\n", - f"PROPOSED EDIT:\n", - f"Operation: {operation}\n", - f"Old String: {old_str_display}\n", - f"New String: {new_str_display}\n", - f"Reasoning: {reasoning}", - "\n---\n", - "Validate whether this edit should be accepted. Respond as JSON:" - ] - - return ''.join(parts) - + return "".join(parts) diff --git a/backend/compiler/prompts/outline_prompts.py b/backend/compiler/prompts/outline_prompts.py index 55920f9..ffc4fbd 100644 --- a/backend/compiler/prompts/outline_prompts.py +++ b/backend/compiler/prompts/outline_prompts.py @@ -158,6 +158,11 @@ def get_outline_create_system_prompt() -> str: - Do NOT force coverage of every brainstorm/database entry - Do NOT ignore clearly crucial source material for the scope you choose +DIRECT-ANSWER-FIRST PRINCIPLE: +- Organize the paper around the strongest rigorous direct answer the paper can justify +- Prefer sections that directly solve, partially solve, refute, or sharply constrain the user's question over broad background accumulation +- Include background and preliminaries only to the extent needed to support the direct answer cleanly and rigorously + - Produce a numbered outline with major sections and subsections - Incorporate the strongest helpful source ideas where appropriate - Flag gaps explicitly if the evidence is insufficient @@ -185,6 +190,7 @@ def get_outline_create_system_prompt() -> str: - Required sections (Abstract, Introduction, Body, Conclusion) present with exact names - Sections follow logical mathematical progression (definitions → theorems → proofs) - The outline optimally serves the paper title and user's compiler-directing prompt +- The outline is focused on the strongest rigorous direct answer available, without unnecessary detours - No further refinement would meaningfully improve the outline - You are confident this outline will guide excellent paper construction @@ -218,6 +224,7 @@ def get_outline_create_system_prompt() -> str: - The outline should support a coherent, logical flow for the final document - Sections should build upon each other logically (definitions → theorems → proofs) - The outline should align with the user's compiler-directing prompt goals +- The outline should prioritize the strongest direct rigorous route to answering the prompt - DO NOT include a separate References or Citations section in the outline - All content must be rooted in sound mathematical reasoning; aggregator/brainstorm material is optional support, not a mandatory checklist - NO unfounded claims or logical fallacies @@ -308,11 +315,16 @@ def get_outline_update_system_prompt() -> str: - Do NOT force additions just because a brainstorm/database entry exists - Do NOT ignore clearly crucial source material for the scope you are keeping +DIRECT-ANSWER-FIRST PRINCIPLE: +- Update the outline when doing so materially strengthens the paper's most direct rigorous answer to the user's prompt +- Do not add side sections that broaden scope without improving the direct answer, partial answer, impossibility result, or key constraint + Decide if the outline requires updates. Consider: - Relevance to current source content when it helps the paper - Missing content that should be included in outline to better serve the user prompt - Structural issues in current outline - Alignment with document construction progress +- Whether the outline still reflects the strongest direct-answer path revealed by the draft CRITICAL - NO PLACEHOLDER TEXT: You must NEVER include placeholder markers like "[HARD CODED PLACEHOLDER FOR...]" in your outline submissions. @@ -348,6 +360,7 @@ def get_outline_update_system_prompt() -> str: - All added content must be rooted in sound mathematical reasoning; source database material is optional support, not a mandatory checklist - NO unfounded claims or logical fallacies - Focus on rigorous mathematical arguments +- Prefer additions that materially strengthen the paper's direct answer rather than merely broadening coverage - NEVER change the names of Abstract, Introduction, or Conclusion sections - New body sections must be inserted between Introduction and Conclusion - DO NOT add unsupported numeric empirical claims in section or subsection headings diff --git a/backend/compiler/prompts/review_prompts.py b/backend/compiler/prompts/review_prompts.py index 0b700e8..0c2ca1b 100644 --- a/backend/compiler/prompts/review_prompts.py +++ b/backend/compiler/prompts/review_prompts.py @@ -74,6 +74,7 @@ def get_review_system_prompt() -> str: - Structural issues - Redundancy - Forward-looking structural previews +- Places where the draft drifts away from the strongest direct answer to the user's prompt - Other improvements CRITICAL - SYSTEM-MANAGED MARKERS (NOT YOUR OUTPUT): @@ -116,6 +117,7 @@ def get_review_system_prompt() -> str: - Unsupported empirical claims, unsupported artifact/code claims, or uncited literature claims - Numeric benchmark-style claims in narrative text that are not explicitly sourced - Statements implying experiments, measurements, or implementations that are not actually evidenced +- Generic exploratory wording that obscures a stronger justified direct answer already present in the draft WHEN NOT TO MAKE AN EDIT: - Document is acceptable for a draft in progress diff --git a/backend/compiler/prompts/rigor_prompts.py b/backend/compiler/prompts/rigor_prompts.py index 5491eb6..37bb345 100644 --- a/backend/compiler/prompts/rigor_prompts.py +++ b/backend/compiler/prompts/rigor_prompts.py @@ -6,9 +6,11 @@ Stage 1 - Theorem discovery (build_rigor_theorem_discovery_prompt): Using the full writing context, the submitter asks itself whether the - paper contains a theorem worth formalizing and proving in Lean 4 that - has not already been verified. Output is a candidate theorem JSON (or - a decline). + paper, outline, support context, or user prompt expose a theorem worth + formalizing and proving in Lean 4. Candidate theorems may verify + existing paper claims or extend partial work when that helps the paper + construction / user prompt. Output is a candidate theorem JSON (or a + decline). Stage 2 - Placement (build_rigor_placement_prompt): Given a Lean-4-verified theorem, the submitter proposes an inline @@ -57,29 +59,37 @@ # STAGE 1: THEOREM DISCOVERY # ============================================================================= -_DISCOVERY_SYSTEM_PROMPT = f"""You are the rigor agent for a mathematical-paper compiler. Your job during the rigor loop is to look at the paper-in-progress together with the full research context and decide whether there is a theorem worth formalizing and proving in Lean 4. +_DISCOVERY_SYSTEM_PROMPT = f"""You are the rigor agent for a mathematical-paper compiler. Your job during the rigor loop is to look at the paper-in-progress together with the full research context and decide whether there is a theorem worth formalizing and proving in Lean 4 because it helps answer, support, or advance the USER RESEARCH PROMPT and/or materially improves the paper under construction. {INTERNAL_CONTENT_WARNING} YOUR TASK - STAGE 1 (DISCOVERY) 1. Read the current outline and the current paper text. -2. Read the list of theorems that have ALREADY been verified by Lean 4 (EXISTING VERIFIED PROOFS block). -3. Read the list of theorems that PREVIOUSLY FAILED Lean 4 verification (OPEN LEMMA TARGETS block, if present). -4. Decide exactly one of: - (A) `needs_theorem_work=false` - no theorem worth trying right now. Good reasons: all interesting claims in the paper are already covered by existing verified proofs; the paper is in too early a state; there is no claim a Lean 4 proof could close usefully. - (B) `needs_theorem_work=true` - propose a single candidate theorem to formalize. +2. Read the USER RESEARCH PROMPT and treat it as the relevance boundary for all theorem work. +3. Read the list of theorems that have ALREADY been verified by Lean 4 (EXISTING VERIFIED PROOFS block). +4. Read the list of theorems that PREVIOUSLY FAILED Lean 4 verification (OPEN LEMMA TARGETS block, if present). +5. Decide exactly one of: + (A) `needs_theorem_work=false` - no prompt-relevant theorem worth trying right now. Good reasons: all useful claims for the user's prompt are already covered by existing verified proofs; the paper is in too early a state; there is no claim a Lean 4 proof could close usefully; or the only available claims are mathematically interesting but off-topic. + (B) `needs_theorem_work=true` - propose a single prompt-relevant candidate theorem to formalize. RULES FOR PROPOSING A THEOREM: +- The theorem must directly help answer, support, or advance the USER RESEARCH PROMPT. Do not propose a theorem merely because it is non-trivial or mathematically interesting. - The theorem must be provable in Lean 4 with Mathlib. - You MUST NOT re-propose a theorem that is already in EXISTING VERIFIED PROOFS. Look for theorems that are DIFFERENT - new results, missed lemmas, or sharper versions that are not yet on the list. -- You MAY retry a theorem from OPEN LEMMA TARGETS when the paper now gives you a better angle on it. When you do, set `retry_existing_failure_id` to the failed `theorem_id`. +- You MAY retry a theorem from OPEN LEMMA TARGETS when it is still prompt-relevant and the paper now gives you a better angle on it. When you do, set `retry_existing_failure_id` to the failed `theorem_id`. +- EXTENSION IS EXPLICITLY ALLOWED AND ENCOURAGED WHERE HELPFUL: you are NOT limited to exact claims already present in the current paper. You may construct a Lean-verifiable theorem by extending partial paper work, the current outline, supporting context, or the USER RESEARCH PROMPT when that theorem would materially help the paper construction and/or the user's requested goal. +- Set `theorem_origin="existing_paper_claim"` only when the theorem directly formalizes a claim already present in the current paper text. +- Set `theorem_origin="extension_from_partial_work"` when the theorem is constructed by extending the current paper, outline, or supporting context beyond the exact written claim. +- Set `theorem_origin="extension_from_user_prompt"` when the theorem is prompted primarily by the USER RESEARCH PROMPT and helps the paper even if the current paper has not yet written the claim. +- Extension-derived theorems (`extension_from_partial_work` or `extension_from_user_prompt`) MUST set `placement_preference="appendix_only"`. These proofs belong at the end of the paper in the Theorems Appendix, not inline in the main body. +- Existing-paper-claim theorems may set `placement_preference="inline"` when a local body insertion would strengthen the existing argument, or `placement_preference="appendix_only"` when the proof is useful but would distract from the prose. - Prefer theorems whose statements are tight enough that Lean 4 can actually close them (arithmetic facts, concrete inequalities, specific algebraic identities, small group/ring/field lemmas, concrete combinatorial identities) over large open conjectures. - The `theorem_statement` is for a human reader. It should be precise, self-contained, and include the hypotheses. -- The `formal_sketch` tells the formalization agent what tactics or lemmas look promising in Lean 4 / Mathlib. Keep it concrete. -- The `source_excerpt` is 2-6 sentences of surrounding paper text that motivates why this theorem is a natural target here. It must be a direct paraphrase or quote from the current paper. +- The `formal_sketch` tells the formalization agent what tactics or lemmas look promising in Lean 4 / Mathlib and why this theorem helps the user's prompt. Keep it concrete. +- The `source_excerpt` is 2-6 sentences of motivating context. For `existing_paper_claim`, it must be a direct paraphrase or quote from the current paper. For extension-derived theorems, it may explain the partial paper work, outline item, supporting evidence, and/or user-prompt need that the theorem extends. -If Stage 1 guesses wrong, Stage 2 cannot recover - 5 Lean 4 attempts will be spent on the wrong target. Prefer declining over a weak proposal. +If Stage 1 guesses wrong, Stage 2 cannot recover - 5 Lean 4 attempts will be spent on the wrong target. Prefer declining over a weak or off-prompt proposal. Output your response ONLY as JSON in this exact format: {{{{ @@ -87,8 +97,10 @@ "theorem_statement": "precise theorem statement with explicit hypotheses and conclusion (empty if needs_theorem_work=false)", "formal_sketch": "concrete sketch: what tactics / Mathlib lemmas you expect to work (empty if needs_theorem_work=false)", "source_excerpt": "2-6 sentences of surrounding paper text that motivates this theorem (empty if needs_theorem_work=false)", + "theorem_origin": "existing_paper_claim | extension_from_partial_work | extension_from_user_prompt (empty if needs_theorem_work=false)", + "placement_preference": "inline | appendix_only (empty if needs_theorem_work=false)", "retry_existing_failure_id": "theorem_id from OPEN LEMMA TARGETS if retrying a prior failure, empty string otherwise", - "reasoning": "why this theorem is the best target right now OR why no theorem should be attempted" + "reasoning": "why this theorem is the best prompt-relevant target right now OR why no theorem should be attempted" }}}}""" @@ -98,6 +110,8 @@ "theorem_statement": "string", "formal_sketch": "string", "source_excerpt": "string", + "theorem_origin": "existing_paper_claim OR extension_from_partial_work OR extension_from_user_prompt", + "placement_preference": "inline OR appendix_only", "retry_existing_failure_id": "string (may be empty)", "reasoning": "string" } @@ -108,8 +122,22 @@ "theorem_statement": "For every natural number n, the sum of the first n positive integers equals n*(n+1)/2.", "formal_sketch": "Induction on n. Base: n=0 both sides are 0. Step: use Finset.sum_range_succ and Nat.succ_mul; close with omega / ring. Mathlib has Finset.sum_range_id which may finish it outright.", "source_excerpt": "In Section 2 we reasoned about partial sums of the form 1 + 2 + ... + n...", + "theorem_origin": "existing_paper_claim", + "placement_preference": "inline", "retry_existing_failure_id": "", - "reasoning": "Section 2 relies on the closed form but currently presents it without a verified proof. Lean 4 can close this cleanly; it does not duplicate any existing verified proof." + "reasoning": "Section 2 uses this closed form to support the user's requested argument but currently presents it without a verified proof. Lean 4 can close this cleanly; it does not duplicate any existing verified proof." +} + +Example (propose an extension theorem for the appendix): +{ + "needs_theorem_work": true, + "theorem_statement": "For every natural number n, n*(n+1) is even.", + "formal_sketch": "Use Nat.even_mul_succ_self or prove by parity cases / omega. This lemma can support a later divisibility argument about triangular numbers.", + "source_excerpt": "The outline asks for arithmetic constraints on triangular-number expressions, but the current paper has not yet isolated the parity lemma needed for the clean construction. This theorem extends the partial plan into a Lean-checkable support result.", + "theorem_origin": "extension_from_partial_work", + "placement_preference": "appendix_only", + "retry_existing_failure_id": "", + "reasoning": "This is not an exact written claim in the current paper; it extends the partial outline into a useful verified lemma. Because it is extension-derived, it should be stored in the Theorems Appendix rather than inserted inline." } Example (decline): @@ -118,8 +146,10 @@ "theorem_statement": "", "formal_sketch": "", "source_excerpt": "", + "theorem_origin": "", + "placement_preference": "", "retry_existing_failure_id": "", - "reasoning": "The paper currently contains only outline scaffolding and the one verified theorem (proof_002). Attempting another Lean 4 proof right now would either duplicate proof_002 or target claims that are too vague to formalize." + "reasoning": "The paper currently contains only outline scaffolding and the one verified theorem (proof_002). Attempting another Lean 4 proof right now would either duplicate proof_002, target claims that are too vague to formalize, or chase claims that do not help the user's prompt." } """ @@ -151,6 +181,7 @@ PLACEMENT GUIDELINES: - Put the theorem where it strengthens the local argument. Prefer insertion points inside a relevant body section (near the discussion it closes) over dumping it in a new section. +- The inline placement should make clear why this verified theorem helps the paper answer or advance the USER RESEARCH PROMPT. - The paper has a Theorems Appendix block already; do NOT try to edit the appendix directly. - Keep `old_string` short but unique (3-5 lines of surrounding context is usually enough). diff --git a/backend/compiler/validation/compiler_validator.py b/backend/compiler/validation/compiler_validator.py index 0afe2fa..c10b3b6 100644 --- a/backend/compiler/validation/compiler_validator.py +++ b/backend/compiler/validation/compiler_validator.py @@ -11,7 +11,7 @@ from backend.shared.api_client_manager import api_client_manager from backend.shared.openrouter_client import FreeModelExhaustedError from backend.shared.models import CompilerSubmission, CompilerValidationResult -from backend.shared.json_parser import parse_json +from backend.shared.json_parser import parse_json, sanitize_model_output_for_retry_context from backend.shared.utils import count_tokens from backend.autonomous.memory.proof_database import proof_database from backend.aggregator.validation.json_validator import json_validator @@ -477,11 +477,16 @@ async def _parse_json_with_retry( logger.info("CompilerValidator: Already in retry, using fallback parser") return self._fallback_parse(response) - # Build retry prompt asking for reformatted JSON - # Note: response is already truncated to 2000 chars in the prompt text + # Build retry prompt asking for reformatted JSON. Keep failed-output + # context, but sanitize it before any replay in prompt or assistant turn. + max_failed_output_chars = 2000 # ~500 tokens - enough for error context + failed_output_preview = sanitize_model_output_for_retry_context( + response, + max_chars=max_failed_output_chars, + ) reparse_prompt = ( "Your previous response could not be parsed as valid JSON.\n\n" - f"YOUR PREVIOUS RESPONSE:\n{response[:2000]}{'...' if len(response) > 2000 else ''}\n\n" + f"YOUR PREVIOUS RESPONSE:\n{failed_output_preview}\n\n" f"PARSE ERROR: {str(parse_error)}\n\n" "Please provide the exact same validation decision in valid JSON format.\n" "CRITICAL: Properly escape backslashes (use \\\\) and quotes (use \\\").\n" @@ -492,13 +497,6 @@ async def _parse_json_with_retry( try: retry_task_id = f"{self.get_current_task_id()}_retry" - # CRITICAL FIX: Truncate failed output to prevent context overflow during retry - max_failed_output_chars = 2000 # ~500 tokens - enough for error context - if len(response) > max_failed_output_chars: - failed_output_preview = response[:max_failed_output_chars] + "\n[...output truncated for retry...]" - else: - failed_output_preview = response - # Calculate if conversation fits in context window from backend.shared.config import system_config, rag_config prompt_tokens = count_tokens(original_prompt) @@ -1295,6 +1293,9 @@ def _build_brainstorm_validation_prompt( You see ONLY the brainstorm database and the proposed operation. You do NOT see the paper or any paper edits. Your decision must be based solely on whether this operation improves the brainstorm database. +PROTECTED LEAN 4 PROOFS: +Lean 4 verified proof entries in the brainstorm database are immutable to paper-writing retroactive operations. If a proposed operation edits, deletes, annotates, or adds context to a Lean 4 verified proof, reject it. Only the normal brainstorm prune system may remove Lean 4 proof entries. + OPERATION TYPE: {action.upper()} """ @@ -1561,6 +1562,7 @@ def _get_outline_validation_system_prompt(self, mode: str) -> str: - Do NOT reject solely because an outline does not explicitly use or cover database material - Do reject if the outline ignores clearly crucial source material in a way that makes its chosen scope weak, incoherent, or misaligned with the user prompt - Accept selective or divergent outline structures when they better serve the user's prompt and remain rigorous +- Prefer outlines that organize the strongest rigorous direct answer to the user's prompt, rather than broad exploratory coverage YOUR TASK: Verify the submission meets ALL criteria above. Accept only if ALL criteria pass. Reject if ANY criterion fails. @@ -1789,6 +1791,7 @@ def _get_paper_validation_system_prompt(self, mode: str) -> str: - The brainstorm database is optional support, not a mandatory checklist - Do NOT reject solely because the submission does not explicitly use brainstorm content or because it departs from brainstorm phrasing - Reject only if the submission ignores clearly necessary established content for its claimed scope, conflicts with the outline, or becomes weaker/less rigorous as a result +- Prefer submissions that strengthen the paper's most direct rigorous answer to the prompt rather than adding indirect breadth YOUR TASK: Verify the submission meets ALL criteria above. If even ONE criterion fails, reject the submission. @@ -2025,233 +2028,4 @@ def _parse_validation_response(self, response: str) -> Optional[dict]: except Exception as e: logger.error(f"Failed to parse validation response: {e}") - return None - - async def validate_rewrite_decision( - self, - decision_result: Dict, - user_prompt: str, - current_body: str, - current_outline: str, - current_title: str, - critique_feedback: str, - aggregator_db: str - ) -> bool: - """ - Validate a rewrite vs continue decision made after critique phase. - - Args: - decision_result: The decision dict from critique submitter - user_prompt: User's compiler-directing prompt - current_body: Body section being evaluated - current_outline: Paper outline - current_title: Current paper title - critique_feedback: All accepted critiques (typically 1-3 out of 5 total attempts) - aggregator_db: Aggregator database content - - Returns: - True if decision is valid, False if should be retried - """ - try: - logger.info("Validating rewrite decision...") - - # Import prompt builder - from backend.compiler.prompts.critique_prompts import build_rewrite_decision_validation_prompt - - # Build validation prompt - prompt = build_rewrite_decision_validation_prompt( - user_prompt=user_prompt, - current_body=current_body, - current_outline=current_outline, - current_title=current_title, - critique_feedback=critique_feedback, - decision_result=decision_result, - aggregator_db=aggregator_db - ) - - # Generate task ID - task_id = self.get_current_task_id() - self.task_sequence += 1 - - # Notify task started - if self.task_tracking_callback: - self.task_tracking_callback("started", task_id) - - # Call LLM - from backend.shared.config import system_config - response = await api_client_manager.generate_completion( - task_id=task_id, - role_id=self.role_id, - model=self.model_name, - messages=[{"role": "user", "content": prompt}], - temperature=0.0, - max_tokens=system_config.compiler_validator_max_output_tokens - ) - - # Notify task completed - if self.task_tracking_callback: - self.task_tracking_callback("completed", task_id) - - # Extract content from response (handles both 'content' and 'reasoning' fields) - message = response.get("choices", [{}])[0].get("message", {}) - llm_output = message.get("content") or message.get("reasoning") or "" - - # Parse the extracted string - data = parse_json(llm_output) - - if data is None: - logger.error("Failed to parse rewrite decision validation response") - return False - - # Handle array responses - if isinstance(data, list): - logger.warning("Validator returned array instead of object - using first element") - if not data: - return False - data = data[0] - - # Check decision - decision = data.get("decision", "").lower() - reasoning = data.get("reasoning", "") - - if decision == "accept": - logger.info(f"Rewrite decision VALIDATED: {reasoning[:200]}...") - return True - else: - logger.info(f"Rewrite decision REJECTED: {reasoning[:200]}...") - return False - - except FreeModelExhaustedError: - raise - except Exception as e: - logger.error(f"Error validating rewrite decision: {e}", exc_info=True) - return False - - async def validate_partial_revision_edit( - self, - edit_proposal: Dict, - current_paper: str, - current_outline: str, - critique_feedback: str - ) -> Tuple[bool, str]: - """ - Validate a single edit proposed during iterative partial revision. - - This validates that an edit: - 1. Uses exact string matching correctly - 2. Addresses critique feedback appropriately - 3. Maintains document coherence - 4. Preserves mathematical rigor - - Args: - edit_proposal: Dict with operation, old_string, new_string, reasoning - current_paper: Current paper state - current_outline: Paper outline - critique_feedback: The accepted critique feedback being addressed - - Returns: - Tuple of (is_valid: bool, rejection_reason: str) - """ - try: - logger.info("Validating partial revision edit...") - - operation = edit_proposal.get("operation", "") - old_string = edit_proposal.get("old_string", "") - new_string = edit_proposal.get("new_string", "") - reasoning = edit_proposal.get("reasoning", "") - - # CRITICAL: Ensure markers are intact BEFORE any old_string validation - # Partial revision operates on paper (not outline) - markers_repaired = await paper_memory.ensure_markers_intact() - if markers_repaired: - logger.info("Paper markers were missing and have been repaired during partial revision validation") - # Re-fetch paper after repair - current_paper = await paper_memory.get_paper() - - # Pre-validation: Check exact string match for non-full_content operations - if operation in ("replace", "insert_after", "delete"): - if not old_string: - return False, "old_string cannot be empty for this operation" - - # Normalize and check - normalized_paper = normalize_unicode_hyphens(current_paper) - normalized_old = normalize_unicode_hyphens(old_string) - - if normalized_old not in normalized_paper: - # Try to find similar text for better error message - logger.warning(f"Exact string not found in document: '{old_string[:100]}...'") - return False, f"EXACT_STRING_NOT_FOUND: The old_string was not found in the document. Ensure you use text that exists verbatim in CURRENT PAPER." - - # Check uniqueness - count = normalized_paper.count(normalized_old) - if count > 1: - return False, f"STRING_NOT_UNIQUE: The old_string appears {count} times in the document. Include more context to make it unique." - - # Import prompt builder for LLM validation - from backend.compiler.prompts.critique_prompts import build_partial_revision_validation_prompt - - # Build validation prompt - prompt = build_partial_revision_validation_prompt( - current_paper=current_paper, - current_outline=current_outline, - critique_feedback=critique_feedback, - edit_proposal=edit_proposal - ) - - # Generate task ID - task_id = self.get_current_task_id() - self.task_sequence += 1 - - # Notify task started - if self.task_tracking_callback: - self.task_tracking_callback("started", task_id) - - # Call LLM - from backend.shared.config import system_config - response = await api_client_manager.generate_completion( - task_id=task_id, - role_id=self.role_id, - model=self.model_name, - messages=[{"role": "user", "content": prompt}], - temperature=0.0, - max_tokens=system_config.compiler_validator_max_output_tokens - ) - - # Notify task completed - if self.task_tracking_callback: - self.task_tracking_callback("completed", task_id) - - # Extract content from response - message = response.get("choices", [{}])[0].get("message", {}) - llm_output = message.get("content") or message.get("reasoning") or "" - - # Parse the response - data = parse_json(llm_output) - - if data is None: - logger.error("Failed to parse partial revision validation response") - return False, "Failed to parse validation response" - - # Handle array responses - if isinstance(data, list): - logger.warning("Validator returned array - using first element") - if not data: - return False, "Empty validation response" - data = data[0] - - # Check decision - decision = data.get("decision", "").lower() - val_reasoning = data.get("reasoning", "No reason provided") - - if decision == "accept": - logger.info(f"Partial revision edit VALIDATED: {val_reasoning[:150]}...") - return True, "" - else: - logger.info(f"Partial revision edit REJECTED: {val_reasoning[:150]}...") - return False, val_reasoning - - except FreeModelExhaustedError: - raise - except Exception as e: - logger.error(f"Error validating partial revision edit: {e}", exc_info=True) - return False, f"Validation error: {str(e)}" \ No newline at end of file + return None \ No newline at end of file diff --git a/backend/leanoj/__init__.py b/backend/leanoj/__init__.py new file mode 100644 index 0000000..b8f612a --- /dev/null +++ b/backend/leanoj/__init__.py @@ -0,0 +1 @@ +"""LeanOJ proof-solver mode.""" diff --git a/backend/leanoj/core/__init__.py b/backend/leanoj/core/__init__.py new file mode 100644 index 0000000..cbd5da7 --- /dev/null +++ b/backend/leanoj/core/__init__.py @@ -0,0 +1 @@ +"""Core orchestration for LeanOJ proof solving.""" diff --git a/backend/leanoj/core/leanoj_context.py b/backend/leanoj/core/leanoj_context.py new file mode 100644 index 0000000..7c2465b --- /dev/null +++ b/backend/leanoj/core/leanoj_context.py @@ -0,0 +1,829 @@ +"""LeanOJ proof-memory persistence and direct/RAG context allocation.""" +from __future__ import annotations + +import asyncio +import hashlib +import json +import logging +import re +import shutil +from dataclasses import dataclass, field +from datetime import datetime +from pathlib import Path +from typing import Any + +import aiofiles + +from backend.aggregator.core.rag_manager import rag_manager +from backend.shared.config import rag_config, system_config +from backend.shared.utils import count_tokens + +logger = logging.getLogger(__name__) + + +ARTIFACT_ACCEPTED_IDEAS = "accepted_ideas" +ARTIFACT_RECURSIVE_TOPICS = "recursive_topics" +ARTIFACT_VERIFIED_SUBPROOFS = "verified_subproofs" +ARTIFACT_PARTIAL_PROOFS = "partial_proofs" +ARTIFACT_FINAL_ATTEMPTS = "final_attempts" +ARTIFACT_FINAL_CYCLE_PACKETS = "final_cycle_packets" +ARTIFACT_FAILED_SUBPROOFS = "failed_subproofs" + + +def _remove_attempt_count_language(value: Any) -> str: + text = str(value or "") + replacements = ( + ( + r"\bfailed\s+\d+\s+consecutive\s+verification/edit\s+attempts?\b", + "encountered repeated verification/edit failures", + ), + (r"\bfailed\s+\d+\s+consecutive\s+attempts?\b", "encountered repeated failures"), + (r"\bfailed\s+\d+\s+attempts?\b", "encountered repeated failures"), + (r"\bfailed\s+\d+\s+times\b", "encountered repeated failures"), + (r"\bafter\s+failed\s+attempts\b", "after recent proof-check failures"), + (r"\bfailed\s+attempts\b", "proof-check failures"), + (r"\battempts\s+\d+\s*-\s*\d+\b", "recent final-loop feedback"), + (r"\bwith\s+exactly\s+\d+\s+failed\s+attempts?\b", "with recent proof-check failures"), + (r"\bUse this exact failed-attempt count[^.]*\.", ""), + (r"\bfailed-attempt count\b", "failure context"), + ) + for pattern, replacement in replacements: + text = re.sub(pattern, replacement, text, flags=re.IGNORECASE) + return re.sub(r" {2,}", " ", text).strip() + +USEFUL_ARTIFACTS = ( + ARTIFACT_ACCEPTED_IDEAS, + ARTIFACT_RECURSIVE_TOPICS, + ARTIFACT_VERIFIED_SUBPROOFS, + ARTIFACT_PARTIAL_PROOFS, + ARTIFACT_FINAL_ATTEMPTS, + ARTIFACT_FINAL_CYCLE_PACKETS, + ARTIFACT_FAILED_SUBPROOFS, +) + + +@dataclass +class LeanOJMemoryItem: + """One optional proof-memory source eligible for direct injection or RAG.""" + + artifact: str + title: str + text: str + priority: int + source_name: str + rag_only: bool = False + + +@dataclass +class LeanOJContextAllocation: + """Prepared context blocks consumed by LeanOJ prompt builders.""" + + direct_proof_context: str = "" + rag_evidence_context: str = "" + refuted_construction_warnings: str = "" + capped_rejection_feedback: str = "" + current_final_cycle_packet: str = "" + current_working_proof_attempt: str = "" + direct_sources: list[str] = field(default_factory=list) + rag_sources: list[str] = field(default_factory=list) + + def as_prompt_blocks(self) -> dict[str, str]: + return { + "direct_proof_context": self.direct_proof_context, + "rag_evidence_context": self.rag_evidence_context, + "refuted_construction_warnings": self.refuted_construction_warnings, + "capped_rejection_feedback": self.capped_rejection_feedback, + "current_final_cycle_packet": self.current_final_cycle_packet, + "current_working_proof_attempt": self.current_working_proof_attempt, + } + + +class LeanOJContextManager: + """Session-scoped LeanOJ artifact storage and RAG/offload routing.""" + + def __init__(self) -> None: + self._indexed_hashes: dict[str, str] = {} + self._index_locks: dict[str, asyncio.Lock] = {} + self._artifact_sync_counts: dict[tuple[str, str], int] = {} + self._artifact_sync_digests: dict[tuple[str, str], str] = {} + + @staticmethod + def artifacts_base_dir() -> Path: + return Path(system_config.data_dir) / "leanoj_artifacts" + + def session_artifact_dir(self, session_id: str) -> Path: + return self.artifacts_base_dir() / (session_id or "latest") + + @staticmethod + def source_prefix(session_id: str) -> str: + return f"leanoj_{session_id or 'latest'}_" + + def source_name(self, session_id: str, artifact: str) -> str: + return f"{self.source_prefix(session_id)}{artifact}" + + def source_names_for_session(self, session_id: str) -> list[str]: + return [self.source_name(session_id, artifact) for artifact in USEFUL_ARTIFACTS] + + async def write_session_artifacts( + self, + *, + session_id: str, + accepted_ideas: list[str], + accepted_idea_records: list[dict[str, Any]] | None = None, + recursive_topics: list[str] | None = None, + verified_subproofs: list[dict[str, Any]], + partial_proofs: list[dict[str, Any]], + failed_subproofs: list[dict[str, Any]], + final_attempts: list[dict[str, Any]], + final_cycle_packets: list[dict[str, Any]], + ) -> None: + """Persist full LeanOJ proof memory independently from trimmed UI state.""" + if not session_id: + return + + base = self.session_artifact_dir(session_id) + base.mkdir(parents=True, exist_ok=True) + accepted_records = [ + dict(record) + for record in (accepted_idea_records or []) + if isinstance(record, dict) and str(record.get("content") or "").strip() + ] + recorded_contents = {str(record.get("content") or "") for record in accepted_records} + accepted_records.extend( + {"content": item} + for item in accepted_ideas + if str(item).strip() and str(item) not in recorded_contents + ) + if not accepted_records: + accepted_records = [{"content": item} for item in accepted_ideas] + await self._sync_jsonl(base / f"{ARTIFACT_ACCEPTED_IDEAS}.jsonl", session_id, ARTIFACT_ACCEPTED_IDEAS, accepted_records) + await self._sync_jsonl(base / f"{ARTIFACT_RECURSIVE_TOPICS}.jsonl", session_id, ARTIFACT_RECURSIVE_TOPICS, [{"content": item} for item in (recursive_topics or [])]) + await self._sync_jsonl(base / f"{ARTIFACT_VERIFIED_SUBPROOFS}.jsonl", session_id, ARTIFACT_VERIFIED_SUBPROOFS, verified_subproofs) + await self._sync_jsonl(base / f"{ARTIFACT_PARTIAL_PROOFS}.jsonl", session_id, ARTIFACT_PARTIAL_PROOFS, partial_proofs) + await self._sync_jsonl(base / f"{ARTIFACT_FAILED_SUBPROOFS}.jsonl", session_id, ARTIFACT_FAILED_SUBPROOFS, failed_subproofs) + await self._sync_jsonl(base / f"{ARTIFACT_FINAL_ATTEMPTS}.jsonl", session_id, ARTIFACT_FINAL_ATTEMPTS, final_attempts) + await self._sync_jsonl(base / f"{ARTIFACT_FINAL_CYCLE_PACKETS}.jsonl", session_id, ARTIFACT_FINAL_CYCLE_PACKETS, final_cycle_packets) + + async def append_record(self, session_id: str, artifact: str, record: dict[str, Any]) -> None: + """Append one record to a full-memory artifact log.""" + if not session_id: + return + path = self.session_artifact_dir(session_id) / f"{artifact}.jsonl" + path.parent.mkdir(parents=True, exist_ok=True) + async with aiofiles.open(path, "a", encoding="utf-8") as f: + await f.write(json.dumps(record, ensure_ascii=False) + "\n") + key = (session_id, artifact) + self._artifact_sync_counts[key] = self._artifact_sync_counts.get(key, self._count_jsonl_records(path) - 1) + 1 + self._artifact_sync_digests.pop(key, None) + + def load_session_artifacts(self, session_id: str) -> dict[str, list[Any]]: + """Load full LeanOJ artifact logs for resume.""" + base = self.session_artifact_dir(session_id) + return { + ARTIFACT_ACCEPTED_IDEAS: self._records_to_strings(self._read_jsonl(base / f"{ARTIFACT_ACCEPTED_IDEAS}.jsonl")), + "accepted_idea_records": self._read_jsonl(base / f"{ARTIFACT_ACCEPTED_IDEAS}.jsonl"), + ARTIFACT_RECURSIVE_TOPICS: self._records_to_strings(self._read_jsonl(base / f"{ARTIFACT_RECURSIVE_TOPICS}.jsonl")), + ARTIFACT_VERIFIED_SUBPROOFS: self._read_jsonl(base / f"{ARTIFACT_VERIFIED_SUBPROOFS}.jsonl"), + ARTIFACT_PARTIAL_PROOFS: self._read_jsonl(base / f"{ARTIFACT_PARTIAL_PROOFS}.jsonl"), + ARTIFACT_FAILED_SUBPROOFS: self._read_jsonl(base / f"{ARTIFACT_FAILED_SUBPROOFS}.jsonl"), + ARTIFACT_FINAL_ATTEMPTS: self._read_jsonl(base / f"{ARTIFACT_FINAL_ATTEMPTS}.jsonl"), + ARTIFACT_FINAL_CYCLE_PACKETS: self._read_jsonl(base / f"{ARTIFACT_FINAL_CYCLE_PACKETS}.jsonl"), + } + + async def allocate_context( + self, + *, + session_id: str, + mode: str, + user_prompt: str, + lean_template: str, + task_request: str, + context_window: int, + max_output_tokens: int, + accepted_ideas: list[str], + recursive_topics: list[str] | None = None, + verified_subproofs: list[dict[str, Any]], + partial_proofs: list[dict[str, Any]], + failed_subproofs: list[dict[str, Any]], + final_attempts: list[dict[str, Any]], + final_cycle_packets: list[dict[str, Any]] | None = None, + refuted_constructions: list[dict[str, Any]] | None = None, + current_final_cycle_packet: dict[str, Any] | None = None, + current_working_proof_attempt: dict[str, Any] | None = None, + capped_rejection_feedback: str = "", + ) -> LeanOJContextAllocation: + """Allocate optional LeanOJ memory direct first, then through scoped RAG.""" + normalized_mode = mode if mode in {"brainstorm", "recursive_brainstorm", "subproof", "final_solver"} else "brainstorm" + allocation = LeanOJContextAllocation( + capped_rejection_feedback=capped_rejection_feedback.strip(), + current_final_cycle_packet=self._format_final_cycle_packet(current_final_cycle_packet) + if current_final_cycle_packet + else "", + current_working_proof_attempt=self._format_working_proof_attempt(current_working_proof_attempt) + if current_working_proof_attempt + else "", + refuted_construction_warnings=self._format_refuted_construction_warnings(refuted_constructions or []) + if normalized_mode == "final_solver" + else "", + ) + + available_tokens = rag_config.get_available_input_tokens(context_window, max_output_tokens) + mandatory_tokens = count_tokens(user_prompt) + count_tokens(lean_template) + count_tokens(task_request) + mandatory_tokens += rag_config.get_prompt_assembly_overhead_estimate() + mandatory_tokens += count_tokens(allocation.current_final_cycle_packet) + mandatory_tokens += count_tokens(allocation.current_working_proof_attempt) + mandatory_tokens += count_tokens(allocation.refuted_construction_warnings) + mandatory_tokens += count_tokens(allocation.capped_rejection_feedback) + remaining_tokens = available_tokens - mandatory_tokens + if remaining_tokens < 0: + raise RuntimeError( + "LeanOJ mandatory context overflow before optional proof memory allocation. " + f"Mandatory tokens: {mandatory_tokens}. Available input tokens: {available_tokens}. " + f"Context mode: {normalized_mode}. Increase the role context window or reduce mandatory context." + ) + + direct_parts: list[str] = [] + offloaded_items: list[LeanOJMemoryItem] = [] + minimum_rag_reserve = min(5000, max(1000, int(available_tokens * 0.05))) + + for item in self._memory_items( + session_id=session_id, + mode=normalized_mode, + accepted_ideas=accepted_ideas, + recursive_topics=recursive_topics or [], + verified_subproofs=verified_subproofs, + partial_proofs=partial_proofs, + failed_subproofs=failed_subproofs, + final_attempts=final_attempts, + final_cycle_packets=final_cycle_packets or [], + current_final_cycle_packet=current_final_cycle_packet, + has_current_working_proof_attempt=current_working_proof_attempt is not None, + ): + formatted = f"{item.title}\n{item.text}".strip() + tokens = count_tokens(formatted) + if ( + not item.rag_only + and tokens <= remaining_tokens + and remaining_tokens - tokens >= minimum_rag_reserve + ): + direct_parts.append(formatted) + allocation.direct_sources.append(item.source_name) + remaining_tokens -= tokens + else: + offloaded_items.append(item) + allocation.rag_sources.append(item.source_name) + + allocation.direct_proof_context = "\n\n".join(direct_parts).strip() + + if offloaded_items and remaining_tokens <= 500: + offloaded_titles = ", ".join(item.artifact for item in offloaded_items) + raise RuntimeError( + "LeanOJ context allocation could not preserve useful proof memory. " + f"Mandatory context left only {remaining_tokens} tokens for RAG/offload; " + f"offloaded sources would be silently dropped: {offloaded_titles}." + ) + + if offloaded_items: + for item in offloaded_items: + await self._ensure_source_indexed(item.source_name, f"{item.title}\n{item.text}".strip()) + + rag_pack = await rag_manager.retrieve( + query="\n\n".join([user_prompt, lean_template, task_request]), + chunk_size=rag_config.validator_chunk_size, + max_tokens=max(0, remaining_tokens - 200), + exclude_sources=allocation.direct_sources or None, + include_sources=allocation.rag_sources, + include_source_prefixes=[self.source_prefix(session_id)], + ) + allocation.rag_evidence_context = rag_pack.text or "" + + return allocation + + async def remove_session(self, session_id: str) -> None: + """Remove persisted LeanOJ artifacts and their RAG sources for one session.""" + base = self.session_artifact_dir(session_id) + if base.exists(): + shutil.rmtree(base) + self._clear_sync_counts(session_id) + await self.remove_session_rag_sources(session_id) + + async def clear_all(self) -> None: + """Remove all LeanOJ artifact stores and LeanOJ RAG sources.""" + base = self.artifacts_base_dir() + session_ids = [path.name for path in base.iterdir() if path.is_dir()] if base.exists() else [] + if base.exists(): + shutil.rmtree(base) + self._artifact_sync_counts.clear() + self._artifact_sync_digests.clear() + await self.remove_all_leanoj_rag_sources(session_ids=session_ids) + + async def remove_session_rag_sources(self, session_id: str) -> None: + await self._remove_rag_sources(self.source_names_for_session(session_id)) + + async def remove_all_leanoj_rag_sources(self, session_ids: list[str] | None = None) -> None: + sources: set[str] = set(self._indexed_hashes.keys()) + for session_id in session_ids or []: + sources.update(self.source_names_for_session(session_id)) + if session_ids is None: + base = self.artifacts_base_dir() + if base.exists(): + for path in base.iterdir(): + if path.is_dir(): + sources.update(self.source_names_for_session(path.name)) + await self._remove_rag_sources(sources) + + async def _remove_rag_sources(self, sources: list[str] | set[str]) -> None: + for source_name in sorted({source for source in sources if source}): + try: + await rag_manager.remove_document(source_name) + except Exception as exc: + logger.warning("Failed to remove LeanOJ RAG source %s: %s", source_name, exc) + self._indexed_hashes.pop(source_name, None) + + async def _ensure_source_indexed(self, source_name: str, text: str) -> None: + if not text.strip(): + return + lock = self._index_locks.setdefault(source_name, asyncio.Lock()) + async with lock: + digest = hashlib.sha256(text.encode("utf-8")).hexdigest() + has_chunks = any( + chunk.source_file == source_name + for chunk in rag_manager.chunks_by_size[rag_config.validator_chunk_size] + ) + if self._indexed_hashes.get(source_name) == digest and has_chunks: + return + + await rag_manager.remove_document(source_name) + + await rag_manager.add_text( + text, + source_name, + chunk_sizes=rag_config.submitter_chunk_intervals, + is_permanent=False, + ) + self._indexed_hashes[source_name] = digest + + async def _sync_jsonl( + self, + path: Path, + session_id: str, + artifact: str, + records: list[Any], + ) -> None: + """Append new records; rewrite when records shrink or same-length content changes.""" + key = (session_id, artifact) + new_digest = self._records_digest(records) + persisted_count = self._artifact_sync_counts.get(key) + if persisted_count is None: + persisted_count = self._count_jsonl_records(path) + + if len(records) < persisted_count: + await self._write_jsonl(path, records) + self._artifact_sync_counts[key] = len(records) + self._artifact_sync_digests[key] = new_digest + return + + if len(records) == persisted_count: + known_digest = self._artifact_sync_digests.get(key) + if known_digest == new_digest: + self._artifact_sync_counts[key] = persisted_count + return + if known_digest is None and self._jsonl_digest(path) == new_digest: + self._artifact_sync_counts[key] = persisted_count + self._artifact_sync_digests[key] = new_digest + return + await self._write_jsonl(path, records) + self._artifact_sync_counts[key] = persisted_count + self._artifact_sync_digests[key] = new_digest + return + + path.parent.mkdir(parents=True, exist_ok=True) + async with aiofiles.open(path, "a", encoding="utf-8") as f: + for record in records[persisted_count:]: + await f.write(json.dumps(record, ensure_ascii=False) + "\n") + self._artifact_sync_counts[key] = len(records) + self._artifact_sync_digests[key] = new_digest + + @staticmethod + def _records_digest(records: list[Any]) -> str: + return hashlib.sha256( + "\n".join(json.dumps(record, ensure_ascii=False, sort_keys=True, default=str) for record in records).encode( + "utf-8" + ) + ).hexdigest() + + def _jsonl_digest(self, path: Path) -> str: + if not path.exists(): + return self._records_digest([]) + try: + records = self._read_jsonl(path) + except Exception as exc: + logger.warning("Failed to digest LeanOJ artifact log %s: %s", path, exc) + return "" + return self._records_digest(records) + + @staticmethod + def _count_jsonl_records(path: Path) -> int: + if not path.exists(): + return 0 + try: + return sum(1 for line in path.read_text(encoding="utf-8").splitlines() if line.strip()) + except Exception as exc: + logger.warning("Failed to count LeanOJ artifact log %s: %s", path, exc) + return 0 + + def _clear_sync_counts(self, session_id: str) -> None: + stale_keys = [key for key in self._artifact_sync_counts if key[0] == session_id] + for key in stale_keys: + self._artifact_sync_counts.pop(key, None) + self._artifact_sync_digests.pop(key, None) + + def _memory_items( + self, + *, + session_id: str, + mode: str, + accepted_ideas: list[str], + recursive_topics: list[str] | None = None, + verified_subproofs: list[dict[str, Any]], + partial_proofs: list[dict[str, Any]], + failed_subproofs: list[dict[str, Any]], + final_attempts: list[dict[str, Any]], + final_cycle_packets: list[dict[str, Any]], + current_final_cycle_packet: dict[str, Any] | None, + has_current_working_proof_attempt: bool = False, + ) -> list[LeanOJMemoryItem]: + recent_failed_subproofs = failed_subproofs[-10:] + raw_items = { + ARTIFACT_FINAL_CYCLE_PACKETS: ( + "HISTORICAL FINAL-CYCLE FAILURE PACKETS", + self._format_final_cycle_packets(final_cycle_packets), + ), + ARTIFACT_VERIFIED_SUBPROOFS: ( + "VERIFIED SUBPROOFS / HELPER LEMMAS", + self._format_verified_subproofs_for_final(verified_subproofs) + if mode == "final_solver" + else self._format_verified_subproofs(verified_subproofs), + ), + ARTIFACT_PARTIAL_PROOFS: ( + "LEAN-ACCEPTED PARTIAL PROOF SCAFFOLDS", + self._format_partial_proofs_for_final(partial_proofs) + if mode == "final_solver" + else self._format_partial_proofs(partial_proofs), + ), + ARTIFACT_ACCEPTED_IDEAS: ( + "ACTIVE PROOF-PLAN NOTES" if mode == "final_solver" else "ACCEPTED BRAINSTORM IDEAS", + self._format_strings_for_final(accepted_ideas) + if mode == "final_solver" + else self._format_strings(accepted_ideas), + ), + ARTIFACT_RECURSIVE_TOPICS: ( + "RECURSIVE PROOF-REPAIR TOPICS", + self._format_strings_for_final(recursive_topics) + if mode == "final_solver" + else self._format_strings(recursive_topics), + ), + ARTIFACT_FAILED_SUBPROOFS: ( + "FAILED SUBPROOF FEEDBACK", + self._format_attempts(recent_failed_subproofs), + ), + } + + brainstorm_priority = [ + ARTIFACT_ACCEPTED_IDEAS, + ARTIFACT_PARTIAL_PROOFS, + ARTIFACT_VERIFIED_SUBPROOFS, + ARTIFACT_FINAL_CYCLE_PACKETS, + ] + if has_current_working_proof_attempt: + brainstorm_priority = [ + ARTIFACT_PARTIAL_PROOFS, + ARTIFACT_VERIFIED_SUBPROOFS, + ARTIFACT_ACCEPTED_IDEAS, + ARTIFACT_FINAL_CYCLE_PACKETS, + ] + + priority_by_mode = { + "final_solver": [ + ARTIFACT_VERIFIED_SUBPROOFS, + ARTIFACT_ACCEPTED_IDEAS, + ], + "brainstorm": brainstorm_priority, + "recursive_brainstorm": brainstorm_priority, + "subproof": [ + ARTIFACT_FAILED_SUBPROOFS, + ARTIFACT_VERIFIED_SUBPROOFS, + ARTIFACT_PARTIAL_PROOFS, + ARTIFACT_ACCEPTED_IDEAS, + ARTIFACT_FINAL_CYCLE_PACKETS, + ], + } + order = priority_by_mode.get(mode, priority_by_mode["brainstorm"]) + + items: list[LeanOJMemoryItem] = [] + for priority, artifact in enumerate(order): + title, text = raw_items[artifact] + if not text: + continue + items.append( + LeanOJMemoryItem( + artifact=artifact, + title=title, + text=text, + priority=priority, + source_name=self.source_name(session_id, artifact), + rag_only=artifact == ARTIFACT_FINAL_CYCLE_PACKETS, + ) + ) + return items + + @staticmethod + def _record_key(record: dict[str, Any] | None) -> str: + if not record: + return "" + try: + return json.dumps(record, sort_keys=True, default=str) + except TypeError: + return str(record) + + @staticmethod + async def _write_jsonl(path: Path, records: list[Any]) -> None: + async with aiofiles.open(path, "w", encoding="utf-8") as f: + for record in records: + await f.write(json.dumps(record, ensure_ascii=False) + "\n") + + @staticmethod + def _read_jsonl(path: Path) -> list[dict[str, Any]]: + if not path.exists(): + return [] + records: list[dict[str, Any]] = [] + try: + for line in path.read_text(encoding="utf-8").splitlines(): + if not line.strip(): + continue + item = json.loads(line) + if isinstance(item, dict): + records.append(item) + except Exception as exc: + logger.warning("Failed to load LeanOJ artifact log %s: %s", path, exc) + return records + + @staticmethod + def _records_to_strings(records: list[dict[str, Any]]) -> list[str]: + values: list[str] = [] + for record in records: + value = record.get("content", record) + if isinstance(value, str) and value.strip(): + values.append(value) + return values + + @staticmethod + def _format_strings(values: list[str]) -> str: + clean = [str(value).strip() for value in values if str(value).strip()] + return "\n".join(f"{index}. {value}" for index, value in enumerate(clean, start=1)) + + @staticmethod + def _format_strings_for_final(values: list[str]) -> str: + clean = [LeanOJContextManager._final_mode_text(value).strip() for value in values if str(value).strip()] + return "\n".join(f"{index}. {value}" for index, value in enumerate(clean, start=1)) + + @staticmethod + def _format_attempts(records: list[dict[str, Any]]) -> str: + blocks: list[str] = [] + for index, record in enumerate(records, start=1): + lean_code = str(record.get("lean_code") or "").strip() + lean_feedback = _remove_attempt_count_language(record.get("lean_feedback") or "") + feedback_lines = ["Lean pass feedback:", lean_feedback] if lean_feedback else [] + blocks.append( + "\n".join( + [ + f"FEEDBACK ITEM {index}: {_remove_attempt_count_language(record.get('request', 'proof feedback'))}", + "Error summary: " + f"{_remove_attempt_count_language(record.get('error_summary', record.get('error_output', '')))}", + *feedback_lines, + "Lean code:", + lean_code or "[not recorded]", + "---", + ] + ) + ) + return "\n".join(blocks) + + @staticmethod + def _format_refuted_construction_warnings( + records: list[dict[str, Any]], + *, + limit: int = 5, + max_chars: int = 1500, + ) -> str: + """Compact final-mode warnings for failed routes, kept separate from proof evidence.""" + clean_records = [record for record in records if isinstance(record, dict)] + blocks: list[str] = [] + for record in clean_records[-limit:]: + content = str(record.get("content") or record.get("summary") or record.get("error_summary") or "").strip() + if not content: + continue + reason = str( + record.get("reasoning") + or record.get("validator_summary") + or record.get("validator_reasoning") + or record.get("edit_reasoning") + or "" + ).strip() + line = LeanOJContextManager._final_mode_text(content) + if reason: + line = f"{line} Reason: {LeanOJContextManager._final_mode_text(reason)}" + blocks.append(line) + + if not blocks: + return "" + + text = "\n".join(f"{index}. {block}" for index, block in enumerate(blocks, start=1)) + if len(text) <= max_chars: + return text + return text[: max_chars - 20].rstrip() + "\n[truncated]" + + @staticmethod + def _format_final_cycle_packets(packets: list[dict[str, Any]]) -> str: + blocks: list[str] = [] + for index, packet in enumerate(packets, start=1): + attempts = packet.get("attempts") if isinstance(packet.get("attempts"), list) else [] + blocks.append( + "\n".join( + [ + f"FINAL-CYCLE FEEDBACK {index}", + f"Summary: {_remove_attempt_count_language(packet.get('summary', ''))}", + "Recent verification/edit feedback:", + LeanOJContextManager._format_attempts([dict(item) for item in attempts if isinstance(item, dict)]), + "---", + ] + ) + ) + return "\n".join(blocks) + + @staticmethod + def _format_verified_subproofs(subproofs: list[dict[str, Any]]) -> str: + blocks: list[str] = [] + for index, subproof in enumerate(subproofs, start=1): + lean_feedback = str(subproof.get("lean_feedback") or "").strip() + feedback_lines = ["Lean verifier feedback:", lean_feedback] if lean_feedback else [] + blocks.append( + "\n".join( + [ + f"SUBPROOF {index}: {subproof.get('request', '')}", + f"Role: {subproof.get('role', '')}", + f"Theorem/Lemma: {subproof.get('theorem_or_lemma', '')}", + *feedback_lines, + "Verified Lean 4 code:", + str(subproof.get("lean_code") or ""), + "---", + ] + ) + ) + return "\n".join(blocks) + + @staticmethod + def _final_mode_text(value: Any) -> str: + text = str(value or "") + cleaned = ( + text.replace("need_more_brainstorming", "additional proof context") + .replace("Brainstorm", "Proof memory") + .replace("brainstorm", "proof memory") + .replace("BRAINSTORM", "PROOF MEMORY") + ) + return _remove_attempt_count_language(cleaned) + + @classmethod + def _format_verified_subproofs_for_final(cls, subproofs: list[dict[str, Any]]) -> str: + blocks: list[str] = [] + for index, subproof in enumerate(subproofs, start=1): + lean_feedback = cls._final_mode_text(subproof.get("lean_feedback") or "").strip() + feedback_lines = ["Lean verifier feedback:", lean_feedback] if lean_feedback else [] + blocks.append( + "\n".join( + [ + f"SUBPROOF {index}: {cls._final_mode_text(subproof.get('request', ''))}", + f"Theorem/Lemma: {cls._final_mode_text(subproof.get('theorem_or_lemma', ''))}", + *feedback_lines, + "Verified Lean 4 code:", + str(subproof.get("lean_code") or ""), + "---", + ] + ) + ) + return "\n".join(blocks) + + @staticmethod + def _format_partial_proofs(partial_proofs: list[dict[str, Any]]) -> str: + blocks: list[str] = [] + for index, proof in enumerate(partial_proofs, start=1): + placeholders = ", ".join(proof.get("placeholder_tokens") or []) or "unknown" + blocks.append( + "\n".join( + [ + f"PARTIAL PROOF {index}: {proof.get('request', '')}", + f"Target: {proof.get('target', '')}; placeholders: {placeholders}", + f"Summary: {proof.get('summary', '')}", + "Lean-accepted incomplete scaffold:", + str(proof.get("lean_code") or ""), + "---", + ] + ) + ) + return "\n".join(blocks) + + @classmethod + def _format_partial_proofs_for_final(cls, partial_proofs: list[dict[str, Any]]) -> str: + blocks: list[str] = [] + for index, proof in enumerate(partial_proofs, start=1): + placeholders = ", ".join(proof.get("placeholder_tokens") or []) or "unknown" + blocks.append( + "\n".join( + [ + f"PARTIAL PROOF {index}: {cls._final_mode_text(proof.get('request', ''))}", + f"Placeholders: {placeholders}", + f"Summary: {cls._final_mode_text(proof.get('summary', ''))}", + "Lean-accepted incomplete scaffold:", + str(proof.get("lean_code") or ""), + "---", + ] + ) + ) + return "\n".join(blocks) + + @staticmethod + def _format_final_cycle_packet(packet: dict[str, Any] | None) -> str: + if not packet: + return "" + attempts = packet.get("attempts") if isinstance(packet.get("attempts"), list) else [] + partial_proofs = packet.get("partial_proofs") if isinstance(packet.get("partial_proofs"), list) else [] + lines = [ + "CURRENT FINAL-CYCLE FEEDBACK", + "This is the immediate final-loop feedback to use for repairing the current proof.", + LeanOJContextManager._format_attempts([dict(item) for item in attempts if isinstance(item, dict)]), + "Partial final scaffolds captured during this cycle:", + LeanOJContextManager._format_partial_proofs( + [dict(item) for item in partial_proofs if isinstance(item, dict)] + ) + or "[none recorded]", + ] + return "\n".join(lines).strip() + + @staticmethod + def _format_working_proof_attempt(packet: dict[str, Any] | None) -> str: + if not packet: + return "" + verified = packet.get("verified_subproofs") if isinstance(packet.get("verified_subproofs"), list) else [] + partials = packet.get("partial_final_proofs") if isinstance(packet.get("partial_final_proofs"), list) else [] + parts = [ + "CURRENT WORKING PROOF ATTEMPT", + "This is the proof attempt the next LeanOJ brainstorm must repair or complete directly.", + f"Trigger: {packet.get('trigger', '')}", + f"Requested path: {packet.get('requested_path', '')}", + f"Stuck reason: {_remove_attempt_count_language(packet.get('stuck_reason', ''))}", + ( + "Master proof metadata: " + f"version={packet.get('master_proof_version', 0)}, " + f"lines={packet.get('master_proof_line_count', 0)}, " + f"sha256={packet.get('master_proof_hash', '')}" + ), + f"Last edit summary: {packet.get('master_proof_last_edit_summary', '')}", + "Latest master_proof.lean:", + str(packet.get("master_proof") or "[not initialized]").strip(), + "Recent final solver feedback:", + str(packet.get("recent_final_attempts") or "[none recorded]").strip(), + "Verified helper subproofs available to reuse:", + LeanOJContextManager._format_verified_subproofs([dict(item) for item in verified if isinstance(item, dict)]) + or "[none recorded]", + "Lean-accepted partial final scaffolds:", + LeanOJContextManager._format_partial_proofs([dict(item) for item in partials if isinstance(item, dict)]) + or "[none recorded]", + ] + old_attempt = str(packet.get("old_attempt_before_redo") or "").strip() + if old_attempt: + validator_justification = str( + packet.get("old_attempt_before_redo_validator_justification") or "" + ).strip() + apparent_issue = str(packet.get("old_attempt_before_redo_apparent_issue") or "").strip() + parts += [ + "", + "OLD ATTEMPT THE SUBMITTER DECIDED TO REDO (preserved for reference only; do NOT revert to this):", + f"Original version: v{packet.get('old_attempt_before_redo_version', '?')}", + ( + "Old attempt metadata: " + f"lines={packet.get('old_attempt_before_redo_line_count', 0)}, " + f"chars={packet.get('old_attempt_before_redo_char_count', 0)}, " + f"sha256={packet.get('old_attempt_before_redo_hash', '')}" + ), + f"Summary: {packet.get('old_attempt_before_redo_summary', '')}", + "WHY THE VALIDATOR ALLOWED THIS REDO/SHORTENING:", + validator_justification or "[No validator justification was recorded.]", + "APPARENT ISSUE WITH THIS OLD LONGER ATTEMPT:", + apparent_issue or "[No apparent issue was recorded.]", + "Old proof content:", + old_attempt, + ] + return "\n".join(parts).strip() + + +leanoj_context_manager = LeanOJContextManager() \ No newline at end of file diff --git a/backend/leanoj/core/leanoj_coordinator.py b/backend/leanoj/core/leanoj_coordinator.py new file mode 100644 index 0000000..52cc4e8 --- /dev/null +++ b/backend/leanoj/core/leanoj_coordinator.py @@ -0,0 +1,5084 @@ +"""Coordinator for the additive LeanOJ proof-solver mode.""" +from __future__ import annotations + +import asyncio +import hashlib +import json +import logging +import re +import shutil +import time +import uuid +from datetime import datetime +from pathlib import Path +from typing import Any, Awaitable, Callable, Optional + +import aiofiles + +from backend.leanoj.core.leanoj_context import ( + ARTIFACT_ACCEPTED_IDEAS, + ARTIFACT_FAILED_SUBPROOFS, + ARTIFACT_FINAL_ATTEMPTS, + ARTIFACT_FINAL_CYCLE_PACKETS, + ARTIFACT_PARTIAL_PROOFS, + ARTIFACT_RECURSIVE_TOPICS, + ARTIFACT_VERIFIED_SUBPROOFS, + _remove_attempt_count_language, + leanoj_context_manager, +) +from backend.leanoj.prompts import ( + build_brainstorm_batch_validation_prompt, + build_brainstorm_prompt, + build_brainstorm_prune_review_prompt, + build_brainstorm_prune_validation_prompt, + build_brainstorm_validation_prompt, + build_final_solution_review_prompt, + build_final_solver_prompt, + build_master_proof_edit_validation_prompt, + build_path_decision_prompt, + build_path_validation_prompt, + build_sufficiency_prompt, + build_topic_batch_validation_prompt, + build_topic_candidate_prompt, + build_topic_selection_prompt, + build_topic_validation_prompt, +) +from backend.autonomous.memory.autonomous_api_logger import autonomous_api_logger +from backend.autonomous.memory.proof_database import proof_database +from backend.shared.api_client_manager import api_client_manager +from backend.shared.brainstorm_proof_gate import is_lean_proof_submission, verify_brainstorm_proof_candidate +from backend.shared.config import rag_config, system_config +from backend.shared.json_parser import parse_json +from backend.shared.lean4_client import Lean4Result, get_lean4_client +from backend.shared.lean_proof_integrity import strip_lean_comments_and_strings, validate_lean_proof_integrity +from backend.shared.model_error_utils import is_non_retryable_model_error +from backend.shared.models import ( + LeanOJAttemptRecord, + LeanOJRoleConfig, + LeanOJStartRequest, + LeanOJState, + LeanOJSubproofRecord, + ModelConfig, + ProofAttemptFeedback, + ProofRecord, + WorkflowTask, +) +from backend.shared.token_tracker import token_tracker +from backend.shared.utils import count_tokens + +logger = logging.getLogger(__name__) + +BroadcastFn = Optional[Callable[[str, dict[str, Any]], Awaitable[None]]] +_LEAN_PLACEHOLDER_RE = re.compile(r"(?{_LEAN_TOP_LEVEL_DECL_KIND_PATTERN})\s+(?P[A-Za-z_][A-Za-z0-9_'.]*|«[^»]+»)?", + re.DOTALL, +) +_TERMINAL_PHASES = {"verified"} +_ACTIVE_PHASES = { + "initial_topic_candidates", + "initial_brainstorm", + "path_decision", + "recursive_brainstorm", + "final_proof_loop", +} +_PHASE_PROGRESS_RANK = { + "idle": 0, + "initial_topic_candidates": 1, + "initial_brainstorm": 2, + "recursive_brainstorm": 3, + "path_decision": 4, + "final_proof_loop": 5, + "error": 7, + "stopped": 7, +} +_LEANOJ_PATH_OPTIONS = ("solve_final_now", "need_more_brainstorming") +_LEANOJ_PATH_OPTIONS_SET = set(_LEANOJ_PATH_OPTIONS) +_LEANOJ_PROOF_EDIT_ACTIONS = {"edit_proof"} +_LEANOJ_PROOF_EDIT_OPERATIONS = {"full_content", "replace", "insert_after", "delete"} +_MASTER_PROOF_EDIT_LOG_COMPACT_RECORD_LIMIT = 500 +_MASTER_PROOF_EDIT_LOG_RECENT_RECORDS_TO_KEEP = 150 +_MASTER_PROOF_NO_PROGRESS_LIMIT = 8 +_MASTER_PROOF_STALE_EDIT_FAILURE_HANDOFF_COUNT = 3 +_MASTER_PROOF_EDIT_SUMMARY_LIMIT = 1000 +_MASTER_PROOF_SHORTENING_CHAR_THRESHOLD = 80 +_LEANOJ_CONTEXT_ROLES = {"active_plan", "verified_hint", "refuted_construction", "scratch"} +_LEANOJ_FINAL_ACTIVE_CONTEXT_ROLES = {"active_plan"} +_LEANOJ_REFUTED_CONTEXT_TERMS = ( + "counterexample", + "refuted", + "do not use", + "fails at", + "fails for", + "falsified", + "false construction", + "invalid construction", + "construction is invalid", + "candidate is invalid", +) +_LEANOJ_ACTIVE_PLAN_CONTEXT_TERMS = ( + "active proof plan", + "current proof route", + "chosen proof route", + "current chosen proof route", + "master proof route", + "next obligation", +) + + +class LeanOJConfigurationError(RuntimeError): + """Non-retryable LeanOJ configuration problem.""" + + +_BrainstormSubmission = tuple[int, str, dict[str, Any]] + + +class _LeanOJBrainstormSubmissionQueue: + """LeanOJ-local pending queue with aggregator-style fairness accounting.""" + + def __init__(self, submitter_count: int) -> None: + self.queue: asyncio.Queue[_BrainstormSubmission] = asyncio.Queue() + self.submitter_count = submitter_count + self.pending_by_submitter: dict[int, int] = {} + self.global_paused = False + self.paused_submitters: set[int] = set() + + def qsize(self) -> int: + return self.queue.qsize() + + def count_for_submitter(self, submitter_index: int) -> int: + return self.pending_by_submitter.get(submitter_index, 0) + + def should_pause_submitter(self, submitter_index: int) -> bool: + if self.qsize() >= system_config.queue_overflow_threshold: + return True + if self.submitter_count <= 1: + return False + return ( + self.count_for_submitter(submitter_index) + > system_config.per_submitter_queue_threshold + ) + + async def put(self, item: _BrainstormSubmission) -> None: + await self.queue.put(item) + submitter_index = item[0] + self.pending_by_submitter[submitter_index] = ( + self.pending_by_submitter.get(submitter_index, 0) + 1 + ) + + async def dequeue_batch( + self, + *, + max_count: int = 3, + timeout: float = 1.0, + collect_window: float = 0.25, + ) -> list[_BrainstormSubmission]: + try: + first = await asyncio.wait_for(self.queue.get(), timeout=timeout) + except asyncio.TimeoutError: + return [] + + batch = [first] + self._decrement_submitter(first[0]) + deadline = time.monotonic() + collect_window + while len(batch) < max_count: + try: + item = self.queue.get_nowait() + batch.append(item) + self._decrement_submitter(item[0]) + continue + except asyncio.QueueEmpty: + pass + + remaining = deadline - time.monotonic() + if remaining <= 0: + break + try: + item = await asyncio.wait_for(self.queue.get(), timeout=remaining) + batch.append(item) + self._decrement_submitter(item[0]) + except asyncio.TimeoutError: + break + return batch + + def refresh_pause_transitions(self) -> dict[str, Any]: + queue_size = self.qsize() + next_global_paused = queue_size >= system_config.queue_overflow_threshold + next_paused_submitters = self._current_paused_submitters() + + transitions = { + "queue_size": queue_size, + "global_paused": next_global_paused, + "global_changed": next_global_paused != self.global_paused, + "submitters_paused": next_paused_submitters - self.paused_submitters, + "submitters_resumed": self.paused_submitters - next_paused_submitters, + } + self.global_paused = next_global_paused + self.paused_submitters = next_paused_submitters + return transitions + + def _current_paused_submitters(self) -> set[int]: + if self.submitter_count <= 1: + return set() + return { + submitter_index + for submitter_index, pending_count in self.pending_by_submitter.items() + if pending_count > system_config.per_submitter_queue_threshold + } + + def _decrement_submitter(self, submitter_index: int) -> None: + pending_count = self.pending_by_submitter.get(submitter_index, 0) + if pending_count <= 1: + self.pending_by_submitter.pop(submitter_index, None) + return + self.pending_by_submitter[submitter_index] = pending_count - 1 + + +class LeanOJCoordinator: + """Run the proof-only LeanOJ workflow as an isolated third mode.""" + + def __init__(self) -> None: + self._running = False + self._state = LeanOJState() + self._request: Optional[LeanOJStartRequest] = None + self._stop_event = asyncio.Event() + self._main_task: Optional[asyncio.Task] = None + self._broadcast_callback: BroadcastFn = None + self._task_sequences: dict[str, int] = {} + + self._validated_topics: list[str] = [] + self._recursive_topics: list[str] = [] + self._accepted_ideas: list[str] = [] + self._accepted_idea_records: list[dict[str, Any]] = [] + self._failed_feedback: list[dict[str, Any]] = [] + self._last_brainstorm_validation_decisions: list[dict[str, Any]] = [] + self._final_attempts: list[dict[str, Any]] = [] + self._final_context_events: list[dict[str, Any]] = [] + self._partial_proofs: list[dict[str, Any]] = [] + self._final_cycle_packets: list[dict[str, Any]] = [] + self._current_final_cycle_packet: Optional[dict[str, Any]] = None + self._current_working_proof_attempt: Optional[dict[str, Any]] = None + + self.workflow_tasks: list[WorkflowTask] = [] + self.completed_task_ids: set[str] = set() + self.current_task_id: Optional[str] = None + self._restored_from_disk = False + self._master_proof_no_progress_count = 0 + self._last_master_proof_edit_signature = "" + + @property + def is_running(self) -> bool: + return self._running + + @property + def is_active(self) -> bool: + return self._running or (self._main_task is not None and not self._main_task.done()) + + def set_broadcast_callback(self, callback: Callable[[str, dict[str, Any]], Awaitable[None]]) -> None: + self._broadcast_callback = callback + + async def _broadcast(self, event: str, data: Optional[dict[str, Any]] = None) -> None: + if self._broadcast_callback: + await self._broadcast_callback(event, data or {}) + + def get_state(self) -> LeanOJState: + return self._state + + def get_status(self) -> dict[str, Any]: + self._ensure_accepted_idea_records() + payload = self._state.model_dump(mode="json") + payload.update( + { + "validated_topics": list(self._validated_topics), + "accepted_ideas": list(self._accepted_ideas), + "accepted_idea_records": list(self._accepted_idea_records), + "failed_feedback": list(self._failed_feedback[-20:]), + "final_attempts": list(self._final_attempts[-20:]), + "final_context_events": list(self._final_context_events[-20:]), + "partial_proofs": list(self._partial_proofs[-20:]), + "final_cycle_packets": list(self._final_cycle_packets[-5:]), + "current_final_cycle_packet": self._current_final_cycle_packet, + "current_working_proof_attempt": self._current_working_proof_attempt, + "workflow_tasks": [task.model_dump(mode="json") for task in self.workflow_tasks], + "resume_available": self._request is not None + and self._state.phase not in _TERMINAL_PHASES + and not self._state.final_solution, + } + ) + return payload + + async def restore_latest_session(self, *, auto_resume: bool = False) -> bool: + """Restore the latest saved LeanOJ session and optionally resume it.""" + if self.is_active: + return False + + state_file = self._find_best_resumable_state_file() or self._find_latest_state_file() + if state_file is None: + return False + + try: + async with aiofiles.open(state_file, "r", encoding="utf-8") as f: + payload = json.loads(await f.read()) + self._restore_from_payload(payload) + except Exception as exc: + logger.warning("Failed to restore LeanOJ session from %s: %s", state_file, exc) + return False + + logger.info( + "Restored LeanOJ session %s (phase=%s, accepted=%s, final_attempts=%s)", + self._state.session_id, + self._state.phase, + len(self._accepted_ideas), + self._state.final_attempt_count, + ) + + if ( + auto_resume + and self._request is not None + and self._state.phase not in _TERMINAL_PHASES + and not self._state.final_solution + ): + logger.info("Auto-resuming interrupted Proof Solver session %s", self._state.session_id) + self.start_in_background() + + return True + + async def initialize(self, request: LeanOJStartRequest) -> None: + if self.is_active: + raise RuntimeError("Proof Solver is already running") + if not request.user_prompt.strip(): + raise ValueError("Proof Solver user prompt is required") + if not request.lean_template.strip(): + raise ValueError("Proof Solver template is required") + if not request.brainstorm_submitters: + raise ValueError("At least one Proof Solver brainstorm submitter is required") + missing_roles = self._missing_model_roles(request) + if missing_roles: + raise ValueError( + "Proof Solver role model configuration is incomplete. Missing model for: " + + ", ".join(missing_roles) + ) + + self._request = request + self._stop_event = asyncio.Event() + self._task_sequences = {} + self._validated_topics = [] + self._accepted_ideas = [] + self._accepted_idea_records = [] + self._failed_feedback = [] + self._final_attempts = [] + self._final_context_events = [] + self._partial_proofs = [] + self._final_cycle_packets = [] + self._current_final_cycle_packet = None + self._current_working_proof_attempt = None + self.workflow_tasks = [] + self.completed_task_ids = set() + self.current_task_id = None + self._restored_from_disk = False + self._master_proof_no_progress_count = 0 + self._last_master_proof_edit_signature = "" + + self._state = LeanOJState( + is_running=False, + phase="idle", + session_id=f"leanoj_{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}_{uuid.uuid4().hex[:8]}", + ) + + self._configure_roles(request) + await self._persist_state() + + async def resume_or_initialize(self, request: LeanOJStartRequest) -> bool: + """Resume matching saved progress when possible, otherwise create a new run.""" + if self.is_active: + raise RuntimeError("Proof Solver is already running") + if not request.user_prompt.strip(): + raise ValueError("Proof Solver user prompt is required") + if not request.lean_template.strip(): + raise ValueError("Proof Solver template is required") + if not request.brainstorm_submitters: + raise ValueError("At least one Proof Solver brainstorm submitter is required") + missing_roles = self._missing_model_roles(request) + if missing_roles: + raise ValueError( + "Proof Solver role model configuration is incomplete. Missing model for: " + + ", ".join(missing_roles) + ) + + matching_state_file = self._find_best_matching_state_file(request) + if matching_state_file is None: + await self.initialize(request) + return False + + try: + async with aiofiles.open(matching_state_file, "r", encoding="utf-8") as f: + payload = json.loads(await f.read()) + self._restore_from_payload(payload) + except Exception as exc: + logger.warning("Failed to restore matching LeanOJ session from %s: %s", matching_state_file, exc) + await self.initialize(request) + return False + + # Keep accumulated proof context, but let the restarted run use the + # latest model/fallback settings from the UI. + self._request = request + self._stop_event = asyncio.Event() + self._configure_roles(request) + self._restored_from_disk = True + await self._persist_state() + logger.info( + "Prepared LeanOJ session %s for resume from %s", + self._state.session_id, + matching_state_file, + ) + return True + + def start_in_background(self) -> bool: + if self._main_task and not self._main_task.done(): + return False + self._main_task = asyncio.create_task(self.start()) + self._main_task.add_done_callback(self._on_task_done) + return True + + def _on_task_done(self, task: asyncio.Task) -> None: + try: + task.result() + except asyncio.CancelledError: + logger.info("LeanOJ coordinator task cancelled") + except Exception: + logger.exception("LeanOJ coordinator task failed") + finally: + if self._main_task is task: + self._main_task = None + + def _enable_api_logging(self) -> None: + async def log_callback( + task_id, + role_id, + model, + provider, + prompt, + response, + tokens_used, + duration_ms, + success, + error, + phase, + ): + try: + await autonomous_api_logger.log_api_call( + task_id=task_id, + role_id=role_id, + model=model, + provider=provider, + prompt=prompt, + response_content=response, + tokens_used=tokens_used, + duration_ms=duration_ms, + success=success, + error=error, + phase=phase or self._state.phase or "leanoj", + workflow="leanoj", + ) + except Exception as exc: + logger.error("Failed to log LeanOJ API call: %s", exc) + + api_client_manager.set_autonomous_logger_callback(log_callback) + logger.info("LeanOJ API logging enabled") + + async def start(self) -> None: + if self._request is None: + raise RuntimeError("LeanOJ coordinator must be initialized before start") + if self._running: + return + + self._running = True + self._state.is_running = True + if self._state.phase == "idle": + self._state.phase = "initial_topic_candidates" + elif self._state.phase in {"stopped", "error"}: + self._state.phase = self._infer_resume_phase() + self._remember_active_phase() + self._state.updated_at = datetime.now() + self._state.last_error = "" + token_tracker.reset() + token_tracker.start_timer() + self._enable_api_logging() + await self._persist_and_broadcast("leanoj_started") + + try: + await self._run_workflow(self._request) + except asyncio.CancelledError: + raise + except Exception as exc: + logger.exception("LeanOJ workflow failed") + self._state.phase = "error" + self._state.last_error = str(exc) + await self._persist_and_broadcast("leanoj_error", {"message": str(exc)}) + finally: + self._running = False + self._state.is_running = False + if self._state.phase not in {"verified", "error"}: + self._remember_active_phase() + self._state.updated_at = datetime.now() + token_tracker.stop_timer() + api_client_manager.set_autonomous_logger_callback(None) + await self._persist_and_broadcast("leanoj_stopped") + + async def stop(self) -> None: + if not self.is_active and not self._state.session_id: + return + self._stop_event.set() + task = self._main_task + if task and not task.done(): + try: + await asyncio.wait_for(asyncio.shield(task), timeout=5) + except asyncio.TimeoutError: + task.cancel() + await asyncio.gather(task, return_exceptions=True) + if not self._running: + self._state.is_running = False + if self._state.phase not in {"verified", "error"}: + self._remember_active_phase() + await self._persist_and_broadcast("leanoj_stopped") + + async def clear(self) -> None: + """Clear Proof Solver progress. This is the explicit reset path.""" + if self.is_active: + await self.stop() + base = self._sessions_base_dir() + if base.exists(): + shutil.rmtree(base) + partial_base = self._partial_proofs_base_dir() + if partial_base.exists(): + shutil.rmtree(partial_base) + await leanoj_context_manager.clear_all() + + self._running = False + self._state = LeanOJState() + self._request = None + self._stop_event = asyncio.Event() + self._main_task = None + self._task_sequences = {} + self._validated_topics = [] + self._accepted_ideas = [] + self._accepted_idea_records = [] + self._failed_feedback = [] + self._final_attempts = [] + self._final_context_events = [] + self._partial_proofs = [] + self._final_cycle_packets = [] + self._current_final_cycle_packet = None + self._current_working_proof_attempt = None + self.workflow_tasks = [] + self.completed_task_ids = set() + self.current_task_id = None + self._restored_from_disk = False + self._master_proof_no_progress_count = 0 + self._last_master_proof_edit_signature = "" + await self._broadcast("leanoj_cleared", self.get_status()) + + async def skip_brainstorm(self) -> None: + self._state.skip_brainstorm_requested = True + await self._persist_and_broadcast("leanoj_skip_brainstorm_requested") + + async def force_brainstorm(self) -> None: + self._state.force_brainstorm_requested = True + await self._persist_and_broadcast("leanoj_force_brainstorm_requested") + + async def _consume_skip_brainstorm(self) -> bool: + if not self._state.skip_brainstorm_requested: + return False + self._state.skip_brainstorm_requested = False + self._state.current_path_decision = "solve_final_now" + self._state.user_forced_final_cycle = True + self._state.phase = "final_proof_loop" + await self._persist_and_broadcast("leanoj_brainstorm_skipped") + return True + + async def _consume_force_brainstorm(self) -> bool: + if not self._state.force_brainstorm_requested: + return False + self._state.force_brainstorm_requested = False + self._state.skip_brainstorm_requested = False + self._state.user_forced_final_cycle = False + self._state.current_path_decision = "need_more_brainstorming" + # A user-forced recursive brainstorm is a fresh acceptance window. + # Otherwise the recursive sufficiency modulo can reuse the prior cycle's + # start count and fire before five new accepted brainstorms arrive. + self._state.active_brainstorm_phase = "" + self._state.active_brainstorm_start_count = self._state.brainstorm_acceptance_events + await self._set_current_working_proof_attempt( + trigger="user_force_brainstorm", + requested_path="need_more_brainstorming", + stuck_reason="User requested recursive brainstorming while preserving the current master proof draft.", + ) + self._state.phase = "recursive_brainstorm" + await self._persist_and_broadcast("leanoj_brainstorm_forced") + return True + + async def _run_workflow(self, request: LeanOJStartRequest) -> None: + if self._state.phase in {"idle", "initial_topic_candidates"}: + selected_topic = await self._initial_topic_phase(request) + if self._should_stop(): + return + self._state.selected_topic = selected_topic + + if await self._consume_force_brainstorm(): + pass + elif self._state.phase == "initial_brainstorm" or ( + self._state.phase == "initial_topic_candidates" and self._state.selected_topic + ): + await self._initial_brainstorm_phase(request) + + if self._state.phase == "recursive_brainstorm": + await self._recursive_brainstorm_phase(request) + + if self._state.phase == "proof_storm": + # Legacy sessions may resume from the removed proof-only brainstorm path. + # Continue with recursive brainstorming because verified proofs can now + # be generated directly inside any brainstorm phase. + await self._recursive_brainstorm_phase(request) + + if self._state.phase == "final_proof_loop": + await self._final_proof_loop(request) + + while not self._should_stop() and self._state.phase != "verified": + if await self._consume_force_brainstorm(): + continue + + if self._state.phase == "final_proof_loop" or self._state.user_forced_final_cycle: + await self._final_proof_loop(request) + continue + + if self._state.phase == "recursive_brainstorm": + await self._recursive_brainstorm_phase(request) + continue + + if self._state.phase == "proof_storm": + await self._recursive_brainstorm_phase(request) + continue + + decision = await self._path_decision_phase(request) + if self._should_stop(): + return + + if await self._consume_force_brainstorm(): + continue + + if decision == "solve_final_now": + await self._final_proof_loop(request) + elif decision == "need_more_brainstorming": + await self._recursive_brainstorm_phase(request) + else: + logger.warning("Unknown Proof Solver path decision %s; falling back to recursive brainstorming", decision) + await self._recursive_brainstorm_phase(request) + + async def _initial_topic_phase(self, request: LeanOJStartRequest) -> str: + self._state.phase = "initial_topic_candidates" + await self._persist_and_broadcast("leanoj_phase_changed") + + if not await self._collect_initial_topics(request, target_topics=5): + return "Direct Proof Solver proof solving from the user's template" + + if self._should_stop(): + return "" + if not self._validated_topics: + return "Direct Proof Solver proof solving from the user's template" + + selected_raw = await self._call_json( + request.topic_generator, + "leanoj_topic", + "leanoj_topic_selector", + build_topic_selection_prompt(request.user_prompt, request.lean_template, self._validated_topics), + ) + selected = str(selected_raw.get("topic") or "").strip() or self._validated_topics[0] + if not await self._validate_topic(request, selected): + selected = self._validated_topics[0] + + await self._persist_and_broadcast("leanoj_initial_topic_selected", {"topic": selected}) + return selected + + async def _validate_topic( + self, + request: LeanOJStartRequest, + topic: str, + accepted_topics: Optional[list[str]] = None, + ) -> bool: + raw = await self._call_json( + request.topic_validator, + "leanoj_topic_val", + "leanoj_topic_validator", + build_topic_validation_prompt( + request.user_prompt, + request.lean_template, + topic, + accepted_topics if accepted_topics is not None else self._validated_topics, + ), + ) + accepted = str(raw.get("decision") or "").strip().lower() == "accept" + self._last_brainstorm_validation_decisions = [ + { + "accepted": accepted, + "reasoning": str(raw.get("reasoning") or "").strip(), + "summary": str(raw.get("summary") or "").strip(), + } + ] + return accepted + + async def _validate_topic_batch( + self, + request: LeanOJStartRequest, + topics: list[str], + accepted_topics: Optional[list[str]] = None, + ) -> list[bool]: + if not topics: + return [] + if len(topics) == 1: + return [await self._validate_topic(request, topics[0], accepted_topics)] + + raw = await self._call_json( + request.topic_validator, + "leanoj_topic_val", + "leanoj_topic_validator", + build_topic_batch_validation_prompt( + request.user_prompt, + request.lean_template, + topics, + accepted_topics if accepted_topics is not None else self._validated_topics, + ), + ) + decisions = raw.get("decisions") + if not isinstance(decisions, list) or len(decisions) != len(topics): + logger.warning( + "LeanOJ topic batch validator returned %s decisions for %s topics", + len(decisions) if isinstance(decisions, list) else "non-list", + len(topics), + ) + return [False for _ in topics] + + accepted: list[bool] = [] + for expected_number, decision_payload in enumerate(decisions, start=1): + if not isinstance(decision_payload, dict): + accepted.append(False) + continue + if decision_payload.get("topic_number") != expected_number: + logger.warning( + "LeanOJ topic batch validator returned out-of-order decision: expected %s, got %s", + expected_number, + decision_payload.get("topic_number"), + ) + return [False for _ in topics] + accepted.append(str(decision_payload.get("decision") or "").strip().lower() == "accept") + return accepted + + async def _collect_initial_topics(self, request: LeanOJStartRequest, *, target_topics: int) -> bool: + if self._state.skip_brainstorm_requested: + await self._persist_and_broadcast("leanoj_brainstorm_skip_deferred") + return False + + topic_queue: asyncio.Queue[tuple[int, str]] = asyncio.Queue( + maxsize=max(3, len(request.brainstorm_submitters) * 2) + ) + submitter_tasks = [ + asyncio.create_task( + self._topic_submitter_loop( + request, + index, + submitter, + topic_queue, + target_topics=target_topics, + ) + ) + for index, submitter in enumerate(request.brainstorm_submitters, start=1) + ] + logger.info( + "LeanOJ initial topic submitters started (submitters=%s, target_topics=%s)", + len(submitter_tasks), + target_topics, + ) + await self._broadcast( + "leanoj_topic_submitters_started", + { + "submitter_count": len(submitter_tasks), + "target_topics": target_topics, + }, + ) + + try: + while len(self._validated_topics) < target_topics and not self._should_stop(): + if self._state.skip_brainstorm_requested: + await self._persist_and_broadcast("leanoj_brainstorm_skip_deferred") + return False + + remaining_topics = target_topics - len(self._validated_topics) + batch = await self._dequeue_topic_batch(topic_queue, max_count=min(3, remaining_topics)) + if not batch: + if all(task.done() for task in submitter_tasks): + errors = [ + task.exception() + for task in submitter_tasks + if task.done() and not task.cancelled() and task.exception() is not None + ] + if errors: + raise RuntimeError(f"All LeanOJ topic submitters stopped: {errors[0]}") + return bool(self._validated_topics) + continue + + topics = [topic for _, topic in batch] + logger.info( + "LeanOJ topic batch validation started (batch_size=%s, submitters=%s)", + len(batch), + [submitter_index for submitter_index, _ in batch], + ) + await self._broadcast( + "leanoj_topic_batch_validation_started", + { + "batch_size": len(batch), + "submitters": [submitter_index for submitter_index, _ in batch], + "accepted_topics": len(self._validated_topics), + "target_topics": target_topics, + }, + ) + decisions = await self._validate_topic_batch( + request, + topics, + accepted_topics=list(self._validated_topics), + ) + for (submitter_index, topic), accepted in zip(batch, decisions): + submitter_config = request.brainstorm_submitters[submitter_index - 1] + if accepted: + self._validated_topics.append(topic) + await self._persist_and_broadcast( + "leanoj_topic_validated", + { + "topic": topic, + "submitter": submitter_index, + "submitter_id": submitter_index, + "submitter_model": submitter_config.model_id, + "submitter_provider": submitter_config.provider, + "accepted_topics": len(self._validated_topics), + "target_topics": target_topics, + }, + ) + else: + await self._broadcast( + "leanoj_topic_rejected", + { + "topic": topic, + "submitter": submitter_index, + "submitter_id": submitter_index, + "submitter_model": submitter_config.model_id, + "submitter_provider": submitter_config.provider, + "accepted_topics": len(self._validated_topics), + "target_topics": target_topics, + }, + ) + return bool(self._validated_topics) + finally: + for task in submitter_tasks: + task.cancel() + await asyncio.gather(*submitter_tasks, return_exceptions=True) + + async def _topic_submitter_loop( + self, + request: LeanOJStartRequest, + submitter_index: int, + submitter: LeanOJRoleConfig, + topic_queue: asyncio.Queue[tuple[int, str]], + *, + target_topics: int, + ) -> None: + task_prefix = f"leanoj_topic_sub{submitter_index}" + role_id = f"leanoj_topic_submitter_{submitter_index}" + attempt = 0 + while not self._should_stop(): + try: + attempt += 1 + topic_index = min(target_topics, len(self._validated_topics) + topic_queue.qsize() + 1) + await self._broadcast( + "leanoj_topic_generation_started", + { + "attempt": attempt, + "topic_index": topic_index, + "target_topics": target_topics, + "accepted_topics": len(self._validated_topics), + "submitter": submitter_index, + "submitter_id": submitter_index, + "submitter_model": submitter.model_id, + "submitter_provider": submitter.provider, + }, + ) + raw = await self._call_json( + submitter, + task_prefix, + role_id, + build_topic_candidate_prompt( + request.user_prompt, + request.lean_template, + self._validated_topics, + ), + temperature=api_client_manager.parallel_brainstorm_submitter_temperature(submitter_index), + ) + + topic = str(raw.get("topic") or "").strip() + if not topic: + await self._broadcast( + "leanoj_topic_empty", + { + "attempt": attempt, + "submitter": submitter_index, + "submitter_id": submitter_index, + }, + ) + continue + + await topic_queue.put((submitter_index, topic)) + await self._broadcast( + "leanoj_topic_candidate_queued", + { + "submitter": submitter_index, + "submitter_id": submitter_index, + "submitter_model": submitter.model_id, + "submitter_provider": submitter.provider, + "queue_size": topic_queue.qsize(), + "topic_preview": self._summarize_error(topic, limit=220), + }, + ) + except asyncio.CancelledError: + raise + except LeanOJConfigurationError: + raise + except Exception as exc: + logger.warning("LeanOJ topic submitter %s failed: %s", submitter_index, exc) + await self._broadcast( + "leanoj_topic_submitter_failed", + { + "submitter": submitter_index, + "submitter_id": submitter_index, + "message": str(exc), + }, + ) + await asyncio.sleep(2) + + async def _initial_brainstorm_phase(self, request: LeanOJStartRequest) -> None: + self._state.phase = "initial_brainstorm" + self._begin_brainstorm_acceptance_phase("initial_brainstorm") + await self._persist_and_broadcast("leanoj_phase_changed") + await self._brainstorm_until_path_check( + request, + phase_key="initial_brainstorm", + max_accepts=request.max_initial_brainstorm_accepts, + sufficiency_interval=10, + force_after_max=True, + ) + + async def _recursive_brainstorm_phase(self, request: LeanOJStartRequest) -> None: + if await self._consume_force_brainstorm(): + return + if await self._consume_skip_brainstorm(): + return + + resuming_recursive_phase = self._state.phase == "recursive_brainstorm" + if not resuming_recursive_phase: + self._state.recursive_cycle_count += 1 + self._state.active_brainstorm_phase = "" + + self._state.phase = "recursive_brainstorm" + self._begin_brainstorm_acceptance_phase("recursive_brainstorm") + await self._persist_and_broadcast("leanoj_phase_changed") + accepted_at_phase_entry = self._state.brainstorm_acceptance_events + logger.info( + "LeanOJ recursive brainstorm cycle %s %s (accepted_events=%s)", + self._state.recursive_cycle_count, + "resumed" if resuming_recursive_phase else "started", + accepted_at_phase_entry, + ) + await self._persist_and_broadcast( + "leanoj_recursive_brainstorm_started", + { + "cycle": self._state.recursive_cycle_count, + "resumed": resuming_recursive_phase, + "accepted_events": accepted_at_phase_entry, + }, + ) + + try: + if await self._consume_skip_brainstorm(): + return + await self._brainstorm_until_path_check( + request, + phase_key="recursive_brainstorm", + max_accepts=request.max_recursive_brainstorm_accepts, + sufficiency_interval=5, + force_after_max=True, + ) + if not self._should_stop(): + accepted_delta = self._state.brainstorm_acceptance_events - accepted_at_phase_entry + logger.info( + "LeanOJ recursive brainstorm cycle %s completed (accepted_delta=%s, total_acceptances=%s)", + self._state.recursive_cycle_count, + accepted_delta, + self._state.accepted_brainstorm_count, + ) + await self._persist_and_broadcast( + "leanoj_recursive_brainstorm_completed", + { + "cycle": self._state.recursive_cycle_count, + "accepted_delta": accepted_delta, + "total_acceptances": self._state.accepted_brainstorm_count, + "total_brainstorm_acceptance_events": self._state.brainstorm_acceptance_events, + }, + ) + finally: + if not self._should_stop(): + self._clear_current_final_cycle_packet() + + async def _brainstorm_until_path_check( + self, + request: LeanOJStartRequest, + *, + phase_key: str = "initial_brainstorm", + max_accepts: int, + sufficiency_interval: int, + force_after_max: bool, + ) -> None: + accepted_at_start = self._get_brainstorm_acceptance_start(phase_key) + run_exit_review = False + submission_queue = _LeanOJBrainstormSubmissionQueue( + submitter_count=len(request.brainstorm_submitters) + ) + submitter_tasks = [ + asyncio.create_task( + self._brainstorm_submitter_loop(request, index, submitter, submission_queue) + ) + for index, submitter in enumerate(request.brainstorm_submitters, start=1) + ] + logger.info( + "LeanOJ brainstorm submitters started (phase=%s, submitters=%s, max_accepts=%s, accepted_at_start=%s)", + phase_key, + len(submitter_tasks), + max_accepts, + accepted_at_start, + ) + await self._broadcast( + "leanoj_brainstorm_submitters_started", + { + "phase": phase_key, + "submitter_count": len(submitter_tasks), + "max_accepts": max_accepts, + "accepted_at_start": accepted_at_start, + }, + ) + + try: + while not self._should_stop(): + if await self._consume_force_brainstorm(): + run_exit_review = False + return + if await self._consume_skip_brainstorm(): + run_exit_review = False + return + + accepted_delta = self._state.brainstorm_acceptance_events - accepted_at_start + if accepted_delta >= max_accepts and force_after_max: + run_exit_review = True + logger.info( + "LeanOJ brainstorm phase limit reached (phase=%s, accepted_delta=%s, max_accepts=%s)", + phase_key, + accepted_delta, + max_accepts, + ) + await self._broadcast( + "leanoj_brainstorm_phase_limit_reached", + { + "phase": phase_key, + "accepted_delta": accepted_delta, + "max_accepts": max_accepts, + "total_acceptances": self._state.accepted_brainstorm_count, + }, + ) + self._finish_brainstorm_acceptance_phase_for_path_decision() + return + if ( + accepted_delta > 0 + and accepted_delta % sufficiency_interval == 0 + and self._state.brainstorm_acceptance_events + != self._state.active_brainstorm_last_sufficiency_check_count + ): + self._state.active_brainstorm_last_sufficiency_check_count = ( + self._state.brainstorm_acceptance_events + ) + logger.info( + "LeanOJ brainstorm sufficiency check started (phase=%s, accepted_delta=%s)", + phase_key, + accepted_delta, + ) + await self._broadcast( + "leanoj_sufficiency_check_started", + { + "phase": phase_key, + "accepted_delta": accepted_delta, + "total_acceptances": self._state.accepted_brainstorm_count, + }, + ) + enough = await self._sufficiency_check(request) + await self._persist_and_broadcast("leanoj_sufficiency_checked", {"enough": enough}) + if enough: + run_exit_review = True + self._finish_brainstorm_acceptance_phase_for_path_decision() + return + + batch = await self._dequeue_brainstorm_batch(submission_queue) + await self._sync_brainstorm_queue_pause_state(submission_queue, phase_key) + if not batch: + if all(task.done() for task in submitter_tasks): + errors = [ + task.exception() + for task in submitter_tasks + if task.done() and not task.cancelled() and task.exception() is not None + ] + if errors: + raise RuntimeError(f"All LeanOJ brainstorm submitters stopped: {errors[0]}") + run_exit_review = True + self._finish_brainstorm_acceptance_phase_for_path_decision() + return + continue + + submissions = [submission for _, submission, _ in batch] + logger.info( + "LeanOJ brainstorm batch validation started (phase=%s, batch_size=%s, submitters=%s)", + phase_key, + len(batch), + [submitter_index for submitter_index, _, _ in batch], + ) + await self._broadcast( + "leanoj_brainstorm_batch_validation_started", + { + "phase": phase_key, + "batch_size": len(batch), + "submitters": [submitter_index for submitter_index, _, _ in batch], + }, + ) + decisions = await self._validate_brainstorm_batch(request, submissions) + validation_decisions = list(self._last_brainstorm_validation_decisions) + for batch_index, ((submitter_index, submission, metadata), accepted) in enumerate( + zip(batch, decisions) + ): + submitter_config = request.brainstorm_submitters[submitter_index - 1] + if accepted: + await self._record_accepted_brainstorm_proof(request, submitter_index, metadata) + validation_feedback = ( + validation_decisions[batch_index] + if batch_index < len(validation_decisions) + else {} + ) + self._record_accepted_brainstorm_idea( + submission, + submitter_index, + phase_key, + validation_feedback, + ) + self._state.accepted_brainstorm_count = len(self._accepted_ideas) + submission_preview = self._summarize_error(submission, limit=220) + logger.info( + "LeanOJ brainstorm ACCEPTED: Submitter %s [%s] (phase=%s, total_acceptances=%s, event=%s) - %s", + submitter_index, + submitter_config.model_id, + phase_key, + self._state.accepted_brainstorm_count, + self._state.brainstorm_acceptance_events, + submission_preview, + ) + await self._persist_and_broadcast( + "leanoj_brainstorm_accepted", + { + "submitter": submitter_index, + "submitter_id": submitter_index, + "submitter_model": submitter_config.model_id, + "submitter_provider": submitter_config.provider, + "submission": submission, + "submission_preview": submission_preview, + "phase": phase_key, + "total_acceptances": self._state.accepted_brainstorm_count, + "total_brainstorm_acceptance_events": self._state.brainstorm_acceptance_events, + }, + ) + accepted_delta = self._state.brainstorm_acceptance_events - accepted_at_start + if ( + accepted_delta > 0 + and accepted_delta % 7 == 0 + and self._state.brainstorm_acceptance_events + != self._state.active_brainstorm_last_prune_review_count + ): + self._state.active_brainstorm_last_prune_review_count = ( + self._state.brainstorm_acceptance_events + ) + await self._perform_brainstorm_prune_review( + request, + phase_key, + reason=f"scheduled review after {accepted_delta} accepted brainstorm events", + ) + if ( + force_after_max + and self._state.brainstorm_acceptance_events - accepted_at_start >= max_accepts + ): + run_exit_review = True + self._finish_brainstorm_acceptance_phase_for_path_decision() + return + else: + validation_feedback = ( + validation_decisions[batch_index] + if batch_index < len(validation_decisions) + else {} + ) + self._state.rejected_brainstorm_count += 1 + self._record_brainstorm_rejection_feedback( + submitter_index, + submission, + validation_feedback, + ) + submission_preview = self._summarize_error(submission, limit=220) + rejection_reason = self._summarize_error( + validation_feedback.get("summary") + or validation_feedback.get("reasoning") + or "Rejected by brainstorm validator.", + limit=220, + ) + logger.info( + "LeanOJ brainstorm REJECTED: Submitter %s [%s] (phase=%s, total_rejections=%s) - %s", + submitter_index, + submitter_config.model_id, + phase_key, + self._state.rejected_brainstorm_count, + rejection_reason, + ) + await self._persist_and_broadcast( + "leanoj_brainstorm_rejected", + { + "submitter": submitter_index, + "submitter_id": submitter_index, + "submitter_model": submitter_config.model_id, + "submitter_provider": submitter_config.provider, + "submission": submission, + "submission_preview": submission_preview, + "validator_reasoning": validation_feedback.get("reasoning", ""), + "validator_summary": validation_feedback.get("summary", ""), + "rejection_reason": rejection_reason, + "phase": phase_key, + "total_acceptances": self._state.accepted_brainstorm_count, + "total_rejections": self._state.rejected_brainstorm_count, + }, + ) + finally: + for task in submitter_tasks: + task.cancel() + await asyncio.gather(*submitter_tasks, return_exceptions=True) + accepted_delta = self._state.brainstorm_acceptance_events - accepted_at_start + if ( + run_exit_review + and not self._should_stop() + and accepted_delta > 0 + and self._state.brainstorm_acceptance_events + != self._state.active_brainstorm_last_prune_review_count + ): + self._state.active_brainstorm_last_prune_review_count = ( + self._state.brainstorm_acceptance_events + ) + await self._perform_brainstorm_prune_review( + request, + phase_key, + reason=f"phase-exit review after {accepted_delta} accepted brainstorm events", + ) + + async def _wait_for_brainstorm_queue_turn( + self, + submission_queue: _LeanOJBrainstormSubmissionQueue, + submitter_index: int, + ) -> None: + while not self._should_stop() and submission_queue.should_pause_submitter(submitter_index): + await self._sync_brainstorm_queue_pause_state( + submission_queue, + self._state.active_brainstorm_phase or self._state.phase, + ) + await asyncio.sleep(2) + await self._sync_brainstorm_queue_pause_state( + submission_queue, + self._state.active_brainstorm_phase or self._state.phase, + ) + + async def _sync_brainstorm_queue_pause_state( + self, + submission_queue: _LeanOJBrainstormSubmissionQueue, + phase_key: str, + ) -> None: + transitions = submission_queue.refresh_pause_transitions() + queue_size = transitions["queue_size"] + if transitions["global_changed"]: + if transitions["global_paused"]: + logger.info( + "LeanOJ brainstorm queue size (%s) >= threshold (%s). Pausing all submitters.", + queue_size, + system_config.queue_overflow_threshold, + ) + await self._broadcast( + "leanoj_brainstorm_submitters_paused", + { + "phase": phase_key, + "queue_size": queue_size, + "threshold": system_config.queue_overflow_threshold, + }, + ) + else: + logger.info( + "LeanOJ brainstorm queue size (%s) < threshold (%s). Resuming all submitters.", + queue_size, + system_config.queue_overflow_threshold, + ) + await self._broadcast( + "leanoj_brainstorm_submitters_resumed", + { + "phase": phase_key, + "queue_size": queue_size, + "threshold": system_config.queue_overflow_threshold, + }, + ) + + for paused_submitter in sorted(transitions["submitters_paused"]): + pending_count = submission_queue.count_for_submitter(paused_submitter) + logger.info( + "LeanOJ brainstorm submitter %s paused for fairness (pending=%s, threshold=%s).", + paused_submitter, + pending_count, + system_config.per_submitter_queue_threshold, + ) + await self._broadcast( + "leanoj_brainstorm_submitter_paused", + { + "phase": phase_key, + "queue_size": queue_size, + "submitter": paused_submitter, + "submitter_id": paused_submitter, + "submitter_pending": pending_count, + "threshold": system_config.per_submitter_queue_threshold, + }, + ) + + for resumed_submitter in sorted(transitions["submitters_resumed"]): + pending_count = submission_queue.count_for_submitter(resumed_submitter) + logger.info( + "LeanOJ brainstorm submitter %s resumed for fairness (pending=%s, threshold=%s).", + resumed_submitter, + pending_count, + system_config.per_submitter_queue_threshold, + ) + await self._broadcast( + "leanoj_brainstorm_submitter_resumed", + { + "phase": phase_key, + "queue_size": queue_size, + "submitter": resumed_submitter, + "submitter_id": resumed_submitter, + "submitter_pending": pending_count, + "threshold": system_config.per_submitter_queue_threshold, + }, + ) + + async def _brainstorm_submitter_loop( + self, + request: LeanOJStartRequest, + submitter_index: int, + submitter: LeanOJRoleConfig, + submission_queue: _LeanOJBrainstormSubmissionQueue, + ) -> None: + task_prefix = f"leanoj_brainstorm_sub{submitter_index}" + role_id = f"leanoj_brainstorm_submitter_{submitter_index}" + while not self._should_stop(): + try: + await self._wait_for_brainstorm_queue_turn(submission_queue, submitter_index) + if self._should_stop(): + break + active_topic = self._active_brainstorm_topic() + prompt_failed_feedback = self._general_brainstorm_feedback_records() + context_blocks = await self._build_context_blocks( + request, + submitter, + mode="brainstorm", + task_request=( + "Generate one concrete proof-solving brainstorm idea for the active LeanOJ topic: " + f"{active_topic}" + ), + include_current_final_cycle_packet=True, + capped_rejection_feedback=self._format_capped_rejection_feedback( + "RECENT FAILED / REJECTION FEEDBACK SUMMARIES", + prompt_failed_feedback, + limit=10, + ), + ) + raw = await self._call_json( + submitter, + task_prefix, + role_id, + build_brainstorm_prompt( + request.user_prompt, + request.lean_template, + active_topic, + self._accepted_ideas, + [item.model_dump(mode="json") for item in self._state.verified_subproofs], + prompt_failed_feedback, + context_blocks=context_blocks, + ), + temperature=api_client_manager.parallel_brainstorm_submitter_temperature(submitter_index), + ) + metadata: dict[str, Any] = {} + if is_lean_proof_submission(raw): + source_context = "\n\n".join( + part + for part in [ + request.lean_template, + active_topic, + "\n\n".join(self._accepted_ideas), + "\n\n".join(str(value) for value in context_blocks.values() if value), + ] + if part + ) + gate_result = await verify_brainstorm_proof_candidate( + parsed=raw, + user_prompt=request.user_prompt, + source_context=source_context, + model_id=submitter.model_id, + role_id=role_id, + task_id_prefix=f"{task_prefix}_lean", + max_tokens=submitter.max_output_tokens, + validator_model=request.brainstorm_validator.model_id, + validator_context=request.brainstorm_validator.context_window, + validator_max_tokens=request.brainstorm_validator.max_output_tokens, + validator_role_id="leanoj_brainstorm_validator", + allowed_baseline=request.lean_template, + max_attempts=5, + ) + if not gate_result.accepted: + feedback = { + "request": str(raw.get("theorem_statement") or raw.get("submission") or active_topic), + "error_summary": self._summarize_error(gate_result.failure_feedback, limit=1200), + "lean_code": gate_result.lean_code, + } + self._failed_feedback.append(feedback) + await self._persist_and_broadcast( + "leanoj_brainstorm_proof_failed", + { + "submitter": submitter_index, + "submitter_id": submitter_index, + "submitter_model": submitter.model_id, + "submitter_provider": submitter.provider, + "feedback": feedback, + }, + ) + continue + raw = { + **raw, + "submission": gate_result.submission_content, + "reasoning": gate_result.reasoning or raw.get("reasoning", ""), + } + metadata["brainstorm_lean_proof"] = { + "theorem_statement": gate_result.theorem_statement, + "theorem_name": gate_result.theorem_name, + "formal_sketch": gate_result.formal_sketch, + "lean_code": gate_result.lean_code, + "lean_feedback": gate_result.lean_feedback, + "reasoning": gate_result.reasoning, + "attempts": [ + attempt.model_dump(mode="json") + for attempt in (gate_result.attempts or []) + ], + "attempt_count": len(gate_result.attempts or []), + } + submission = str(raw.get("submission") or "").strip() + if submission: + await self._wait_for_brainstorm_queue_turn(submission_queue, submitter_index) + if self._should_stop(): + break + await submission_queue.put((submitter_index, submission, metadata)) + await self._sync_brainstorm_queue_pause_state( + submission_queue, + self._state.active_brainstorm_phase or self._state.phase, + ) + logger.info( + "LeanOJ brainstorm submission queued (phase=%s, submitter=%s, queue_size=%s)", + self._state.active_brainstorm_phase or self._state.phase, + submitter_index, + submission_queue.qsize(), + ) + await self._broadcast( + "leanoj_brainstorm_submission_queued", + { + "phase": self._state.active_brainstorm_phase or self._state.phase, + "submitter": submitter_index, + "submitter_id": submitter_index, + "submitter_model": submitter.model_id, + "submitter_provider": submitter.provider, + "queue_size": submission_queue.qsize(), + "submission_preview": self._summarize_error(submission, limit=220), + }, + ) + except asyncio.CancelledError: + raise + except LeanOJConfigurationError: + raise + except Exception as exc: + logger.warning("LeanOJ brainstorm submitter %s failed: %s", submitter_index, exc) + await self._broadcast( + "leanoj_brainstorm_submitter_failed", + {"submitter": submitter_index, "message": str(exc)}, + ) + await asyncio.sleep(2) + + async def _dequeue_brainstorm_batch( + self, + submission_queue: _LeanOJBrainstormSubmissionQueue, + *, + max_count: int = 3, + ) -> list[_BrainstormSubmission]: + return await submission_queue.dequeue_batch(max_count=max_count) + + async def _dequeue_topic_batch( + self, + topic_queue: asyncio.Queue[tuple[int, str]], + *, + max_count: int = 3, + ) -> list[tuple[int, str]]: + try: + first = await asyncio.wait_for(topic_queue.get(), timeout=1.0) + except asyncio.TimeoutError: + return [] + + batch = [first] + deadline = time.monotonic() + 0.25 + while len(batch) < max_count: + try: + batch.append(topic_queue.get_nowait()) + continue + except asyncio.QueueEmpty: + pass + + remaining = deadline - time.monotonic() + if remaining <= 0: + break + try: + batch.append(await asyncio.wait_for(topic_queue.get(), timeout=remaining)) + except asyncio.TimeoutError: + break + return batch + + def _topic_validation_context(self) -> list[str]: + topics: list[str] = [] + seen: set[str] = set() + for topic in self._validated_topics: + normalized = topic.strip() + if not normalized or normalized in seen: + continue + topics.append(normalized) + seen.add(normalized) + return topics + + async def _record_accepted_brainstorm_proof( + self, + request: LeanOJStartRequest, + submitter_index: int, + metadata: dict[str, Any], + ) -> None: + proof_payload = (metadata or {}).get("brainstorm_lean_proof") + if not isinstance(proof_payload, dict): + return + + theorem_statement = str(proof_payload.get("theorem_statement") or "").strip() + lean_code = str(proof_payload.get("lean_code") or "").strip() + if not theorem_statement or not lean_code: + return + + subproof_id = f"brainstorm_proof_{self._state.brainstorm_acceptance_events + 1}_{uuid.uuid4().hex[:6]}" + lean_feedback = str(proof_payload.get("lean_feedback") or "").strip() + proof_attempts = [ + item if isinstance(item, ProofAttemptFeedback) else ProofAttemptFeedback.model_validate(item) + for item in (proof_payload.get("attempts") or []) + ] + proof_record: Optional[ProofRecord] = None + try: + proof_record = await self._register_verified_leanoj_proof( + request, + proof_kind="subproof", + theorem_statement=theorem_statement, + theorem_name=str(proof_payload.get("theorem_name") or subproof_id), + lean_code=lean_code, + attempt_count=int(proof_payload.get("attempt_count") or 1), + formal_sketch=str(proof_payload.get("formal_sketch") or "LeanOJ brainstorm proof candidate"), + theorem_id=subproof_id, + source_title=f"LeanOJ brainstorm proof from submitter {submitter_index}", + verification_notes=( + lean_feedback + or "Proof Solver verified this brainstorm subproof with Lean 4 and template/device checks." + ), + attempts=proof_attempts, + ) + except Exception as exc: + logger.warning("LeanOJ accepted brainstorm proof registration failed: %s", exc, exc_info=True) + await self._broadcast( + "leanoj_brainstorm_proof_registration_failed", + { + "subproof_id": subproof_id, + "submitter": submitter_index, + "error": str(exc), + }, + ) + + record = LeanOJSubproofRecord( + subproof_id=subproof_id, + request=theorem_statement, + role="Verified during brainstorm before validator acceptance.", + theorem_or_lemma=str(proof_payload.get("theorem_name") or theorem_statement), + verified=True, + lean_code=lean_code, + lean_feedback=lean_feedback, + attempts_used=int(proof_payload.get("attempt_count") or 1), + proof_id=proof_record.proof_id if proof_record else "", + novel=proof_record.novel if proof_record else False, + novelty_tier=proof_record.novelty_tier if proof_record else "not_novel", + novelty_reasoning=proof_record.novelty_reasoning if proof_record else "", + ) + self._state.verified_subproofs.append(record) + await self._persist_and_broadcast( + "leanoj_brainstorm_proof_verified", + { + "subproof": record.model_dump(mode="json"), + "submitter": submitter_index, + "submitter_id": submitter_index, + }, + ) + + async def _validate_brainstorm(self, request: LeanOJStartRequest, submission: str) -> bool: + raw = await self._call_json( + request.brainstorm_validator, + "leanoj_brainstorm_val", + "leanoj_brainstorm_validator", + build_brainstorm_validation_prompt( + request.user_prompt, + request.lean_template, + submission, + self._accepted_ideas, + context_blocks=await self._build_context_blocks( + request, + request.brainstorm_validator, + mode="brainstorm", + task_request="Validate whether a LeanOJ brainstorm submission is useful and non-redundant.", + include_current_final_cycle_packet=True, + ), + ), + ) + accepted = str(raw.get("decision") or "").strip().lower() == "accept" + self._last_brainstorm_validation_decisions = [ + { + "accepted": accepted, + "context_role": self._normalize_brainstorm_context_role(raw, submission), + "reasoning": str(raw.get("reasoning") or "").strip(), + "summary": str(raw.get("summary") or "").strip(), + } + ] + return accepted + + async def _validate_brainstorm_batch(self, request: LeanOJStartRequest, submissions: list[str]) -> list[bool]: + if not submissions: + self._last_brainstorm_validation_decisions = [] + return [] + if len(submissions) == 1: + return [await self._validate_brainstorm(request, submissions[0])] + + raw = await self._call_json( + request.brainstorm_validator, + "leanoj_brainstorm_val", + "leanoj_brainstorm_validator", + build_brainstorm_batch_validation_prompt( + request.user_prompt, + request.lean_template, + submissions, + self._accepted_ideas, + context_blocks=await self._build_context_blocks( + request, + request.brainstorm_validator, + mode="brainstorm", + task_request="Batch-validate LeanOJ brainstorm submissions for usefulness and redundancy.", + include_current_final_cycle_packet=True, + ), + ), + ) + decisions = raw.get("decisions") + if not isinstance(decisions, list) or len(decisions) != len(submissions): + logger.warning( + "LeanOJ brainstorm batch validator returned %s decisions for %s submissions", + len(decisions) if isinstance(decisions, list) else "non-list", + len(submissions), + ) + self._last_brainstorm_validation_decisions = [ + { + "accepted": False, + "reasoning": "Brainstorm validator returned malformed decision payload.", + "summary": "Validator did not return one ordered decision per submission.", + } + for _ in submissions + ] + return [False for _ in submissions] + + accepted: list[bool] = [] + validation_decisions: list[dict[str, Any]] = [] + for expected_number, decision_payload in enumerate(decisions, start=1): + if not isinstance(decision_payload, dict): + accepted.append(False) + validation_decisions.append( + { + "accepted": False, + "reasoning": "Decision payload was not an object.", + "summary": "Validator returned a malformed decision entry.", + } + ) + continue + if decision_payload.get("submission_number") != expected_number: + logger.warning( + "LeanOJ brainstorm batch validator returned out-of-order decision: expected %s, got %s", + expected_number, + decision_payload.get("submission_number"), + ) + self._last_brainstorm_validation_decisions = [ + { + "accepted": False, + "reasoning": "Brainstorm validator returned out-of-order decisions.", + "summary": "Validator decisions could not be matched to submissions.", + } + for _ in submissions + ] + return [False for _ in submissions] + is_accepted = str(decision_payload.get("decision") or "").strip().lower() == "accept" + accepted.append(is_accepted) + validation_decisions.append( + { + "accepted": is_accepted, + "context_role": self._normalize_brainstorm_context_role(decision_payload, submissions[expected_number - 1]), + "reasoning": str(decision_payload.get("reasoning") or "").strip(), + "summary": str(decision_payload.get("summary") or "").strip(), + } + ) + self._last_brainstorm_validation_decisions = validation_decisions + return accepted + + def _record_brainstorm_rejection_feedback( + self, + submitter_index: int, + submission: str, + validation_feedback: dict[str, Any], + ) -> None: + summary = str(validation_feedback.get("summary") or "").strip() + reasoning = str(validation_feedback.get("reasoning") or "").strip() + feedback_parts = [ + "VALIDATOR REJECTED BRAINSTORM SUBMISSION", + f"Summary: {summary}" if summary else "", + f"Reasoning: {reasoning}" if reasoning else "", + f"Rejected submission: {self._summarize_error(submission, limit=500)}", + ] + error_summary = "\n".join(part for part in feedback_parts if part) + self._failed_feedback.append( + { + "request": f"brainstorm submitter {submitter_index} rejected submission", + "error_summary": self._summarize_error(error_summary, limit=1200), + "submission": self._summarize_error(submission, limit=500), + "submitter_index": submitter_index, + "source": "brainstorm_validator", + } + ) + + def _record_accepted_brainstorm_idea( + self, + submission: str, + submitter_index: int, + phase_key: str, + validation_feedback: dict[str, Any] | None = None, + ) -> None: + validation_feedback = validation_feedback or {} + context_role = self._normalize_brainstorm_context_role(validation_feedback, submission) + self._accepted_ideas.append(submission) + self._state.brainstorm_acceptance_events += 1 + self._accepted_idea_records.append( + { + "content": submission, + "context_role": context_role, + "submitter_index": submitter_index, + "phase": phase_key, + "validator_summary": str(validation_feedback.get("summary") or "").strip(), + "validator_reasoning": str(validation_feedback.get("reasoning") or "").strip(), + "created_at": datetime.now().isoformat(), + "acceptance_event": self._state.brainstorm_acceptance_events, + } + ) + + def _ensure_accepted_idea_records(self) -> None: + existing = [ + dict(record) + for record in self._accepted_idea_records + if isinstance(record, dict) and str(record.get("content") or "").strip() + ] + used_existing_indices: set[int] = set() + existing_by_content: dict[str, list[tuple[int, dict[str, Any]]]] = {} + for record_index, record in enumerate(existing): + existing_by_content.setdefault(str(record.get("content") or ""), []).append((record_index, record)) + + def take_existing_record(content: str) -> dict[str, Any] | None: + candidates = existing_by_content.get(content, []) + while candidates: + record_index, record = candidates.pop(0) + if record_index not in used_existing_indices: + used_existing_indices.add(record_index) + return dict(record) + return None + + records: list[dict[str, Any]] = [] + for index, idea in enumerate(self._accepted_ideas): + content = str(idea) + aligned_record = ( + existing[index] + if index < len(existing) + and str(existing[index].get("content") or "") == content + and index not in used_existing_indices + else None + ) + if aligned_record is not None: + used_existing_indices.add(index) + record = dict(aligned_record) + else: + record = take_existing_record(content) or {} + if not record: + record = { + "content": content, + "submitter_index": 1, + "phase": "legacy", + "created_at": "", + "acceptance_event": index + 1, + "legacy": True, + } + record["content"] = content + record["context_role"] = self._normalize_brainstorm_context_role(record, idea) + records.append(record) + self._accepted_idea_records = records + + @staticmethod + def _normalize_brainstorm_context_role(record: dict[str, Any] | None, text: str = "") -> str: + role = str((record or {}).get("context_role") or "").strip().lower() + if role in _LEANOJ_CONTEXT_ROLES: + return role + + combined = " ".join( + part.lower() + for part in [ + text, + str((record or {}).get("content") or ""), + str((record or {}).get("summary") or ""), + str((record or {}).get("reasoning") or ""), + str((record or {}).get("validator_summary") or ""), + str((record or {}).get("validator_reasoning") or ""), + ] + if part + ) + if any(term in combined for term in _LEANOJ_REFUTED_CONTEXT_TERMS): + return "refuted_construction" + if "[lean 4 verified brainstorm proof]" in combined: + return "verified_hint" + if any(term in combined for term in _LEANOJ_ACTIVE_PLAN_CONTEXT_TERMS): + return "active_plan" + return "scratch" + + def _final_solver_active_plan_items(self) -> list[str]: + self._ensure_accepted_idea_records() + return [ + str(record.get("content") or "").strip() + for record in self._accepted_idea_records + if str(record.get("context_role") or "") in _LEANOJ_FINAL_ACTIVE_CONTEXT_ROLES + and str(record.get("content") or "").strip() + ] + + def _final_solver_refuted_construction_records(self) -> list[dict[str, Any]]: + self._ensure_accepted_idea_records() + accepted_refutations = [ + record + for record in self._accepted_idea_records + if str(record.get("context_role") or "") == "refuted_construction" + ] + verified_refutations = [ + { + "content": record.get("request", record.get("theorem_or_lemma", "")), + "reasoning": record.get("theorem_or_lemma", record.get("role", "")), + "source": "verified_subproof", + } + for record in self._verified_subproof_dicts() + if self._record_mentions_refuted_construction(record) + ] + failure_refutations = [ + { + "content": record.get("error_summary", record.get("summary", "")), + "reasoning": record.get("reasoning", record.get("lean_feedback", "")), + "source": record.get("request", "failure feedback"), + } + for record in [*self._failed_feedback, *self._failed_context_dicts(), *self._final_attempts] + if self._record_mentions_refuted_construction(record) + ] + return self._dedupe_dict_records([*accepted_refutations, *verified_refutations, *failure_refutations]) + + def _final_solver_verified_subproof_dicts(self) -> list[dict[str, Any]]: + return [ + record + for record in self._verified_subproof_dicts() + if not self._record_mentions_refuted_construction(record) + ] + + @staticmethod + def _record_mentions_refuted_construction(record: dict[str, Any]) -> bool: + combined = " ".join( + str(record.get(key) or "").lower() + for key in ( + "request", + "theorem_or_lemma", + "role", + "error_summary", + "summary", + "reasoning", + "lean_feedback", + "submission", + ) + ) + return any(term in combined for term in _LEANOJ_REFUTED_CONTEXT_TERMS) + + def _active_brainstorm_topic(self, phase_key: str = "") -> str: + phase = phase_key or self._state.phase + if phase == "recursive_brainstorm": + if self._current_working_proof_attempt: + summary = _remove_attempt_count_language( + self._current_working_proof_attempt.get("summary") or "" + ).strip() + base = "Repair and complete the current Proof Solver master proof attempt." + return f"{base} {summary}".strip() if summary else base + return "Continue the recursive Proof Solver brainstorm from the current proof state and accepted proof memory." + if phase == "initial_brainstorm": + return self._state.selected_topic or "Solve the user's Proof Solver template." + return self._state.selected_topic or "Solve the user's Proof Solver template." + + def _select_brainstorm_prune_reviewer( + self, + request: LeanOJStartRequest, + phase_key: str, + ) -> tuple[LeanOJRoleConfig, int]: + self._ensure_accepted_idea_records() + phase_records = [ + record + for record in self._accepted_idea_records + if str(record.get("phase") or "") == phase_key + ] + submitter_index = 1 + if phase_records: + try: + submitter_index = int(phase_records[-1].get("submitter_index") or 1) + except (TypeError, ValueError): + submitter_index = 1 + submitter_index = max(1, min(submitter_index, len(request.brainstorm_submitters))) + return request.brainstorm_submitters[submitter_index - 1], submitter_index + + async def _perform_brainstorm_prune_review( + self, + request: LeanOJStartRequest, + phase_key: str, + *, + reason: str, + ) -> None: + if not self._accepted_ideas: + return + self._state.brainstorm_prune_reviews_performed += 1 + reviewer, reviewer_index = self._select_brainstorm_prune_reviewer(request, phase_key) + active_topic = self._active_brainstorm_topic(phase_key) + try: + context_blocks = await self._build_context_blocks( + request, + reviewer, + mode="brainstorm", + task_request=f"Review LeanOJ brainstorm memory for one conservative prune operation: {reason}.", + include_current_final_cycle_packet=True, + ) + raw = await self._call_json( + reviewer, + "leanoj_brainstorm_prune", + f"leanoj_brainstorm_prune_reviewer_{reviewer_index}", + build_brainstorm_prune_review_prompt( + request.user_prompt, + request.lean_template, + active_topic, + self._accepted_ideas, + context_blocks=context_blocks, + ), + ) + operation = self._normalize_brainstorm_prune_operation(raw) + if operation["action"] == "none": + await self._persist_and_broadcast( + "leanoj_brainstorm_prune_review_complete", + {"action": "none", "reason": reason, "reviewer": reviewer_index}, + ) + return + + validator_context = await self._build_context_blocks( + request, + request.brainstorm_validator, + mode="brainstorm", + task_request="Validate one proposed LeanOJ brainstorm prune operation.", + include_current_final_cycle_packet=True, + ) + validation = await self._call_json( + request.brainstorm_validator, + "leanoj_brainstorm_prune_val", + "leanoj_brainstorm_validator", + build_brainstorm_prune_validation_prompt( + request.user_prompt, + request.lean_template, + active_topic, + self._accepted_ideas, + operation, + context_blocks=validator_context, + ), + ) + if str(validation.get("decision") or "").strip().lower() != "accept": + await self._persist_and_broadcast( + "leanoj_brainstorm_prune_rejected", + { + "operation": operation, + "reasoning": validation.get("reasoning", ""), + "reviewer": reviewer_index, + }, + ) + return + applied = self._apply_brainstorm_prune_operation(operation, reviewer_index, phase_key) + await self._persist_and_broadcast( + "leanoj_brainstorm_prune_applied" if applied else "leanoj_brainstorm_prune_apply_failed", + { + "operation": operation, + "reasoning": validation.get("reasoning", ""), + "reviewer": reviewer_index, + }, + ) + except asyncio.CancelledError: + raise + except LeanOJConfigurationError: + raise + except Exception as exc: + logger.warning("LeanOJ brainstorm prune review failed: %s", exc, exc_info=True) + await self._persist_and_broadcast( + "leanoj_brainstorm_prune_error", + {"message": str(exc), "reason": reason}, + ) + + def _normalize_brainstorm_prune_operation(self, raw: dict[str, Any]) -> dict[str, Any]: + action = str(raw.get("action") or "none").strip().lower() + if action not in {"none", "delete", "edit", "add"}: + action = "none" + idea_index: Optional[int] = None + try: + if raw.get("idea_index") is not None: + idea_index = int(raw.get("idea_index")) + except (TypeError, ValueError): + idea_index = None + new_content = str(raw.get("new_content") or "").strip() + reasoning = str(raw.get("reasoning") or "").strip() + if action in {"delete", "edit"} and (idea_index is None or idea_index < 1 or idea_index > len(self._accepted_ideas)): + action = "none" + reasoning = f"Invalid idea_index for prune operation. {reasoning}".strip() + if action in {"edit", "add"} and not new_content: + action = "none" + reasoning = f"Missing new_content for prune operation. {reasoning}".strip() + return { + "action": action, + "idea_index": idea_index, + "new_content": new_content, + "reasoning": reasoning, + } + + def _apply_brainstorm_prune_operation(self, operation: dict[str, Any], reviewer_index: int, phase_key: str) -> bool: + self._ensure_accepted_idea_records() + action = operation["action"] + idea_index = operation.get("idea_index") + if action == "delete": + if not isinstance(idea_index, int) or idea_index < 1 or idea_index > len(self._accepted_ideas): + return False + del self._accepted_ideas[idea_index - 1] + del self._accepted_idea_records[idea_index - 1] + elif action == "edit": + if not isinstance(idea_index, int) or idea_index < 1 or idea_index > len(self._accepted_ideas): + return False + self._accepted_ideas[idea_index - 1] = operation["new_content"] + self._accepted_idea_records[idea_index - 1]["content"] = operation["new_content"] + self._accepted_idea_records[idea_index - 1]["context_role"] = self._normalize_brainstorm_context_role( + {"reasoning": operation.get("reasoning", "")}, + operation["new_content"], + ) + self._accepted_idea_records[idea_index - 1]["edited_at"] = datetime.now().isoformat() + self._accepted_idea_records[idea_index - 1]["edit_reasoning"] = operation.get("reasoning", "") + elif action == "add": + self._accepted_ideas.append(operation["new_content"]) + self._accepted_idea_records.append( + { + "content": operation["new_content"], + "context_role": self._normalize_brainstorm_context_role( + {"reasoning": operation.get("reasoning", "")}, + operation["new_content"], + ), + "submitter_index": reviewer_index, + "phase": phase_key, + "created_at": datetime.now().isoformat(), + "acceptance_event": self._state.brainstorm_acceptance_events, + "prune_add": True, + "reasoning": operation.get("reasoning", ""), + } + ) + else: + return False + self._state.accepted_brainstorm_count = len(self._accepted_ideas) + self._state.brainstorm_prune_operations_applied += 1 + return True + + async def _sufficiency_check(self, request: LeanOJStartRequest) -> bool: + raw = await self._call_json( + request.brainstorm_validator, + "leanoj_sufficiency", + "leanoj_brainstorm_validator", + build_sufficiency_prompt( + request.user_prompt, + request.lean_template, + self._accepted_ideas, + [item.model_dump(mode="json") for item in self._state.verified_subproofs], + context_blocks=await self._build_context_blocks( + request, + request.brainstorm_validator, + mode="brainstorm", + task_request="Decide whether the accumulated Proof Solver context is sufficient for the final loop.", + include_current_final_cycle_packet=True, + ), + ), + ) + return bool(raw.get("enough")) + + async def _path_decision_phase(self, request: LeanOJStartRequest) -> str: + self._state.phase = "path_decision" + await self._persist_and_broadcast("leanoj_phase_changed") + decision_actor, decision_role_id = self._path_decision_actor(request) + prompt_failed_feedback = self._general_brainstorm_feedback_records() + raw = await self._call_json( + decision_actor, + "leanoj_path", + decision_role_id, + build_path_decision_prompt( + request.user_prompt, + request.lean_template, + self._accepted_ideas, + [item.model_dump(mode="json") for item in self._state.verified_subproofs], + prompt_failed_feedback, + context_blocks=await self._build_context_blocks( + request, + decision_actor, + mode="brainstorm", + task_request="Choose the next LeanOJ path after reviewing accumulated proof memory.", + include_current_final_cycle_packet=True, + capped_rejection_feedback=self._format_capped_rejection_feedback( + "RECENT FAILED / REJECTION FEEDBACK SUMMARIES", + prompt_failed_feedback, + limit=10, + ), + ), + ), + ) + decision = str(raw.get("path") or "").strip() + if decision not in _LEANOJ_PATH_OPTIONS_SET: + decision = "need_more_brainstorming" + path_valid, corrected_path = await self._validate_path_decision(request, decision, str(raw.get("reasoning") or "")) + if not path_valid: + decision = corrected_path or "need_more_brainstorming" + self._state.current_path_decision = decision + await self._persist_and_broadcast("leanoj_path_decided", {"decision": decision, "reasoning": raw.get("reasoning", "")}) + return decision + + @staticmethod + def _path_decision_actor( + request: LeanOJStartRequest, + valid_paths: tuple[str, ...] = _LEANOJ_PATH_OPTIONS, + ) -> tuple[LeanOJRoleConfig, str]: + if "solve_final_now" in valid_paths: + return request.final_solver, "leanoj_final_solver" + return request.topic_generator, "leanoj_topic_generator" + + async def _validate_path_decision(self, request: LeanOJStartRequest, decision: str, reasoning: str) -> tuple[bool, str]: + raw = await self._call_json( + request.topic_validator, + "leanoj_path_val", + "leanoj_path_validator", + build_path_validation_prompt( + request.user_prompt, + request.lean_template, + decision, + reasoning, + self._accepted_ideas, + [item.model_dump(mode="json") for item in self._state.verified_subproofs], + context_blocks=await self._build_context_blocks( + request, + request.topic_validator, + mode="brainstorm", + task_request="Validate the proposed LeanOJ path decision.", + include_current_final_cycle_packet=True, + ), + ), + ) + accepted = str(raw.get("decision") or "").strip().lower() == "accept" + corrected_path = str(raw.get("corrected_path") or "").strip() + if corrected_path not in _LEANOJ_PATH_OPTIONS_SET: + corrected_path = "" + await self._persist_and_broadcast( + "leanoj_path_validated", + { + "decision": decision, + "validated": accepted, + "corrected_path": corrected_path, + "reasoning": raw.get("reasoning", ""), + }, + ) + return accepted, corrected_path + + async def _register_verified_leanoj_proof( + self, + request: LeanOJStartRequest, + *, + proof_kind: str, + theorem_statement: str, + theorem_name: str, + lean_code: str, + attempt_count: int, + formal_sketch: str = "", + theorem_id: str = "", + source_title: str = "", + verification_notes: str = "", + attempts: Optional[list[ProofAttemptFeedback]] = None, + ) -> Optional[ProofRecord]: + """Register a Proof Solver verified proof in the shared proof database.""" + if not request.topic_validator.model_id: + raise LeanOJConfigurationError("Proof Solver proof novelty validator model is unavailable") + + source_type = "leanoj_final" if proof_kind == "final" else "leanoj_subproof" + task_id = self._next_task_id(f"leanoj_{proof_kind}_novelty") + self.current_task_id = task_id + self._refresh_workflow_tasks(f"leanoj_{proof_kind}_novelty", "Proof Novelty Validator") + api_client_manager.set_autonomous_phase(self._state.phase or "leanoj") + try: + # Lazy import avoids pulling autonomous coordinator into LeanOJ module load. + from backend.autonomous.core.proof_registration import register_verified_lean_proof + + registration = await register_verified_lean_proof( + proof_database=proof_database, + user_prompt=request.user_prompt, + theorem_statement=theorem_statement, + lean_code=lean_code, + validator_model=request.topic_validator.model_id, + validator_context=request.topic_validator.context_window, + validator_max_tokens=request.topic_validator.max_output_tokens, + task_id=task_id, + role_id="leanoj_proof_novelty", + source_type=source_type, + source_id=self._state.session_id, + source_title=source_title or self._state.selected_topic or request.user_prompt, + theorem_id=theorem_id, + theorem_name=theorem_name, + formal_sketch=formal_sketch, + solver="Proof Solver", + verification_notes=( + verification_notes + or "Proof Solver verified this proof with Lean 4 and template/device checks." + ), + attempt_count=attempt_count, + attempts=attempts, + broadcast_fn=self._broadcast, + base_event={ + "source_type": source_type, + "source_id": self._state.session_id, + "source_title": source_title or self._state.selected_topic or request.user_prompt, + "trigger": "leanoj_verified", + }, + ) + self.completed_task_ids.add(task_id) + return registration.record + except Exception as exc: + logger.warning("Proof Solver proof registration failed for %s: %s", proof_kind, exc) + raise + finally: + self.current_task_id = None + self._refresh_workflow_tasks(f"leanoj_{proof_kind}_novelty", "Proof Novelty Validator") + + async def _check_proof_and_capture_partial( + self, + request: LeanOJStartRequest, + lean_code: str, + *, + target: str, + attempt_number: int, + proof_request: str, + reasoning: str, + theorem_or_lemma: str = "", + ) -> Lean4Result: + placeholder_tokens = self._placeholder_tokens(lean_code) + if not placeholder_tokens: + return await get_lean4_client().check_proof(lean_code, timeout=system_config.lean4_proof_timeout) + + lean_result = await get_lean4_client().check_proof( + lean_code, + timeout=system_config.lean4_proof_timeout, + allow_placeholders=True, + ) + if not lean_result.success: + return lean_result + + device_error = self._validate_no_new_declaration_devices( + request.lean_template, + lean_code, + target=f"partial {target}", + ) + if device_error: + return Lean4Result( + success=False, + error_output=device_error, + goal_states=lean_result.goal_states, + raw_stderr=lean_result.raw_stderr, + ) + if target == "final": + template_error = self._validate_final_solution_integrity( + request.lean_template, + lean_code, + ) + if template_error: + return Lean4Result( + success=False, + error_output=template_error, + goal_states=lean_result.goal_states, + raw_stderr=lean_result.raw_stderr, + ) + + partial_record = { + "session_id": self._state.session_id, + "attempt": attempt_number, + "target": target, + "request": proof_request, + "theorem_or_lemma": theorem_or_lemma, + "placeholder_tokens": sorted(set(placeholder_tokens)), + "lean_code": lean_code, + "reasoning": reasoning, + "high_value_scaffold": False, + "master_seed_eligible": False, + "created_at": datetime.now().isoformat(), + "summary": ( + "Lean accepted this incomplete scaffold with placeholders. " + "It is stored for future reference, but it is not a verified proof and is not eligible " + "to seed the master proof unless a validator explicitly marks it high-value." + ), + } + await self._record_partial_proof(partial_record) + return Lean4Result( + success=False, + error_output=( + "PARTIAL PROOF SAVED: Lean accepted this scaffold with placeholder token(s) " + f"{', '.join(partial_record['placeholder_tokens'])}. It has been stored in the " + "LeanOJ partial-proof database for future reference, but final verification must " + "continue until every `sorry`/`admit` is replaced by a complete proof." + ), + goal_states=lean_result.goal_states, + raw_stderr=lean_result.raw_stderr, + ) + + async def _record_partial_proof(self, partial_record: dict[str, Any]) -> None: + self._partial_proofs.append(partial_record) + await self._append_partial_proof_database(partial_record) + await self._persist_and_broadcast( + "leanoj_partial_proof_saved", + {"partial_proof": partial_record}, + ) + + async def _append_partial_proof_database(self, partial_record: dict[str, Any]) -> None: + path = self._partial_proof_database_path() + path.parent.mkdir(parents=True, exist_ok=True) + async with aiofiles.open(path, "a", encoding="utf-8") as f: + await f.write(json.dumps(partial_record, ensure_ascii=False) + "\n") + await leanoj_context_manager.append_record( + self._state.session_id, + ARTIFACT_PARTIAL_PROOFS, + partial_record, + ) + + @staticmethod + def _placeholder_tokens(lean_code: str) -> list[str]: + stripped = strip_lean_comments_and_strings(lean_code or "") + return [match.group(1) for match in _LEAN_PLACEHOLDER_RE.finditer(stripped)] + + @staticmethod + def _partial_proofs_base_dir() -> Path: + return Path(system_config.data_dir) / "leanoj_partial_proofs" + + def _partial_proof_database_path(self, session_id: str = "") -> Path: + return self._partial_proofs_base_dir() / f"{session_id or self._state.session_id or 'latest'}.jsonl" + + def _load_partial_proof_database(self, session_id: str) -> list[dict[str, Any]]: + path = self._partial_proof_database_path(session_id) + if not path.exists(): + return [] + records: list[dict[str, Any]] = [] + try: + for line in path.read_text(encoding="utf-8").splitlines(): + if not line.strip(): + continue + item = json.loads(line) + if isinstance(item, dict): + records.append(item) + except Exception as exc: + logger.warning("Failed to load LeanOJ partial proof database from %s: %s", path, exc) + return records + + @staticmethod + def _dedupe_partial_proofs(records: list[dict[str, Any]]) -> list[dict[str, Any]]: + deduped: list[dict[str, Any]] = [] + seen: set[tuple[str, str, str, str, str]] = set() + for record in records: + key = ( + str(record.get("session_id") or ""), + str(record.get("target") or ""), + str(record.get("attempt") or ""), + str(record.get("request") or ""), + str(record.get("lean_code") or ""), + ) + if key in seen: + continue + seen.add(key) + deduped.append(record) + return deduped + + @staticmethod + def _dedupe_strings(records: list[str]) -> list[str]: + deduped: list[str] = [] + seen: set[str] = set() + for record in records: + value = str(record).strip() + if not value or value in seen: + continue + seen.add(value) + deduped.append(value) + return deduped + + @staticmethod + def _dedupe_dict_records(records: list[dict[str, Any]]) -> list[dict[str, Any]]: + deduped: list[dict[str, Any]] = [] + seen: set[str] = set() + for record in records: + key = LeanOJCoordinator._dict_record_key(record) + if key in seen: + continue + seen.add(key) + deduped.append(record) + return deduped + + @staticmethod + def _dict_record_key(record: dict[str, Any]) -> str: + try: + return json.dumps(record, sort_keys=True, default=str) + except TypeError: + return str(record) + + def _verified_subproof_dicts(self) -> list[dict[str, Any]]: + return [item.model_dump(mode="json") for item in self._state.verified_subproofs] + + def _failed_context_dicts(self) -> list[dict[str, Any]]: + return self._dedupe_dict_records( + [ + *[item.model_dump(mode="json") for item in self._state.failed_subproofs], + *self._failed_feedback, + ] + ) + + @staticmethod + def _is_subproof_or_final_failure_feedback(record: dict[str, Any]) -> bool: + request = str(record.get("request") or "").lower() + return bool(record.get("lean_code")) or "subproof" in request or "final proof solver" in request + + def _general_brainstorm_feedback_records(self) -> list[dict[str, Any]]: + if self._state.phase == "recursive_brainstorm": + return [] + return [ + record + for record in self._failed_feedback + if isinstance(record, dict) and not self._is_subproof_or_final_failure_feedback(record) + ] + + async def _build_context_blocks( + self, + request: LeanOJStartRequest, + role_config: LeanOJRoleConfig, + *, + mode: str, + task_request: str, + include_current_final_cycle_packet: bool = False, + capped_rejection_feedback: str = "", + context_scope: str = "", + ) -> dict[str, str]: + resolved_scope = context_scope or self._infer_context_scope(mode) + current_packet = self._current_final_cycle_packet if include_current_final_cycle_packet else None + working_proof_attempt = None + if resolved_scope == "recursive_brainstorm": + working_proof_attempt = await self._working_proof_attempt_context_packet() + capped_rejection_feedback = "" + include_failed_subproofs = resolved_scope == "subproof" + accepted_context = ( + self._final_solver_active_plan_items() + if resolved_scope == "final_solver" + else self._accepted_ideas + ) + refuted_constructions = ( + self._final_solver_refuted_construction_records() + if resolved_scope == "final_solver" + else [] + ) + allocation = await leanoj_context_manager.allocate_context( + session_id=self._state.session_id, + mode=resolved_scope, + user_prompt=request.user_prompt, + lean_template=request.lean_template, + task_request=task_request, + context_window=role_config.context_window, + max_output_tokens=role_config.max_output_tokens, + accepted_ideas=accepted_context, + recursive_topics=self._recursive_topics, + verified_subproofs=( + self._final_solver_verified_subproof_dicts() + if resolved_scope == "final_solver" + else self._verified_subproof_dicts() + ), + partial_proofs=self._partial_proofs, + failed_subproofs=self._failed_context_dicts() if include_failed_subproofs else [], + final_attempts=self._final_attempts[-5:] if resolved_scope == "final_solver" else [], + final_cycle_packets=[], + refuted_constructions=refuted_constructions, + current_final_cycle_packet=current_packet, + current_working_proof_attempt=working_proof_attempt, + capped_rejection_feedback=capped_rejection_feedback, + ) + return allocation.as_prompt_blocks() + + def _infer_context_scope(self, mode: str) -> str: + if mode == "final_solver": + return "final_solver" + if mode == "subproof": + return "subproof" + if self._state.phase == "recursive_brainstorm" or self._current_working_proof_attempt: + return "recursive_brainstorm" + return "brainstorm" + + async def _set_current_working_proof_attempt( + self, + *, + trigger: str, + requested_path: str, + stuck_reason: str, + ) -> None: + master_proof = await self._read_master_proof() + if master_proof: + self._set_master_proof_metadata(master_proof) + prompt_safe_stuck_reason = _remove_attempt_count_language( + stuck_reason or self._state.master_proof_last_stuck_reason or "Final proof needs more context." + ) + summary = self._summarize_error( + f"{trigger}: {prompt_safe_stuck_reason}", + limit=500, + ) + self._current_working_proof_attempt = { + "session_id": self._state.session_id, + "trigger": trigger, + "requested_path": requested_path, + "stuck_reason": self._summarize_error(prompt_safe_stuck_reason, limit=1200), + "summary": summary, + "master_proof_version": self._state.master_proof_version, + "master_proof_hash": self._state.master_proof_hash, + "master_proof_line_count": self._state.master_proof_line_count, + "master_proof_char_count": self._state.master_proof_char_count, + "master_proof_last_edit_summary": self._state.master_proof_last_edit_summary, + "created_at": datetime.now().isoformat(), + } + + async def _working_proof_attempt_context_packet(self) -> Optional[dict[str, Any]]: + if not self._current_working_proof_attempt: + return None + master_proof = await self._read_master_proof() + if master_proof: + self._set_master_proof_metadata(master_proof) + old_attempt_before_redo = await self._read_master_proof_old_attempt_before_redo() + packet = dict(self._current_working_proof_attempt) + packet.update( + { + "master_proof": master_proof, + "master_proof_version": self._state.master_proof_version, + "master_proof_hash": self._state.master_proof_hash, + "master_proof_line_count": self._state.master_proof_line_count, + "master_proof_char_count": self._state.master_proof_char_count, + "master_proof_last_edit_summary": self._state.master_proof_last_edit_summary, + "old_attempt_before_redo": old_attempt_before_redo, + "old_attempt_before_redo_version": self._state.master_proof_old_attempt_before_redo_version, + "old_attempt_before_redo_hash": self._state.master_proof_old_attempt_before_redo_hash, + "old_attempt_before_redo_line_count": self._state.master_proof_old_attempt_before_redo_line_count, + "old_attempt_before_redo_char_count": self._state.master_proof_old_attempt_before_redo_char_count, + "old_attempt_before_redo_summary": self._state.master_proof_old_attempt_before_redo_summary, + "old_attempt_before_redo_validator_justification": ( + self._state.master_proof_old_attempt_before_redo_validator_justification + ), + "old_attempt_before_redo_apparent_issue": ( + self._state.master_proof_old_attempt_before_redo_apparent_issue + ), + "recent_final_attempts": leanoj_context_manager._format_attempts(self._final_attempts[-10:]), + "verified_subproofs": self._verified_subproof_dicts(), + "partial_final_proofs": [ + proof for proof in self._partial_proofs[-10:] if str(proof.get("target") or "") == "final" + ], + } + ) + return packet + + def _clear_current_final_cycle_packet(self) -> None: + """Clear one-shot direct final-cycle context after its next phase has completed.""" + self._current_final_cycle_packet = None + + @staticmethod + def _format_capped_rejection_feedback( + title: str, + records: list[dict[str, Any]], + *, + limit: int, + ) -> str: + visible = [record for record in records[-limit:] if isinstance(record, dict)] + if not visible: + return "" + lines = [title] + for index, record in enumerate(visible, start=1): + lines.append( + f"{index}. {_remove_attempt_count_language(record.get('request', 'proof feedback'))} :: " + f"{_remove_attempt_count_language(record.get('error_summary', record.get('error_output', '')))}" + ) + lean_feedback = _remove_attempt_count_language(record.get("lean_feedback") or "") + if lean_feedback: + lines.append(f" Lean feedback: {lean_feedback}") + return "\n".join(lines) + + @staticmethod + def _is_final_prompt_feedback_safe(record: dict[str, Any]) -> bool: + text = "\n".join( + str(record.get(key) or "") + for key in ("request", "error_summary", "error_output", "lean_feedback", "reasoning") + ).lower() + if not text.strip(): + return False + blocked_terms = ( + "brainstorm", + "need_more_brainstorming", + "stuck_needs_brainstorm", + "final proof solver proof cycle", + "failed-attempt count", + "failed attempts", + ) + if not any(term in text for term in blocked_terms): + return True + concrete_terms = ( + "old_string", + "unexpected token", + "missing cases", + "unsolved goals", + "error:", + "rejected", + "invalid", + "json", + "max_tokens", + "lean", + "verification", + "watchdog", + ) + return any(term in text for term in concrete_terms) + + def _record_final_context_event( + self, + event_type: str, + *, + request: str, + error_summary: str = "", + lean_feedback: str = "", + reasoning: str = "", + ) -> None: + record = { + "event_type": event_type, + "request": self._summarize_error(request, limit=300), + "error_summary": self._summarize_error(error_summary, limit=1200), + "lean_feedback": self._summarize_error(lean_feedback, limit=1200), + "reasoning": self._summarize_error(reasoning, limit=800), + "created_at": datetime.now().isoformat(), + } + self._final_context_events.append(record) + self._final_context_events = self._final_context_events[-50:] + + def _final_solver_failure_window(self) -> list[dict[str, Any]]: + recent_events = [ + event + for event in self._final_context_events[-5:] + if isinstance(event, dict) + ] + return [ + event + for event in recent_events + if event.get("event_type") == "failure" and self._is_final_prompt_feedback_safe(event) + ] + + def _master_proof_path(self, session_id: str = "") -> Path: + resolved_session_id = session_id or self._state.session_id or "latest" + return self._sessions_base_dir() / resolved_session_id / "master_proof.lean" + + def _master_proof_old_attempt_before_redo_path(self, session_id: str = "") -> Path: + resolved_session_id = session_id or self._state.session_id or "latest" + return self._sessions_base_dir() / resolved_session_id / "master_proof_old_attempt_before_redo.lean" + + def _master_proof_edit_log_path(self, session_id: str = "") -> Path: + resolved_session_id = session_id or self._state.session_id or "latest" + return self._sessions_base_dir() / resolved_session_id / "master_proof_edits.jsonl" + + def _master_proof_snapshot_log_path(self, session_id: str = "") -> Path: + resolved_session_id = session_id or self._state.session_id or "latest" + return self._sessions_base_dir() / resolved_session_id / "master_proof_snapshots.jsonl" + + @staticmethod + def _hash_master_proof(content: str) -> str: + return hashlib.sha256((content or "").encode("utf-8")).hexdigest() if content else "" + + def _set_master_proof_metadata( + self, + content: str, + *, + summary: str = "", + increment_version: bool = False, + ) -> None: + if increment_version: + self._state.master_proof_version += 1 + self._state.master_proof_initialized = bool((content or "").strip()) + self._state.master_proof_hash = self._hash_master_proof(content) + self._state.master_proof_char_count = len(content or "") + self._state.master_proof_line_count = len((content or "").splitlines()) if content else 0 + if summary: + self._state.master_proof_last_edit_summary = self._summarize_error(summary, limit=500) + + async def _read_master_proof(self) -> str: + path = self._master_proof_path() + if not path.exists(): + return "" + try: + async with aiofiles.open(path, "r", encoding="utf-8") as f: + return await f.read() + except Exception as exc: + logger.warning("Failed to read Proof Solver master proof from %s: %s", path, exc) + return "" + + async def _write_master_proof(self, content: str, *, summary: str = "") -> None: + path = self._master_proof_path() + path.parent.mkdir(parents=True, exist_ok=True) + async with aiofiles.open(path, "w", encoding="utf-8") as f: + await f.write(content or "") + self._set_master_proof_metadata(content or "", summary=summary, increment_version=True) + + async def _read_master_proof_old_attempt_before_redo(self) -> str: + path = self._master_proof_old_attempt_before_redo_path() + if not path.exists(): + return "" + try: + async with aiofiles.open(path, "r", encoding="utf-8") as f: + return await f.read() + except Exception as exc: + logger.warning("Failed to read Proof Solver old attempt before redo from %s: %s", path, exc) + return "" + + async def _write_master_proof_old_attempt_before_redo(self, content: str) -> None: + path = self._master_proof_old_attempt_before_redo_path() + path.parent.mkdir(parents=True, exist_ok=True) + async with aiofiles.open(path, "w", encoding="utf-8") as f: + await f.write(content or "") + + async def _append_master_proof_edit(self, record: dict[str, Any]) -> None: + path = self._master_proof_edit_log_path() + path.parent.mkdir(parents=True, exist_ok=True) + payload = { + "session_id": self._state.session_id, + "master_proof_version": self._state.master_proof_version, + "created_at": datetime.now().isoformat(), + **record, + } + async with aiofiles.open(path, "a", encoding="utf-8") as f: + await f.write(json.dumps(payload, ensure_ascii=False) + "\n") + await self._compact_master_proof_edit_log_if_needed() + + async def get_master_proof_draft(self) -> dict[str, Any]: + content = await self._read_master_proof() + if content: + self._set_master_proof_metadata(content) + return { + "session_id": self._state.session_id, + "exists": bool(content.strip()), + "content": content, + "metadata": self._master_proof_metadata_payload(), + } + + async def get_master_proof_edit_summaries(self, *, limit: int = 50) -> dict[str, Any]: + safe_limit = max(1, min(500, int(limit or 50))) + records = self._read_master_proof_edit_records() + visible = records[-safe_limit:] + return { + "session_id": self._state.session_id, + "total_edits": len(records), + "limit": safe_limit, + "edits": [self._summarize_master_proof_edit_record(record) for record in visible], + "metadata": self._master_proof_metadata_payload(), + } + + def _master_proof_metadata_payload(self) -> dict[str, Any]: + return { + "initialized": self._state.master_proof_initialized, + "version": self._state.master_proof_version, + "sha256": self._state.master_proof_hash, + "line_count": self._state.master_proof_line_count, + "char_count": self._state.master_proof_char_count, + "last_edit_summary": self._state.master_proof_last_edit_summary, + "last_stuck_reason": self._state.master_proof_last_stuck_reason, + } + + def _read_master_proof_edit_records(self, session_id: str = "") -> list[dict[str, Any]]: + path = self._master_proof_edit_log_path(session_id) + if not path.exists(): + return [] + records: list[dict[str, Any]] = [] + try: + for line in path.read_text(encoding="utf-8").splitlines(): + if not line.strip(): + continue + item = json.loads(line) + if isinstance(item, dict): + records.append(item) + except Exception as exc: + logger.warning("Failed to read Proof Solver master proof edit log %s: %s", path, exc) + return records + + def _summarize_master_proof_edit_record(self, record: dict[str, Any]) -> dict[str, Any]: + summary_keys = [ + "session_id", + "master_proof_version", + "created_at", + "action", + "operation", + "accepted", + "needs_more_time", + "requested_path", + "master_proof_hash", + "master_proof_line_count", + "master_proof_char_count", + ] + summary = {key: record.get(key) for key in summary_keys if key in record} + for key in ("reasoning", "stuck_reason", "error_summary", "validator_feedback", "validator_reasoning"): + if record.get(key): + summary[key] = self._summarize_error(str(record.get(key)), limit=500) + if isinstance(record.get("shortening_metrics"), dict): + summary["shortening_metrics"] = record.get("shortening_metrics") + if record.get("old_string"): + summary["old_string_preview"] = self._summarize_error(str(record.get("old_string")), limit=240) + if record.get("new_string"): + summary["new_string_preview"] = self._summarize_error(str(record.get("new_string")), limit=240) + summary["new_string_char_count"] = len(str(record.get("new_string") or "")) + return summary + + async def _compact_master_proof_edit_log_if_needed(self) -> None: + path = self._master_proof_edit_log_path() + records = self._read_master_proof_edit_records() + if len(records) <= _MASTER_PROOF_EDIT_LOG_COMPACT_RECORD_LIMIT: + return + + keep_count = min(_MASTER_PROOF_EDIT_LOG_RECENT_RECORDS_TO_KEEP, len(records)) + retained = records[-keep_count:] + compacted_count = len(records) - len(retained) + current_proof = await self._read_master_proof() + snapshot = { + "session_id": self._state.session_id, + "created_at": datetime.now().isoformat(), + "snapshot_kind": "master_proof_edit_log_compaction", + "compacted_edit_count": compacted_count, + "retained_edit_count": len(retained), + "master_proof_version": self._state.master_proof_version, + "master_proof_hash": self._hash_master_proof(current_proof), + "master_proof_line_count": len(current_proof.splitlines()) if current_proof else 0, + "master_proof_char_count": len(current_proof or ""), + "first_compacted_edit": self._summarize_master_proof_edit_record(records[0]), + "last_compacted_edit": self._summarize_master_proof_edit_record(records[compacted_count - 1]), + } + snapshot_path = self._master_proof_snapshot_log_path() + snapshot_path.parent.mkdir(parents=True, exist_ok=True) + async with aiofiles.open(snapshot_path, "a", encoding="utf-8") as f: + await f.write(json.dumps(snapshot, ensure_ascii=False) + "\n") + await self._write_jsonl_records(path, retained) + + @staticmethod + async def _write_jsonl_records(path: Path, records: list[dict[str, Any]]) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + async with aiofiles.open(path, "w", encoding="utf-8") as f: + for record in records: + await f.write(json.dumps(record, ensure_ascii=False) + "\n") + + def _select_master_proof_seed(self, request: LeanOJStartRequest) -> str: + for proof in reversed(self._partial_proofs): + if str(proof.get("target") or "") != "final": + continue + lean_code = str(proof.get("lean_code") or "").strip() + if ( + lean_code + and self._is_high_value_master_seed_partial(request, proof, lean_code) + and not self._validate_final_solution_integrity(request.lean_template, lean_code) + ): + return lean_code + + for attempt in reversed(self._final_attempts): + lean_code = str(attempt.get("lean_code") or "").strip() + if not lean_code: + continue + if not self._is_explicit_master_seed_candidate(request, attempt, lean_code): + continue + if lean_code and not self._validate_final_solution_integrity(request.lean_template, lean_code): + return lean_code + + return request.lean_template.strip() + + def _is_high_value_master_seed_partial( + self, + request: LeanOJStartRequest, + proof: dict[str, Any], + lean_code: str, + ) -> bool: + """Only explicitly elevated partials may seed the durable master proof.""" + return self._is_explicit_master_seed_candidate( + request, + proof, + lean_code, + require_placeholders=True, + ) + + def _is_explicit_master_seed_candidate( + self, + request: LeanOJStartRequest, + record: dict[str, Any], + lean_code: str, + *, + require_placeholders: bool = False, + ) -> bool: + """Require an explicit validator/metadata signal before seeding from prior attempts.""" + if not (record.get("high_value_scaffold") is True or record.get("master_seed_eligible") is True): + return False + if not lean_code.strip(): + return False + if require_placeholders and not self._placeholder_tokens(lean_code): + return False + normalized_code = self._normalize_lean_for_template_check(lean_code) + normalized_template = self._normalize_lean_for_template_check(request.lean_template) + if normalized_code == normalized_template: + return False + text = " ".join( + str(record.get(key) or "").lower() + for key in ("request", "reasoning", "error_summary") + ) + blocked_terms = ( + "template unchanged", + "minimal scaffold", + "best achievable", + "infeasible", + "cannot be completed", + "sorry placeholders", + ) + return not any(term in text for term in blocked_terms) + + async def _ensure_master_proof_initialized(self, request: LeanOJStartRequest) -> str: + current = await self._read_master_proof() + if current.strip(): + self._set_master_proof_metadata(current) + return current + + seed = self._select_master_proof_seed(request) + await self._write_master_proof(seed, summary="Initialized Proof Solver master proof draft") + await self._append_master_proof_edit( + { + "action": "initialize_master_proof", + "operation": "full_content", + "reasoning": "Seeded the durable master proof draft from existing Proof Solver context.", + "new_string": seed, + } + ) + await self._persist_and_broadcast("leanoj_master_proof_initialized") + return seed + + @staticmethod + def _normalize_final_solver_edit(raw: dict[str, Any]) -> dict[str, Any]: + if raw.get("lean_code") and not raw.get("action"): + return { + "action": "edit_proof", + "operation": "full_content", + "old_string": "", + "new_string": str(raw.get("lean_code") or ""), + "needs_more_time": False, + "reasoning": str(raw.get("reasoning") or "Legacy whole-file final proof response."), + } + + action = str(raw.get("action") or "").strip() + if not action and raw.get("operation"): + action = "edit_proof" + needs_more_time = bool(raw.get("needs_more_time")) + return { + "action": action, + "operation": str(raw.get("operation") or "").strip(), + "old_string": str(raw.get("old_string") or ""), + "new_string": str(raw.get("new_string") or ""), + "needs_more_time": needs_more_time, + "reasoning": str(raw.get("reasoning") or raw.get("summary") or "").strip(), + "stuck_reason": str(raw.get("stuck_reason") or raw.get("reasoning") or "").strip(), + "requested_path": str(raw.get("requested_path") or raw.get("path") or "").strip(), + } + + def _apply_master_proof_edit(self, current_proof: str, edit: dict[str, Any]) -> tuple[Optional[str], str]: + action = str(edit.get("action") or "").strip() + if action not in _LEANOJ_PROOF_EDIT_ACTIONS: + return None, ( + f"Invalid final solver action `{action}`. Final proof mode accepts only `edit_proof`; " + "phase transitions are selected by the separate path-decision mode." + ) + + operation = str(edit.get("operation") or "").strip() + old_string = str(edit.get("old_string") or "") + new_string = str(edit.get("new_string") or "") + if operation not in _LEANOJ_PROOF_EDIT_OPERATIONS: + return None, ( + f"Invalid master proof edit operation `{operation}`. " + "Use full_content, replace, insert_after, or delete." + ) + + if operation == "full_content": + if not new_string.strip(): + return None, "full_content requires non-empty new_string Lean code." + return new_string.strip(), "" + + if not current_proof.strip(): + return None, f"Master proof is empty; operation `{operation}` must be full_content." + if not old_string: + return None, f"Operation `{operation}` requires a non-empty old_string copied from the current master proof." + match_count = current_proof.count(old_string) + if match_count == 0: + return None, "old_string was not found verbatim in the current master proof." + if match_count > 1: + return None, f"old_string appears {match_count} times in the current master proof; include more context." + + if operation == "replace": + return current_proof.replace(old_string, new_string, 1), "" + if operation == "insert_after": + if not new_string.strip(): + return None, "insert_after requires non-empty new_string Lean code." + insert_pos = current_proof.find(old_string) + len(old_string) + return ( + current_proof[:insert_pos].rstrip() + + "\n\n" + + new_string.strip() + + "\n\n" + + current_proof[insert_pos:].lstrip() + ), "" + if operation == "delete": + updated = current_proof.replace(old_string, "", 1) + while "\n\n\n" in updated: + updated = updated.replace("\n\n\n", "\n\n") + return updated, "" + + return None, f"Unsupported master proof edit operation `{operation}`." + + @classmethod + def _master_proof_shortening_metrics(cls, before_proof: str, after_proof: str) -> dict[str, Any]: + before = before_proof or "" + after = after_proof or "" + before_chars = len(before) + after_chars = len(after) + before_lines = len(before.splitlines()) if before else 0 + after_lines = len(after.splitlines()) if after else 0 + before_placeholders = len(cls._placeholder_tokens(before)) + after_placeholders = len(cls._placeholder_tokens(after)) + return { + "before_char_count": before_chars, + "after_char_count": after_chars, + "char_delta_removed": max(0, before_chars - after_chars), + "before_line_count": before_lines, + "after_line_count": after_lines, + "line_delta_removed": max(0, before_lines - after_lines), + "before_placeholder_count": before_placeholders, + "after_placeholder_count": after_placeholders, + "placeholder_delta_added": max(0, after_placeholders - before_placeholders), + "after_to_before_char_ratio": round(after_chars / before_chars, 4) if before_chars else 1.0, + } + + @staticmethod + def _should_validate_master_proof_shortening_edit(edit: dict[str, Any], metrics: dict[str, Any]) -> bool: + char_delta = int(metrics.get("char_delta_removed") or 0) + line_delta = int(metrics.get("line_delta_removed") or 0) + placeholder_delta = int(metrics.get("placeholder_delta_added") or 0) + if char_delta <= 0: + return False + operation = str(edit.get("operation") or "").strip() + return ( + line_delta > 0 + or char_delta >= _MASTER_PROOF_SHORTENING_CHAR_THRESHOLD + or placeholder_delta > 0 + or operation == "delete" + ) + + async def _validate_master_proof_shortening_edit( + self, + request: LeanOJStartRequest, + edit: dict[str, Any], + before_proof: str, + after_proof: str, + metrics: dict[str, Any], + ) -> tuple[bool, str, str, str, str]: + await self._broadcast( + "leanoj_master_proof_edit_validation_started", + { + "master_proof_version": self._state.master_proof_version, + "operation": str(edit.get("operation") or ""), + "char_delta_removed": metrics.get("char_delta_removed", 0), + "line_delta_removed": metrics.get("line_delta_removed", 0), + }, + ) + raw = await self._call_json( + request.brainstorm_validator, + "leanoj_master_proof_edit_val", + "leanoj_master_proof_edit_validator", + build_master_proof_edit_validation_prompt( + request.user_prompt, + request.lean_template, + before_proof, + after_proof, + edit, + metrics, + ), + ) + decision = str(raw.get("decision") or "").strip().lower() + reasoning = str(raw.get("reasoning") or raw.get("summary") or "").strip() + feedback = str(raw.get("feedback_to_submitter") or raw.get("summary") or reasoning).strip() + approval_justification = str( + raw.get("shortening_approval_justification") + or raw.get("approval_justification") + or reasoning + ).strip() + apparent_issue = str( + raw.get("apparent_issue_with_old_attempt") + or raw.get("old_attempt_apparent_issue") + or raw.get("old_attempt_issue") + or "" + ).strip() + if decision == "accept": + accepted_reasoning = reasoning or "Master proof edit validator accepted the shortening as progressive." + return ( + True, + feedback, + accepted_reasoning, + approval_justification or accepted_reasoning, + apparent_issue + or "Validator judged the removed material redundant, superseded, or less progressive than the shorter edit.", + ) + return ( + False, + feedback or "Restore the deleted proof progress or replace it with an equivalent stronger proof before shortening.", + reasoning or "Master proof edit validator rejected the shortening as non-progressive.", + "", + "", + ) + + def _build_master_proof_direct_context( + self, + master_proof: str, + request: LeanOJStartRequest, + context_blocks: dict[str, str] | None, + ) -> tuple[str, dict[str, Any]]: + proof = master_proof or request.lean_template + proof_tokens = count_tokens(proof) + available_input = rag_config.get_available_input_tokens( + request.final_solver.context_window, + request.final_solver.max_output_tokens, + ) + nonproof_parts = [ + request.user_prompt, + request.lean_template, + "\n\n".join(str(value) for value in (context_blocks or {}).values() if value), + ] + nonproof_tokens = sum(count_tokens(part) for part in nonproof_parts) + nonproof_tokens += rag_config.get_prompt_assembly_overhead_estimate() + 2500 + proof_token_budget = available_input - nonproof_tokens + + if proof_tokens > proof_token_budget: + raise LeanOJConfigurationError( + "PROOF SOLVER MANDATORY DIRECT CONTEXT OVERFLOW: The full master proof is mandatory direct-inject " + "context and cannot be truncated, summarized, windowed, or RAG-substituted. " + f"Full master proof tokens: {proof_tokens}. Available mandatory direct-inject proof budget after " + f"user prompt, Lean template, proof memory, schema, and output reserve: {proof_token_budget}. " + f"Configured final-solver context window: {request.final_solver.context_window}. " + f"Configured final-solver max output tokens: {request.final_solver.max_output_tokens}. " + "Increase the final solver context window or reduce other mandatory prompt context before resuming." + ) + + return proof, { + "direct_context_mode": "full_mandatory", + "master_proof_tokens": proof_tokens, + "mandatory_direct_proof_token_budget": proof_token_budget, + } + + @classmethod + def _normalize_master_proof_for_progress(cls, content: str) -> str: + return cls._normalize_lean_for_template_check(strip_lean_comments_and_strings(content or "")) + + def _record_master_proof_progress( + self, + edit: dict[str, Any], + before_proof: str, + after_proof: str, + ) -> str: + before_hash = self._hash_master_proof(before_proof) + after_hash = self._hash_master_proof(after_proof) + before_semantic = self._normalize_master_proof_for_progress(before_proof) + after_semantic = self._normalize_master_proof_for_progress(after_proof) + signature = self._master_proof_edit_signature(edit) + no_hash_change = before_hash == after_hash + no_semantic_change = before_semantic == after_semantic + repeated_region = bool(signature and signature == self._last_master_proof_edit_signature) + + if no_hash_change or no_semantic_change or repeated_region: + self._master_proof_no_progress_count += 1 + else: + self._master_proof_no_progress_count = 0 + + self._last_master_proof_edit_signature = signature + + if self._master_proof_no_progress_count < _MASTER_PROOF_NO_PROGRESS_LIMIT: + return "" + + reason_parts = [ + f"LeanOJ final solver made {_MASTER_PROOF_NO_PROGRESS_LIMIT} consecutive edit-only steps", + ] + if no_semantic_change: + reason_parts.append("without changing non-comment Lean code") + elif no_hash_change: + reason_parts.append("without changing the master proof hash") + if repeated_region: + reason_parts.append("while repeatedly editing/inserting at the same proof region") + reason_parts.append("so the run is returning to recursive brainstorming for fresh context instead of looping indefinitely.") + return "; ".join(reason_parts) + + def _reset_master_proof_progress_watchdog(self) -> None: + self._master_proof_no_progress_count = 0 + self._last_master_proof_edit_signature = "" + + @classmethod + def _master_proof_edit_signature(cls, edit: dict[str, Any]) -> str: + operation = str(edit.get("operation") or "") + old_string = str(edit.get("old_string") or "") + if operation == "full_content": + new_string = str(edit.get("new_string") or "") + normalized_new = cls._normalize_master_proof_for_progress(new_string) + return f"full_content:{hashlib.sha256(normalized_new[:1000].encode('utf-8')).hexdigest()}" + if not old_string: + return operation + normalized_old = cls._normalize_master_proof_for_progress(old_string) + return f"{operation}:{hashlib.sha256(normalized_old.encode('utf-8')).hexdigest()}" + + @staticmethod + def _final_cycle_should_handoff_to_recursive(cycle_attempts: list[dict[str, Any]]) -> bool: + if any( + str(attempt.get("request") or "") == "final Proof Solver master proof progress watchdog" + for attempt in cycle_attempts + ): + return True + stale_edit_failures = sum( + 1 + for attempt in cycle_attempts + if "old_string was not found verbatim" in str(attempt.get("error_summary") or "") + ) + return stale_edit_failures >= _MASTER_PROOF_STALE_EDIT_FAILURE_HANDOFF_COUNT + + @staticmethod + def _format_lean_success_feedback(lean_result: Lean4Result) -> str: + diagnostics = str(getattr(lean_result, "diagnostic_output", "") or "").strip() + if not diagnostics: + diagnostics = str(getattr(lean_result, "raw_stderr", "") or "").strip() + goal_states = str(getattr(lean_result, "goal_states", "") or "").strip() + parts = [] + if diagnostics: + parts.append(diagnostics) + if goal_states: + parts.append(f"Goal state output:\n{goal_states}") + return "\n\n".join(parts).strip() or "Lean 4 accepted with no diagnostics." + + async def _review_final_solution_completion( + self, + request: LeanOJStartRequest, + *, + lean_code: str, + final_solver_reasoning: str, + lean_result: Lean4Result, + ) -> tuple[bool, str, str]: + lean_feedback = self._format_lean_success_feedback(lean_result) + raw = await self._call_json( + request.final_solver, + "leanoj_final_review", + "leanoj_final_solver", + build_final_solution_review_prompt( + request.user_prompt, + request.lean_template, + lean_code, + final_solver_reasoning, + lean_feedback, + ), + ) + raw_solved = raw.get("solved") + solved = raw_solved if isinstance(raw_solved, bool) else str(raw_solved).strip().lower() == "true" + reasoning = str(raw.get("reasoning") or raw.get("summary") or "").strip() + continuation_feedback = str(raw.get("continuation_feedback") or "").strip() + if solved: + return True, reasoning or "Final solver review accepted the Lean-verified solution.", lean_feedback + return ( + False, + continuation_feedback or reasoning or "Final solver review rejected this Lean-accepted code as not complete.", + lean_feedback, + ) + + async def _check_master_proof_edit_before_persist( + self, + request: LeanOJStartRequest, + *, + lean_code: str, + needs_more_time: bool, + attempt_number: int, + reasoning: str, + final_solver_metadata: dict[str, Any], + ) -> tuple[Lean4Result, str]: + if needs_more_time: + lean_result = await get_lean4_client().check_proof( + lean_code, + timeout=system_config.lean4_proof_timeout, + allow_placeholders=True, + ) + lean_pass_feedback = self._format_lean_success_feedback(lean_result) if lean_result.success else "" + if lean_result.success: + template_error = self._validate_final_solution_integrity( + request.lean_template, + lean_code, + ) + if template_error: + lean_result.success = False + lean_result.error_output = template_error + return lean_result, lean_pass_feedback + + lean_result = await self._check_proof_and_capture_partial( + request, + lean_code, + target="final", + attempt_number=attempt_number, + proof_request="final Proof Solver solution", + reasoning=reasoning, + ) + lean_pass_feedback = self._format_lean_success_feedback(lean_result) if lean_result.success else "" + if lean_result.success: + template_error = self._validate_final_solution_integrity( + request.lean_template, + lean_code, + ) + if template_error: + lean_result.success = False + lean_result.error_output = template_error + else: + adequacy_error = self._validate_final_answer_adequacy( + request.lean_template, + lean_code, + ) + if adequacy_error: + lean_result.success = False + lean_result.error_output = adequacy_error + if lean_result.success: + review_solved, review_feedback, lean_pass_feedback = await self._review_final_solution_completion( + request, + lean_code=lean_code, + final_solver_reasoning=reasoning, + lean_result=lean_result, + ) + if not review_solved: + lean_result.success = False + lean_result.error_output = ( + "PROOF SOLVER FINAL SOLUTION REVIEW REJECTED: Lean 4 accepted the code, but the " + "Final Proof Solver judged that it does not yet solve the actual Proof Solver problem. " + f"Continuation feedback: {review_feedback}" + ) + self._failed_feedback.append( + { + "request": "final Proof Solver solution semantic review", + "error_summary": self._summarize_error(lean_result.error_output, limit=1200), + "lean_feedback": self._summarize_error(lean_pass_feedback, limit=1200), + "lean_code": lean_code, + } + ) + await self._persist_and_broadcast( + "leanoj_final_solution_review_rejected", + { + "attempt": attempt_number, + "continuation_feedback": self._summarize_error(review_feedback, limit=1200), + "lean_feedback": self._summarize_error(lean_pass_feedback, limit=1200), + **final_solver_metadata, + }, + ) + return lean_result, lean_pass_feedback + + async def _final_proof_loop(self, request: LeanOJStartRequest) -> None: + if await self._consume_force_brainstorm(): + return + + self._state.phase = "final_proof_loop" + await self._persist_and_broadcast("leanoj_phase_changed") + + await self._ensure_master_proof_initialized(request) + final_solver_metadata = { + "solver_model": request.final_solver.model_id, + "solver_provider": request.final_solver.provider, + } + failed_attempts_this_cycle = 0 + cycle_start_attempt = self._state.final_attempt_count + 1 + max_failed_attempts = max(1, request.final_attempts_per_cycle) + while not self._should_stop() and failed_attempts_this_cycle < max_failed_attempts: + if await self._consume_force_brainstorm(): + return + + current_master_proof = await self._read_master_proof() + self._set_master_proof_metadata(current_master_proof) + final_prompt_feedback = self._final_solver_failure_window() + await self._broadcast( + "leanoj_master_proof_edit_started", + { + "next_verification_attempt": self._state.final_attempt_count + 1, + "master_proof_version": self._state.master_proof_version, + }, + ) + try: + context_blocks = await self._build_context_blocks( + request, + request.final_solver, + mode="final_solver", + task_request="Edit the durable Proof Solver master proof and decide whether it is ready for Lean verification.", + capped_rejection_feedback=self._format_capped_rejection_feedback( + "RECENT PROOF FEEDBACK SUMMARIES", + final_prompt_feedback, + limit=10, + ), + ) + master_proof_direct_context, direct_context_metadata = self._build_master_proof_direct_context( + current_master_proof, + request, + context_blocks, + ) + raw = await self._call_json( + request.final_solver, + "leanoj_final", + "leanoj_final_solver", + build_final_solver_prompt( + request.user_prompt, + request.lean_template, + master_proof_direct_context, + { + "version": self._state.master_proof_version, + "line_count": self._state.master_proof_line_count, + "char_count": self._state.master_proof_char_count, + "sha256": self._state.master_proof_hash, + "last_edit_summary": self._state.master_proof_last_edit_summary, + "last_shortening_approval_justification": ( + self._state.master_proof_last_shortening_approval_justification + ), + "last_shortening_apparent_issue": self._state.master_proof_last_shortening_apparent_issue, + **direct_context_metadata, + }, + self._final_solver_active_plan_items(), + self._final_solver_verified_subproof_dicts(), + self._partial_proofs, + final_prompt_feedback, + self._final_attempts[-5:], + context_blocks=context_blocks, + ), + ) + except asyncio.CancelledError: + raise + except LeanOJConfigurationError: + raise + except Exception as exc: + attempt_number = self._state.final_attempt_count + 1 + self._state.final_attempt_count = attempt_number + failed_attempts_this_cycle += 1 + error_text = f"Final solver failed before Lean verification: {type(exc).__name__}: {exc}" + attempt = LeanOJAttemptRecord( + attempt=attempt_number, + target="final", + request="final Proof Solver master proof edit", + success=False, + error_output=error_text, + reasoning="Model/API output could not be parsed or generated; retrying in the final loop.", + ) + self._final_attempts.append( + { + "request": "final Proof Solver master proof edit", + "error_summary": self._summarize_error(error_text, limit=1200), + "lean_code": current_master_proof, + } + ) + self._record_final_context_event( + "failure", + request="final Proof Solver master proof edit", + error_summary=error_text, + ) + await self._persist_and_broadcast( + "leanoj_final_attempt_failed", + {"attempt": attempt.model_dump(mode="json"), **final_solver_metadata}, + ) + continue + + edit = self._normalize_final_solver_edit(raw) + action = str(edit.get("action") or "") + reasoning = str(edit.get("reasoning") or "") + updated_master_proof, edit_error = self._apply_master_proof_edit(current_master_proof, edit) + if edit_error or updated_master_proof is None: + attempt_number = self._state.final_attempt_count + 1 + self._state.final_attempt_count = attempt_number + failed_attempts_this_cycle += 1 + error_text = f"MASTER PROOF EDIT REJECTED: {edit_error}" + attempt = LeanOJAttemptRecord( + attempt=attempt_number, + target="final", + request="final Proof Solver master proof edit", + lean_code=current_master_proof, + success=False, + error_output=error_text, + reasoning=reasoning, + ) + self._final_attempts.append( + { + "request": "final Proof Solver master proof edit", + "error_summary": self._summarize_error(error_text, limit=1200), + "lean_code": current_master_proof, + } + ) + self._record_final_context_event( + "failure", + request="final Proof Solver master proof edit", + error_summary=error_text, + reasoning=reasoning, + ) + await self._append_master_proof_edit( + { + **edit, + "accepted": False, + "error_summary": self._summarize_error(error_text, limit=1200), + } + ) + await self._persist_and_broadcast( + "leanoj_final_attempt_failed", + {"attempt": attempt.model_dump(mode="json"), **final_solver_metadata}, + ) + continue + + shortening_metrics = self._master_proof_shortening_metrics(current_master_proof, updated_master_proof) + shortening_approval_justification = "" + old_attempt_apparent_issue = "" + if self._should_validate_master_proof_shortening_edit(edit, shortening_metrics): + ( + edit_valid, + validator_feedback, + validator_reasoning, + shortening_approval_justification, + old_attempt_apparent_issue, + ) = await self._validate_master_proof_shortening_edit( + request, + edit, + current_master_proof, + updated_master_proof, + shortening_metrics, + ) + if not edit_valid: + attempt_number = self._state.final_attempt_count + 1 + self._state.final_attempt_count = attempt_number + failed_attempts_this_cycle += 1 + error_text = ( + "MASTER PROOF EDIT VALIDATOR REJECTED SHORTENING: " + f"{validator_feedback}" + ) + error_summary = self._summarize_error(error_text, limit=1200) + self._failed_feedback.append( + { + "request": "final Proof Solver master proof edit validator", + "error_summary": error_summary, + "reasoning": self._summarize_error(validator_reasoning, limit=1200), + } + ) + attempt = LeanOJAttemptRecord( + attempt=attempt_number, + target="final", + request="final Proof Solver master proof edit validator", + lean_code=current_master_proof, + success=False, + error_output=error_text, + reasoning=reasoning, + ) + self._final_attempts.append( + { + "request": "final Proof Solver master proof edit validator", + "error_summary": error_summary, + "lean_code": current_master_proof, + "validator_feedback": self._summarize_error(validator_feedback, limit=1200), + "validator_reasoning": self._summarize_error(validator_reasoning, limit=1200), + } + ) + self._record_final_context_event( + "failure", + request="final Proof Solver master proof edit validator", + error_summary=error_summary, + reasoning=validator_reasoning, + ) + await self._append_master_proof_edit( + { + **edit, + "accepted": False, + "error_summary": error_summary, + "validator_feedback": self._summarize_error(validator_feedback, limit=1200), + "validator_reasoning": self._summarize_error(validator_reasoning, limit=1200), + "shortening_metrics": shortening_metrics, + } + ) + await self._persist_and_broadcast( + "leanoj_master_proof_edit_rejected", + { + "attempt": attempt_number, + "error_summary": error_summary, + "validator_feedback": self._summarize_error(validator_feedback, limit=1200), + "validator_reasoning": self._summarize_error(validator_reasoning, limit=1200), + "shortening_metrics": shortening_metrics, + **final_solver_metadata, + }, + ) + await self._persist_and_broadcast( + "leanoj_final_attempt_failed", + {"attempt": attempt.model_dump(mode="json"), **final_solver_metadata}, + ) + continue + + needs_more_time = bool(edit.get("needs_more_time")) + lean_code = updated_master_proof.strip() + attempt_number = self._state.final_attempt_count + 1 + if not needs_more_time: + await self._broadcast( + "leanoj_final_attempt_started", + {"attempt": attempt_number, **final_solver_metadata}, + ) + lean_result, lean_pass_feedback = await self._check_master_proof_edit_before_persist( + request, + lean_code=lean_code, + needs_more_time=needs_more_time, + attempt_number=attempt_number, + reasoning=reasoning, + final_solver_metadata=final_solver_metadata, + ) + if not lean_result.success: + self._state.final_attempt_count = attempt_number + failed_attempts_this_cycle += 1 + failure_request = ( + "final Proof Solver master proof edit Lean gate" + if needs_more_time + else "final Proof Solver solution from master proof" + ) + error_summary = self._summarize_error(lean_result.error_output, limit=1200) + attempt = LeanOJAttemptRecord( + attempt=attempt_number, + target="final", + request=failure_request, + lean_code=lean_code, + success=False, + error_output=lean_result.error_output, + reasoning=reasoning, + ) + failure = { + "request": failure_request, + "error_summary": error_summary, + "lean_code": lean_code, + } + if lean_pass_feedback: + failure["lean_feedback"] = self._summarize_error(lean_pass_feedback, limit=1200) + lean_diagnostics = { + key: self._summarize_error(str(value), limit=1200) + for key, value in { + "diagnostic_output": getattr(lean_result, "diagnostic_output", ""), + "goal_states": getattr(lean_result, "goal_states", ""), + "raw_stderr": getattr(lean_result, "raw_stderr", ""), + }.items() + if str(value or "").strip() + } + failure.update(lean_diagnostics) + self._final_attempts.append(failure) + self._record_final_context_event( + "failure", + request=failure_request, + error_summary=error_summary, + lean_feedback=str(failure.get("lean_feedback") or ""), + reasoning=reasoning, + ) + await self._append_master_proof_edit( + { + **edit, + "accepted": False, + "error_summary": error_summary, + "lean_code": lean_code, + **lean_diagnostics, + **({"lean_feedback": failure["lean_feedback"]} if "lean_feedback" in failure else {}), + } + ) + await self._persist_and_broadcast( + "leanoj_master_proof_edit_rejected", + { + "attempt": attempt_number, + "error_summary": error_summary, + **lean_diagnostics, + **({"lean_feedback": failure["lean_feedback"]} if "lean_feedback" in failure else {}), + **final_solver_metadata, + }, + ) + await self._persist_and_broadcast( + "leanoj_final_attempt_failed", + {"attempt": attempt.model_dump(mode="json"), **final_solver_metadata}, + ) + continue + + if shortening_approval_justification or old_attempt_apparent_issue: + self._state.master_proof_last_shortening_approval_justification = self._summarize_error( + shortening_approval_justification, + limit=1200, + ) + self._state.master_proof_last_shortening_apparent_issue = self._summarize_error( + old_attempt_apparent_issue, + limit=1200, + ) + old_char_count = len(current_master_proof or "") + stored_old_char_count = self._state.master_proof_old_attempt_before_redo_char_count + if old_char_count > stored_old_char_count: + await self._write_master_proof_old_attempt_before_redo(current_master_proof) + self._state.master_proof_old_attempt_before_redo_version = self._state.master_proof_version + self._state.master_proof_old_attempt_before_redo_hash = self._hash_master_proof(current_master_proof) + self._state.master_proof_old_attempt_before_redo_line_count = ( + len(current_master_proof.splitlines()) if current_master_proof else 0 + ) + self._state.master_proof_old_attempt_before_redo_char_count = old_char_count + self._state.master_proof_old_attempt_before_redo_summary = ( + f"Submitter chose to redo/shorten this v{self._state.master_proof_version} attempt " + f"({old_char_count} chars, " + f"{self._state.master_proof_old_attempt_before_redo_line_count} lines)." + ) + self._state.master_proof_old_attempt_before_redo_validator_justification = ( + self._summarize_error(shortening_approval_justification, limit=1200) + ) + self._state.master_proof_old_attempt_before_redo_apparent_issue = self._summarize_error( + old_attempt_apparent_issue, + limit=1200, + ) + + edit_summary = reasoning or f"Applied {edit.get('operation')} edit to Proof Solver master proof." + if shortening_approval_justification or old_attempt_apparent_issue: + edit_summary = " ".join( + part + for part in ( + edit_summary, + ( + f"Validator allowed shortening because: {shortening_approval_justification}" + if shortening_approval_justification + else "" + ), + ( + f"Apparent issue with old longer attempt: {old_attempt_apparent_issue}" + if old_attempt_apparent_issue + else "" + ), + ) + if part + ) + shortening_audit = {} + if shortening_approval_justification: + shortening_audit["shortening_approval_justification"] = self._summarize_error( + shortening_approval_justification, + limit=1200, + ) + if old_attempt_apparent_issue: + shortening_audit["old_attempt_apparent_issue"] = self._summarize_error( + old_attempt_apparent_issue, + limit=1200, + ) + await self._write_master_proof(updated_master_proof, summary=edit_summary) + await self._append_master_proof_edit( + { + **edit, + "accepted": True, + "master_proof_hash": self._state.master_proof_hash, + "master_proof_line_count": self._state.master_proof_line_count, + "master_proof_char_count": self._state.master_proof_char_count, + **shortening_audit, + } + ) + await self._persist_and_broadcast( + "leanoj_master_proof_edit_applied", + { + "master_proof_version": self._state.master_proof_version, + "needs_more_time": needs_more_time, + "reasoning": self._summarize_error(edit_summary, limit=500), + }, + ) + self._record_final_context_event( + "acceptance", + request="final Proof Solver master proof edit accepted", + reasoning=edit_summary, + ) + + if needs_more_time: + watchdog_reason = self._record_master_proof_progress(edit, current_master_proof, updated_master_proof) + if watchdog_reason: + self._state.master_proof_last_stuck_reason = self._summarize_error(watchdog_reason, limit=500) + self._failed_feedback.append( + { + "request": "final Proof Solver master proof progress watchdog", + "error_summary": self._summarize_error(watchdog_reason, limit=1200), + } + ) + await self._append_master_proof_edit( + { + "action": "progress_watchdog", + "reasoning": watchdog_reason, + "master_proof_hash": self._state.master_proof_hash, + "master_proof_line_count": self._state.master_proof_line_count, + "master_proof_char_count": self._state.master_proof_char_count, + } + ) + self._reset_master_proof_progress_watchdog() + attempt_number = self._state.final_attempt_count + 1 + self._state.final_attempt_count = attempt_number + failed_attempts_this_cycle += 1 + self._final_attempts.append( + { + "request": "final Proof Solver master proof progress watchdog", + "error_summary": self._summarize_error(watchdog_reason, limit=1200), + "lean_code": updated_master_proof, + } + ) + self._record_final_context_event( + "failure", + request="final Proof Solver master proof progress watchdog", + error_summary=watchdog_reason, + reasoning=reasoning, + ) + await self._persist_and_broadcast( + "leanoj_final_attempt_failed", + { + "attempt": LeanOJAttemptRecord( + attempt=attempt_number, + target="final", + request="final Proof Solver master proof progress watchdog", + lean_code=updated_master_proof, + success=False, + error_output=watchdog_reason, + reasoning=reasoning, + ).model_dump(mode="json"), + **final_solver_metadata, + }, + ) + await self._persist_and_broadcast( + "leanoj_master_proof_progress_watchdog", + { + "reasoning": watchdog_reason, + "continuing_final_cycle": ( + failed_attempts_this_cycle < max_failed_attempts + and self._state.user_forced_final_cycle + ), + }, + ) + if failed_attempts_this_cycle < max_failed_attempts and self._state.user_forced_final_cycle: + logger.info( + "LeanOJ final cycle continuing after progress watchdog", + ) + self._state.phase = "final_proof_loop" + self._state.current_path_decision = "solve_final_now" + continue + break + continue + self._reset_master_proof_progress_watchdog() + + self._state.final_attempt_count = attempt_number + attempt = LeanOJAttemptRecord( + attempt=attempt_number, + target="final", + request="final Proof Solver solution from master proof", + lean_code=lean_code, + success=lean_result.success, + error_output=lean_result.error_output, + reasoning=reasoning, + ) + + try: + proof_record = await self._register_verified_leanoj_proof( + request, + proof_kind="final", + theorem_statement=request.user_prompt, + theorem_name="Final Proof Solver Submission", + lean_code=lean_code, + attempt_count=attempt_number, + formal_sketch="Final Proof Solver solution for the user's template.", + theorem_id=f"{self._state.session_id}_final", + source_title=self._state.selected_topic or request.user_prompt, + ) + except Exception as exc: + if self._is_non_retryable_model_error(exc): + raise LeanOJConfigurationError(str(exc)) from exc + lean_result.success = False + lean_result.error_output = f"PROOF SOLVER PROOF REGISTRATION FAILED: {exc}" + attempt.success = False + attempt.error_output = lean_result.error_output + + if lean_result.success: + self._state.phase = "verified" + self._state.user_forced_final_cycle = False + self._state.final_solution = lean_code + self._state.final_proof_id = proof_record.proof_id if proof_record else "" + self._state.final_novel = proof_record.novel if proof_record else False + self._state.final_novelty_tier = proof_record.novelty_tier if proof_record else "not_novel" + self._state.final_novelty_reasoning = proof_record.novelty_reasoning if proof_record else "" + self._current_final_cycle_packet = None + self._current_working_proof_attempt = None + await self._persist_and_broadcast( + "leanoj_final_verified", + {"attempt": attempt.model_dump(mode="json"), **final_solver_metadata}, + ) + return + + failure = { + "request": "final Proof Solver solution from master proof", + "error_summary": self._summarize_error(lean_result.error_output, limit=1200), + "lean_code": lean_code, + } + if lean_pass_feedback: + failure["lean_feedback"] = self._summarize_error(lean_pass_feedback, limit=1200) + self._final_attempts.append(failure) + self._record_final_context_event( + "failure", + request=str(failure.get("request") or "final Proof Solver solution from master proof"), + error_summary=str(failure.get("error_summary") or ""), + lean_feedback=str(failure.get("lean_feedback") or ""), + reasoning=reasoning, + ) + failed_attempts_this_cycle += 1 + await self._persist_and_broadcast( + "leanoj_final_attempt_failed", + {"attempt": attempt.model_dump(mode="json"), **final_solver_metadata}, + ) + + if self._should_stop() or self._state.phase == "verified": + return + + cycle_end_attempt = self._state.final_attempt_count + last_error = "" + if self._final_attempts: + last_error = str(self._final_attempts[-1].get("error_summary") or "") + cycle_summary = ( + "The final master proof loop did not verify yet. " + f"Latest blocker: {last_error or 'No final attempt error was recorded.'} " + "Use the concrete Lean/edit feedback to choose the next proof action." + ) + cycle_attempts = list(self._final_attempts[-failed_attempts_this_cycle:]) + cycle_partials = [ + proof + for proof in self._partial_proofs + if str(proof.get("target") or "") == "final" + and cycle_start_attempt <= int(proof.get("attempt") or 0) <= cycle_end_attempt + ] + cycle_packet = { + "session_id": self._state.session_id, + "cycle_start_attempt": cycle_start_attempt, + "cycle_end_attempt": cycle_end_attempt, + "failed_attempt_count": failed_attempts_this_cycle, + "attempts": cycle_attempts, + "partial_proofs": cycle_partials, + "created_at": datetime.now().isoformat(), + "summary": self._summarize_error(cycle_summary, limit=1200), + } + self._final_cycle_packets.append(cycle_packet) + self._current_final_cycle_packet = cycle_packet + self._failed_feedback.append( + { + "request": "final Proof Solver proof cycle", + "error_summary": self._summarize_error(cycle_summary, limit=1200), + } + ) + handoff_to_recursive = self._final_cycle_should_handoff_to_recursive(cycle_attempts) + self._state.user_forced_final_cycle = False + self._state.phase = "recursive_brainstorm" if handoff_to_recursive else "path_decision" + self._state.current_path_decision = "need_more_brainstorming" + await self._set_current_working_proof_attempt( + trigger="final_attempt_cycle_exhausted", + requested_path="need_more_brainstorming", + stuck_reason=cycle_summary, + ) + await self._persist_and_broadcast( + "leanoj_final_attempt_cycle_exhausted", + { + "attempts_in_cycle": failed_attempts_this_cycle, + "cycle_start_attempt": cycle_start_attempt, + "cycle_end_attempt": cycle_end_attempt, + "message": self._summarize_error(cycle_summary, limit=500), + }, + ) + + @staticmethod + def _normalize_lean_for_template_check(code: str) -> str: + return " ".join((code or "").split()) + + @classmethod + def _validate_final_solution_integrity(cls, lean_template: str, lean_code: str) -> str: + device_error = cls._validate_no_new_declaration_devices(lean_template, lean_code, target="solution") + if device_error: + return device_error + template_error = cls._validate_final_solution_matches_template(lean_template, lean_code) + if template_error: + return template_error + return "" + + @classmethod + def _validate_final_answer_adequacy(cls, lean_template: str, lean_code: str) -> str: + """Reject final-only answer definitions that restate the extremal target.""" + if cls._placeholder_tokens(lean_code): + return "" + template_answer = cls._find_declaration_block(lean_template, "def answer") + if not template_answer or not cls._declaration_has_placeholder(template_answer): + return "" + candidate_answer = cls._find_declaration_block(lean_code, "def answer") + if not candidate_answer: + return "" + + body = cls._normalize_lean_for_semantic_scan(cls._lean_declaration_body(candidate_answer)) + if not body: + return "" + + extremal_markers = ( + "ssup", + "csup", + "nat.ssup", + "sup ", + "isgreatest", + "bddabove", + "upperbounds", + ) + self_reference_markers = ( + "s n", + "set ", + "finset", + "card", + "exists", + "u in", + "v in", + "divides", + " ∣ ", + "∣", + ) + if any(marker in body for marker in extremal_markers) and any( + marker in body for marker in self_reference_markers + ): + return ( + "PROOF SOLVER ANSWER ADEQUACY REJECTED: Lean accepted the final code, but `answer` is defined " + "using an extremal/supremum construction over the same feasible-cardinality problem instead " + "of determining the requested largest size in terms of n. This may remain in the durable " + "master proof as intermediate context, but it is not final-ready. Continue the Proof Solver loop, " + "derive an explicit formula for `answer n`, and then prove `IsGreatest (S n)` for that formula." + ) + return "" + + @classmethod + def _validate_no_new_declaration_devices(cls, lean_template: str, lean_code: str, *, target: str) -> str: + integrity = validate_lean_proof_integrity( + lean_code=lean_code, + allowed_baseline=lean_template, + ) + if integrity.valid: + return "" + return ( + "PROOF SOLVER FORBIDDEN PROOF DEVICE: Lean accepted the submitted code, but the " + f"{target} introduced new axiom/constant/opaque declarations not present in the original template: " + f"{', '.join(integrity.introduced_devices[:8])}. Do not solve Proof Solver problems by adding fake assumptions; " + "preserve the template and fill the proof using constructive Lean/Mathlib proof terms or tactics." + ) + + @classmethod + def _validate_final_solution_matches_template(cls, lean_template: str, lean_code: str) -> str: + """Return an error message when a compiling final answer does not solve the template.""" + template = lean_template or "" + candidate = lean_code or "" + hole_aware_error = cls._validate_final_solution_matches_template_declarations(template, candidate) + if hole_aware_error is not None: + return hole_aware_error + + template_parts = [ + cls._normalize_lean_for_template_check(part) + for part in _LEAN_PLACEHOLDER_RE.split(template) + if part not in {"sorry", "admit"} + ] + significant_parts = [part for part in template_parts if len(part) >= 12] + normalized_candidate = cls._normalize_lean_for_template_check(candidate) + + if not significant_parts: + normalized_template = cls._normalize_lean_for_template_check(template) + if normalized_template and normalized_template not in normalized_candidate: + return ( + "PROOF SOLVER TEMPLATE MISMATCH: Lean accepted the submitted code, but the code does not preserve " + "the user's original Proof Solver template/declaration. Return the complete original template with " + "only the proof holes filled unless a template change is explicitly required." + ) + return "" + + search_from = 0 + for part in significant_parts: + found_at = normalized_candidate.find(part, search_from) + if found_at < 0: + return ( + "PROOF SOLVER TEMPLATE MISMATCH: Lean accepted the submitted code, but the code does not contain " + "the original Proof Solver template structure around the proof hole. Do not replace the task with " + "an unrelated theorem; preserve the user's declarations and fill the required proof." + ) + search_from = found_at + len(part) + return "" + + @classmethod + def _validate_final_solution_matches_template_declarations(cls, lean_template: str, lean_code: str) -> Optional[str]: + """Validate LeanOJ submissions by preserving declarations while allowing hole bodies to change.""" + template_decls = cls._lean_declaration_blocks(lean_template) + candidate_decls = cls._lean_declaration_blocks(lean_code) + if not template_decls or not candidate_decls: + return None + + candidate_by_key: dict[str, str] = {} + for declaration in candidate_decls: + key = cls._lean_declaration_key(declaration) + if key and key not in candidate_by_key: + candidate_by_key[key] = declaration + + candidate_imports = set(cls._lean_imports(lean_code)) + for import_line in cls._lean_imports(lean_template): + if import_line not in candidate_imports: + return ( + "PROOF SOLVER TEMPLATE MISMATCH: Lean accepted the submitted code, but the code removed an " + f"original Proof Solver import required by the template: {import_line}. Preserve original imports; " + "additional imports are allowed when Lean needs them." + ) + + for template_decl in template_decls: + key = cls._lean_declaration_key(template_decl) + if not key: + continue + candidate_decl = candidate_by_key.get(key) + if not candidate_decl: + return ( + "PROOF SOLVER TEMPLATE MISMATCH: Lean accepted the submitted code, but it does not preserve " + f"the original Proof Solver declaration `{key}`. Do not replace the task with unrelated declarations." + ) + + if cls._declaration_has_placeholder(template_decl): + template_header = cls._normalize_lean_declaration_header(template_decl) + candidate_header = cls._normalize_lean_declaration_header(candidate_decl) + if template_header != candidate_header: + return ( + "PROOF SOLVER TEMPLATE MISMATCH: Lean accepted the submitted code, but it changed the " + f"signature/target of original declaration `{key}`. Fill only the `sorry`/`admit` body." + ) + continue + + normalized_template_decl = cls._normalize_lean_for_template_check(template_decl) + normalized_candidate_decl = cls._normalize_lean_for_template_check(candidate_decl) + if normalized_template_decl != normalized_candidate_decl: + return ( + "PROOF SOLVER TEMPLATE MISMATCH: Lean accepted the submitted code, but it changed a fixed " + f"non-hole declaration `{key}` from the original template. Preserve fixed definitions exactly." + ) + + return "" + + @staticmethod + def _lean_imports(code: str) -> list[str]: + return [ + line.strip() + for line in (code or "").splitlines() + if line.strip().startswith("import ") + ] + + @staticmethod + def _lean_declaration_blocks(code: str) -> list[str]: + matches = list(_LEAN_TOP_LEVEL_DECL_RE.finditer(code or "")) + blocks: list[str] = [] + for index, match in enumerate(matches): + end = matches[index + 1].start() if index + 1 < len(matches) else len(code or "") + block = (code or "")[match.start() : end].strip() + if block: + blocks.append(block) + return blocks + + @staticmethod + def _lean_declaration_key(declaration: str) -> str: + match = _LEAN_DECL_KEY_RE.search(declaration or "") + if not match: + return "" + kind = match.group("kind") or "" + name = match.group("name") or "" + return f"{kind} {name}".strip() + + @classmethod + def _find_declaration_block(cls, code: str, declaration_key: str) -> str: + for declaration in cls._lean_declaration_blocks(code): + if cls._lean_declaration_key(declaration) == declaration_key: + return declaration + return "" + + @staticmethod + def _lean_declaration_body(declaration: str) -> str: + if ":=" not in (declaration or ""): + return "" + return (declaration or "").split(":=", 1)[1].strip() + + @classmethod + def _normalize_lean_for_semantic_scan(cls, code: str) -> str: + return cls._normalize_lean_for_template_check(strip_lean_comments_and_strings(code or "")).lower() + + @staticmethod + def _declaration_has_placeholder(declaration: str) -> bool: + return bool(_LEAN_PLACEHOLDER_RE.search(strip_lean_comments_and_strings(declaration or ""))) + + @classmethod + def _normalize_lean_declaration_header(cls, declaration: str) -> str: + header = (declaration or "").split(":=", 1)[0] + normalized = cls._normalize_lean_for_template_check(header) + while True: + previous = normalized + normalized = re.sub(r"^open\s+Classical\s+in\s+", "", normalized) + normalized = re.sub(r"^(?:@\[[^\]]+\]\s*)+", "", normalized) + normalized = re.sub(r"^(?:(?:private|protected|noncomputable|unsafe)\s+)+", "", normalized) + normalized = normalized.strip() + if normalized == previous: + break + return normalized.strip() + + @staticmethod + def _json_retry_schema_hint(role_id: str) -> str: + if role_id.startswith("leanoj_brainstorm_submitter_"): + return ( + "ROLE-SPECIFIC COMPACT RETRY CONTRACT:\n" + "- For this retry, use the normal idea schema only; do not choose `lean_proof`.\n" + "- Return exactly: " + "{\"submission_type\":\"idea\",\"submission\":\"...\",\"reasoning\":\"...\"}\n" + "- Keep `submission` under 600 characters and `reasoning` under 200 characters.\n" + "- Do not quote the Lean template, prior proof, accepted ideas, or failure log." + ) + if role_id == "leanoj_brainstorm_validator": + return ( + "ROLE-SPECIFIC COMPACT RETRY CONTRACT:\n" + "- If the original prompt requested batch validation, return exactly one compact " + "`decisions` entry per submission in the original order.\n" + "- If the original prompt requested single validation, return one compact " + "`decision` object.\n" + "- Keep every `reasoning` and `summary` string under 160 characters.\n" + "- Do not quote submissions, Lean code, accepted ideas, or proof context." + ) + if role_id == "leanoj_topic_validator": + return ( + "ROLE-SPECIFIC COMPACT RETRY CONTRACT:\n" + "- Keep every topic, reasoning, and summary field under 200 characters.\n" + "- Do not quote the Lean template or prior topics." + ) + return ( + "ROLE-SPECIFIC COMPACT RETRY CONTRACT:\n" + "- Keep every string field short; do not quote large context blocks or code." + ) + + @classmethod + def _summarize_model_call_result( + cls, + role_id: str, + task_id: str, + parsed: dict[str, Any], + *, + limit: int = 700, + ) -> str: + """Return a compact outcome summary for live logs and INFO output.""" + if not isinstance(parsed, dict): + return cls._summarize_error(str(parsed), limit=limit) + + def clean(value: Any, text_limit: int = 320) -> str: + return cls._summarize_error(str(value or ""), limit=text_limit) + + def first_text(*keys: str, text_limit: int = 320) -> str: + for key in keys: + value = parsed.get(key) + if value: + return clean(value, text_limit) + return "" + + decisions = parsed.get("decisions") + if isinstance(decisions, list): + accepted = 0 + rejected = 0 + samples: list[str] = [] + for index, decision in enumerate(decisions, start=1): + if not isinstance(decision, dict): + continue + verdict = clean(decision.get("decision") or decision.get("verdict"), 40).lower() + if verdict == "accept": + accepted += 1 + elif verdict == "reject": + rejected += 1 + reason = clean( + decision.get("summary") or decision.get("reasoning") or decision.get("feedback"), + 160, + ) + if reason and len(samples) < 2: + samples.append(f"{index}: {verdict or 'decision'} - {reason}") + prefix = f"batch result: {accepted} accepted, {rejected} rejected" + return cls._summarize_error( + f"{prefix}; {'; '.join(samples)}" if samples else prefix, + limit=limit, + ) + + if "enough" in parsed: + status = "ready for path decision" if bool(parsed.get("enough")) else "continue brainstorming" + reason = first_text("reasoning", "summary", "feedback", text_limit=260) + return cls._summarize_error( + f"sufficiency result: {status}{f' - {reason}' if reason else ''}", + limit=limit, + ) + + if parsed.get("path"): + reason = first_text("reasoning", "summary", text_limit=300) + return cls._summarize_error( + f"path result: {clean(parsed.get('path'), 80)}{f' - {reason}' if reason else ''}", + limit=limit, + ) + + if parsed.get("decision"): + reason = first_text("summary", "reasoning", "feedback_to_submitter", text_limit=300) + return cls._summarize_error( + f"decision: {clean(parsed.get('decision'), 80)}{f' - {reason}' if reason else ''}", + limit=limit, + ) + + if parsed.get("action") or parsed.get("operation"): + action = clean(parsed.get("action") or "action", 80) + operation = clean(parsed.get("operation"), 80) + reason = first_text("reasoning", "summary", "stuck_reason", text_limit=300) + label = f"{action}{f'/{operation}' if operation else ''}" + return cls._summarize_error( + f"action result: {label}{f' - {reason}' if reason else ''}", + limit=limit, + ) + + if parsed.get("topic"): + reason = first_text("reasoning", "summary", text_limit=220) + return cls._summarize_error( + f"topic: {clean(parsed.get('topic'), 360)}{f' - {reason}' if reason else ''}", + limit=limit, + ) + + if parsed.get("submission"): + submission_type = clean(parsed.get("submission_type") or "idea", 60) + reason = first_text("reasoning", "formal_sketch", text_limit=220) + return cls._summarize_error( + f"{submission_type}: {clean(parsed.get('submission'), 420)}{f' - {reason}' if reason else ''}", + limit=limit, + ) + + if parsed.get("theorem_statement"): + theorem = clean(parsed.get("theorem_name") or parsed.get("theorem_statement"), 360) + sketch = first_text("formal_sketch", "reasoning", text_limit=220) + return cls._summarize_error( + f"lean proof: {theorem}{f' - {sketch}' if sketch else ''}", + limit=limit, + ) + + if "solved" in parsed: + reason = first_text("reasoning", "summary", "continuation_feedback", text_limit=320) + return cls._summarize_error( + f"solver review: {'solved' if bool(parsed.get('solved')) else 'not solved'}{f' - {reason}' if reason else ''}", + limit=limit, + ) + + summary = first_text( + "summary", + "reasoning", + "feedback", + "message", + "answer", + text_limit=500, + ) + if summary: + return summary + + keys = ", ".join(sorted(str(key) for key in parsed.keys())[:8]) + return f"{role_id or task_id} returned JSON fields: {keys or 'none'}" + + async def _call_json( + self, + config: LeanOJRoleConfig, + task_prefix: str, + role_id: str, + prompt: str, + temperature: float = 0.0, + ) -> dict[str, Any]: + if not config.model_id: + raise LeanOJConfigurationError(f"Proof Solver role {role_id} has no configured model") + current_prompt = prompt + attempt_index = 0 + while not self._should_stop(): + attempt_index += 1 + if self._should_stop(): + raise asyncio.CancelledError() + task_id = self._next_task_id(task_prefix) + self.current_task_id = task_id + self._refresh_workflow_tasks(task_prefix, role_id) + api_client_manager.set_autonomous_phase(self._state.phase or "leanoj") + started = time.monotonic() + call_payload = { + "role_id": role_id, + "task_id": task_id, + "phase": self._state.phase or "leanoj", + "attempt": attempt_index, + "provider": config.provider, + "model": config.model_id, + "context_window": config.context_window, + "max_output_tokens": config.max_output_tokens, + "temperature": temperature, + } + logger.debug( + "Proof Solver model call started (role=%s, task=%s, phase=%s, provider=%s, model=%s, attempt=%s)", + role_id, + task_id, + call_payload["phase"], + config.provider, + config.model_id, + attempt_index, + ) + try: + prompt_tokens = count_tokens(current_prompt) + max_input_tokens = rag_config.get_available_input_tokens( + config.context_window, + config.max_output_tokens, + ) + call_payload["prompt_tokens"] = prompt_tokens + call_payload["max_input_tokens"] = max_input_tokens + if prompt_tokens > max_input_tokens: + raise LeanOJConfigurationError( + "PROOF SOLVER PROMPT CONTEXT OVERFLOW: assembled prompt exceeds the configured " + f"input budget for role {role_id}. Prompt tokens: {prompt_tokens}. " + f"Available input tokens: {max_input_tokens}. Context window: {config.context_window}. " + f"Max output tokens: {config.max_output_tokens}." + ) + response = await api_client_manager.generate_completion( + task_id=task_id, + role_id=role_id, + model=config.model_id, + messages=[{"role": "user", "content": current_prompt}], + max_tokens=config.max_output_tokens, + temperature=temperature, + ) + self.completed_task_ids.add(task_id) + + choices = response.get("choices") or [] + content = "" + if choices: + message = choices[0].get("message") or {} + content = message.get("content") or message.get("reasoning") or "" + parsed = parse_json(content) + if isinstance(parsed, list): + parsed = parsed[0] if parsed else {} + if isinstance(parsed, dict): + duration_ms = round((time.monotonic() - started) * 1000) + result_summary = self._summarize_model_call_result(role_id, task_id, parsed) + logger.info( + "Proof Solver model call result (role=%s, task=%s, phase=%s, duration_ms=%s, response_chars=%s): %s", + role_id, + task_id, + call_payload["phase"], + duration_ms, + len(content), + result_summary, + ) + await self._broadcast( + "leanoj_model_call_completed", + { + **call_payload, + "duration_ms": duration_ms, + "response_chars": len(content), + "result_summary": result_summary, + }, + ) + return parsed + raise ValueError("Proof Solver role returned JSON that was not an object.") + except asyncio.CancelledError: + raise + except LeanOJConfigurationError: + raise + except Exception as exc: + duration_ms = round((time.monotonic() - started) * 1000) + if self._is_non_retryable_model_error(exc): + logger.error( + "Proof Solver model call failed with non-retryable error (role=%s, task=%s, phase=%s, duration_ms=%s): %s", + role_id, + task_id, + call_payload["phase"], + duration_ms, + exc, + ) + await self._broadcast( + "leanoj_model_call_failed", + { + **call_payload, + "duration_ms": duration_ms, + "retryable": False, + "message": self._summarize_error(str(exc), limit=700), + }, + ) + raise LeanOJConfigurationError(str(exc)) from exc + logger.warning( + "Proof Solver role %s task %s failed to produce valid JSON on retryable attempt %s: %s", + role_id, + task_id, + attempt_index, + exc, + ) + error_summary = self._summarize_error( + f"Proof Solver role {role_id} returned unusable JSON on retryable attempt {attempt_index}: " + f"{type(exc).__name__}: {exc}", + limit=1200, + ) + await self._broadcast( + "leanoj_model_call_failed", + { + **call_payload, + "duration_ms": duration_ms, + "retryable": True, + "message": error_summary, + }, + ) + if attempt_index == 1 or attempt_index % 3 == 0: + self._failed_feedback.append( + { + "request": f"{role_id} JSON generation", + "error_summary": error_summary, + "role_id": role_id, + "attempt": attempt_index, + } + ) + await self._persist_and_broadcast( + "leanoj_role_json_retrying", + { + "role_id": role_id, + "task_id": task_id, + "attempt": attempt_index, + "message": error_summary, + }, + ) + current_prompt = ( + f"{prompt}\n\n" + "IMPORTANT - YOUR PREVIOUS RESPONSE WAS REJECTED BY THE JSON PARSER:\n" + "REJECTION REASON: INVALID_OR_TRUNCATED_JSON\n" + f"ISSUE: {type(exc).__name__}: {self._summarize_error(str(exc), limit=700)}\n" + "FIX REQUIRED:\n" + "- Return raw JSON only, with no markdown fences, commentary, or analysis.\n" + "- Start with `{` and end with `}`.\n" + "- Keep every string field concise enough to finish before max_tokens.\n" + "- Preserve the requested schema exactly.\n" + "- Escape Lean/LaTeX backslashes so the result is valid JSON.\n\n" + f"{self._json_retry_schema_hint(role_id)}" + ) + if attempt_index % 3 == 0: + await asyncio.sleep(min(5.0, 0.5 * (attempt_index // 3))) + finally: + self.current_task_id = None + self._refresh_workflow_tasks(task_prefix, role_id) + + raise asyncio.CancelledError() + + @staticmethod + def _missing_model_roles(request: LeanOJStartRequest) -> list[str]: + role_configs: list[tuple[str, LeanOJRoleConfig]] = [ + ("topic_generator", request.topic_generator), + ("topic_validator", request.topic_validator), + ("brainstorm_validator", request.brainstorm_validator), + ("final_solver", request.final_solver), + ] + role_configs.extend( + (f"brainstorm_submitter_{index}", submitter) + for index, submitter in enumerate(request.brainstorm_submitters, start=1) + ) + return [role_name for role_name, config in role_configs if not (config.model_id or "").strip()] + + def _next_task_id(self, prefix: str) -> str: + current = self._task_sequences.get(prefix, 0) + self._task_sequences[prefix] = current + 1 + return f"{prefix}_{current:03d}" + + def _refresh_workflow_tasks(self, active_prefix: str = "leanoj_topic", active_role: str = "LeanOJ") -> None: + submitter_count = max(1, len(self._request.brainstorm_submitters) if self._request else 1) + brainstorm_submitter_patterns = [ + (f"leanoj_brainstorm_sub{index}", f"Brainstorm Submitter {index}", "Cumulative Brainstorm") + for index in range(1, submitter_count + 1) + ] + pattern = [ + ("leanoj_topic", "Topic Generator", "Topic Selection"), + ("leanoj_topic_val", "Topic Validator", "Topic Validation"), + *brainstorm_submitter_patterns, + ("leanoj_brainstorm_val", "Brainstorm Validator", "Brainstorm Validation"), + ("leanoj_brainstorm_prune", "Brainstorm Prune Reviewer", "Brainstorm Pruning"), + ("leanoj_brainstorm_prune_val", "Brainstorm Prune Validator", "Brainstorm Pruning"), + ("leanoj_path", "Final Proof Solver", "Path Decision"), + ("leanoj_path_val", "Path Validator", "Path Validation"), + ("leanoj_final", "Final Solver", "Final Lean Loop"), + ("leanoj_master_proof_edit_val", "Master Proof Edit Validator", "Final Lean Loop"), + ("leanoj_final_review", "Final Solver Review", "Final Lean Loop"), + ] + tasks: list[WorkflowTask] = [] + start_seq = sum(self._task_sequences.values()) + for offset in range(20): + prefix, role, mode = pattern[offset % len(pattern)] + seq = self._task_sequences.get(prefix, 0) + offset + task_id = f"{prefix}_{seq:03d}" + tasks.append( + WorkflowTask( + task_id=task_id, + sequence_number=start_seq + offset + 1, + role=active_role if prefix == active_prefix else role, + mode=mode, + provider="lm_studio", + active=prefix == active_prefix, + completed=task_id in self.completed_task_ids, + ) + ) + self.workflow_tasks = tasks + + def _configure_roles(self, request: LeanOJStartRequest) -> None: + self._configure_role("leanoj_topic_generator", request.topic_generator) + self._configure_role("leanoj_topic_selector", request.topic_generator) + self._configure_role("leanoj_topic_validator", request.topic_validator) + self._configure_role("leanoj_path_validator", request.topic_validator) + self._configure_role("leanoj_proof_novelty", request.topic_validator) + self._configure_role("leanoj_brainstorm_validator", request.brainstorm_validator) + self._configure_role("leanoj_master_proof_edit_validator", request.brainstorm_validator) + self._configure_role("leanoj_final_solver", request.final_solver) + for index, submitter in enumerate(request.brainstorm_submitters, start=1): + self._configure_role(f"leanoj_topic_submitter_{index}", submitter) + self._configure_role(f"leanoj_brainstorm_submitter_{index}", submitter) + self._configure_role(f"leanoj_brainstorm_prune_reviewer_{index}", submitter) + + @staticmethod + def _configure_role(role_id: str, config: LeanOJRoleConfig) -> None: + api_client_manager.configure_role( + role_id, + ModelConfig( + provider=config.provider, + model_id=config.model_id, + openrouter_model_id=config.model_id if config.provider == "openrouter" else None, + openrouter_provider=config.openrouter_provider, + openrouter_reasoning_effort=config.openrouter_reasoning_effort, + lm_studio_fallback_id=config.lm_studio_fallback_id, + context_window=config.context_window, + max_output_tokens=config.max_output_tokens, + supercharge_enabled=config.supercharge_enabled, + ), + ) + + async def _persist_and_broadcast(self, event: str, data: Optional[dict[str, Any]] = None) -> None: + self._state.updated_at = datetime.now() + self._remember_active_phase() + await self._persist_state() + await self._broadcast(event, data or self.get_status()) + await self._broadcast("leanoj_status_updated", self.get_status()) + + async def _persist_state(self) -> None: + session_dir = self._session_dir() + session_dir.mkdir(parents=True, exist_ok=True) + self._ensure_accepted_idea_records() + payload = self.get_status() + if self._request is not None: + payload["request"] = self._request.model_dump(mode="json") + payload["task_sequences"] = dict(self._task_sequences) + payload["completed_task_ids"] = sorted(self.completed_task_ids) + await leanoj_context_manager.write_session_artifacts( + session_id=self._state.session_id, + accepted_ideas=self._accepted_ideas, + accepted_idea_records=self._accepted_idea_records, + recursive_topics=self._recursive_topics, + verified_subproofs=self._verified_subproof_dicts(), + partial_proofs=self._partial_proofs, + failed_subproofs=self._failed_context_dicts(), + final_attempts=self._final_attempts, + final_cycle_packets=self._final_cycle_packets, + ) + async with aiofiles.open(session_dir / "state.json", "w", encoding="utf-8") as f: + await f.write(json.dumps(payload, indent=2)) + + def _session_dir(self) -> Path: + session_id = self._state.session_id or "latest" + return self._sessions_base_dir() / session_id + + @staticmethod + def _sessions_base_dir() -> Path: + return Path(system_config.data_dir) / "leanoj_sessions" + + def _find_latest_state_file(self) -> Optional[Path]: + base = self._sessions_base_dir() + if not base.exists(): + return None + state_files = [path for path in base.glob("*/state.json") if path.is_file()] + if not state_files: + return None + return max(state_files, key=lambda path: path.stat().st_mtime) + + def _find_best_resumable_state_file(self) -> Optional[Path]: + """Prefer the most valuable interrupted session after process restart.""" + return self._find_best_state_file() + + def _find_best_matching_state_file(self, request: LeanOJStartRequest) -> Optional[Path]: + """Prefer the most-progressed saved session for this exact Proof Solver problem.""" + return self._find_best_state_file(request) + + def _find_best_state_file(self, request: Optional[LeanOJStartRequest] = None) -> Optional[Path]: + base = self._sessions_base_dir() + if not base.exists(): + return None + + candidates: list[tuple[tuple[int, int, int, int, int, int, int, float], Path]] = [] + for path in base.glob("*/state.json"): + if not path.is_file(): + continue + try: + payload = json.loads(path.read_text(encoding="utf-8")) + except Exception: + continue + if request is not None and not self._payload_matches_request(payload, request): + continue + if payload.get("final_solution"): + continue + phase = str(payload.get("phase") or "") + if phase in _TERMINAL_PHASES: + continue + candidates.append((self._payload_progress_score(payload, path), path)) + + if not candidates: + return None + return max(candidates, key=lambda item: item[0])[1] + + @staticmethod + def _payload_matches_request(payload: dict[str, Any], request: LeanOJStartRequest) -> bool: + request_payload = payload.get("request") + if not isinstance(request_payload, dict): + return False + return ( + str(request_payload.get("user_prompt") or "").strip() == request.user_prompt.strip() + and str(request_payload.get("lean_template") or "").strip() == request.lean_template.strip() + ) + + @staticmethod + def _payload_progress_score(payload: dict[str, Any], path: Path) -> tuple[int, int, int, int, int, int, int, float]: + verified_subproofs = payload.get("verified_subproofs") or [] + validated_topics = payload.get("validated_topics") or [] + failed_subproofs = payload.get("failed_subproofs") or [] + accepted_count = int(payload.get("accepted_brainstorm_count") or 0) + topic_count = len(validated_topics) if isinstance(validated_topics, list) else 0 + final_attempt_count = int(payload.get("final_attempt_count") or 0) + master_proof_version = int(payload.get("master_proof_version") or 0) + return ( + LeanOJCoordinator._payload_phase_rank(payload), + master_proof_version, + final_attempt_count, + len(verified_subproofs) if isinstance(verified_subproofs, list) else 0, + accepted_count, + topic_count, + len(failed_subproofs) if isinstance(failed_subproofs, list) else 0, + path.stat().st_mtime, + ) + + @staticmethod + def _payload_phase_rank(payload: dict[str, Any]) -> int: + phase = str(payload.get("phase") or "") + if phase in {"stopped", "error"}: + last_active_phase = str(payload.get("last_active_phase") or "") + inferred_rank = LeanOJCoordinator._infer_payload_phase_rank(payload) + last_active_rank = _PHASE_PROGRESS_RANK.get(last_active_phase, 0) + return max(inferred_rank, last_active_rank) + return _PHASE_PROGRESS_RANK.get(phase, 0) + + @staticmethod + def _infer_payload_phase_rank(payload: dict[str, Any]) -> int: + if payload.get("master_proof_initialized") or int(payload.get("master_proof_version") or 0) > 0: + return _PHASE_PROGRESS_RANK["final_proof_loop"] + if int(payload.get("final_attempt_count") or 0) > 0 or payload.get("final_attempts"): + return _PHASE_PROGRESS_RANK["final_proof_loop"] + if payload.get("current_path_decision") == "solve_final_now": + return _PHASE_PROGRESS_RANK["final_proof_loop"] + if payload.get("verified_subproofs") or payload.get("failed_subproofs"): + return _PHASE_PROGRESS_RANK["path_decision"] + if int(payload.get("accepted_brainstorm_count") or 0) > 0 or payload.get("accepted_ideas"): + return _PHASE_PROGRESS_RANK["initial_brainstorm"] + if payload.get("selected_topic"): + return _PHASE_PROGRESS_RANK["initial_brainstorm"] + if payload.get("validated_topics"): + return _PHASE_PROGRESS_RANK["initial_topic_candidates"] + return _PHASE_PROGRESS_RANK["idle"] + + def _restore_from_payload(self, payload: dict[str, Any]) -> None: + request_payload = payload.get("request") + restored_request = LeanOJStartRequest.model_validate(request_payload) if request_payload else None + + self._state = LeanOJState.model_validate(payload) + self._state.is_running = False + master_proof_path = self._master_proof_path(self._state.session_id) + if master_proof_path.exists(): + try: + self._set_master_proof_metadata(master_proof_path.read_text(encoding="utf-8")) + except Exception as exc: + logger.warning("Failed to restore Proof Solver master proof metadata from %s: %s", master_proof_path, exc) + artifacts = leanoj_context_manager.load_session_artifacts(self._state.session_id) + self._validated_topics = [str(item) for item in payload.get("validated_topics") or []] + restored_accepted_ideas = [ + *[str(item) for item in payload.get("accepted_ideas") or []], + *[str(item) for item in artifacts.get(ARTIFACT_ACCEPTED_IDEAS, [])], + ] + self._accepted_idea_records = [ + dict(item) for item in (payload.get("accepted_idea_records") or []) if isinstance(item, dict) + ] + artifact_idea_records = [ + dict(item) + for item in artifacts.get("accepted_idea_records", []) + if isinstance(item, dict) + ] + if artifact_idea_records: + record_keys = { + self._dict_record_key(record) + for record in self._accepted_idea_records + } + for record in artifact_idea_records: + content = str(record.get("content") or "") + record_key = self._dict_record_key(record) + if content.strip() and record_key not in record_keys: + self._accepted_idea_records.append(record) + record_keys.add(record_key) + if self._accepted_idea_records: + self._accepted_ideas = [ + str(record.get("content") or "") + for record in self._accepted_idea_records + if str(record.get("content") or "").strip() + ] + recorded_contents = set(self._accepted_ideas) + self._accepted_ideas.extend( + idea + for idea in restored_accepted_ideas + if str(idea).strip() and idea not in recorded_contents + ) + else: + self._accepted_ideas = self._dedupe_strings(restored_accepted_ideas) + self._ensure_accepted_idea_records() + if self._state.brainstorm_acceptance_events < len(self._accepted_ideas): + self._state.brainstorm_acceptance_events = max( + int(payload.get("brainstorm_acceptance_events") or 0), + len(self._accepted_ideas), + ) + self._failed_feedback = [ + dict(item) for item in (payload.get("failed_feedback") or []) if isinstance(item, dict) + ] + self._failed_feedback = self._dedupe_dict_records( + [ + *self._failed_feedback, + *[ + dict(item) + for item in artifacts.get(ARTIFACT_FAILED_SUBPROOFS, []) + if isinstance(item, dict) + ], + ] + ) + self._final_attempts = [ + dict(item) for item in (payload.get("final_attempts") or []) if isinstance(item, dict) + ] + self._final_attempts = self._dedupe_dict_records( + [ + *[ + dict(item) + for item in artifacts.get(ARTIFACT_FINAL_ATTEMPTS, []) + if isinstance(item, dict) + ], + *self._final_attempts, + ] + ) + self._final_context_events = [ + dict(item) + for item in payload.get("final_context_events") or [] + if isinstance(item, dict) + ][-50:] + partial_proofs = [ + dict(item) for item in (payload.get("partial_proofs") or []) if isinstance(item, dict) + ] + persisted_partial_proofs = self._load_partial_proof_database(self._state.session_id) + self._partial_proofs = self._dedupe_partial_proofs( + [ + *partial_proofs, + *persisted_partial_proofs, + *[ + dict(item) + for item in artifacts.get(ARTIFACT_PARTIAL_PROOFS, []) + if isinstance(item, dict) + ], + ] + ) + verified_records = self._dedupe_dict_records( + [ + *[item.model_dump(mode="json") for item in self._state.verified_subproofs], + *[ + dict(item) + for item in artifacts.get(ARTIFACT_VERIFIED_SUBPROOFS, []) + if isinstance(item, dict) + ], + ] + ) + self._state.verified_subproofs = [ + LeanOJSubproofRecord.model_validate(item) + for item in verified_records + ] + self._final_cycle_packets = self._dedupe_dict_records( + [ + *[ + dict(item) + for item in artifacts.get(ARTIFACT_FINAL_CYCLE_PACKETS, []) + if isinstance(item, dict) + ], + *[ + dict(item) + for item in payload.get("final_cycle_packets") or [] + if isinstance(item, dict) + ], + ] + ) + current_packet = payload.get("current_final_cycle_packet") + self._current_final_cycle_packet = dict(current_packet) if isinstance(current_packet, dict) else None + working_packet = payload.get("current_working_proof_attempt") + self._current_working_proof_attempt = dict(working_packet) if isinstance(working_packet, dict) else None + self._task_sequences = { + str(key): int(value) + for key, value in (payload.get("task_sequences") or {}).items() + if isinstance(value, int) or str(value).isdigit() + } + self.completed_task_ids = {str(item) for item in payload.get("completed_task_ids") or []} + self.workflow_tasks = [] + self.current_task_id = None + self._stop_event = asyncio.Event() + self._request = restored_request + self._running = False + self._restored_from_disk = True + self._reset_master_proof_progress_watchdog() + + if self._request is not None: + self._configure_roles(self._request) + + def _should_stop(self) -> bool: + return self._stop_event.is_set() + + def _begin_brainstorm_acceptance_phase(self, phase_key: str) -> None: + if self._state.active_brainstorm_phase != phase_key: + self._state.active_brainstorm_phase = phase_key + self._state.active_brainstorm_start_count = self._state.brainstorm_acceptance_events + self._state.active_brainstorm_last_sufficiency_check_count = 0 + self._state.active_brainstorm_last_prune_review_count = 0 + + def _get_brainstorm_acceptance_start(self, phase_key: str) -> int: + if self._state.active_brainstorm_phase != phase_key: + self._begin_brainstorm_acceptance_phase(phase_key) + if self._state.active_brainstorm_start_count > self._state.brainstorm_acceptance_events: + self._state.active_brainstorm_start_count = self._state.brainstorm_acceptance_events + return self._state.active_brainstorm_start_count + + def _finish_brainstorm_acceptance_phase_for_path_decision(self) -> None: + self._state.phase = "path_decision" + self._state.active_brainstorm_phase = "" + self._state.active_brainstorm_start_count = self._state.brainstorm_acceptance_events + self._state.active_brainstorm_last_sufficiency_check_count = 0 + self._state.active_brainstorm_last_prune_review_count = 0 + + @staticmethod + def _is_non_retryable_model_error(exc: Exception) -> bool: + return is_non_retryable_model_error(exc) + + def _remember_active_phase(self) -> None: + if self._state.phase in _ACTIVE_PHASES: + self._state.last_active_phase = self._state.phase + + def _infer_resume_phase(self) -> str: + if ( + self._current_working_proof_attempt + or self._state.master_proof_initialized + or self._state.master_proof_version > 0 + ): + return "final_proof_loop" + if self._state.final_attempt_count > 0 or self._final_attempts: + return "final_proof_loop" + if self._state.current_path_decision == "solve_final_now": + return "final_proof_loop" + if self._state.last_active_phase in _ACTIVE_PHASES: + return self._state.last_active_phase + if self._state.verified_subproofs or self._state.failed_subproofs: + return "path_decision" + if self._accepted_ideas or self._state.accepted_brainstorm_count > 0 or self._state.selected_topic: + return "initial_brainstorm" + if self._validated_topics: + return "initial_topic_candidates" + return "initial_topic_candidates" + + @staticmethod + def _summarize_error(error_text: str, limit: int = 800) -> str: + cleaned = " ".join((error_text or "").split()) + return cleaned[:limit] + ("..." if len(cleaned) > limit else "") + + +leanoj_coordinator = LeanOJCoordinator() diff --git a/backend/leanoj/prompts.py b/backend/leanoj/prompts.py new file mode 100644 index 0000000..e4e9ad5 --- /dev/null +++ b/backend/leanoj/prompts.py @@ -0,0 +1,1009 @@ +"""Prompt builders for the LeanOJ proof-solver mode.""" +from __future__ import annotations + +import re +from typing import Any, Iterable + + +JSON_RULES = ( + "Respond with ONLY valid JSON. Do not use markdown fences. " + "Escape Lean backslashes and newlines correctly for JSON strings." +) + +LEANOJ_FORMALIZATION_GUARDRAILS = """LEANOJ FORMALIZATION GUARDRAILS: +- Treat the LeanOJ template as the source of truth for formal semantics. Do not silently reinterpret template operations to match informal olympiad intuition. For example, in a template over `Nat`, `a - b` is truncated natural subtraction, not signed integer subtraction. +- Before committing to a closed-form `answer`, test proposed formulas and constructions against the exact Lean predicate on small cases when feasible. Counterexamples to the exact template override informal expectations. +- Lean acceptance is necessary but not sufficient for final success. A Lean-verified file proves the formal statement it encodes; it does not automatically prove the user's informal problem statement if the template or chosen definitions exploit or mismatch the natural-language task. +- If the template semantics and informal statement appear to conflict, make the mismatch explicit in reasoning and do not claim that a Lean-verified template proof settles the informal statement unless that correspondence has also been justified.""" + + +def _format_items(items: Iterable[Any], *, empty: str = "[none]") -> str: + values = [str(item).strip() for item in (items or []) if str(item).strip()] + if not values: + return empty + return "\n".join(f"{index}. {value}" for index, value in enumerate(values, start=1)) + + +def _format_brainstorm(ideas: list[str], limit: int = 80) -> str: + if not ideas: + return "[No accepted brainstorm ideas yet.]" + visible = ideas[-limit:] + prefix = "" if len(visible) == len(ideas) else f"[Showing most recent {len(visible)} of {len(ideas)} accepted ideas.]\n" + return prefix + "\n".join(f"{index}. {idea}" for index, idea in enumerate(visible, start=1)) + + +def _final_mode_text(value: Any) -> str: + text = str(value or "") + cleaned = ( + text.replace("need_more_brainstorming", "additional proof context") + .replace("Brainstorm", "Proof memory") + .replace("brainstorm", "proof memory") + .replace("BRAINSTORM", "PROOF MEMORY") + ) + return _remove_attempt_count_language(cleaned) + + +def _remove_attempt_count_language(value: Any) -> str: + text = str(value or "") + replacements = ( + ( + r"\bfailed\s+\d+\s+consecutive\s+verification/edit\s+attempts?\b", + "encountered repeated verification/edit failures", + ), + (r"\bfailed\s+\d+\s+consecutive\s+attempts?\b", "encountered repeated failures"), + (r"\bfailed\s+\d+\s+attempts?\b", "encountered repeated failures"), + (r"\bfailed\s+\d+\s+times\b", "encountered repeated failures"), + (r"\bafter\s+failed\s+attempts\b", "after recent proof-check failures"), + (r"\bfailed\s+attempts\b", "proof-check failures"), + (r"\battempts\s+\d+\s*-\s*\d+\b", "recent final-loop feedback"), + (r"\bwith\s+exactly\s+\d+\s+failed\s+attempts?\b", "with recent proof-check failures"), + (r"\bUse this exact failed-attempt count[^.]*\.", ""), + (r"\bfailed-attempt count\b", "failure context"), + ) + for pattern, replacement in replacements: + text = re.sub(pattern, replacement, text, flags=re.IGNORECASE) + return re.sub(r" {2,}", " ", text).strip() + + +def _format_proof_memory_notes(ideas: list[str], limit: int = 80) -> str: + if not ideas: + return "[No accepted proof memory notes yet.]" + visible = ideas[-limit:] + prefix = "" if len(visible) == len(ideas) else f"[Showing most recent {len(visible)} accepted proof memory notes.]\n" + return prefix + "\n".join(f"{index}. {_final_mode_text(idea)}" for index, idea in enumerate(visible, start=1)) + + +def _format_verified_subproofs(subproofs: list[dict[str, Any]]) -> str: + if not subproofs: + return "[No verified subproofs yet.]" + blocks = [] + for index, subproof in enumerate(subproofs, start=1): + lean_feedback = str(subproof.get("lean_feedback") or "").strip() + feedback_lines = ["Lean verifier feedback:", lean_feedback] if lean_feedback else [] + blocks.append( + "\n".join( + [ + f"SUBPROOF {index}: {subproof.get('request', '')}", + f"Role: {subproof.get('role', '')}", + f"Theorem/Lemma: {subproof.get('theorem_or_lemma', '')}", + *feedback_lines, + "Verified Lean 4 code:", + subproof.get("lean_code", ""), + "---", + ] + ) + ) + return "\n".join(blocks) + + +def _format_verified_subproofs_for_final(subproofs: list[dict[str, Any]]) -> str: + if not subproofs: + return "[No verified subproofs yet.]" + blocks = [] + for index, subproof in enumerate(subproofs, start=1): + lean_feedback = _final_mode_text(subproof.get("lean_feedback") or "").strip() + feedback_lines = ["Lean verifier feedback:", lean_feedback] if lean_feedback else [] + blocks.append( + "\n".join( + [ + f"SUBPROOF {index}: {_final_mode_text(subproof.get('request', ''))}", + f"Theorem/Lemma: {_final_mode_text(subproof.get('theorem_or_lemma', ''))}", + *feedback_lines, + "Verified Lean 4 code:", + subproof.get("lean_code", ""), + "---", + ] + ) + ) + return "\n".join(blocks) + + +def _format_partial_proofs(partial_proofs: list[dict[str, Any]], limit: int = 8) -> str: + if not partial_proofs: + return "[No accepted partial proof scaffolds yet.]" + blocks = [] + for index, proof in enumerate(partial_proofs[-limit:], start=1): + placeholders = ", ".join(proof.get("placeholder_tokens") or []) or "unknown" + blocks.append( + "\n".join( + [ + f"PARTIAL PROOF {index}: {proof.get('request', '')}", + f"Target: {proof.get('target', '')}; placeholders: {placeholders}", + f"Summary: {proof.get('summary', '')}", + "Lean-accepted incomplete scaffold:", + proof.get("lean_code", ""), + "---", + ] + ) + ) + return "\n".join(blocks) + + +def _format_partial_proofs_for_final(partial_proofs: list[dict[str, Any]], limit: int = 8) -> str: + if not partial_proofs: + return "[No accepted partial proof scaffolds yet.]" + blocks = [] + for index, proof in enumerate(partial_proofs[-limit:], start=1): + placeholders = ", ".join(proof.get("placeholder_tokens") or []) or "unknown" + blocks.append( + "\n".join( + [ + f"PARTIAL PROOF {index}: {_final_mode_text(proof.get('request', ''))}", + f"Placeholders: {placeholders}", + f"Summary: {_final_mode_text(proof.get('summary', ''))}", + "Lean-accepted incomplete scaffold:", + proof.get("lean_code", ""), + "---", + ] + ) + ) + return "\n".join(blocks) + + +def _format_failures(failures: list[dict[str, Any]], limit: int = 10) -> str: + if not failures: + return "[No useful failed proof feedback yet.]" + visible = failures[-limit:] + blocks = [] + for index, failure in enumerate(visible, start=1): + block = ( + f"{index}. {_remove_attempt_count_language(failure.get('request', 'final proof'))} :: " + f"{_remove_attempt_count_language(failure.get('error_summary', ''))}" + ) + lean_feedback = str(failure.get("lean_feedback") or "").strip() + if lean_feedback: + block += f"\n Lean feedback: {_remove_attempt_count_language(lean_feedback)}" + blocks.append(block) + return "\n".join(blocks) + + +def _format_feedback_notes(failures: list[dict[str, Any]], limit: int = 10) -> str: + if not failures: + return "[No recent proof feedback available.]" + visible = failures[-limit:] + blocks = [] + for failure in visible: + request = str(failure.get("request") or "").strip() + error_summary = str(failure.get("error_summary") or failure.get("error_output") or "").strip() + lean_feedback = str(failure.get("lean_feedback") or "").strip() + combined = "\n".join(part for part in [request, error_summary, lean_feedback] if part).lower() + phase_noise = "need_more_brainstorming" in combined or "stuck_needs_brainstorm" in combined + if phase_noise and not _has_concrete_execution_feedback(combined): + continue + pieces = [ + part + for part in [ + _final_mode_text(error_summary), + f"Lean feedback: {_final_mode_text(lean_feedback)}" if lean_feedback else "", + ] + if part + ] + if pieces: + blocks.append("\n".join(pieces)) + return "\n\n---\n\n".join(blocks) if blocks else "[No recent proof feedback available.]" + + +def _has_concrete_execution_feedback(text: str) -> bool: + concrete_terms = ( + "old_string", + "unexpected token", + "missing cases", + "unsolved goals", + "error:", + "rejected", + "invalid", + "json", + "max_tokens", + "lean", + "verification", + "watchdog", + ) + lowered = str(text or "").lower() + return any(term in lowered for term in concrete_terms) + + +def _clip_prompt_field(value: Any, limit: int = 1200) -> str: + text = _final_mode_text(value).strip() + if len(text) <= limit: + return text + return text[: limit - 20].rstrip() + " ... [truncated]" + + +def _format_recent_final_attempts(attempts: list[dict[str, Any]], limit: int = 5) -> str: + visible = [record for record in (attempts or [])[-limit:] if isinstance(record, dict)] + if not visible: + return "[No recent final feedback recorded.]" + blocks = [] + for index, record in enumerate(visible, start=1): + request = _clip_prompt_field(record.get("request") or "final proof feedback", limit=300) + error_summary = _clip_prompt_field( + record.get("error_summary") or record.get("error_output") or "", + limit=1400, + ) + lean_feedback = _clip_prompt_field(record.get("lean_feedback") or "", limit=1000) + reasoning = _clip_prompt_field(record.get("reasoning") or "", limit=800) + lines = [f"FEEDBACK ITEM {index}: {request}"] + if error_summary: + lines.append(f"Result/error: {error_summary}") + if lean_feedback: + lines.append(f"Lean feedback: {lean_feedback}") + if reasoning: + lines.append(f"Prior solver reasoning: {reasoning}") + blocks.append("\n".join(lines)) + return "\n\n---\n\n".join(blocks) + + +def _format_context_blocks(context_blocks: dict[str, str] | None, fallback: str) -> str: + if not context_blocks: + return fallback + sections = [] + working_proof = (context_blocks.get("current_working_proof_attempt") or "").strip() + current_packet = (context_blocks.get("current_final_cycle_packet") or "").strip() + direct_context = (context_blocks.get("direct_proof_context") or "").strip() + rag_context = (context_blocks.get("rag_evidence_context") or "").strip() + refuted_warnings = (context_blocks.get("refuted_construction_warnings") or "").strip() + capped_feedback = (context_blocks.get("capped_rejection_feedback") or "").strip() + if working_proof: + sections.append(working_proof) + if current_packet: + sections.append(current_packet) + if direct_context: + sections.append(f"DIRECT PROOF CONTEXT:\n{direct_context}") + if rag_context: + sections.append(f"RETRIEVED LEANOJ RAG EVIDENCE:\n{rag_context}") + if refuted_warnings: + sections.append( + "REFUTED CONSTRUCTIONS - DO NOT USE AS PROOF EVIDENCE:\n" + f"{refuted_warnings}" + ) + if capped_feedback: + sections.append(f"CAPPED REJECTION FEEDBACK:\n{capped_feedback}") + return "\n\n".join(sections) if sections else fallback + + +def build_topic_candidate_prompt(user_prompt: str, lean_template: str, prior_topics: list[str]) -> str: + return f"""You are generating one candidate root foundation question for a LeanOJ proof-solving run. + +The system must solve the user's Lean 4 template completely. Propose a broad initial foundation question that can guide the entire session before recursive brainstorms add details. This is not a local sublemma target: it should set the durable direction for finding the complete solution. + +The topic must address ALL major solution obligations: +- Determine an explicit formula/value for `answer n`. +- Find or verify the extremal lower-bound construction. +- Prove the matching upper bound. +- Respect the exact LeanOJ template semantics, including Lean/Nat behavior. +- Identify a Mathlib-compatible Lean 4 formalization route for `IsGreatest (S n) (answer n)`. + +Reject narrow framing in your own generation. Do not return a topic that is only about one lemma, one tactic, one bound, one construction, small-case testing alone, or repairing a current proof attempt. + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +PRIOR VALIDATED TOPICS: +{_format_items(prior_topics)} + +Return a new non-duplicative broad foundation topic. It should read like a general question that addresses the whole problem and can remain locked as the initial session foundation. If prior topics already cover the same root framing, choose a distinct foundation angle that still covers all obligations, such as exact-template semantics first, extremal-combinatorics first, or Lean-formalization architecture first. + +Correct topic style: +{{"topic": "Determine a complete Lean 4 solution strategy for the exact LeanOJ template, including the explicit answer formula, extremal construction, upper-bound proof, template-semantics checks, and Mathlib formalization route.", "reasoning": "This covers every obligation needed for the final LeanOJ proof."}} + +Wrong topic style: +{{"topic": "Find a useful divisibility lemma for complex numbers.", "reasoning": "This is too narrow because it targets only one possible lemma and does not address the full solution foundation."}} + +{JSON_RULES} +JSON format: +{{"topic": "broad foundation topic", "reasoning": "why this topic sets the best foundation for solving the whole Lean template"}} +""" + + +def build_topic_validation_prompt(user_prompt: str, lean_template: str, topic: str, accepted_topics: list[str]) -> str: + return f"""You are validating a proposed LeanOJ initial foundation topic. + +Accept only if the topic is relevant to solving the user's exact Lean 4 template, non-duplicative, and broad enough to serve as the locked initial session foundation. + +The topic must address ALL major solution obligations: +- An explicit formula/value for `answer n`. +- A lower-bound construction. +- A matching upper-bound proof. +- Exact LeanOJ template semantics, including Lean/Nat behavior. +- A Lean 4 / Mathlib formalization route for `IsGreatest (S n) (answer n)`. + +Reject topics that are narrow, partial, or local: one sublemma, one tactic, one bound, one construction, small-case testing alone, or current-proof repair. Those belong in recursive brainstorms after the foundation exists, not in initial topic selection. + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +ACCEPTED TOPICS: +{_format_items(accepted_topics)} + +PROPOSED TOPIC: +{topic} + +Correct acceptance target: +{{"decision": "accept", "reasoning": "The topic covers answer formula, construction, upper bound, template semantics, and Lean formalization.", "summary": "Broad foundation topic."}} + +Required rejection target for narrow topics: +{{"decision": "reject", "reasoning": "The topic asks for one divisibility lemma.", "summary": "Invalid because this is a narrow sublemma topic, not a whole-problem foundation."}} + +{JSON_RULES} +JSON format: +{{"decision": "accept or reject", "reasoning": "brief validation reasoning", "summary": "short feedback if rejected"}} +""" + + +def build_topic_batch_validation_prompt( + user_prompt: str, + lean_template: str, + topics: list[str], + accepted_topics: list[str], +) -> str: + formatted_topics = "\n\n---\n\n".join( + f"TOPIC {index}:\n{topic}" + for index, topic in enumerate(topics, start=1) + ) + return f"""You are the single validator for cumulative LeanOJ initial foundation topics. + +Evaluate EACH proposed topic independently against the current accepted topic context, then check accepted topics for intra-batch redundancy. Accept only topics that are relevant to solving the user's exact Lean 4 template, non-duplicative, and broad enough to serve as the locked initial session foundation. + +Each accepted topic must address ALL major solution obligations: +- An explicit formula/value for `answer n`. +- A lower-bound construction. +- A matching upper-bound proof. +- Exact LeanOJ template semantics, including Lean/Nat behavior. +- A Lean 4 / Mathlib formalization route for `IsGreatest (S n) (answer n)`. + +CRITICAL: +- Judge each topic against CURRENT ACCEPTED TOPICS first, not against the other topics in this batch. +- Only after independent decisions, compare independently accepted topics against each other. +- If two accepted topics are redundant with each other, keep the stronger/more concrete one and reject the weaker one with an intra-batch redundancy summary. +- Reject narrow or partial initial topics even if they would be useful later: one sublemma, one tactic, one bound, one construction, small-case testing alone, or current-proof repair. +- Return exactly one decision object per topic, in the same order. + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +CURRENT ACCEPTED TOPICS: +{_format_items(accepted_topics)} + +TOPICS TO VALIDATE: +{formatted_topics} + +Correct acceptance target: +{{"topic_number": 1, "decision": "accept", "reasoning": "The topic covers the full answer formula, construction, upper bound, exact template semantics, and Lean formalization route.", "summary": "Broad foundation topic."}} + +Required rejection target for narrow topics: +{{"topic_number": 1, "decision": "reject", "reasoning": "The topic asks for one useful tactic or helper lemma.", "summary": "Invalid because this is too narrow for initial topic selection."}} + +{JSON_RULES} +JSON format: +{{"decisions": [{{"topic_number": 1, "decision": "accept or reject", "reasoning": "validation reasoning", "summary": "short rejection or acceptance summary"}}]}} +""" + + +def build_topic_selection_prompt(user_prompt: str, lean_template: str, topics: list[str]) -> str: + return f"""You are selecting the locked initial foundation topic for a LeanOJ proof-solving run. + +Choose exactly one of the validated topics below, or propose a clearly better replacement topic. The chosen topic must maximize the chance of solving the Lean 4 template by setting a broad root direction for the whole session. + +The selected topic will be treated as the initial frozen foundation that recursive brainstorms build on. It must not be a narrow sublemma, tactic-only investigation, one-bound-only question, one-construction-only question, small-case-only check, or current-proof repair target. + +The selected topic must address ALL major solution obligations: +- Determine an explicit formula/value for `answer n`. +- Establish the extremal lower-bound construction. +- Prove the matching upper bound. +- Respect exact LeanOJ template semantics, including Lean/Nat behavior. +- Set a Mathlib-compatible Lean 4 formalization route for `IsGreatest (S n) (answer n)`. + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +VALIDATED TOPICS: +{_format_items(topics)} + +Correct selected-topic style: +{{"topic": "Determine a complete Lean 4 solution strategy for the exact LeanOJ template, including the explicit answer formula, extremal construction, upper-bound proof, template-semantics checks, and Mathlib formalization route.", "reasoning": "This is broad enough to anchor the session and leaves recursive brainstorms to fill in details."}} + +Wrong selected-topic style: +{{"topic": "Prove one helper divisibility lemma.", "reasoning": "This is too narrow because it cannot serve as the locked foundation for the whole problem."}} + +{JSON_RULES} +JSON format: +{{"topic": "selected or improved broad foundation topic", "reasoning": "why this is the best locked initial foundation for solving the whole Lean template"}} +""" + + +def build_brainstorm_prompt( + user_prompt: str, + lean_template: str, + active_topic: str, + accepted_ideas: list[str], + verified_subproofs: list[dict[str, Any]], + failed_feedback: list[dict[str, Any]], + context_blocks: dict[str, str] | None = None, +) -> str: + fallback_context = f"""ACCEPTED BRAINSTORM CONTEXT: +{_format_brainstorm(accepted_ideas)} + +VERIFIED SUBPROOFS: +{_format_verified_subproofs(verified_subproofs)} + +USEFUL FAILED PROOF FEEDBACK: +{_format_failures(failed_feedback)}""" + return f"""You are a LeanOJ proof brainstorm submitter. Generate one concrete idea that helps solve the user's Lean 4 template. + +Focus on exact Lean tactics, Mathlib lemmas, theorem-shaping, induction/cases structure, or mathematical transformations. If a current working proof attempt is provided, treat ACTIVE TOPIC as that exact proof-repair target. Brainstorm only information that directly helps complete or repair it; if a direct solution is unavailable, give the nearest concrete step that works toward solving that exact proof. + +If you can produce a complete Lean 4 proof for a useful sublemma or proof fragment, you may choose `submission_type: "lean_proof"`. The system will run Lean 4 first, give you up to 5 repair attempts with Lean feedback, and only then send the Lean-verified proof to the normal brainstorm validator. Do not use `sorry`, `admit`, or fake `axiom`/`constant`/`opaque` devices. + +Do not write a whole final proof unless the idea is directly useful as context. Final template solving still happens in the final loop. + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +ACTIVE TOPIC: +{active_topic} + +ALLOCATED LEANOJ PROOF MEMORY: +{_format_context_blocks(context_blocks, fallback_context)} + +{JSON_RULES} +JSON format for a normal idea: +{{"submission_type": "idea", "submission": "one concrete proof-solving idea", "reasoning": "why it advances the LeanOJ solution"}} + +JSON format for a Lean proof candidate: +{{"submission_type": "lean_proof", "theorem_statement": "natural-language statement proved", "formal_sketch": "why this proof fragment helps the LeanOJ template", "theorem_name": "optional Lean declaration name", "lean_code": "complete Lean 4 code", "reasoning": "why this verified proof would help"}} +""" + + +def build_brainstorm_validation_prompt( + user_prompt: str, + lean_template: str, + submission: str, + accepted_ideas: list[str], + context_blocks: dict[str, str] | None = None, +) -> str: + fallback_context = f"CURRENT ACCEPTED IDEAS:\n{_format_brainstorm(accepted_ideas)}" + return f"""You are the single validator for a cumulative LeanOJ proof-solving brainstorm. + +Accept the submission only if it adds useful, non-redundant information for solving the exact Lean template. Reject vague encouragement, duplicate ideas, or claims unrelated to Lean verification. + +If the submission contains [LEAN 4 VERIFIED BRAINSTORM PROOF], Lean 4 and MOTO integrity checks already accepted the code. Your job is still to decide whether the verified proof is useful, relevant, and non-redundant for this LeanOJ brainstorm. Do not re-prove Lean correctness, and do not accept irrelevant/trivial proofs merely because Lean verified them. + +Classify accepted submissions for later final-proof context: +- active_plan: a concrete current proof route, decomposition plan, or next obligation that should guide `master_proof.lean`. +- verified_hint: a reusable verified lemma or exact Lean tactic fact. +- refuted_construction: a failed construction/counterexample/route warning. This is useful only as "do not use" feedback and must not be treated as proof evidence. +- scratch: useful exploratory context that should not be direct final-proof context. + +Use `scratch` unless the submission clearly fits one of the narrower roles. Do not default to `active_plan`. + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +ALLOCATED LEANOJ PROOF MEMORY: +{_format_context_blocks(context_blocks, fallback_context)} + +SUBMISSION: +{submission} + +{JSON_RULES} +JSON format: +{{"decision": "accept", "context_role": "scratch", "reasoning": "validation reasoning", "summary": "short rejection or acceptance summary"}} +""" + + +def build_brainstorm_batch_validation_prompt( + user_prompt: str, + lean_template: str, + submissions: list[str], + accepted_ideas: list[str], + context_blocks: dict[str, str] | None = None, +) -> str: + formatted_submissions = "\n\n---\n\n".join( + f"SUBMISSION {index}:\n{submission}" + for index, submission in enumerate(submissions, start=1) + ) + fallback_context = f"CURRENT ACCEPTED IDEAS:\n{_format_brainstorm(accepted_ideas)}" + return f"""You are the single validator for a cumulative LeanOJ proof-solving brainstorm. + +Evaluate EACH submission independently against the current accepted brainstorm context, then check accepted submissions for intra-batch redundancy. Accept only submissions that add useful, non-redundant information for solving the exact Lean template. Reject vague encouragement, duplicate ideas, or claims unrelated to Lean verification. + +If a submission contains [LEAN 4 VERIFIED BRAINSTORM PROOF], Lean 4 and MOTO integrity checks already accepted the code. Still decide whether that verified proof is useful, relevant, and non-redundant for this LeanOJ brainstorm. + +For each accepted submission, classify how it may be used later: +- active_plan: a concrete current proof route, decomposition plan, or next obligation that should guide `master_proof.lean`. +- verified_hint: a reusable verified lemma or exact Lean tactic fact. +- refuted_construction: a failed construction/counterexample/route warning. This is useful only as "do not use" feedback and must not be treated as proof evidence. +- scratch: useful exploratory context that should not be direct final-proof context. + +Use `scratch` unless the submission clearly fits one of the narrower roles. Do not default to `active_plan`. + +CRITICAL: +- Judge each submission against CURRENT ACCEPTED IDEAS first, not against the other submissions in this batch. +- Only after independent decisions, compare independently accepted submissions against each other. +- If two accepted submissions are redundant with each other, keep the stronger/more concrete one and reject the weaker one with an intra-batch redundancy summary. +- Return exactly one decision object per submission, in the same order. + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +ALLOCATED LEANOJ PROOF MEMORY: +{_format_context_blocks(context_blocks, fallback_context)} + +SUBMISSIONS TO VALIDATE: +{formatted_submissions} + +{JSON_RULES} +JSON format: +{{"decisions": [{{"submission_number": 1, "decision": "accept", "context_role": "scratch", "reasoning": "validation reasoning", "summary": "short rejection or acceptance summary"}}]}} +""" + + +def build_brainstorm_prune_review_prompt( + user_prompt: str, + lean_template: str, + active_topic: str, + accepted_ideas: list[str], + context_blocks: dict[str, str] | None = None, +) -> str: + fallback_context = f"CURRENT ACCEPTED IDEAS:\n{_format_brainstorm(accepted_ideas)}" + return f"""You are checking whether any LeanOJ brainstorm memory should be removed or updated because it is outdated, redundant, wrong, harmful, superseded, or missing proof-solving information. + +You may propose AT MOST ONE operation. Do not force a removal: choose "none" unless one operation clearly improves the proof-solving database. + +Allowed actions: +- "none": no change is needed. +- "delete": remove one accepted idea that is outdated, wrong, harmful, redundant with stronger retained context, or now superseded. +- "edit": replace one accepted idea with a more accurate version, especially when it removes outdated or redundant content while preserving unique proof-solving value. +- "add": add one compact corrective insight that is now clearly needed. + +Do not prune merely for style. Keep any idea that still provides unique proof-solving value. The question is whether any single idea should be removed or updated due to being outdated, redundant, wrong, harmful, or superseded; if not, return "none". + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +ACTIVE TOPIC: +{active_topic} + +ALLOCATED LEANOJ PROOF MEMORY: +{_format_context_blocks(context_blocks, fallback_context)} + +ACCEPTED BRAINSTORM IDEAS TO REVIEW: +{_format_brainstorm(accepted_ideas)} + +{JSON_RULES} +JSON format: +{{"action": "none", "idea_index": null, "new_content": "", "reasoning": "why no prune is needed or why this one operation improves the database"}} +""" + + +def build_brainstorm_prune_validation_prompt( + user_prompt: str, + lean_template: str, + active_topic: str, + accepted_ideas: list[str], + operation: dict[str, Any], + context_blocks: dict[str, str] | None = None, +) -> str: + fallback_context = f"CURRENT ACCEPTED IDEAS:\n{_format_brainstorm(accepted_ideas)}" + return f"""You are the single validator for a proposed LeanOJ brainstorm prune operation. + +Validate ONLY whether this operation improves the proof-solving brainstorm database for the exact Lean template and active topic. Use a conservative default: reject if uncertain. + +ACCEPT delete only if the selected idea is outdated, wrong, harmful, redundant with stronger retained context, or superseded by stronger retained context. +ACCEPT edit only if the replacement is materially more accurate and still useful, including when it removes outdated or redundant content while preserving unique proof-solving value. +ACCEPT add only if the new content is concrete, non-redundant, and directly useful for the proof. +REJECT vague, stylistic, speculative, or risky changes. + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +ACTIVE TOPIC: +{active_topic} + +ALLOCATED LEANOJ PROOF MEMORY: +{_format_context_blocks(context_blocks, fallback_context)} + +CURRENT ACCEPTED IDEAS: +{_format_brainstorm(accepted_ideas)} + +PROPOSED OPERATION: +{operation} + +{JSON_RULES} +JSON format: +{{"decision": "reject", "reasoning": "why this prune operation should be accepted or rejected"}} +""" + + +def build_sufficiency_prompt( + user_prompt: str, + lean_template: str, + accepted_ideas: list[str], + verified_subproofs: list[dict[str, Any]], + context_blocks: dict[str, str] | None = None, +) -> str: + fallback_context = f"""ACCEPTED BRAINSTORM CONTEXT: +{_format_brainstorm(accepted_ideas)} + +VERIFIED SUBPROOFS: +{_format_verified_subproofs(verified_subproofs)}""" + return f"""You are deciding whether there is enough context to attempt solving the user's LeanOJ template now. + +This is not final proof validation. Lean 4 will validate the actual proof. Decide whether the accumulated context is likely sufficient to start the final proof loop. + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +ALLOCATED LEANOJ PROOF MEMORY: +{_format_context_blocks(context_blocks, fallback_context)} + +{JSON_RULES} +JSON format: +{{"enough": true, "reasoning": "why the final loop should or should not start"}} +""" + + +def build_path_decision_prompt( + user_prompt: str, + lean_template: str, + accepted_ideas: list[str], + verified_subproofs: list[dict[str, Any]], + failed_feedback: list[dict[str, Any]], + context_blocks: dict[str, str] | None = None, +) -> str: + fallback_context = f"""ACCEPTED BRAINSTORM CONTEXT: +{_format_brainstorm(accepted_ideas)} + +VERIFIED SUBPROOFS: +{_format_verified_subproofs(verified_subproofs)} + +USEFUL FAILED PROOF FEEDBACK: +{_format_failures(failed_feedback)}""" + return f"""You are choosing the next path in a LeanOJ proof-solving state machine. + +There is no give-up state. Choose one: +- solve_final_now: the system should attempt the final full Lean 4 solution. +- need_more_brainstorming: more cumulative brainstorm context is needed. + +When solve_final_now is available, make this decision from the final proof solver's perspective: decide whether the dominant next move toward a solution is to enter the final Lean proof loop now. Since Lean-verified subproofs can now be generated during any brainstorm, defer only to more brainstorming when the final proof path is not yet the strongest next move. + +If the current proof memory includes a recent final-cycle packet or working-proof attempt caused by repeated stale `old_string` edits, no-progress watchdog feedback, placeholder/comment churn, or an unresolved missing lemma, choose `need_more_brainstorming` unless the allocated memory already contains fresh concrete proof content that directly resolves that blocker. + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +ALLOCATED LEANOJ PROOF MEMORY: +{_format_context_blocks(context_blocks, fallback_context)} + +{JSON_RULES} +JSON format: +{{"path": "solve_final_now", "reasoning": "why this path is required", "remaining_questions": ["optional missing questions"]}} +""" + + +def build_path_validation_prompt( + user_prompt: str, + lean_template: str, + proposed_path: str, + proposed_reasoning: str, + accepted_ideas: list[str], + verified_subproofs: list[dict[str, Any]], + context_blocks: dict[str, str] | None = None, +) -> str: + fallback_context = f"""ACCEPTED BRAINSTORM CONTEXT: +{_format_brainstorm(accepted_ideas)} + +VERIFIED SUBPROOFS: +{_format_verified_subproofs(verified_subproofs)}""" + return f"""You are validating a LeanOJ path decision. + +Accept only if the proposed path is justified by the current proof-solving context. Reject decisions that try the final proof too early or request more brainstorming when the next proof action is already clear. + +VALID PATHS: +- solve_final_now +- need_more_brainstorming + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +ALLOCATED LEANOJ PROOF MEMORY: +{_format_context_blocks(context_blocks, fallback_context)} + +PROPOSED PATH: +{proposed_path} + +PROPOSED REASONING: +{proposed_reasoning} + +{JSON_RULES} +JSON format: +{{"decision": "accept", "reasoning": "validation reasoning", "summary": "short rejection feedback if rejected", "corrected_path": "solve_final_now or need_more_brainstorming if rejected"}} +""" + +def build_final_solver_prompt( + user_prompt: str, + lean_template: str, + current_master_proof: str, + master_proof_metadata: dict[str, Any], + accepted_ideas: list[str], + verified_subproofs: list[dict[str, Any]], + partial_proofs: list[dict[str, Any]], + failed_feedback: list[dict[str, Any]], + final_attempts: list[dict[str, Any]], + context_blocks: dict[str, str] | None = None, +) -> str: + metadata_lines = "\n".join( + f"- {key}: {value}" + for key, value in (master_proof_metadata or {}).items() + if value not in (None, "") + ) or "[No master proof metadata available.]" + recent_final_feedback = _format_recent_final_attempts(final_attempts, limit=5) + fallback_context = f"""ACTIVE PROOF-PLAN NOTES: +{_format_proof_memory_notes(accepted_ideas)} + +VERIFIED SUBPROOFS: +{_format_verified_subproofs_for_final(verified_subproofs)} + +RECENT EXECUTION FEEDBACK - USE TO CHOOSE THE NEXT EDIT; DO NOT TREAT FAILED CODE AS PROVEN: +{_format_feedback_notes(failed_feedback)}""" + return f"""You are in the final LeanOJ master-proof editing loop. + +Your task is to edit the durable master Lean 4 proof like a paper draft. Preserve the original imports and declarations unless changing them is necessary and allowed by the problem template. Replace required `sorry` holes with real Lean proofs over as many edit prompts as needed. + +Master proof route discipline: +- `master_proof.lean` must contain the current chosen proof route only. +- Do not append multiple competing constructions or abandoned approaches into the master proof. +- If a route is refuted or superseded, replace it with the chosen route instead of keeping both. +- Failed constructions may appear only as compact comments when they directly explain an active invariant; otherwise keep them out of the Lean file. +- Use verified standalone lemmas and active proof-plan notes as positive context. Treat refuted-construction warnings only as "do not use" constraints, never as evidence for a proof route. + +Correction priority: +- Required corrections take priority over new additions. Treat recent final feedback, Lean errors, exact-string edit rejections, edit-validator feedback, and semantic-review continuation feedback as the next correction targets. +- If any correction is pending, your next edit must address that correction before attempting unrelated new lemmas, fresh proof routes, or speculative additions. +- New additions are allowed only when they directly implement the required correction or provide helper code needed for that correction. +- In your reasoning, name the correction you addressed. If no correction is pending, state which next unsolved proof obligation your edit advances. + +You must choose exactly one action: edit_proof. +This final mode cannot request phase transitions, cannot delegate to planning, and cannot stop early. If the proof is incomplete, make the best concrete edit available and set "needs_more_time": true. + +Binary verification gate: +- The system runs Lean after every proposed master proof edit before accepting it into the durable master proof. +- If your edit is useful but the proof still needs more editing time, set "needs_more_time": true. Lean will check the edited file with placeholders allowed, and syntax/type errors will reject the edit. +- If your edit should make the current master proof final-ready, set "needs_more_time": false. Lean will check the edited file with no placeholders allowed, then final integrity/review checks will run. +- A master proof edit is not accepted merely because the string edit applies; it must pass the appropriate Lean gate first. +- A Lean-accepted loophole may be useful intermediate progress, but it is not final-ready. For LeanOJ `answer` definitions, do not terminate with `answer` defined as `sSup`, `csSup`, `Nat.sSup`, `Sup`, or an equivalent maximum over the same feasible set. Final readiness requires an explicit formula/value for `answer n` and a proof that this formula is greatest. +- A continuing edit must change non-comment Lean proof content in a way that discharges, splits, or materially advances an obligation. Do not spend an edit only rewriting comments, TODOs, placeholders, or "prepare for next edit" wording. + +Exact-string editing rules: +- Use operation "full_content" only when replacing the whole master proof. +- Use operation "replace", "insert_after", or "delete" for targeted edits. +- For targeted edits, old_string must be copied verbatim from the CURRENT FULL MASTER PROOF and must appear exactly once in that full master proof. +- Include enough surrounding Lean lines in old_string to make the match unique. +- new_string must contain the replacement/insertion Lean code, except delete uses an empty new_string. +- Never introduce fake `axiom`, `constant`, or `opaque` proof devices. +- Final verification requires no `sorry`/`admit`, but intermediate master proof edits may preserve placeholders while you continue working. + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE TO SOLVE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +CURRENT MASTER PROOF METADATA: +{metadata_lines} + +RECENT FINAL FEEDBACK (USE TO AVOID REPEATING FAILED EDITS): +{recent_final_feedback} + +CURRENT FULL MASTER PROOF TO EDIT (MANDATORY DIRECT-INJECT CONTEXT; NEVER TRUNCATED): +{current_master_proof or lean_template} + +ALLOCATED LEANOJ PROOF MEMORY: +{_format_context_blocks(context_blocks, fallback_context)} + +{JSON_RULES} +JSON format for continuing edits: +{{"action": "edit_proof", "needs_more_time": true, "operation": "replace", "old_string": "exact unique text from CURRENT MASTER PROOF", "new_string": "updated Lean code", "reasoning": "why this edit advances the proof and what remains"}} + +JSON format for final verification after this edit: +{{"action": "edit_proof", "needs_more_time": false, "operation": "replace", "old_string": "exact unique text from CURRENT MASTER PROOF", "new_string": "updated Lean code expected to verify", "reasoning": "why the edited master proof should now pass Lean"}} +""" + + +def build_master_proof_edit_validation_prompt( + user_prompt: str, + lean_template: str, + current_master_proof: str, + proposed_master_proof: str, + edit: dict[str, Any], + metrics: dict[str, Any], +) -> str: + metrics_lines = "\n".join( + f"- {key}: {value}" + for key, value in (metrics or {}).items() + if value not in (None, "") + ) or "[No shortening metrics available.]" + return f"""You are the independent LeanOJ master-proof edit validator. + +The final Proof Solver proposed an edit that shortens the durable master proof. Your job is to decide whether this shortening is real proof progress or whether it deletes useful work because the solver is stuck, frustrated, restarting, or giving up. + +Accept only if the proposed shorter proof is genuinely progressive for solving the exact LeanOJ template: +- It preserves or strengthens useful solved Lean content, definitions, lemmas, and proof structure. +- It replaces removed material with equivalent or stronger proof content, or removes only clearly redundant/noisy material. +- It still moves toward a complete Lean 4 proof of the original template. + +Reject if the edit goes backward: +- It deletes useful proof progress, helper lemmas, explicit formulas, or developed argument structure without a stronger replacement. +- It replaces concrete work with `sorry`, `admit`, comments, vague plans, or a reset toward the original template. +- It looks like abandonment, frustration, a restart, or an attempt to make the file shorter by discarding hard obligations. +- It bloats the master proof by accumulating multiple competing/refuted proof routes instead of maintaining one current chosen route. +- It ignores a required correction in the proof and instead prioritizes unrelated new additions, fresh routes, or speculative helper material. + +If you reject, give precise feedback to the proof submitter. Name the content that must be restored or the exact kind of progressive replacement required. +If corrections are required, your feedback must say that those corrections must be fixed before any new addition attempts. New additions are acceptable only when they directly implement the required correction. + +If you accept, give a clear justification that can be shown later alongside the old longer proof. This justification must explain: +- WHY the validator allowed the shortening instead of requiring the longer attempt to be restored. +- What the apparent issue was with the old longer attempt, such as redundant code, noisy scaffolding, a weaker route, or content replaced by stronger proof structure. + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +PROPOSED EDIT: +- operation: {edit.get("operation", "")} +- needs_more_time: {edit.get("needs_more_time", "")} +- solver_reasoning: {edit.get("reasoning", "")} +- old_string: +{edit.get("old_string", "")} + +- new_string: +{edit.get("new_string", "")} + +SHORTENING METRICS: +{metrics_lines} + +CURRENT MASTER PROOF BEFORE EDIT: +{current_master_proof} + +PROPOSED MASTER PROOF AFTER EDIT: +{proposed_master_proof} + +{JSON_RULES} +JSON format if this shortening is progressive: +{{"decision": "accept", "reasoning": "why this shorter edit preserves or improves proof progress", "shortening_approval_justification": "clear reason the validator allowed this shortening", "apparent_issue_with_old_attempt": "what was apparently wrong, redundant, noisy, or superseded in the old longer attempt", "feedback_to_submitter": ""}} + +JSON format if this shortening goes backward: +{{"decision": "reject", "reasoning": "why this deletes progress or gives up", "feedback_to_submitter": "precise correction for the final solver"}} +""" + + +def build_final_solution_review_prompt( + user_prompt: str, + lean_template: str, + lean_code: str, + final_solver_reasoning: str, + lean_feedback: str, +) -> str: + return f"""You are the final LeanOJ proof checker for a Lean-accepted submission. + +Lean 4 has already checked the code. Your job is NOT to re-run Lean or act as a planning validator. Your job is to decide whether this Lean-accepted file actually solves the user's LeanOJ problem prompt and template in the intended sense. + +Lean acceptance is necessary but not sufficient. Reject loopholes that satisfy the weak formal theorem while evading the natural-language task, such as defining an answer by taking a maximum/supremum over the same feasible set instead of determining the requested value in terms of n. + +Accept only if the code: +- Preserves and solves the user's LeanOJ template. +- Fully addresses the actual problem prompt, not merely a different formal statement. +- Uses an answer/formulation that genuinely determines the requested object when the problem asks for an explicit value or formula. +- Contains no placeholder proof devices or semantic shortcuts that should remain continuation context instead of the final stop condition. + +USER PROBLEM: +{user_prompt} + +LEANOJ TEMPLATE: +{lean_template} + +{LEANOJ_FORMALIZATION_GUARDRAILS} + +LEAN 4 FEEDBACK FROM THE ACCEPTED RUN: +{lean_feedback or "Lean 4 accepted with no diagnostics."} + +FINAL SOLVER REASONING BEFORE LEAN CHECK: +{final_solver_reasoning or "[No final solver reasoning provided.]"} + +LEAN-ACCEPTED FINAL CODE: +{lean_code} + +{JSON_RULES} +JSON format if this is truly solved: +{{"solved": true, "reasoning": "why this Lean-accepted code completely solves the LeanOJ problem prompt and template"}} + +JSON format if this is not done: +{{"solved": false, "continuation_feedback": "specific feedback explaining what is missing and what the next final solver attempt should fix", "reasoning": "why Lean acceptance is not enough here"}} +""" diff --git a/backend/shared/api_client_manager.py b/backend/shared/api_client_manager.py index cfbbe7e..57cd383 100644 --- a/backend/shared/api_client_manager.py +++ b/backend/shared/api_client_manager.py @@ -9,6 +9,7 @@ 4. Per-task Toggle - Task ID based (legacy) """ import asyncio +import json import logging import time from typing import Dict, Any, List, Optional, Callable @@ -26,18 +27,43 @@ from backend.shared.config import rag_config, system_config from backend.shared.fastembed_provider import FASTEMBED_MODEL_NAME, FastEmbedProvider from backend.shared.free_model_manager import free_model_manager +from backend.shared.json_parser import sanitize_model_output_for_retry_context from backend.shared.models import ModelConfig from backend.shared.token_tracker import token_tracker logger = logging.getLogger(__name__) +def _response_shape_for_logging(response: Any) -> str: + """Summarize an upstream response shape without logging provider/model text.""" + if isinstance(response, dict): + keys = sorted(str(key) for key in response.keys()) + usage = response.get("usage") if isinstance(response.get("usage"), dict) else {} + return ( + f"type=dict, keys={keys}, choices_present={bool(response.get('choices'))}, " + f"error_present={'error' in response}, usage_keys={sorted(str(key) for key in usage.keys())}" + ) + if isinstance(response, list): + return f"type=list, length={len(response)}" + return f"type={type(response).__name__}" + + class APIClientManager: """ Central manager for routing API calls to OpenRouter or LM Studio. Handles fallback on credit exhaustion and boost integration. """ CALL_METADATA_KEY = "_moto_call_metadata" + # Supercharge intentionally breaks the default 0.0 temperature policy for + # candidate attempts so parallel completions produce meaningfully different answers. + SUPERCHARGE_ATTEMPT_TEMPERATURES = (0.0, 0.2, 0.4, 0.8) + SUPERCHARGE_CANDIDATE_MAX_CHARS = 20000 + # Parallel brainstorm submitters use a lane-based ladder: submitter 1 stays + # deterministic, later lanes get increasing exploration pressure. + PARALLEL_BRAINSTORM_SUBMITTER_TEMPERATURES = ( + 0.0, 0.1, 0.2, 0.3, 0.4, + 0.5, 0.6, 0.7, 0.8, 0.9, + ) def __init__(self): self._openrouter_client: Optional[OpenRouterClient] = None @@ -60,10 +86,11 @@ def __init__(self): # Signature: async callback(model_id: str) self._model_tracking_callback: Optional[Callable] = None - # Autonomous API logger callback - # Called after each API call (success or failure) with full details - # Signature: async callback(task_id, role_id, model, provider, prompt, response, duration_ms, success, error, phase) - self._autonomous_logger_callback: Optional[Callable] = None + # API logger callback. Workflows can override this to add namespace-specific + # metadata; otherwise the manager still logs every model call by default. + # Signature: async callback(task_id, role_id, model, provider, prompt, response, + # tokens_used, duration_ms, success, error, phase) + self._autonomous_logger_callback: Optional[Callable] = self._default_api_logger_callback # Current autonomous phase (set by autonomous coordinator) self._current_autonomous_phase: str = "unknown" @@ -73,6 +100,17 @@ def __init__(self): # Lock for thread-safe state updates self._state_lock = asyncio.Lock() + + @classmethod + def parallel_brainstorm_submitter_temperature(cls, submitter_index: int) -> float: + """Return the deterministic temperature lane for a parallel brainstorm submitter.""" + try: + index = int(submitter_index) + except (TypeError, ValueError): + index = 1 + index = max(1, index) + ladder_index = min(index - 1, len(cls.PARALLEL_BRAINSTORM_SUBMITTER_TEMPERATURES) - 1) + return cls.PARALLEL_BRAINSTORM_SUBMITTER_TEMPERATURES[ladder_index] def set_broadcast_callback(self, callback: Callable) -> None: """Set callback for broadcasting WebSocket events.""" @@ -136,6 +174,78 @@ def set_model_tracking_callback(self, callback: Optional[Callable]) -> None: else: logger.info("Model tracking callback cleared") + @staticmethod + def _infer_api_log_workflow(task_id: str, role_id: str) -> str: + """Infer the API-log namespace used by the shared log tab.""" + task = (task_id or "").strip().lower() + role = (role_id or "").strip().lower() + if role.startswith("leanoj_") or task.startswith("leanoj_"): + return "leanoj" + return "autonomous" + + @staticmethod + def _prompt_for_logging(messages: Optional[List[Dict[str, Any]]]) -> str: + """Return a safe prompt preview source without raw tool-result content.""" + if not messages: + return "" + + message = messages[-1] + role = str(message.get("role") or "") + content = message.get("content", "") + + if role == "tool": + tool_name = str(message.get("name") or "") + tool_call_id = str(message.get("tool_call_id") or "") + content_len = len(content) if isinstance(content, str) else len(str(content or "")) + return ( + "[tool message redacted for API logging; " + f"name={tool_name or 'unknown'}, " + f"tool_call_id_present={bool(tool_call_id)}, " + f"content_length={content_len}]" + ) + + if isinstance(content, str): + return content + try: + return json.dumps(content, ensure_ascii=False) + except Exception: + return str(content or "") + + async def _default_api_logger_callback( + self, + task_id, + role_id, + model, + provider, + prompt, + response, + tokens_used, + duration_ms, + success, + error, + phase, + ) -> None: + """Persist API calls even when no workflow-specific logger is active.""" + try: + from backend.autonomous.memory.autonomous_api_logger import autonomous_api_logger + + await autonomous_api_logger.log_api_call( + task_id=task_id, + role_id=role_id, + model=model, + provider=provider, + prompt=prompt, + response_content=response, + tokens_used=tokens_used, + duration_ms=duration_ms, + success=success, + error=error, + phase=phase or self._current_autonomous_phase, + workflow=self._infer_api_log_workflow(task_id, role_id), + ) + except Exception as e: + logger.error(f"Failed to log API call in default logger: {e}") + def set_autonomous_logger_callback(self, callback: Optional[Callable]) -> None: """ Set callback for autonomous API logging. @@ -143,16 +253,16 @@ def set_autonomous_logger_callback(self, callback: Optional[Callable]) -> None: The callback is called after each API call with full details for logging. Args: - callback: Async function with signature: + callback: Async function with signature: callback(task_id, role_id, model, provider, prompt, response, - duration_ms, success, error, phase) - or None to disable + tokens_used, duration_ms, success, error, phase) + or None to restore default all-call logging """ - self._autonomous_logger_callback = callback + self._autonomous_logger_callback = callback or self._default_api_logger_callback if callback: logger.info("Autonomous API logger callback set") else: - logger.info("Autonomous API logger callback cleared") + logger.info("Autonomous API logger callback restored to default") def set_autonomous_phase(self, phase: str) -> None: """ @@ -189,6 +299,7 @@ def _annotate_response_with_call_metadata( boosted: bool, boost_mode: Optional[str] = None, openrouter_provider: Optional[str] = None, + openrouter_reasoning_effort: Optional[str] = None, ) -> Dict[str, Any]: """Attach effective routing details to a successful API response.""" if not isinstance(response, dict): @@ -205,6 +316,7 @@ def _annotate_response_with_call_metadata( "boosted": boosted, "boost_mode": boost_mode, "openrouter_provider": openrouter_provider, + "openrouter_reasoning_effort": openrouter_reasoning_effort, } return response @@ -249,6 +361,27 @@ def configure_role(self, role_id: str, config: ModelConfig) -> None: config: Model configuration (includes provider, model_id, openrouter_model_id, lm_studio_fallback_id, and optionally openrouter_provider) """ + if system_config.generic_mode: + if config.provider != "openrouter": + logger.warning( + "Generic mode is OpenRouter-only. Normalizing role '%s' from provider=%s to OpenRouter.", + role_id, + config.provider, + ) + config = config.model_copy( + update={ + "provider": "openrouter", + "openrouter_model_id": config.openrouter_model_id or config.model_id, + "lm_studio_fallback_id": None, + } + ) + elif config.lm_studio_fallback_id: + logger.warning( + "Generic mode is OpenRouter-only. Dropping LM Studio fallback for role '%s'.", + role_id, + ) + config = config.model_copy(update={"lm_studio_fallback_id": None}) + self._role_model_configs[role_id] = config # Set initial fallback state based on provider @@ -294,7 +427,7 @@ def _determine_boost_mode(self, task_id: str) -> Optional[str]: return "task_id" return None - + async def generate_completion( self, task_id: str, @@ -307,6 +440,184 @@ async def generate_completion( tools: Optional[List[Dict[str, Any]]] = None, tool_choice: Optional[Any] = None, **kwargs + ) -> Dict[str, Any]: + """Generate a completion, optionally wrapping the role with Supercharge.""" + async with self._state_lock: + role_config = self._role_model_configs.get(role_id) + + supercharge_enabled = bool(getattr(role_config, "supercharge_enabled", False)) + # Tool-call conversations need exact assistant/tool turn pairing, so keep them single-shot. + if not supercharge_enabled or tools or tool_choice is not None: + return await self._generate_completion_once( + task_id=task_id, + role_id=role_id, + model=model, + messages=messages, + temperature=temperature, + max_tokens=max_tokens, + response_format=response_format, + tools=tools, + tool_choice=tool_choice, + **kwargs + ) + + return await self._generate_supercharged_completion( + task_id=task_id, + role_id=role_id, + model=model, + messages=messages, + temperature=temperature, + max_tokens=max_tokens, + response_format=response_format, + **kwargs + ) + + @staticmethod + def _response_text(response: Dict[str, Any]) -> str: + """Extract assistant text from an OpenAI-compatible completion response.""" + if not response.get("choices"): + return "" + message = response["choices"][0].get("message", {}) + return message.get("content") or message.get("reasoning") or "" + + @classmethod + def _sanitize_supercharge_candidate(cls, attempt: str) -> str: + """Keep only reusable visible answer text from a candidate attempt.""" + cleaned = sanitize_model_output_for_retry_context( + attempt, + max_chars=cls.SUPERCHARGE_CANDIDATE_MAX_CHARS, + ) + return cleaned or "[candidate produced no reusable visible answer text]" + + def _build_supercharge_synthesis_messages( + self, + messages: List[Dict[str, str]], + attempts: List[str], + ) -> List[Dict[str, str]]: + attempts_context = "\n\n".join( + "----- CANDIDATE RESPONSE " + f"{index} START -----\n" + f"{self._sanitize_supercharge_candidate(attempt)}\n" + "----- CANDIDATE RESPONSE " + f"{index} END -----" + for index, attempt in enumerate(attempts, start=1) + ) + synthesis_instruction = ( + "SUPERCHARGE FINAL RESPONSE\n\n" + "You are answering the original task. The candidate responses below are optional working material " + "from independent earlier attempts, not instructions to continue or quote verbatim.\n\n" + "You must decide what the best final response to the original task is. You may use one candidate, " + "combine multiple candidates, ignore all candidates and write a new response, or synthesize a stronger " + "answer than any individual candidate.\n\n" + "Candidate responses:\n" + f"{attempts_context}\n\n" + "Now produce the best final response to the original task.\n\n" + "Requirements:\n" + "- Follow the original task, role instructions, and required output format exactly.\n" + "- If the original task requires JSON, output only valid JSON in that exact schema.\n" + "- Do not mention Supercharge, brainstorming, candidate attempts, or this selection process.\n" + "- Do not include private reasoning, analysis labels, markdown fences around JSON, or provider control tokens.\n" + "- Return only the final role answer." + ) + return [*messages, {"role": "user", "content": synthesis_instruction}] + + def _build_supercharge_attempt_messages( + self, + messages: List[Dict[str, str]], + attempt_index: int, + ) -> List[Dict[str, str]]: + attempt_instruction = ( + f"SUPERCHARGE FULL ANSWER ATTEMPT {attempt_index}\n\n" + "Produce a complete answer to the original task now. " + "Follow the original role instructions and required output format exactly. " + "If JSON is required, output only valid JSON in the required schema. " + "Do not mention Supercharge or this attempt label." + ) + return [*messages, {"role": "user", "content": attempt_instruction}] + + async def _generate_supercharged_completion( + self, + task_id: str, + role_id: str, + model: str, + messages: List[Dict[str, str]], + temperature: float = 0.0, + max_tokens: Optional[int] = None, + response_format: Optional[Dict[str, str]] = None, + **kwargs + ) -> Dict[str, Any]: + """Run four parallel diverse attempts, then a deterministic same-route synthesis call.""" + boost_mode = self._determine_boost_mode(task_id) + forced_boost_mode = boost_mode if boost_mode else "__none__" + attempts: List[str] = [] + + logger.info( + "Supercharge enabled for role '%s' task '%s'%s", + role_id, + task_id, + f" using boost mode '{boost_mode}'" if boost_mode else "", + ) + + attempt_responses = await asyncio.gather(*[ + self._generate_completion_once( + task_id=f"{task_id}_supercharge_attempt_{attempt_index}", + role_id=role_id, + model=model, + messages=self._build_supercharge_attempt_messages(messages, attempt_index), + temperature=attempt_temperature, + max_tokens=max_tokens, + response_format=response_format, + _moto_force_boost_mode=forced_boost_mode, + _moto_consume_boost_count=False, + _moto_strict_boost=bool(boost_mode), + **kwargs + ) + for attempt_index, attempt_temperature in enumerate( + self.SUPERCHARGE_ATTEMPT_TEMPERATURES, + start=1, + ) + ]) + attempts = [self._response_text(response) for response in attempt_responses] + + synthesis_response = await self._generate_completion_once( + task_id=f"{task_id}_supercharge_final", + role_id=role_id, + model=model, + messages=self._build_supercharge_synthesis_messages(messages, attempts), + temperature=0.0, + max_tokens=max_tokens, + response_format=response_format, + _moto_force_boost_mode=forced_boost_mode, + _moto_consume_boost_count=False, + _moto_strict_boost=bool(boost_mode), + **kwargs + ) + + metadata = self.extract_call_metadata(synthesis_response) + if boost_mode == "next_count" and metadata.get("boosted"): + await boost_manager.consume_boost_count() + + if isinstance(synthesis_response, dict): + synthesis_response[self.CALL_METADATA_KEY] = { + **metadata, + "supercharged": True, + "supercharge_attempts": 4, + "supercharge_attempt_temperatures": list(self.SUPERCHARGE_ATTEMPT_TEMPERATURES), + } + return synthesis_response + + async def _generate_completion_once( + self, + task_id: str, + role_id: str, + model: str, + messages: List[Dict[str, str]], + temperature: float = 0.0, + max_tokens: Optional[int] = None, + response_format: Optional[Dict[str, str]] = None, + tools: Optional[List[Dict[str, Any]]] = None, + tool_choice: Optional[Any] = None, + **kwargs ) -> Dict[str, Any]: """ Generate a completion using the appropriate API. @@ -331,13 +642,21 @@ async def generate_completion( Returns: API response dict """ + forced_boost_mode = kwargs.pop("_moto_force_boost_mode", None) + consume_boost_count = kwargs.pop("_moto_consume_boost_count", True) + strict_boost = kwargs.pop("_moto_strict_boost", False) requested_model = model async with self._state_lock: initial_role_config = self._role_model_configs.get(role_id) configured_provider = initial_role_config.provider if initial_role_config else None # Check if task should use boost (unified check for all boost modes) - boost_mode = self._determine_boost_mode(task_id) + if forced_boost_mode == "__none__": + boost_mode = None + elif forced_boost_mode is not None: + boost_mode = forced_boost_mode + else: + boost_mode = self._determine_boost_mode(task_id) if boost_mode and boost_manager.boost_config: boost_model = boost_manager.boost_config.boost_model_id @@ -348,8 +667,8 @@ async def generate_completion( # Get prompt preview for logging prompt_preview = "" if messages: - last_message = messages[-1].get("content", "") - prompt_preview = last_message[:500] if last_message else "" + last_message = self._prompt_for_logging(messages) + prompt_preview = last_message or "" start_time = time.time() @@ -373,6 +692,7 @@ async def generate_completion( max_tokens=max_tokens or boost_manager.boost_config.boost_max_output_tokens, response_format=response_format, provider=boost_provider, + reasoning_effort=boost_manager.boost_config.boost_reasoning_effort, tools=tools, tool_choice=tool_choice, ), @@ -386,9 +706,11 @@ async def generate_completion( # Check for missing choices (upstream provider timeout/error) if not result.get("choices"): - import json as _json - raw_response = _json.dumps(result)[:2000] - logger.error(f"OpenRouter boost response missing 'choices' after {duration_ms:.0f}ms - raw: {raw_response}") + logger.error( + "OpenRouter boost response missing 'choices' after %.0fms - %s", + duration_ms, + _response_shape_for_logging(result), + ) # Log as failure await boost_logger.log_boost_call( @@ -433,6 +755,7 @@ async def generate_completion( boosted=True, boost_mode=boost_mode, openrouter_provider=boost_provider, + openrouter_reasoning_effort=boost_manager.boost_config.boost_reasoning_effort, ) # Log the boost call @@ -450,7 +773,7 @@ async def generate_completion( # Log to autonomous API logger if callback set if self._autonomous_logger_callback: - full_prompt = messages[-1].get("content", "") if messages else "" + full_prompt = self._prompt_for_logging(messages) await self._autonomous_logger_callback( task_id=task_id, role_id=role_id, @@ -469,7 +792,7 @@ async def generate_completion( await self._track_model_usage(boost_model) # Consume boost count if using next_count mode - if boost_mode == "next_count": + if boost_mode == "next_count" and consume_boost_count: await boost_manager.consume_boost_count() return result @@ -493,7 +816,7 @@ async def generate_completion( # Log to autonomous API logger if callback set if self._autonomous_logger_callback: - full_prompt = messages[-1].get("content", "") if messages else "" + full_prompt = self._prompt_for_logging(messages) await self._autonomous_logger_callback( task_id=task_id, role_id=role_id, @@ -519,6 +842,8 @@ async def generate_completion( # Fall through to primary model (boost has no fallback concept) logger.info(f"Boost rate limited, using primary model for task {task_id}") + if strict_boost: + raise RuntimeError(f"Strict boost call failed for task {task_id}: {e}") from e except OpenRouterPrivacyPolicyError as e: # Privacy policy error - log and crash (boost has no fallback concept) @@ -537,7 +862,7 @@ async def generate_completion( # Log to autonomous API logger if callback set if self._autonomous_logger_callback: - full_prompt = messages[-1].get("content", "") if messages else "" + full_prompt = self._prompt_for_logging(messages) await self._autonomous_logger_callback( task_id=task_id, role_id=role_id, @@ -597,7 +922,7 @@ async def generate_completion( # Log to autonomous API logger if callback set if self._autonomous_logger_callback: - full_prompt = messages[-1].get("content", "") if messages else "" + full_prompt = self._prompt_for_logging(messages) await self._autonomous_logger_callback( task_id=task_id, role_id=role_id, @@ -618,6 +943,8 @@ async def generate_completion( "task_id": task_id, "message": "Boost credits exhausted, falling back to primary model" }) + if strict_boost: + raise RuntimeError(f"Strict boost call credits exhausted for task {task_id}: {e}") from e # Continue to primary model routing below except Exception as e: @@ -637,7 +964,7 @@ async def generate_completion( # Log to autonomous API logger if callback set if self._autonomous_logger_callback: - full_prompt = messages[-1].get("content", "") if messages else "" + full_prompt = self._prompt_for_logging(messages) await self._autonomous_logger_callback( task_id=task_id, role_id=role_id, @@ -653,12 +980,23 @@ async def generate_completion( ) logger.error(f"Boost API error for task {task_id}: {e}, using primary model") + if strict_boost: + raise RuntimeError(f"Strict boost call failed for task {task_id}: {e}") from e # Fall through to primary model # Check role fallback state async with self._state_lock: fallback_state = self._role_fallback_state.get(role_id, "lm_studio") role_config = self._role_model_configs.get(role_id) + + if system_config.generic_mode and role_config and fallback_state != "openrouter": + logger.warning( + "Generic mode reset role '%s' fallback state from %s to OpenRouter.", + role_id, + fallback_state, + ) + fallback_state = "openrouter" + self._role_fallback_state[role_id] = "openrouter" # If OpenRouter configured and not fallen back, try OpenRouter if fallback_state == "openrouter" and role_config: @@ -719,6 +1057,7 @@ async def generate_completion( max_tokens=max_tokens or role_config.max_output_tokens, response_format=response_format, provider=openrouter_provider, + reasoning_effort=role_config.openrouter_reasoning_effort, tools=tools, tool_choice=tool_choice, ), @@ -732,9 +1071,11 @@ async def generate_completion( # Check for missing choices (upstream provider timeout/error) if not result.get("choices"): - import json as _json - raw_response = _json.dumps(result)[:2000] - logger.error(f"OpenRouter response missing 'choices' after {duration_ms:.0f}ms - raw: {raw_response}") + logger.error( + "OpenRouter response missing 'choices' after %.0fms - %s", + duration_ms, + _response_shape_for_logging(result), + ) raise ValueError(f"OpenRouter response missing 'choices' after {duration_ms:.0f}ms (upstream provider timeout)") response_content = "" @@ -761,11 +1102,12 @@ async def generate_completion( boosted=False, boost_mode=None, openrouter_provider=openrouter_provider, + openrouter_reasoning_effort=role_config.openrouter_reasoning_effort, ) # Log to autonomous API logger if callback set if self._autonomous_logger_callback: - full_prompt = messages[-1].get("content", "") if messages else "" + full_prompt = self._prompt_for_logging(messages) await self._autonomous_logger_callback( task_id=task_id, role_id=role_id, @@ -790,7 +1132,7 @@ async def generate_completion( duration_ms = (time.time() - start_time) * 1000 if self._autonomous_logger_callback: - full_prompt = messages[-1].get("content", "") if messages else "" + full_prompt = self._prompt_for_logging(messages) await self._autonomous_logger_callback( task_id=task_id, role_id=role_id, @@ -827,6 +1169,9 @@ async def generate_completion( temperature=temperature, max_tokens=max_tokens or role_config.max_output_tokens, response_format=response_format, + reasoning_effort=role_config.openrouter_reasoning_effort, + tools=tools, + tool_choice=tool_choice, ) if rotated_result is not None: free_model_manager.clear_failed_models() # Success - clear failures @@ -852,7 +1197,7 @@ async def generate_completion( # Log to autonomous API logger if callback set if self._autonomous_logger_callback: - full_prompt = messages[-1].get("content", "") if messages else "" + full_prompt = self._prompt_for_logging(messages) await self._autonomous_logger_callback( task_id=task_id, role_id=role_id, @@ -918,7 +1263,7 @@ async def generate_completion( # Log to autonomous API logger if callback set if self._autonomous_logger_callback: - full_prompt = messages[-1].get("content", "") if messages else "" + full_prompt = self._prompt_for_logging(messages) await self._autonomous_logger_callback( task_id=task_id, role_id=role_id, @@ -979,7 +1324,7 @@ async def generate_completion( # Log to autonomous API logger if callback set if self._autonomous_logger_callback: - full_prompt = messages[-1].get("content", "") if messages else "" + full_prompt = self._prompt_for_logging(messages) await self._autonomous_logger_callback( task_id=task_id, role_id=role_id, @@ -1010,6 +1355,12 @@ async def generate_completion( ) raise + if system_config.generic_mode: + raise RuntimeError( + f"Generic mode is OpenRouter-only; role '{role_id}' cannot use LM Studio. " + "Configure the role with provider='openrouter' and a valid OpenRouter model/key." + ) + # Use LM Studio (either configured as primary or fallen back) logger.debug(f"Role {role_id} using LM Studio: {model}") start_time = time.time() @@ -1036,13 +1387,17 @@ async def generate_completion( # Check for missing choices if not result.get("choices"): - import json as _json - raw_response = _json.dumps(result)[:2000] - logger.error(f"LM Studio response missing 'choices' after {duration_ms:.0f}ms - raw: {raw_response}") + logger.error( + "LM Studio response missing 'choices' after %.0fms - %s", + duration_ms, + _response_shape_for_logging(result), + ) raise ValueError(f"LM Studio response missing 'choices' after {duration_ms:.0f}ms") response_content = "" tokens_used = None + lm_routing_metadata = lm_studio_client.extract_routing_metadata(result) + actual_lm_studio_model = lm_routing_metadata.get("actual_model") or model if result.get("choices"): message = result["choices"][0].get("message", {}) response_content = message.get("content") or message.get("reasoning") or "" @@ -1051,7 +1406,7 @@ async def generate_completion( _pt = result["usage"].get("prompt_tokens") _ct = result["usage"].get("completion_tokens") if _pt is not None and _ct is not None: - token_tracker.track(model, _pt, _ct) + token_tracker.track(actual_lm_studio_model, _pt, _ct) await self._broadcast("token_usage_updated", token_tracker.get_stats()) result = self._annotate_response_with_call_metadata( @@ -1059,7 +1414,7 @@ async def generate_completion( task_id=task_id, role_id=role_id, configured_model=requested_model, - actual_model=model, + actual_model=actual_lm_studio_model, configured_provider=role_config.provider if role_config else configured_provider or "lm_studio", actual_provider="lm_studio", boosted=False, @@ -1068,11 +1423,11 @@ async def generate_completion( # Log to autonomous API logger if callback set if self._autonomous_logger_callback: - full_prompt = messages[-1].get("content", "") if messages else "" + full_prompt = self._prompt_for_logging(messages) await self._autonomous_logger_callback( task_id=task_id, role_id=role_id, - model=model, + model=actual_lm_studio_model, provider="lm_studio", prompt=full_prompt, response=response_content, @@ -1084,7 +1439,7 @@ async def generate_completion( ) # Track model usage for Tier 3 - await self._track_model_usage(model) + await self._track_model_usage(actual_lm_studio_model) return result @@ -1092,7 +1447,7 @@ async def generate_completion( # Log LM Studio error to autonomous logger if callback set duration_ms = (time.time() - start_time) * 1000 if self._autonomous_logger_callback: - full_prompt = messages[-1].get("content", "") if messages else "" + full_prompt = self._prompt_for_logging(messages) await self._autonomous_logger_callback( task_id=task_id, role_id=role_id, @@ -1120,6 +1475,9 @@ async def _try_free_model_rotation( temperature: float, max_tokens: int, response_format: Optional[Dict[str, str]], + reasoning_effort: Optional[str] = None, + tools: Optional[List[Dict[str, Any]]] = None, + tool_choice: Optional[Any] = None, ) -> Optional[Dict[str, Any]]: """ Attempt free model rotation chain: looping -> auto-selector. @@ -1153,6 +1511,7 @@ async def _try_free_model_rotation( temperature=temperature, max_tokens=max_tokens, response_format=response_format, + reasoning_effort=reasoning_effort, tools=tools, tool_choice=tool_choice, ), @@ -1177,6 +1536,7 @@ async def _try_free_model_rotation( actual_provider="openrouter", boosted=False, boost_mode=None, + openrouter_reasoning_effort=reasoning_effort, ) if free_model_manager.is_account_exhausted(): free_model_manager.clear_account_exhaustion() @@ -1204,6 +1564,7 @@ async def _try_free_model_rotation( temperature=temperature, max_tokens=max_tokens, response_format=response_format, + reasoning_effort=reasoning_effort, tools=tools, tool_choice=tool_choice, ), @@ -1228,6 +1589,7 @@ async def _try_free_model_rotation( actual_provider="openrouter", boosted=False, boost_mode=None, + openrouter_reasoning_effort=reasoning_effort, ) if free_model_manager.is_account_exhausted(): free_model_manager.clear_account_exhaustion() diff --git a/backend/shared/boost_logger.py b/backend/shared/boost_logger.py index 257cb59..13df4ff 100644 --- a/backend/shared/boost_logger.py +++ b/backend/shared/boost_logger.py @@ -4,18 +4,32 @@ main API call log view. """ import asyncio +import hashlib import json import logging import os +from collections import deque from datetime import datetime from typing import Dict, Any, List, Optional from pathlib import Path from backend.shared.config import system_config +from backend.shared.log_redaction import redact_log_text logger = logging.getLogger(__name__) +def _payload_metadata(value: str, preview_chars: int) -> Dict[str, Any]: + """Return safe log metadata for a boost response payload.""" + text = value or "" + preview = redact_log_text(text, preview_chars) + return { + "preview": preview, + "size": len(text), + "sha256": hashlib.sha256(text.encode("utf-8", errors="replace")).hexdigest() if text else "", + } + + class BoostLogger: """ Logger for boost API call outputs. @@ -39,6 +53,7 @@ def __init__(self): self._initialized = True self._ensure_log_file() + self._scrub_persisted_full_payloads() logger.info("BoostLogger initialized") def _ensure_log_file(self) -> None: @@ -52,6 +67,61 @@ def _ensure_log_file(self) -> None: def _get_log_path(self) -> Path: """Return the instance-scoped boost log path.""" return Path(system_config.data_dir) / "boost_api_log.txt" + + def _scrub_persisted_full_payloads(self) -> None: + """Remove legacy raw prompt/response bodies from the on-disk JSONL log.""" + log_path = self._get_log_path() + if not log_path.exists(): + return + + changed = False + scrubbed_lines: List[str] = [] + + try: + with open(log_path, "r", encoding="utf-8") as f: + lines = f.readlines() + + for line in lines: + stripped = line.strip() + if not stripped: + continue + try: + entry = json.loads(stripped) + except json.JSONDecodeError: + scrubbed_lines.append(line) + continue + + original_entry = dict(entry) + prompt_full = str(entry.pop("prompt_full", "") or "") + response_full = str(entry.pop("response_full", "") or "") + prompt_source = prompt_full or str(entry.get("prompt_preview") or "") + response_source = response_full or str(entry.get("response_preview") or "") + + if prompt_source: + entry["prompt_preview"] = redact_log_text(prompt_source, 500) + entry["prompt_size"] = int(entry.get("prompt_size") or len(prompt_source)) + if response_source: + response_meta = _payload_metadata(response_source, 2000) + entry["response_preview"] = response_meta["preview"] + entry["response_size"] = int(entry.get("response_size") or response_meta["size"]) + entry.setdefault("response_sha256", response_meta["sha256"]) + + entry["has_full_prompt"] = False + entry["has_full_response"] = False + entry["response_redacted"] = True + if entry.get("error"): + entry["error"] = redact_log_text(entry["error"], 1000) + + if prompt_full or response_full or entry != original_entry: + changed = True + scrubbed_lines.append(json.dumps(entry) + "\n") + + if changed: + with open(log_path, "w", encoding="utf-8") as f: + f.writelines(scrubbed_lines) + logger.info("Scrubbed legacy full payloads from boost API log") + except Exception as e: + logger.warning(f"Failed to scrub legacy boost API log payloads: {e}") async def log_boost_call( self, @@ -83,20 +153,27 @@ async def log_boost_call( """ async with self._lock: try: + response_meta = _payload_metadata(response_content, 2000) + store_full_payloads = bool(system_config.api_log_store_full_payloads) log_entry = { "timestamp": datetime.now().isoformat(), "task_id": task_id, "role_id": role_id, "model": model, "boost_mode": boost_mode, - "prompt_preview": prompt_preview[:500] if prompt_preview else "", - "response_preview": response_content[:2000] if response_content else "", - "response_full": response_content, + "prompt_preview": redact_log_text(prompt_preview, 500) if prompt_preview else "", + "response_preview": response_meta["preview"], + "response_size": response_meta["size"], + "response_sha256": response_meta["sha256"], + "response_redacted": not store_full_payloads, + "has_full_response": store_full_payloads and bool(response_content), "tokens_used": tokens_used, "duration_ms": duration_ms, "success": success, - "error": error + "error": redact_log_text(error, 1000) } + if store_full_payloads: + log_entry["response_full"] = response_content # Append to log file with open(self._get_log_path(), "a", encoding="utf-8") as f: @@ -126,7 +203,7 @@ async def _trim_log_if_needed(self) -> None: except Exception as e: logger.error(f"Failed to trim boost log: {e}") - async def get_logs(self, limit: int = 100) -> List[Dict[str, Any]]: + async def get_logs(self, limit: int = 100, include_full: bool = True) -> List[Dict[str, Any]]: """ Get recent boost API call logs. @@ -143,7 +220,7 @@ async def get_logs(self, limit: int = 100) -> List[Dict[str, Any]]: return [] with open(log_path, "r", encoding="utf-8") as f: - lines = f.readlines() + lines = deque(f, maxlen=max(1, limit)) logs = [] for line in lines: @@ -151,6 +228,15 @@ async def get_logs(self, limit: int = 100) -> List[Dict[str, Any]]: if line: try: log_entry = json.loads(line) + if not include_full or not system_config.api_log_store_full_payloads: + prompt_full = str(log_entry.pop("prompt_full", "") or "") + response_full = str(log_entry.pop("response_full", "") or "") + log_entry["prompt_size"] = len(prompt_full) + log_entry["response_size"] = int(log_entry.get("response_size") or len(response_full)) + log_entry["has_full_prompt"] = False + log_entry["has_full_response"] = False + if response_full and not log_entry.get("response_sha256"): + log_entry["response_sha256"] = hashlib.sha256(response_full.encode("utf-8", errors="replace")).hexdigest() logs.append(log_entry) except json.JSONDecodeError: continue @@ -163,7 +249,7 @@ async def get_logs(self, limit: int = 100) -> List[Dict[str, Any]]: logger.error(f"Failed to get boost logs: {e}") return [] - async def get_log_entry(self, index: int) -> Optional[Dict[str, Any]]: + async def get_log_entry(self, index: int, include_full: bool = True) -> Optional[Dict[str, Any]]: """ Get a specific log entry by index (0 = most recent). @@ -173,15 +259,53 @@ async def get_log_entry(self, index: int) -> Optional[Dict[str, Any]]: Returns: Log entry dict or None if not found """ - logs = await self.get_logs(limit=index + 1) + logs = await self.get_logs(limit=index + 1, include_full=include_full) if index < len(logs): return logs[index] return None - async def clear_logs(self) -> None: - """Clear all boost API logs.""" + @staticmethod + def _entry_workflow(entry: Dict[str, Any]) -> str: + workflow = str(entry.get("workflow") or "").strip().lower() + if workflow: + return workflow + + role_id = str(entry.get("role_id") or "") + task_id = str(entry.get("task_id") or "") + if role_id.startswith("leanoj_") or task_id.startswith("leanoj_"): + return "leanoj" + return "autonomous" + + async def clear_logs(self, workflow: Optional[str] = None) -> None: + """Clear boost API logs, optionally scoped to one workflow.""" async with self._lock: try: + if workflow: + log_path = self._get_log_path() + if not os.path.exists(log_path): + return + + with open(log_path, "r", encoding="utf-8") as f: + lines = f.readlines() + + retained_lines: List[str] = [] + for line in lines: + stripped = line.strip() + if not stripped: + continue + try: + entry = json.loads(stripped) + except json.JSONDecodeError: + retained_lines.append(line) + continue + if self._entry_workflow(entry) != workflow: + retained_lines.append(line) + + with open(log_path, "w", encoding="utf-8") as f: + f.writelines(retained_lines) + logger.info("Boost logs cleared for workflow %s", workflow) + return + with open(self._get_log_path(), "w", encoding="utf-8") as f: f.write("") logger.info("Boost logs cleared") diff --git a/backend/shared/boost_manager.py b/backend/shared/boost_manager.py index d26f883..e967b50 100644 --- a/backend/shared/boost_manager.py +++ b/backend/shared/boost_manager.py @@ -13,6 +13,8 @@ - Topic Validator, Redundancy Checker → agg_val (Agg Validator) - Brainstorm aggregation submitters/validator → agg_sub1..10, agg_val (via Coordinator) - Paper compilation → comp_hc, comp_hp, comp_val, comp_crit (via CompilerCoordinator) +- LeanOJ path-decision calls use `leanoj_path_*` task IDs for workflow display, but belong to the + Final Solver boost category (`leanoj_final`) because that role owns final-readiness decisions. State is persisted to backend/data/boost_state.json for crash recovery. """ @@ -22,7 +24,7 @@ import os from typing import Optional, Set, Callable, Any, Dict, List -from backend.shared.config import system_config +from backend.shared.config import rag_config, system_config from backend.shared.models import BoostConfig logger = logging.getLogger(__name__) @@ -48,6 +50,38 @@ "comp_hp": "High-Param Model", "comp_val": "Compiler Validator", "comp_crit": "Critique Submitter", + # LeanOJ + "leanoj_topic": "Proof Solver Topic Generator", + "leanoj_topic_val": "Proof Solver Topic Validator", + "leanoj_topic_sub1": "Proof Solver Topic Submitter 1", + "leanoj_topic_sub2": "Proof Solver Topic Submitter 2", + "leanoj_topic_sub3": "Proof Solver Topic Submitter 3", + "leanoj_topic_sub4": "Proof Solver Topic Submitter 4", + "leanoj_topic_sub5": "Proof Solver Topic Submitter 5", + "leanoj_topic_sub6": "Proof Solver Topic Submitter 6", + "leanoj_topic_sub7": "Proof Solver Topic Submitter 7", + "leanoj_topic_sub8": "Proof Solver Topic Submitter 8", + "leanoj_topic_sub9": "Proof Solver Topic Submitter 9", + "leanoj_topic_sub10": "Proof Solver Topic Submitter 10", + "leanoj_brainstorm_sub1": "Proof Solver Brainstorm Submitter 1", + "leanoj_brainstorm_sub2": "Proof Solver Brainstorm Submitter 2", + "leanoj_brainstorm_sub3": "Proof Solver Brainstorm Submitter 3", + "leanoj_brainstorm_sub4": "Proof Solver Brainstorm Submitter 4", + "leanoj_brainstorm_sub5": "Proof Solver Brainstorm Submitter 5", + "leanoj_brainstorm_sub6": "Proof Solver Brainstorm Submitter 6", + "leanoj_brainstorm_sub7": "Proof Solver Brainstorm Submitter 7", + "leanoj_brainstorm_sub8": "Proof Solver Brainstorm Submitter 8", + "leanoj_brainstorm_sub9": "Proof Solver Brainstorm Submitter 9", + "leanoj_brainstorm_sub10": "Proof Solver Brainstorm Submitter 10", + "leanoj_brainstorm_val": "Proof Solver Brainstorm Validator", + "leanoj_sufficiency": "Proof Solver Sufficiency Check", + "leanoj_path_val": "Proof Solver Path Validator", + "leanoj_final": "Proof Solver Final Solver", +} + +CATEGORY_ALIASES = { + # Path decisions are absorbed into the dominant Final Solver role. + "leanoj_path": "leanoj_final", } @@ -110,26 +144,36 @@ def _load_state(self) -> None: with open(state_file, 'r', encoding='utf-8') as f: state = json.load(f) - # Restore boost config if it was enabled + legacy_key_present = bool(state.get('api_key')) + + # Restore boost config if it was enabled. Legacy plaintext + # `api_key` values are intentionally ignored and scrubbed below. if state.get('enabled') and state.get('model_id'): self.boost_config = BoostConfig( enabled=True, - openrouter_api_key=state.get('api_key', ''), + openrouter_api_key='', boost_model_id=state.get('model_id'), boost_provider=state.get('provider'), + boost_reasoning_effort=state.get('reasoning_effort', 'auto'), boost_context_window=state.get('context_window', 131072), boost_max_output_tokens=state.get('max_output_tokens', 25000) ) # Restore boost modes self.boost_next_count = state.get('boost_next_count', 0) - self.boosted_categories = set(state.get('boosted_categories', [])) + self.boosted_categories = { + self._canonical_category(category) + for category in state.get('boosted_categories', []) + } self.boost_always_prefer = state.get('boost_always_prefer', False) self.boosted_task_ids = set(state.get('boosted_task_ids', [])) logger.info(f"Loaded boost state: enabled={state.get('enabled')}, model={state.get('model_id')}, " f"next_count={self.boost_next_count}, categories={len(self.boosted_categories)}, " f"always_prefer={self.boost_always_prefer}") + if legacy_key_present: + self._save_state() + logger.info("Scrubbed legacy plaintext boost API key from persisted state") except Exception as e: logger.warning(f"Failed to load boost state: {e}") @@ -144,9 +188,9 @@ def _save_state(self) -> None: 'enabled': self.boost_config is not None and self.boost_config.enabled, 'model_id': self.boost_config.boost_model_id if self.boost_config else None, 'provider': self.boost_config.boost_provider if self.boost_config else None, + 'reasoning_effort': self.boost_config.boost_reasoning_effort if self.boost_config else 'auto', 'context_window': self.boost_config.boost_context_window if self.boost_config else 131072, 'max_output_tokens': self.boost_config.boost_max_output_tokens if self.boost_config else 25000, - 'api_key': self.boost_config.openrouter_api_key if self.boost_config else '', 'boost_next_count': self.boost_next_count, 'boosted_categories': list(self.boosted_categories), 'boost_always_prefer': self.boost_always_prefer, @@ -181,6 +225,7 @@ async def set_boost_config(self, config: BoostConfig) -> None: provider_info = f", provider={config.boost_provider}" if config.boost_provider else " (auto-routing)" logger.info( f"Boost enabled: model={config.boost_model_id}{provider_info}, " + f"reasoning={config.boost_reasoning_effort}, " f"context={config.boost_context_window}, " f"max_tokens={config.boost_max_output_tokens}" ) @@ -191,6 +236,7 @@ async def set_boost_config(self, config: BoostConfig) -> None: await self._broadcast("boost_enabled", { "model_id": config.boost_model_id, "provider": config.boost_provider, + "reasoning_effort": config.boost_reasoning_effort, "context_window": config.boost_context_window, "max_output_tokens": config.boost_max_output_tokens }) @@ -307,6 +353,7 @@ async def toggle_category_boost(self, category: str) -> bool: Returns: True if category is now boosted, False if unboosted """ + category = self._canonical_category(category) async with self._lock: if category in self.boosted_categories: self.boosted_categories.remove(category) @@ -327,6 +374,11 @@ async def toggle_category_boost(self, category: str) -> bool: }) return boosted + + @staticmethod + def _canonical_category(category: str) -> str: + """Map absorbed/legacy category prefixes to their owning role category.""" + return CATEGORY_ALIASES.get(category, category) def _extract_role_prefix(self, task_id: str) -> str: """ @@ -340,8 +392,8 @@ def _extract_role_prefix(self, task_id: str) -> str: # Split on last underscore and take everything before it parts = task_id.rsplit('_', 1) if len(parts) == 2: - return parts[0] - return task_id + return self._canonical_category(parts[0]) + return self._canonical_category(task_id) def should_use_boost(self, task_id: str) -> bool: """ @@ -410,6 +462,8 @@ def get_boost_status(self) -> Dict[str, Any]: return { "enabled": False, "model_id": None, + "has_available_key": bool(rag_config.openrouter_api_key), + "needs_key": False, "boosted_task_count": 0, "boost_next_count": 0, "boosted_categories": [], @@ -417,12 +471,19 @@ def get_boost_status(self) -> Dict[str, Any]: "boosted_tasks": [] } + has_available_key = bool( + (self.boost_config.openrouter_api_key or "").strip() + or (rag_config.openrouter_api_key or "").strip() + ) return { "enabled": self.boost_config.enabled, "model_id": self.boost_config.boost_model_id, "provider": self.boost_config.boost_provider, + "reasoning_effort": self.boost_config.boost_reasoning_effort, "context_window": self.boost_config.boost_context_window, "max_output_tokens": self.boost_config.boost_max_output_tokens, + "has_available_key": has_available_key, + "needs_key": bool(self.boost_config.enabled and not has_available_key), "boosted_task_count": len(self.boosted_task_ids), "boosted_tasks": list(self.boosted_task_ids), "boost_next_count": self.boost_next_count, @@ -470,6 +531,26 @@ def get_available_categories(self, mode: str = "all") -> List[Dict[str, str]]: {"id": "comp_hp", "label": "High-Param Model", "group": "Compiler"}, {"id": "comp_crit", "label": "Critique Submitter", "group": "Compiler"}, ]) + + categories.extend([ + {"id": "leanoj_topic", "label": "Topic Generator", "group": "Proof Solver"}, + {"id": "leanoj_topic_val", "label": "Topic Validator", "group": "Proof Solver"}, + {"id": "leanoj_brainstorm_val", "label": "Brainstorm Validator", "group": "Proof Solver"}, + {"id": "leanoj_sufficiency", "label": "Sufficiency Check", "group": "Proof Solver"}, + {"id": "leanoj_path_val", "label": "Path Validator", "group": "Proof Solver"}, + {"id": "leanoj_final", "label": "Final Solver", "group": "Proof Solver"}, + ]) + for i in range(1, 11): + categories.append({ + "id": f"leanoj_topic_sub{i}", + "label": f"Topic Submitter {i}", + "group": "Proof Solver", + }) + categories.append({ + "id": f"leanoj_brainstorm_sub{i}", + "label": f"Brainstorm Submitter {i}", + "group": "Proof Solver", + }) return categories diff --git a/backend/shared/brainstorm_proof_gate.py b/backend/shared/brainstorm_proof_gate.py new file mode 100644 index 0000000..1d2a26e --- /dev/null +++ b/backend/shared/brainstorm_proof_gate.py @@ -0,0 +1,341 @@ +"""Shared Lean 4 gate for brainstorm proof candidates.""" +from __future__ import annotations + +import logging +from dataclasses import dataclass +from typing import Any, Optional + +from backend.autonomous.prompts.proof_prompts import LEAN4_COMMON_PITFALLS +from backend.shared.api_client_manager import api_client_manager +from backend.shared.config import system_config +from backend.shared.json_parser import parse_json +from backend.shared.lean4_client import get_lean4_client +from backend.shared.lean_proof_integrity import validate_full_lean_proof_integrity +from backend.shared.model_error_utils import is_non_retryable_model_error +from backend.shared.models import ProofAttemptFeedback + +logger = logging.getLogger(__name__) + +BRAINSTORM_LEAN_PROOF_MARKER = "[LEAN 4 VERIFIED BRAINSTORM PROOF]" + + +@dataclass +class BrainstormProofGateResult: + """Result of checking a proof candidate before normal brainstorm validation.""" + + accepted: bool + submission_content: str = "" + theorem_statement: str = "" + theorem_name: str = "" + formal_sketch: str = "" + lean_code: str = "" + reasoning: str = "" + lean_feedback: str = "" + attempts: list[ProofAttemptFeedback] | None = None + failure_feedback: str = "" + + +def is_lean_proof_submission(parsed: dict[str, Any]) -> bool: + """Return True when a submitter chose the optional Lean proof route.""" + submission_type = str(parsed.get("submission_type") or parsed.get("type") or "").strip().lower() + if submission_type in {"lean_proof", "proof", "lean4_proof"}: + return True + return bool(parsed.get("lean_code")) and bool(parsed.get("theorem_statement") or parsed.get("theorem_or_lemma")) + + +def _summarize_error(error_output: str, limit: int = 1400) -> str: + text = " ".join((error_output or "").split()) + return text[:limit] + ("..." if len(text) > limit else "") + + +def _format_attempts(attempts: list[ProofAttemptFeedback]) -> str: + if not attempts: + return "No prior Lean attempts." + blocks: list[str] = [] + for attempt in attempts[-5:]: + lean_feedback = ( + attempt.error_output + or attempt.diagnostic_output + or attempt.raw_stderr + or ("Lean accepted this attempt with no diagnostics." if attempt.success else "[none]") + ) + blocks.extend( + [ + f"ATTEMPT {attempt.attempt}:", + f"Reasoning: {attempt.reasoning or '[none]'}", + "Lean code:", + attempt.lean_code or "[none]", + "Lean / integrity feedback:", + lean_feedback, + f"Goal states: {attempt.goal_states or '[none]'}", + "---", + ] + ) + return "\n".join(blocks) + + +def _format_lean_feedback(lean_result: Any) -> str: + diagnostics = str(getattr(lean_result, "diagnostic_output", "") or "").strip() + if not diagnostics: + diagnostics = str(getattr(lean_result, "raw_stderr", "") or "").strip() + goal_states = str(getattr(lean_result, "goal_states", "") or "").strip() + parts = [] + if diagnostics: + parts.append(diagnostics) + if goal_states: + parts.append(f"Goal state output:\n{goal_states}") + return "\n\n".join(parts).strip() or "Lean 4 accepted with no diagnostics." + + +def _build_retry_prompt( + *, + user_prompt: str, + source_context: str, + theorem_statement: str, + formal_sketch: str, + prior_attempts: list[ProofAttemptFeedback], +) -> str: + context_excerpt = (source_context or "").strip() + if len(context_excerpt) > 12000: + context_excerpt = context_excerpt[:12000] + "\n...[context truncated for proof retry]..." + return f"""You are repairing a Lean 4 proof candidate for a brainstorm submission. + +The previous proof candidate was rejected by Lean 4 or by MOTO's post-Lean integrity gate. Produce a corrected complete Lean 4 proof. Do not use `sorry`, `admit`, or fake `axiom`/`constant`/`opaque` proof devices. + +{LEAN4_COMMON_PITFALLS} + +USER PROMPT: +{user_prompt} + +INTENDED THEOREM STATEMENT: +{theorem_statement} + +FORMALIZATION NOTES: +{formal_sketch or "[none]"} + +BRAINSTORM CONTEXT EXCERPT: +{context_excerpt or "[none]"} + +PRIOR ATTEMPTS AND FEEDBACK: +{_format_attempts(prior_attempts)} + +Respond with ONLY valid JSON: +{{ + "theorem_name": "Lean declaration name, if named", + "theorem_statement": "natural-language theorem statement being proved", + "formal_sketch": "updated formalization notes", + "lean_code": "complete Lean 4 code", + "reasoning": "brief explanation of the repair" +}} +""" + + +def _build_submission_content( + *, + theorem_statement: str, + formal_sketch: str, + lean_code: str, + reasoning: str, + lean_feedback: str, + attempts: list[ProofAttemptFeedback], +) -> str: + attempt_count = len(attempts) + sections = [ + BRAINSTORM_LEAN_PROOF_MARKER, + "", + "Lean 4 has accepted the following proof before this submission reached the brainstorm validator. The validator should still decide whether it is useful, non-redundant brainstorm progress.", + "", + f"Theorem statement: {theorem_statement}", + ] + if formal_sketch: + sections.extend(["", f"Formalization notes: {formal_sketch}"]) + if reasoning: + sections.extend(["", f"Submitter reasoning: {reasoning}"]) + sections.extend( + [ + "", + f"Lean verification: accepted after {attempt_count} attempt{'s' if attempt_count != 1 else ''}.", + f"Lean verifier feedback: {lean_feedback}", + "", + "Lean 4 code:", + "```lean", + lean_code, + "```", + ] + ) + return "\n".join(sections).strip() + + +async def verify_brainstorm_proof_candidate( + *, + parsed: dict[str, Any], + user_prompt: str, + source_context: str, + model_id: str, + role_id: str, + task_id_prefix: str, + max_tokens: int, + validator_model: Optional[str], + validator_context: int, + validator_max_tokens: int, + validator_role_id: str, + allowed_baseline: str = "", + max_attempts: int = 5, +) -> BrainstormProofGateResult: + """Lean-check a brainstorm proof candidate before it reaches the validator.""" + theorem_statement = str(parsed.get("theorem_statement") or parsed.get("theorem_or_lemma") or parsed.get("submission") or "").strip() + formal_sketch = str(parsed.get("formal_sketch") or parsed.get("proof_sketch") or "").strip() + theorem_name = str(parsed.get("theorem_name") or "").strip() + lean_code = str(parsed.get("lean_code") or "").strip() + reasoning = str(parsed.get("reasoning") or "").strip() + + if not theorem_statement or not lean_code: + return BrainstormProofGateResult( + accepted=False, + theorem_statement=theorem_statement, + lean_code=lean_code, + reasoning=reasoning, + failure_feedback=( + "Lean proof candidate was malformed: both `theorem_statement` and `lean_code` " + "are required. Start the next brainstorm attempt fresh." + ), + attempts=[], + ) + + attempts: list[ProofAttemptFeedback] = [] + current = { + "theorem_statement": theorem_statement, + "formal_sketch": formal_sketch, + "theorem_name": theorem_name, + "lean_code": lean_code, + "reasoning": reasoning, + } + + for attempt_number in range(1, max(1, max_attempts) + 1): + theorem_statement = str(current.get("theorem_statement") or theorem_statement).strip() + formal_sketch = str(current.get("formal_sketch") or formal_sketch).strip() + theorem_name = str(current.get("theorem_name") or theorem_name).strip() + lean_code = str(current.get("lean_code") or "").strip() + reasoning = str(current.get("reasoning") or reasoning).strip() + + lean_result = await get_lean4_client().check_proof( + lean_code, + timeout=system_config.lean4_proof_timeout, + ) + feedback = ProofAttemptFeedback( + attempt=attempt_number, + theorem_id="brainstorm_inline_proof", + reasoning=reasoning, + lean_code=lean_code, + error_output=lean_result.error_output, + diagnostic_output=str(getattr(lean_result, "diagnostic_output", "") or ""), + goal_states=lean_result.goal_states, + raw_stderr=str(getattr(lean_result, "raw_stderr", "") or ""), + strategy="full_script", + success=lean_result.success, + ) + + if lean_result.success: + lean_feedback = _format_lean_feedback(lean_result) + integrity = await validate_full_lean_proof_integrity( + user_prompt=user_prompt, + theorem_statement=theorem_statement, + formal_sketch=formal_sketch, + lean_code=lean_code, + source_excerpt=source_context or theorem_statement, + allowed_baseline=allowed_baseline, + validator_model=validator_model, + validator_context=validator_context, + validator_max_tokens=validator_max_tokens, + task_id=f"{task_id_prefix}_integrity_{attempt_number}", + role_id=validator_role_id, + require_statement_alignment=True, + ) + if integrity.valid: + feedback.success = True + feedback.error_output = "" + attempts.append(feedback) + return BrainstormProofGateResult( + accepted=True, + submission_content=_build_submission_content( + theorem_statement=theorem_statement, + formal_sketch=formal_sketch, + lean_code=lean_code, + reasoning=reasoning, + lean_feedback=lean_feedback, + attempts=attempts, + ), + theorem_statement=theorem_statement, + theorem_name=theorem_name, + formal_sketch=formal_sketch, + lean_code=lean_code, + reasoning=reasoning, + lean_feedback=lean_feedback, + attempts=attempts, + ) + + feedback.success = False + feedback.error_output = integrity.reason + + attempts.append(feedback) + if attempt_number >= max_attempts: + break + + prompt = _build_retry_prompt( + user_prompt=user_prompt, + source_context=source_context, + theorem_statement=theorem_statement, + formal_sketch=formal_sketch, + prior_attempts=attempts, + ) + try: + response = await api_client_manager.generate_completion( + task_id=f"{task_id_prefix}_repair_{attempt_number + 1}", + role_id=role_id, + model=model_id, + messages=[{"role": "user", "content": prompt}], + temperature=0.0, + max_tokens=max_tokens, + ) + if not response or not response.get("choices"): + raise ValueError("Proof repair model returned no choices.") + message = response["choices"][0].get("message", {}) + content = message.get("content") or message.get("reasoning") or "" + repaired = parse_json(content) + if isinstance(repaired, list): + repaired = repaired[0] if repaired else {} + if not isinstance(repaired, dict): + raise ValueError("Proof repair response was not a JSON object.") + current = { + "theorem_statement": str(repaired.get("theorem_statement") or theorem_statement).strip(), + "formal_sketch": str(repaired.get("formal_sketch") or formal_sketch).strip(), + "theorem_name": str(repaired.get("theorem_name") or theorem_name).strip(), + "lean_code": str(repaired.get("lean_code") or "").strip(), + "reasoning": str(repaired.get("reasoning") or "").strip(), + } + except Exception as exc: + if is_non_retryable_model_error(exc): + raise + logger.warning("Brainstorm proof repair attempt setup failed: %s", exc) + current = { + "theorem_statement": theorem_statement, + "formal_sketch": formal_sketch, + "theorem_name": theorem_name, + "lean_code": lean_code, + "reasoning": f"Prior proof repair call failed before Lean verification: {exc}", + } + + last_error = attempts[-1].error_output if attempts else "No Lean attempts completed." + return BrainstormProofGateResult( + accepted=False, + theorem_statement=theorem_statement, + theorem_name=theorem_name, + formal_sketch=formal_sketch, + lean_code=lean_code, + reasoning=reasoning, + attempts=attempts, + failure_feedback=( + "Lean proof candidate failed the 5-attempt brainstorm proof gate. " + f"Last feedback: {_summarize_error(last_error)}. Start the next brainstorm attempt with a fresh useful question or idea." + ), + ) diff --git a/backend/shared/build_info.py b/backend/shared/build_info.py index 94ae74c..bda5d73 100644 --- a/backend/shared/build_info.py +++ b/backend/shared/build_info.py @@ -22,7 +22,7 @@ "version": "0.0.0-dev", "build_commit": "dev", "update_channel": "main", - "api_contract_version": "build5-v1", + "api_contract_version": "build5-v12", } _ENV_OVERRIDES = { diff --git a/backend/shared/config.py b/backend/shared/config.py index 1f965f5..257eefd 100644 --- a/backend/shared/config.py +++ b/backend/shared/config.py @@ -138,7 +138,7 @@ class SystemConfig(BaseSettings): validation_alias=AliasChoices("MOTO_INSTANCE_ID", "INSTANCE_ID"), ) backend_host: str = Field( - default="0.0.0.0", + default="127.0.0.1", validation_alias=AliasChoices("MOTO_BACKEND_HOST", "HOST"), ) backend_port: int = Field( @@ -157,6 +157,30 @@ class SystemConfig(BaseSettings): default=None, validation_alias=AliasChoices("MOTO_INTERNAL_PROXY_SECRET", "INTERNAL_PROXY_SECRET"), ) + generic_max_request_bytes: int = Field( + default=16 * 1024 * 1024, + validation_alias=AliasChoices("MOTO_GENERIC_MAX_REQUEST_BYTES", "GENERIC_MAX_REQUEST_BYTES"), + ) + pdf_max_html_bytes: int = Field( + default=2 * 1024 * 1024, + validation_alias=AliasChoices("MOTO_PDF_MAX_HTML_BYTES", "PDF_MAX_HTML_BYTES"), + ) + pdf_max_outline_bytes: int = Field( + default=1 * 1024 * 1024, + validation_alias=AliasChoices("MOTO_PDF_MAX_OUTLINE_BYTES", "PDF_MAX_OUTLINE_BYTES"), + ) + pdf_max_metadata_bytes: int = Field( + default=64 * 1024, + validation_alias=AliasChoices("MOTO_PDF_MAX_METADATA_BYTES", "PDF_MAX_METADATA_BYTES"), + ) + api_log_store_full_payloads: bool = Field( + default=False, + validation_alias=AliasChoices("MOTO_API_LOG_STORE_FULL_PAYLOADS", "API_LOG_STORE_FULL_PAYLOADS"), + ) + desktop_api_token: Optional[str] = Field( + default=None, + validation_alias=AliasChoices("MOTO_DESKTOP_API_TOKEN", "DESKTOP_API_TOKEN"), + ) frontend_storage_prefix: Optional[str] = Field( default=None, validation_alias=AliasChoices("MOTO_FRONTEND_STORAGE_PREFIX", "FRONTEND_STORAGE_PREFIX"), @@ -169,6 +193,13 @@ class SystemConfig(BaseSettings): consecutive_rejection_reset_threshold: int = 15 queue_overflow_threshold: int = 10 per_submitter_queue_threshold: int = 4 # Pause an individual submitter when it already has more than this many submissions queued (fairness cap) + max_model_concurrency_per_model: int = Field( + default=3, + validation_alias=AliasChoices( + "MOTO_MAX_MODEL_CONCURRENCY_PER_MODEL", + "MAX_MODEL_CONCURRENCY_PER_MODEL", + ), + ) # Compiler settings (Phase 2) # NOTE: Compiler contexts are set by user in GUI, these are just default fallbacks @@ -235,6 +266,10 @@ class SystemConfig(BaseSettings): default=600, validation_alias=AliasChoices("MOTO_LEAN4_LSP_IDLE_TIMEOUT", "LEAN4_LSP_IDLE_TIMEOUT"), ) + leanoj_auto_resume_enabled: bool = Field( + default=False, + validation_alias=AliasChoices("MOTO_LEANOJ_AUTO_RESUME_ENABLED", "LEANOJ_AUTO_RESUME_ENABLED"), + ) # Maximum number of theorem candidates whose Lean 4 formalization attempts # may run concurrently within a single proof-verification stage. Novelty # assessment and proof-database persistence remain serialized after each @@ -374,6 +409,9 @@ def _join_data_path(*parts: str) -> str: if self.internal_proxy_secret is not None: self.internal_proxy_secret = self.internal_proxy_secret.strip() or None + if self.desktop_api_token is not None: + self.desktop_api_token = self.desktop_api_token.strip() or None + if self.frontend_storage_prefix is not None: self.frontend_storage_prefix = self.frontend_storage_prefix.strip() or None diff --git a/backend/shared/json_parser.py b/backend/shared/json_parser.py index befbd08..5f8f4c8 100644 --- a/backend/shared/json_parser.py +++ b/backend/shared/json_parser.py @@ -10,11 +10,277 @@ import json import logging import re +import hashlib from typing import Any logger = logging.getLogger(__name__) +RETRY_CONTEXT_EMPTY_PLACEHOLDER = "[previous output contained no reusable visible answer text]" + + +def _content_diagnostics(value: str) -> str: + """Return parse diagnostics without logging raw model output.""" + text = value or "" + return f"length={len(text)}, sha256={hashlib.sha256(text.encode('utf-8', errors='replace')).hexdigest() if text else ''}" + +_PRIVATE_REASONING_OPEN_TAG_PATTERN = re.compile(r"^\s*<(?:think|thought)\b[^>]*>", re.IGNORECASE) +_FINAL_CHANNEL_PATTERN = re.compile(r"<\|channel\|?>\s*final\b", re.IGNORECASE) +_PRIVATE_CHANNEL_PATTERN = re.compile(r"<\|channel\|?>\s*(?:analysis|thought|commentary)\b", re.IGNORECASE) +_LEGACY_CHANNEL_BOUNDARY_PATTERN = re.compile(r"", re.IGNORECASE) +_KNOWN_CONTROL_TOKEN_NAMES = ( + "channel", + "message", + "end", + "constrain", + "start", + "return", + "call", + "recipient", +) +_KNOWN_CONTROL_TOKEN_ALTERNATION = "|".join(_KNOWN_CONTROL_TOKEN_NAMES) +_BROAD_CONTROL_TOKEN_PATTERN = re.compile(r"<\|[A-Za-z0-9_:-]+(?:\|>|>)", re.IGNORECASE) +_LEGACY_CONTROL_TOKEN_PATTERN = re.compile(r"<(?:channel|message|end|constrain)\|>", re.IGNORECASE) +_PARTIAL_CONTROL_TOKEN_PATTERN = re.compile( + rf"<\|(?:{_KNOWN_CONTROL_TOKEN_ALTERNATION})[A-Za-z_:-]*$|" + r"<(?:channel|message|end|constrain)\|?$", + re.IGNORECASE, +) + + +def _first_likely_visible_boundary(content: str) -> int: + """Return the first likely user-visible answer boundary, or -1 if absent.""" + candidates = [] + for marker in ("```", "{"): + idx = content.find(marker) + if idx >= 0: + candidates.append(idx) + if candidates: + return min(candidates) + + for match in re.finditer(r"\[", content): + after_bracket = content[match.end():].lstrip() + if after_bracket and after_bracket[0] in '{["]-0123456789tfn': + return match.start() + + return -1 + + +def _find_matches_outside_json_strings(pattern: re.Pattern, content: str) -> list[re.Match]: + """Find regex matches that start outside JSON-style quoted strings.""" + matches = [] + i = 0 + in_string = False + escape_next = False + + while i < len(content): + char = content[i] + + if escape_next: + escape_next = False + i += 1 + continue + + if char == "\\" and in_string: + escape_next = True + i += 1 + continue + + if char == '"': + in_string = not in_string + i += 1 + continue + + if not in_string: + match = pattern.match(content, i) + if match: + matches.append(match) + i = max(match.end(), i + 1) + continue + + i += 1 + + return matches + + +def _has_match_outside_json_strings(pattern: re.Pattern, content: str) -> bool: + """Return True when pattern matches outside JSON-style quoted strings.""" + return bool(_find_matches_outside_json_strings(pattern, content)) + + +def _strip_control_tokens_outside_json_strings(content: str) -> str: + """Strip provider control tokens without touching visible JSON string values.""" + result = [] + i = 0 + in_string = False + escape_next = False + + while i < len(content): + char = content[i] + + if escape_next: + result.append(char) + escape_next = False + i += 1 + continue + + if char == "\\" and in_string: + result.append(char) + escape_next = True + i += 1 + continue + + if char == '"': + in_string = not in_string + result.append(char) + i += 1 + continue + + if not in_string: + if content.startswith("<|", i): + pipe_close = content.find("|>", i + 2) + angle_close = content.find(">", i + 2) + token_end = -1 + + if pipe_close >= 0 and (angle_close < 0 or pipe_close + 1 <= angle_close): + token_end = pipe_close + 2 + token_body = content[i + 2:pipe_close] + elif angle_close >= 0: + token_end = angle_close + 1 + token_body = content[i + 2:angle_close] + else: + token_body = content[i + 2:] + if re.fullmatch(r"[A-Za-z0-9_:-]+", token_body): + i = len(content) + continue + + if token_end > 0 and re.fullmatch(r"[A-Za-z0-9_:-]+", token_body.strip()): + i = token_end + continue + + for legacy_token in ("", "", "", ""): + if content[i:i + len(legacy_token)].lower() == legacy_token: + i += len(legacy_token) + break + else: + result.append(char) + i += 1 + continue + + result.append(char) + i += 1 + + return "".join(result) + + +def _strip_leading_private_reasoning_blocks(content: str) -> str: + """ + Remove leading private reasoning transcript blocks without touching visible content. + + Some providers expose private reasoning as a leading `` or `` + block before the actual answer. Treat only leading blocks as transcript + scaffolding; preserve literal tags that appear inside visible JSON, code, or + prose because those may be the user's/model's actual content. + """ + while True: + match = _PRIVATE_REASONING_OPEN_TAG_PATTERN.match(content) + if not match: + return content.strip() + + tag_match = re.match(r"\s*<(?Pthink|thought)\b", content, re.IGNORECASE) + if not tag_match: + return content.strip() + + tag_name = tag_match.group("tag") + close_match = re.search(rf"", content[match.end():], re.IGNORECASE) + if close_match: + content = content[match.end() + close_match.end():].strip() + continue + + # Unclosed private block: keep later likely answer text if it exists. + after_open_tag = content[match.end():] + boundary = _first_likely_visible_boundary(after_open_tag) + if boundary >= 0: + content = after_open_tag[boundary:].strip() + continue + + return RETRY_CONTEXT_EMPTY_PLACEHOLDER + + +def sanitize_model_output_for_retry_context(raw: str, max_chars: int = 2000) -> str: + """ + Sanitize raw model output before replaying it as retry context. + + This preserves useful visible failed-output excerpts for conversational retries + while stripping private reasoning/channel/control tokens that provider chat + templates may reject or that should not enter MOTO memory/context surfaces. + """ + if raw is None: + return RETRY_CONTEXT_EMPTY_PLACEHOLDER + + content = str(raw).replace("\r\n", "\n").replace("\r", "\n").strip() + if not content: + return RETRY_CONTEXT_EMPTY_PLACEHOLDER + + original_content = content + private_marker_seen = bool( + _has_match_outside_json_strings(_PRIVATE_CHANNEL_PATTERN, content) + or _has_match_outside_json_strings(_FINAL_CHANNEL_PATTERN, content) + or _has_match_outside_json_strings(_LEGACY_CHANNEL_BOUNDARY_PATTERN, content) + or _has_match_outside_json_strings(_BROAD_CONTROL_TOKEN_PATTERN, content) + or _has_match_outside_json_strings(_LEGACY_CONTROL_TOKEN_PATTERN, content) + or _has_match_outside_json_strings(_PARTIAL_CONTROL_TOKEN_PATTERN, content) + ) + + # If a Harmony-style final channel is present, only the final-channel payload + # is reusable answer text. Earlier analysis/thought channels are private. + final_matches = _find_matches_outside_json_strings(_FINAL_CHANNEL_PATTERN, content) + if final_matches: + content = content[final_matches[-1].end():] + elif _has_match_outside_json_strings(_PRIVATE_CHANNEL_PATTERN, content): + legacy_boundaries = _find_matches_outside_json_strings(_LEGACY_CHANNEL_BOUNDARY_PATTERN, content) + if legacy_boundaries: + content = content[legacy_boundaries[-1].end():] + else: + boundary = _first_likely_visible_boundary(content) + if boundary >= 0: + content = content[boundary:] + else: + return RETRY_CONTEXT_EMPTY_PLACEHOLDER + + content = _strip_leading_private_reasoning_blocks(content) + if content == RETRY_CONTEXT_EMPTY_PLACEHOLDER: + return content + + # Remove complete and partial provider/private control tokens. Channel labels + # directly following token shapes are not user-visible answer content. + content = _strip_control_tokens_outside_json_strings(content) + content = re.sub(r"(?im)^\s*(analysis|thought|commentary|final)\s*$", "", content) + content = content.strip() + + if private_marker_seen: + boundary = _first_likely_visible_boundary(content) + if boundary > 0: + content = content[boundary:].strip() + + content = re.sub(r"\n{3,}", "\n\n", content).strip() + if not content: + logger.debug("Retry-context sanitizer removed private-only model output") + return RETRY_CONTEXT_EMPTY_PLACEHOLDER + + if max_chars and max_chars > 0 and len(content) > max_chars: + content = content[:max_chars].rstrip() + "\n[...sanitized output truncated for retry...]" + + if content != original_content: + logger.debug( + "Sanitized retry context output (%d -> %d chars)", + len(original_content), + len(content), + ) + + return content or RETRY_CONTEXT_EMPTY_PLACEHOLDER + + def sanitize_json_response(raw_content: str) -> str: """ Sanitize JSON response to handle LaTeX expressions and invalid escape sequences. @@ -68,7 +334,7 @@ def sanitize_json_response(raw_content: str) -> str: if len(content) < original_len: logger.debug(f"Stripped ... reasoning tokens ({original_len} -> {len(content)} chars)") - logger.debug(f"Content after think removal (first 300 chars): {repr(content[:300])}") + logger.debug("Content after think removal redacted (%s)", _content_diagnostics(content)) # Extra safety: Remove any remaining thinking-related tags content = re.sub(r'', '', content, flags=re.IGNORECASE).strip() @@ -122,8 +388,9 @@ def sanitize_json_response(raw_content: str) -> str: original_content = content content = re.sub(control_token_pattern, '', content).strip() logger.debug( - f"Stripped control tokens: " - f"'{original_content[:150]}...' -> '{content[:150]}...'" + "Stripped control tokens: before=(%s), after=(%s)", + _content_diagnostics(original_content), + _content_diagnostics(content), ) # Additional cleanup: Remove any remaining angle bracket artifacts @@ -146,14 +413,14 @@ def sanitize_json_response(raw_content: str) -> str: # If no JSON start found, raise explicit error if json_start < 0: logger.warning(f"No JSON start character found in content (length={len(content)})") - logger.warning(f"Content preview: {repr(content[:200])}...") + logger.warning("Content preview redacted (%s)", _content_diagnostics(content)) # NEW: Don't continue - this is pure reasoning text with no JSON # Raise explicit error for retry mechanism raise ValueError( f"No JSON found in response - only conversational reasoning text " f"({len(content)} chars). Model likely hit max_tokens before writing JSON. " - f"Content starts with: {repr(content[:200])}" + "Raw content preview is withheld from retry prompts; use logs for diagnostics." ) else: # Strip everything before the JSON start (handles reasoning models that output @@ -163,7 +430,7 @@ def sanitize_json_response(raw_content: str) -> str: content = content[json_start:] json_start = 0 # Reset to 0 since we stripped the prefix logger.debug(f"Stripped {len(stripped_prefix)} chars of non-JSON prefix") - logger.debug(f"Stripped prefix preview: {repr(stripped_prefix[:200])}...") + logger.debug("Stripped prefix preview redacted (%s)", _content_diagnostics(stripped_prefix)) if json_start >= 0: try: @@ -220,7 +487,7 @@ def sanitize_json_response(raw_content: str) -> str: f"JSON response truncated at max_tokens: {brace_count} unclosed braces, " f"in_string={in_string}, response length {len(content)} chars. " f"Model needs to generate more concise output that fits within token limits. " - f"{last_complete_context}" + "Raw content preview is withheld from retry prompts; use logs for diagnostics." ) elif start_char == '[': @@ -277,7 +544,7 @@ def sanitize_json_response(raw_content: str) -> str: # Safety check: ensure content is not empty after preprocessing if not content or not content.strip(): logger.error(f"Sanitization resulted in empty content! Original length: {len(raw_content)}") - logger.error(f"Original content preview: {raw_content[:500]}...") + logger.error("Original content preview redacted (%s)", _content_diagnostics(raw_content)) # Return original content and let the caller handle the error return raw_content.strip() @@ -683,7 +950,7 @@ def parse_json(response_content: str) -> dict: # Check for anomalously short response if len(response_content.strip()) < 10: logger.error(f"parse_json: Response too short ({len(response_content)} chars)") - logger.error(f"Short response content: {repr(response_content)}") + logger.error("Short response content redacted (%s)", _content_diagnostics(response_content)) raise ValueError(f"Response too short ({len(response_content)} chars)") # Sanitize and parse @@ -718,7 +985,7 @@ def parse_json(response_content: str) -> dict: stripped = sanitized_content.rstrip() if stripped and stripped[-1] not in '}]': is_likely_truncated = True - truncation_hints.append(f"JSON doesn't end with }} or ] (ends with: {repr(stripped[-20:])})") + truncation_hints.append("JSON doesn't end with } or ]") # Count unclosed braces/brackets (rough check) open_braces = sanitized_content.count('{') - sanitized_content.count('}') @@ -737,23 +1004,14 @@ def parse_json(response_content: str) -> dict: logger.error(f"🚨 LIKELY TRUNCATED LLM OUTPUT: {', '.join(truncation_hints)}") logger.error("This usually means the LLM hit max_tokens limit before completing the JSON response") - logger.error(f"Original response length: {len(response_content)} chars") - logger.error(f"Original response (first 500 chars): {repr(response_content[:500])}") - logger.error(f"Original response (last 200 chars): {repr(response_content[-200:])}") - logger.error(f"Sanitized content length: {len(sanitized_content)} chars") - logger.error(f"Sanitized content (first 500 chars): {repr(sanitized_content[:500])}") - logger.error(f"Sanitized content (last 200 chars): {repr(sanitized_content[-200:])}") + logger.error("Original response content redacted (%s)", _content_diagnostics(response_content)) + logger.error("Sanitized content redacted (%s)", _content_diagnostics(sanitized_content)) logger.error(f"Error position: line {e.lineno}, column {e.colno}, char {e.pos}") - if e.pos is not None and e.pos < len(sanitized_content): - # Show context around error position - start = max(0, e.pos - 50) - end = min(len(sanitized_content), e.pos + 50) - logger.error(f"Error context: ...{repr(sanitized_content[start:end])}...") raise except Exception as e: # Catch any other parsing errors logger.error(f"parse_json: Unexpected error during parsing - {type(e).__name__}: {e}") - logger.error(f"Response content: {repr(response_content[:1000])}") + logger.error("Response content redacted (%s)", _content_diagnostics(response_content)) raise # Handle array responses - extract first element diff --git a/backend/shared/lean4_client.py b/backend/shared/lean4_client.py index 310d4b3..e8d0fcf 100644 --- a/backend/shared/lean4_client.py +++ b/backend/shared/lean4_client.py @@ -52,6 +52,7 @@ class Lean4Result: """Result of one Lean 4 proof check.""" success: bool error_output: str = "" + diagnostic_output: str = "" goal_states: str = "" raw_stderr: str = "" tactic_error_slice: str = "" @@ -370,7 +371,7 @@ async def _run_lean_file_once( ) -> tuple[int, str, str]: temp_path = self.workspace_dir / temp_filename try: - temp_path.write_text(prepared_code, encoding="utf-8") + await asyncio.to_thread(temp_path.write_text, prepared_code, encoding="utf-8") return await self._run_process( [self.lake_path, "env", self.lean_path or self._resolve_executable("lean"), temp_filename], cwd=self.workspace_dir, @@ -379,7 +380,7 @@ async def _run_lean_file_once( finally: try: if temp_path.exists(): - temp_path.unlink() + await asyncio.to_thread(temp_path.unlink) except OSError: logger.debug("Could not remove temporary Lean file %s", temp_path) @@ -478,7 +479,7 @@ async def _repair_workspace_after_infrastructure_error(self, output: str) -> boo async with self._workspace_lock: self._workspace_unhealthy_error = "" self._workspace_ready = False - self._wipe_lake_directory() + await asyncio.to_thread(self._wipe_lake_directory) repaired = await self._ensure_workspace_locked() if not repaired: self._mark_workspace_unhealthy(output) @@ -940,7 +941,7 @@ def _extract_tactic_error_slice( ).strip() return error_slice, failing_tactic_index - async def check_proof(self, lean_code: str, timeout: int = 120) -> Lean4Result: + async def check_proof(self, lean_code: str, timeout: int = 120, *, allow_placeholders: bool = False) -> Lean4Result: """Write a temp Lean file, run Lean 4, and return structured feedback.""" if not system_config.lean4_enabled: return Lean4Result(success=False, error_output="Lean 4 is disabled in system configuration.") @@ -951,9 +952,10 @@ async def check_proof(self, lean_code: str, timeout: int = 120) -> Lean4Result: # Fast pre-check: reject placeholder proofs before invoking Lean so # the model learns the rejection reason even when Lean would have - # compiled the file with only a warning. + # compiled the file with only a warning. LeanOJ can opt out when it + # intentionally wants to harvest a compiling incomplete scaffold. placeholder = _detect_forbidden_placeholder(prepared_code) - if placeholder: + if placeholder and not allow_placeholders: return Lean4Result( success=False, error_output=_format_placeholder_rejection(placeholder, from_lean_diagnostic=False), @@ -1002,16 +1004,13 @@ async def check_proof(self, lean_code: str, timeout: int = 120) -> Lean4Result: has_error_diagnostic = "error:" in lowered has_sorry_warning = _output_contains_sorry_warning(combined_output) lean_exited_cleanly = returncode == 0 - positive_pass = ( - lean_exited_cleanly - and not has_error_diagnostic - and not has_sorry_warning - ) + positive_pass = lean_exited_cleanly and not has_error_diagnostic and (allow_placeholders or not has_sorry_warning) if positive_pass: return Lean4Result( success=True, error_output="", + diagnostic_output=combined_output, goal_states=goal_states, raw_stderr=stderr.strip(), ) @@ -1173,6 +1172,7 @@ async def _run_tactic_script_once( return Lean4Result( success=True, error_output="", + diagnostic_output=combined_output, goal_states=goal_states, raw_stderr=stderr.strip(), tactic_error_slice="", @@ -1586,6 +1586,7 @@ def _result_from_diagnostics( return Lean4Result( success=True, error_output="", + diagnostic_output=combined_output, goal_states=goal_states, raw_stderr=raw_stderr, ) @@ -1649,7 +1650,7 @@ async def _check_via_lsp( self._open_document_versions[uri] = version try: - temp_path.write_text(prepared_code, encoding="utf-8") + await asyncio.to_thread(temp_path.write_text, prepared_code, encoding="utf-8") await self._send_notification( "textDocument/didOpen", { @@ -1697,9 +1698,9 @@ async def _check_via_lsp( self._open_document_versions.pop(uri, None) with suppress(OSError): if temp_path.exists(): - temp_path.unlink() + await asyncio.to_thread(temp_path.unlink) - async def check_proof(self, lean_code: str, timeout: int = 120) -> Lean4Result: + async def check_proof(self, lean_code: str, timeout: int = 120, *, allow_placeholders: bool = False) -> Lean4Result: """Check a proof through the persistent Lean LSP when healthy, otherwise fall back.""" if not system_config.lean4_enabled: return Lean4Result(success=False, error_output="Lean 4 is disabled in system configuration.") @@ -1709,6 +1710,12 @@ async def check_proof(self, lean_code: str, timeout: int = 120) -> Lean4Result: return Lean4Result(success=False, error_output="No Lean 4 code was provided.") placeholder = _detect_forbidden_placeholder(prepared_code) + if placeholder and allow_placeholders: + return await self._subprocess_fallback.check_proof( + lean_code, + timeout=timeout, + allow_placeholders=True, + ) if placeholder: return Lean4Result( success=False, diff --git a/backend/shared/lean_proof_integrity.py b/backend/shared/lean_proof_integrity.py new file mode 100644 index 0000000..ee81e74 --- /dev/null +++ b/backend/shared/lean_proof_integrity.py @@ -0,0 +1,233 @@ +"""Shared integrity checks for Lean 4 proof outputs.""" +from __future__ import annotations + +import logging +import re +from dataclasses import dataclass, field +from typing import Optional + +from backend.autonomous.prompts.proof_prompts import build_proof_statement_alignment_prompt +from backend.shared.api_client_manager import api_client_manager +from backend.shared.json_parser import parse_json +from backend.shared.model_error_utils import is_non_retryable_model_error +from backend.shared.utils import count_tokens + +logger = logging.getLogger(__name__) + +_LEAN_DECL_NAME = r"(?:[A-Za-z_][A-Za-z0-9_'.]*|«[^»]+»)" + +_DECLARATION_DEVICE_COMMAND_RE = re.compile( + r"^\s*(?:@\[[^\]]+\]\s*)*(?:private\s+|protected\s+|noncomputable\s+|unsafe\s+)*" + r"(axiom|constant|opaque)\b(?P.*?)" + r"(?=^\s*(?:@\[[^\]]+\]\s*)*(?:private\s+|protected\s+|noncomputable\s+|unsafe\s+)*" + r"(?:axiom|constant|opaque|theorem|lemma|def|example|import|namespace|section|end|open|" + r"variable|variables|structure|class|inductive|instance|abbrev)\b|\Z)", + re.MULTILINE | re.DOTALL, +) +_DECLARATION_NAME_RE = re.compile(_LEAN_DECL_NAME) +_DECLARATION_BINDER_RE = re.compile(rf"\(\s*({_LEAN_DECL_NAME}(?:\s+{_LEAN_DECL_NAME})*)\s*:") +_DECLARATION_LEADING_NAMES_RE = re.compile(rf"^\s*({_LEAN_DECL_NAME}(?:\s+{_LEAN_DECL_NAME})*)\s*(?::|:=|where\b|$)") + + +@dataclass +class LeanProofIntegrityResult: + """Result of non-Lean integrity checks applied after Lean accepts code.""" + valid: bool + reason: str = "" + category: str = "ok" + introduced_devices: list[str] = field(default_factory=list) + + +def strip_lean_comments_and_strings(code: str) -> str: + """Best-effort removal of comments and string literals before source scanning.""" + without_block_comments = re.sub(r"/-.*?-/", " ", code or "", flags=re.DOTALL) + without_line_comments = re.sub(r"--[^\n]*", " ", without_block_comments) + return re.sub(r'"(?:\\.|[^"\\])*"', ' "" ', without_line_comments) + + +def find_declaration_devices(code: str) -> set[tuple[str, str]]: + """Return axiom/constant/opaque declarations found in Lean source.""" + devices: set[tuple[str, str]] = set() + for match in _DECLARATION_DEVICE_COMMAND_RE.finditer(strip_lean_comments_and_strings(code)): + kind = match.group(1) + body = match.group("body") or "" + names: list[str] = [] + + for binder_match in _DECLARATION_BINDER_RE.finditer(body): + names.extend(name.group(0) for name in _DECLARATION_NAME_RE.finditer(binder_match.group(1))) + + if not names: + leading_match = _DECLARATION_LEADING_NAMES_RE.match(body) + if leading_match: + names.extend(name.group(0) for name in _DECLARATION_NAME_RE.finditer(leading_match.group(1))) + + for name in names: + devices.add((kind, name)) + return devices + + +def find_introduced_declaration_devices(lean_code: str, allowed_baseline: str = "") -> list[str]: + """Return declaration devices present in ``lean_code`` but absent from baseline.""" + allowed = find_declaration_devices(allowed_baseline) + introduced: list[str] = [] + for kind, name in sorted(find_declaration_devices(lean_code)): + if (kind, name) not in allowed: + introduced.append(f"{kind} {name}") + return introduced + + +def validate_lean_proof_integrity( + *, + lean_code: str, + allowed_baseline: str = "", +) -> LeanProofIntegrityResult: + """Reject fake declaration devices that Lean accepts but MOTO does not.""" + introduced = find_introduced_declaration_devices( + lean_code=lean_code, + allowed_baseline=allowed_baseline, + ) + if introduced: + return LeanProofIntegrityResult( + valid=False, + category="forbidden_declaration_device", + introduced_devices=introduced, + reason=( + "LEAN PROOF INTEGRITY REJECTED: the submitted Lean code introduces new " + "axiom/constant/opaque declarations not present in the allowed baseline: " + f"{', '.join(introduced[:8])}. Do not prove results by adding fake assumptions; " + "use constructive Lean/Mathlib proof terms or tactics." + ), + ) + return LeanProofIntegrityResult(valid=True) + + +async def validate_lean_statement_alignment( + *, + user_prompt: str, + theorem_statement: str, + formal_sketch: str, + lean_code: str, + source_excerpt: str, + validator_model: str, + validator_context: int, + validator_max_tokens: int, + task_id: str, + role_id: str, +) -> LeanProofIntegrityResult: + """Use an LLM validator to ensure accepted Lean code matches the intended claim.""" + prompt = build_proof_statement_alignment_prompt( + user_prompt=user_prompt, + theorem_statement=theorem_statement, + formal_sketch=formal_sketch, + lean_code=lean_code, + source_excerpt=source_excerpt, + ) + max_input_tokens = validator_context - validator_max_tokens + trimmed_excerpt = source_excerpt or "" + while count_tokens(prompt) > max_input_tokens and len(trimmed_excerpt) > 1500: + trimmed_excerpt = trimmed_excerpt[: max(len(trimmed_excerpt) // 2, 1500)] + prompt = build_proof_statement_alignment_prompt( + user_prompt=user_prompt, + theorem_statement=theorem_statement, + formal_sketch=formal_sketch, + lean_code=lean_code, + source_excerpt=trimmed_excerpt, + ) + + try: + response = await api_client_manager.generate_completion( + task_id=task_id, + role_id=role_id, + model=validator_model, + messages=[{"role": "user", "content": prompt}], + max_tokens=validator_max_tokens, + temperature=0.0, + ) + if not response or not response.get("choices"): + return LeanProofIntegrityResult( + valid=False, + category="statement_alignment_unavailable", + reason="LEAN PROOF INTEGRITY REJECTED: statement-alignment validator returned no response.", + ) + message = response["choices"][0].get("message", {}) + content = message.get("content") or message.get("reasoning") or "" + if not content: + return LeanProofIntegrityResult( + valid=False, + category="statement_alignment_unavailable", + reason="LEAN PROOF INTEGRITY REJECTED: statement-alignment validator returned empty content.", + ) + data = parse_json(content) + if isinstance(data, list): + data = data[0] if data else {} + if not isinstance(data, dict): + data = {} + except Exception as exc: + if is_non_retryable_model_error(exc): + raise + logger.warning("Lean statement alignment validation failed: %s", exc) + return LeanProofIntegrityResult( + valid=False, + category="statement_alignment_unavailable", + reason=( + "LEAN PROOF INTEGRITY REJECTED: statement-alignment validation failed before " + f"a usable decision was produced: {type(exc).__name__}: {exc}" + ), + ) + + decision = str(data.get("decision") or "").strip().lower() + reasoning = str(data.get("reasoning") or data.get("summary") or "").strip() + if decision != "accept": + return LeanProofIntegrityResult( + valid=False, + category="statement_alignment_rejected", + reason=( + "LEAN PROOF INTEGRITY REJECTED: Lean accepted the code, but the statement-alignment " + f"validator rejected it as unrelated or insufficient. {reasoning}" + ).strip(), + ) + return LeanProofIntegrityResult(valid=True, reason=reasoning, category="statement_alignment") + + +async def validate_full_lean_proof_integrity( + *, + user_prompt: str, + theorem_statement: str, + formal_sketch: str, + lean_code: str, + source_excerpt: str, + allowed_baseline: str, + validator_model: Optional[str] = None, + validator_context: int = 131072, + validator_max_tokens: int = 25000, + task_id: str = "proof_integrity_000", + role_id: str = "proof_integrity_validator", + require_statement_alignment: bool = True, +) -> LeanProofIntegrityResult: + """Run all post-Lean integrity checks used by proof-producing systems.""" + structural = validate_lean_proof_integrity( + lean_code=lean_code, + allowed_baseline=allowed_baseline, + ) + if not structural.valid: + return structural + if not require_statement_alignment: + return structural + if not validator_model: + return LeanProofIntegrityResult( + valid=False, + category="statement_alignment_unavailable", + reason="LEAN PROOF INTEGRITY REJECTED: no validator model was configured for statement alignment.", + ) + return await validate_lean_statement_alignment( + user_prompt=user_prompt, + theorem_statement=theorem_statement, + formal_sketch=formal_sketch, + lean_code=lean_code, + source_excerpt=source_excerpt, + validator_model=validator_model, + validator_context=validator_context, + validator_max_tokens=validator_max_tokens, + task_id=task_id, + role_id=role_id, + ) diff --git a/backend/shared/lm_studio_client.py b/backend/shared/lm_studio_client.py index ebc0e1a..3b0b8fe 100644 --- a/backend/shared/lm_studio_client.py +++ b/backend/shared/lm_studio_client.py @@ -15,9 +15,10 @@ import asyncio import time import os +import re from pathlib import Path from datetime import datetime -from typing import List, Dict, Any, Optional +from typing import List, Dict, Any, Optional, Tuple from backend.shared.config import rag_config, system_config import logging @@ -27,8 +28,23 @@ Path(system_config.logs_dir).mkdir(parents=True, exist_ok=True) +def _sanitize_lm_studio_error_text(value: Any, max_chars: int = 500) -> str: + """Return a bounded LM Studio diagnostic without echoed prompts or secrets.""" + text = str(value or "") + text = re.sub(r"(Bearer\s+)[A-Za-z0-9._~+\-/=]+", r"\1[redacted]", text, flags=re.IGNORECASE) + text = re.sub(r'("api[_-]?key"\s*:\s*)"[^"]*"', r'\1"[redacted]"', text, flags=re.IGNORECASE) + text = re.sub(r'("messages"\s*:\s*)\[[\s\S]*?\]', r'\1[redacted]', text, flags=re.IGNORECASE) + text = re.sub(r'("prompt"\s*:\s*)"[\s\S]*?"', r'\1"[redacted]"', text, flags=re.IGNORECASE) + if len(text) > max_chars: + return text[:max_chars] + "...[truncated]" + return text + + class LMStudioClient: """Client for LM Studio API.""" + ROUTING_METADATA_KEY = "_moto_lm_studio_routing" + INSTANCE_REGISTRY_TTL_SECONDS = 5.0 + _NUMERIC_INSTANCE_SUFFIX_RE = re.compile(r"^(?P.+):(?P\d+)$") # Embedding performance settings EMBEDDING_BATCH_SIZE = 100 # Process embeddings in batches of 100 @@ -56,11 +72,76 @@ def __init__(self, base_url: str = None): keepalive_expiry=30.0 ) ) + self._loaded_instance_groups: Dict[str, List[str]] = {} + self._loaded_instance_cache_at = 0.0 + self._instance_registry_lock = asyncio.Lock() + self._inflight_by_model: Dict[str, int] = {} + self._inflight_lock = asyncio.Lock() + + @classmethod + def split_numeric_instance_suffix(cls, model: str) -> Tuple[str, Optional[int]]: + """Split LM Studio's final numeric `:#` instance suffix, if present.""" + model_id = (model or "").strip() + match = cls._NUMERIC_INSTANCE_SUFFIX_RE.match(model_id) + if not match: + return model_id, None + return match.group("base"), int(match.group("instance")) + + @classmethod + def normalize_instance_base(cls, model: str) -> str: + """Return the same-base model key used to group LM Studio sibling instances.""" + base, _ = cls.split_numeric_instance_suffix(model) + return base + + @classmethod + def has_numeric_instance_suffix(cls, model: str) -> bool: + """Return True only for LM Studio-style numeric instance IDs like `model:2`.""" + _, instance = cls.split_numeric_instance_suffix(model) + return instance is not None + + @classmethod + def build_instance_groups(cls, loaded_models: List[str]) -> Dict[str, List[str]]: + """Group loaded LM Studio model IDs by same-base instance family.""" + groups: Dict[str, List[str]] = {} + for model in loaded_models or []: + model_id = (model or "").strip() + if not model_id: + continue + base = cls.normalize_instance_base(model_id) + groups.setdefault(base, []).append(model_id) + + for base, models in groups.items(): + groups[base] = sorted( + dict.fromkeys(models), + key=lambda item: ( + cls.split_numeric_instance_suffix(item)[1] + if cls.split_numeric_instance_suffix(item)[1] is not None + else 0, + item, + ), + ) + return groups + + @classmethod + def get_sibling_instances_from_loaded(cls, model: str, loaded_models: List[str]) -> List[str]: + """Return same-base loaded instances for a requested model.""" + base = cls.normalize_instance_base(model) + siblings = cls.build_instance_groups(loaded_models).get(base, []) + if len(siblings) < 2: + return [] + if not any(cls.has_numeric_instance_suffix(candidate) for candidate in siblings): + return [] + return siblings + + @classmethod + def count_sibling_instances_from_loaded(cls, model: str, loaded_models: List[str]) -> int: + """Count same-base loaded LM Studio instances for scheduler decisions.""" + return len(cls.get_sibling_instances_from_loaded(model, loaded_models)) async def _get_model_semaphore(self, model: str) -> asyncio.Semaphore: """ Get or create semaphore for a specific model. - Each model gets its own semaphore (limit=1) to prevent concurrent requests. + Each model gets its own semaphore to bound concurrent requests. Different models can run in parallel. Args: @@ -71,8 +152,9 @@ async def _get_model_semaphore(self, model: str) -> asyncio.Semaphore: """ async with self._semaphore_lock: if model not in self._model_semaphores: - self._model_semaphores[model] = asyncio.Semaphore(1) - logger.debug(f"Created semaphore for model: {model}") + limit = max(1, int(system_config.max_model_concurrency_per_model or 1)) + self._model_semaphores[model] = asyncio.Semaphore(limit) + logger.debug(f"Created semaphore for model: {model} (limit={limit})") return self._model_semaphores[model] async def list_models(self) -> List[Dict[str, Any]]: @@ -151,6 +233,142 @@ async def get_loaded_models(self) -> List[str]: except Exception as e: logger.error(f"Failed to get loaded models: {e}") return [] + + async def get_loaded_instance_groups(self, force_refresh: bool = False) -> Dict[str, List[str]]: + """Return cached same-base groups for loaded LM Studio instances.""" + now = time.monotonic() + async with self._instance_registry_lock: + cache_is_fresh = ( + not force_refresh + and self._loaded_instance_cache_at > 0 + and now - self._loaded_instance_cache_at < self.INSTANCE_REGISTRY_TTL_SECONDS + ) + if cache_is_fresh: + return {base: list(models) for base, models in self._loaded_instance_groups.items()} + + try: + loaded_models = await self.get_loaded_models() + except Exception as exc: + logger.debug(f"LM Studio instance registry refresh failed: {exc}") + loaded_models = [] + + self._loaded_instance_groups = self.build_instance_groups(loaded_models) + self._loaded_instance_cache_at = time.monotonic() + return {base: list(models) for base, models in self._loaded_instance_groups.items()} + + async def count_loaded_sibling_instances(self, model: str, loaded_models: Optional[List[str]] = None) -> int: + """Count loaded same-base instances for a requested model.""" + if loaded_models is not None: + return self.count_sibling_instances_from_loaded(model, loaded_models) + groups = await self.get_loaded_instance_groups() + siblings = groups.get(self.normalize_instance_base(model), []) + if len(siblings) < 2: + return 0 + if not any(self.has_numeric_instance_suffix(candidate) for candidate in siblings): + return 0 + return len(siblings) + + @classmethod + def extract_routing_metadata(cls, response: Optional[Dict[str, Any]]) -> Dict[str, Any]: + """Return LM Studio instance-routing metadata attached to a response.""" + if not isinstance(response, dict): + return {} + metadata = response.get(cls.ROUTING_METADATA_KEY) + if isinstance(metadata, dict): + return metadata.copy() + return {} + + def _attach_routing_metadata( + self, + response: Dict[str, Any], + *, + requested_model: str, + actual_model: str, + sibling_instances: Optional[List[str]] = None, + ) -> Dict[str, Any]: + """Attach requested/actual LM Studio instance details to a response.""" + if not isinstance(response, dict): + return response + base_model = self.normalize_instance_base(requested_model) + response[self.ROUTING_METADATA_KEY] = { + "requested_model": requested_model, + "actual_model": actual_model, + "base_model": base_model, + "shared_instance": actual_model != requested_model, + "sibling_instances": list(sibling_instances or []), + } + return response + + async def _select_completion_model(self, requested_model: str) -> Tuple[str, List[str]]: + """ + Choose an idle same-base LM Studio instance for a completion. + + Discovery failures are fail-closed: return the requested model. + """ + groups = await self.get_loaded_instance_groups() + base_model = self.normalize_instance_base(requested_model) + siblings = groups.get(base_model, []) + if len(siblings) < 2 or not any(self.has_numeric_instance_suffix(candidate) for candidate in siblings): + siblings = [] + + # If discovery found loaded siblings for this base, only dispatch to + # those concrete loaded IDs. This also supports callers configured with + # the unsuffixed base model while LM Studio exposes `base:1`, `base:2`. + candidates = siblings or [requested_model] + + requested_base = self.normalize_instance_base(requested_model) + candidates = [ + candidate + for candidate in dict.fromkeys(candidates) + if self.normalize_instance_base(candidate) == requested_base + ] + if not candidates: + candidates = [requested_model] + + async with self._inflight_lock: + idle_candidates = [ + candidate + for candidate in candidates + if self._inflight_by_model.get(candidate, 0) <= 0 + ] + if idle_candidates: + selected_model = min( + idle_candidates, + key=lambda candidate: ( + candidate != requested_model, + self.split_numeric_instance_suffix(candidate)[1] + if self.split_numeric_instance_suffix(candidate)[1] is not None + else 0, + candidate, + ), + ) + elif requested_model in candidates: + selected_model = requested_model + else: + # Unsuffixed configs may only have concrete loaded `base:#` + # instances. If they are all busy, queue on the least-loaded + # concrete instance rather than sending an unloaded base ID. + selected_model = min( + candidates, + key=lambda candidate: ( + self._inflight_by_model.get(candidate, 0), + self.split_numeric_instance_suffix(candidate)[1] + if self.split_numeric_instance_suffix(candidate)[1] is not None + else 0, + candidate, + ), + ) + self._inflight_by_model[selected_model] = self._inflight_by_model.get(selected_model, 0) + 1 + return selected_model, siblings + + async def _release_completion_model(self, actual_model: str) -> None: + """Release in-flight accounting for a selected LM Studio instance.""" + async with self._inflight_lock: + current = self._inflight_by_model.get(actual_model, 0) + if current <= 1: + self._inflight_by_model.pop(actual_model, None) + else: + self._inflight_by_model[actual_model] = current - 1 async def generate_completion( self, @@ -171,22 +389,39 @@ async def generate_completion( tools: Optional OpenAI-compatible tool schemas (LM Studio 0.3+). tool_choice: Optional tool-choice directive. """ + requested_model = model # Get model-specific semaphore (allows different models to run in parallel) if skip_semaphore: # Direct execution without semaphore - return await self._execute_completion_request( + response = await self._execute_completion_request( model, messages, temperature, max_tokens, response_format, tools=tools, tool_choice=tool_choice, ) + return self._attach_routing_metadata( + response, + requested_model=requested_model, + actual_model=model, + sibling_instances=[], + ) - model_semaphore = await self._get_model_semaphore(model) + actual_model, sibling_instances = await self._select_completion_model(requested_model) + model_semaphore = await self._get_model_semaphore(actual_model) - # ACQUIRE THIS MODEL'S SEMAPHORE to prevent concurrent requests to same model - async with model_semaphore: - return await self._execute_completion_request( - model, messages, temperature, max_tokens, response_format, - tools=tools, tool_choice=tool_choice, - ) + # Bound same-model parallelism so multi-submitter phases can overlap without unbounded fanout. + try: + async with model_semaphore: + response = await self._execute_completion_request( + actual_model, messages, temperature, max_tokens, response_format, + tools=tools, tool_choice=tool_choice, + ) + return self._attach_routing_metadata( + response, + requested_model=requested_model, + actual_model=actual_model, + sibling_instances=sibling_instances, + ) + finally: + await self._release_completion_model(actual_model) async def _execute_completion_request( self, @@ -247,7 +482,8 @@ async def _execute_completion_request( except httpx.HTTPStatusError as e: if e.response.status_code == 400: - error_detail = e.response.text if hasattr(e.response, 'text') else str(e) + raw_error_detail = e.response.text if hasattr(e.response, 'text') else str(e) + error_detail = _sanitize_lm_studio_error_text(raw_error_detail) logger.error( f"LM Studio 400 Bad Request (attempt {attempt + 1}/{max_retries + 1}): " f"model={model}, approx_tokens={approx_tokens}, " @@ -444,7 +680,7 @@ async def test_connection(self) -> bool: logger.error(f"Failed to connect to LM Studio: {e}") return False - async def check_availability(self) -> Dict[str, Any]: + async def check_availability(self, include_cli_models: bool = False) -> Dict[str, Any]: """ Check if LM Studio server is reachable and has models loaded. @@ -472,13 +708,9 @@ async def check_availability(self) -> Dict[str, Any]: # Server is reachable result["available"] = True - # Extract models from the /v1/models response as a reliable fallback. - # The `lms ps` CLI is preferred (it returns instance IDs), but the CLI - # may be missing from PATH or slow/timing out during startup while - # nomic is still loading. In either case we must NOT downgrade a - # successful /v1/models response to "no models" — that produces a - # phantom "LM Studio Offline" state even though embedding calls - # are succeeding. + # Extract models from the /v1/models response. Routine availability + # checks use HTTP only; the `lms ps` CLI can hang or crash under load + # on Windows, so reserve it for explicit diagnostics. http_models: List[str] = [] try: data = response.json() @@ -490,7 +722,7 @@ async def check_availability(self) -> Dict[str, Any]: except Exception as parse_err: logger.debug(f"Could not parse /v1/models response body: {parse_err}") - cli_models = await self.get_loaded_models() + cli_models = await self.get_loaded_models() if include_cli_models else [] if cli_models: models = cli_models @@ -557,7 +789,6 @@ async def test_model_compatibility(self, model_name: str) -> tuple[bool, str, di "completion_tokens": completion_tokens, "prompt_tokens": prompt_tokens, "content_length": len(content), - "content_preview": content[:100] if content else "(empty)" } # Check 1: Empty or whitespace-only response @@ -581,11 +812,15 @@ async def test_model_compatibility(self, model_name: str) -> tuple[bool, str, di sanitized_content = sanitize_json_response(content) parsed_json = json.loads(sanitized_content) - logger.info(f"Model '{model_name}' produced valid JSON: {parsed_json}") + logger.info( + "Model '%s' produced valid JSON with keys: %s", + model_name, + sorted(parsed_json.keys()) if isinstance(parsed_json, dict) else type(parsed_json).__name__, + ) except json.JSONDecodeError as json_err: error = f"Model '{model_name}' FAILED to produce valid JSON: {json_err}" logger.error(f"Compatibility test FAILED: {error}") - logger.error(f"Response content: {content}") + logger.error("Response content redacted (length=%d)", len(content or "")) logger.error(f"Details: {details}") return (False, error, details) diff --git a/backend/shared/log_redaction.py b/backend/shared/log_redaction.py new file mode 100644 index 0000000..06a797b --- /dev/null +++ b/backend/shared/log_redaction.py @@ -0,0 +1,37 @@ +""" +Small helpers for removing obvious secrets from locally persisted log previews. +""" +from __future__ import annotations + +import re +from typing import Any + + +_SECRET_PATTERNS = ( + re.compile(r"(Bearer\s+)[A-Za-z0-9._~+/=-]+", re.IGNORECASE), + re.compile(r'("(?:api[_-]?key|appid|authorization|password|token|secret)"\s*:\s*)"[^"]*"', re.IGNORECASE), + re.compile(r"((?:api[_-]?key|appid|authorization|password|token|secret)\s*[=:]\s*)[^\s,&}\]]+", re.IGNORECASE), + re.compile(r"\bsk-or-v1-[A-Za-z0-9._~+/=-]+", re.IGNORECASE), +) + + +def redact_log_text(value: Any, max_chars: int | None = None) -> str: + """Return text with common credential shapes redacted and optionally capped.""" + text = str(value or "") + for pattern in _SECRET_PATTERNS: + text = pattern.sub( + lambda match: f"{match.group(1) if match.lastindex else ''}[redacted]", + text, + ) + + # Prevent log forging by keeping caller-controlled values on one line. + text = ( + text + .replace("\r", "\\r") + .replace("\n", "\\n") + .replace("\t", "\\t") + ) + + if max_chars is not None and max_chars >= 0 and len(text) > max_chars: + return text[:max_chars] + "...[truncated]" + return text diff --git a/backend/shared/model_error_utils.py b/backend/shared/model_error_utils.py new file mode 100644 index 0000000..d6455b9 --- /dev/null +++ b/backend/shared/model_error_utils.py @@ -0,0 +1,39 @@ +"""Helpers for distinguishing model availability failures from ordinary output errors.""" +from __future__ import annotations + +from backend.shared.openrouter_client import ( + CreditExhaustionError, + FreeModelExhaustedError, + OpenRouterPrivacyPolicyError, +) + + +_NON_RETRYABLE_MODEL_ERROR_MARKERS = ( + "account free credits exhausted", + "all free model options exhausted", + "and no fallback configured", + "and no lm studio fallback", + "boost requested but no openrouter api key", + "free credits exhausted", + "no api key is set", + "no fallback configured", + "no lm studio fallback", + "no openrouter api key is available", + "openrouter credits exhausted", + "openrouter privacy settings are blocking", +) + + +def is_non_retryable_model_error(exc: Exception) -> bool: + """Return true when a model/API failure should halt workflow progress.""" + if isinstance( + exc, + ( + CreditExhaustionError, + FreeModelExhaustedError, + OpenRouterPrivacyPolicyError, + ), + ): + return True + message = str(exc).lower() + return any(marker in message for marker in _NON_RETRYABLE_MODEL_ERROR_MARKERS) diff --git a/backend/shared/models.py b/backend/shared/models.py index 2532b4e..dee2328 100644 --- a/backend/shared/models.py +++ b/backend/shared/models.py @@ -5,7 +5,12 @@ from datetime import datetime from typing import List, Dict, Optional, Any, Literal -from pydantic import BaseModel, Field +from pydantic import BaseModel, ConfigDict, Field + +DEFAULT_CONTEXT_WINDOW = 131072 +DEFAULT_MAX_OUTPUT_TOKENS = 25000 +DEFAULT_OPENROUTER_REASONING_EFFORT = "auto" +OpenRouterReasoningEffort = Literal["auto", "xhigh", "high", "medium", "low", "minimal", "none"] class DocumentChunk(BaseModel): @@ -105,9 +110,11 @@ class ModelConfig(BaseModel): model_id: str openrouter_model_id: Optional[str] = None # For OpenRouter (different naming) openrouter_provider: Optional[str] = None # Specific OpenRouter provider (e.g., "Anthropic") + openrouter_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT lm_studio_fallback_id: Optional[str] = None # Fallback LM Studio model if OpenRouter fails - context_window: int = 131072 - max_output_tokens: int = 25000 + context_window: int = DEFAULT_CONTEXT_WINDOW + max_output_tokens: int = DEFAULT_MAX_OUTPUT_TOKENS + supercharge_enabled: bool = False class BoostConfig(BaseModel): @@ -116,8 +123,9 @@ class BoostConfig(BaseModel): openrouter_api_key: str = "" boost_model_id: str = "" # OpenRouter model to use for boost boost_provider: Optional[str] = None # Specific provider, or None to let OpenRouter choose - boost_context_window: int = 131072 - boost_max_output_tokens: int = 25000 + boost_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT + boost_context_window: int = DEFAULT_CONTEXT_WINDOW + boost_max_output_tokens: int = DEFAULT_MAX_OUTPUT_TOKENS class FreeModelSettings(BaseModel): @@ -144,9 +152,11 @@ class SubmitterConfig(BaseModel): provider: Literal["lm_studio", "openrouter"] = "lm_studio" model_id: str # LM Studio model OR OpenRouter model based on provider openrouter_provider: Optional[str] = None # Specific OpenRouter provider (e.g., "Anthropic") + openrouter_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT lm_studio_fallback_id: Optional[str] = None # Fallback LM Studio model if OpenRouter fails - context_window: int = 131072 - max_output_tokens: int = 25000 + context_window: int = DEFAULT_CONTEXT_WINDOW + max_output_tokens: int = DEFAULT_MAX_OUTPUT_TOKENS + supercharge_enabled: bool = False class AggregatorStartRequest(BaseModel): @@ -157,9 +167,11 @@ class AggregatorStartRequest(BaseModel): validator_provider: Literal["lm_studio", "openrouter"] = "lm_studio" validator_model: str # LM Studio model OR OpenRouter model based on provider validator_openrouter_provider: Optional[str] = None # Specific OpenRouter provider + validator_openrouter_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT validator_lm_studio_fallback: Optional[str] = None # Fallback if OpenRouter fails - validator_context_size: int = 131072 - validator_max_output_tokens: int = 25000 + validator_context_size: int = DEFAULT_CONTEXT_WINDOW + validator_max_output_tokens: int = DEFAULT_MAX_OUTPUT_TOKENS + validator_supercharge_enabled: bool = False uploaded_files: List[str] = Field(default_factory=list) @@ -280,30 +292,38 @@ class CompilerStartRequest(BaseModel): validator_provider: Literal["lm_studio", "openrouter"] = "lm_studio" validator_model: str validator_openrouter_provider: Optional[str] = None + validator_openrouter_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT validator_lm_studio_fallback: Optional[str] = None - validator_context_size: int = 131072 - validator_max_output_tokens: int = 25000 + validator_context_size: int = DEFAULT_CONTEXT_WINDOW + validator_max_output_tokens: int = DEFAULT_MAX_OUTPUT_TOKENS + validator_supercharge_enabled: bool = False # High-context submitter config high_context_provider: Literal["lm_studio", "openrouter"] = "lm_studio" high_context_model: str high_context_openrouter_provider: Optional[str] = None + high_context_openrouter_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT high_context_lm_studio_fallback: Optional[str] = None - high_context_context_size: int = 131072 - high_context_max_output_tokens: int = 25000 + high_context_context_size: int = DEFAULT_CONTEXT_WINDOW + high_context_max_output_tokens: int = DEFAULT_MAX_OUTPUT_TOKENS + high_context_supercharge_enabled: bool = False # High-param submitter config high_param_provider: Literal["lm_studio", "openrouter"] = "lm_studio" high_param_model: str high_param_openrouter_provider: Optional[str] = None + high_param_openrouter_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT high_param_lm_studio_fallback: Optional[str] = None - high_param_context_size: int = 131072 - high_param_max_output_tokens: int = 25000 + high_param_context_size: int = DEFAULT_CONTEXT_WINDOW + high_param_max_output_tokens: int = DEFAULT_MAX_OUTPUT_TOKENS + high_param_supercharge_enabled: bool = False # Critique submitter config critique_submitter_provider: Literal["lm_studio", "openrouter"] = "lm_studio" critique_submitter_model: str critique_submitter_openrouter_provider: Optional[str] = None + critique_submitter_openrouter_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT critique_submitter_lm_studio_fallback: Optional[str] = None - critique_submitter_context_window: int = 131072 - critique_submitter_max_tokens: int = 25000 + critique_submitter_context_window: int = DEFAULT_CONTEXT_WINDOW + critique_submitter_max_tokens: int = DEFAULT_MAX_OUTPUT_TOKENS + critique_submitter_supercharge_enabled: bool = False # ============================================================================ @@ -331,7 +351,7 @@ class PaperMetadata(BaseModel): word_count: int = 0 source_brainstorm_ids: List[str] = Field(default_factory=list) referenced_papers: List[str] = Field(default_factory=list) - status: Literal["in_progress", "complete", "archived"] = "complete" + status: Literal["in_progress", "complete", "archived", "pruned"] = "complete" created_at: datetime = Field(default_factory=datetime.now) # Per-paper model tracking: model_id -> API call count model_usage: Optional[Dict[str, int]] = None @@ -339,6 +359,10 @@ class PaperMetadata(BaseModel): generation_date: Optional[datetime] = None # Wolfram Alpha verification count (tracked separately from LLM API calls) wolfram_calls: Optional[int] = None + # Pruned papers are preserved for users but excluded from all model context. + pruned_at: Optional[datetime] = None + pruned_reason: Optional[str] = None + pruned_by: Optional[Literal["system", "user", "legacy"]] = None class TopicSelectionSubmission(BaseModel): @@ -422,6 +446,7 @@ class AutonomousResearchState(BaseModel): total_brainstorms_completed: int = 0 total_papers_completed: int = 0 total_papers_archived: int = 0 + total_papers_pruned: int = 0 total_submissions_accepted: int = 0 total_submissions_rejected: int = 0 topic_selection_rejections: int = 0 @@ -438,30 +463,38 @@ class AutonomousResearchStartRequest(BaseModel): validator_provider: Literal["lm_studio", "openrouter"] = "lm_studio" validator_model: str validator_openrouter_provider: Optional[str] = None + validator_openrouter_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT validator_lm_studio_fallback: Optional[str] = None - validator_context_window: int = 131072 - validator_max_tokens: int = 25000 + validator_context_window: int = DEFAULT_CONTEXT_WINDOW + validator_max_tokens: int = DEFAULT_MAX_OUTPUT_TOKENS + validator_supercharge_enabled: bool = False # Compiler high-context settings (separate from aggregator submitters) high_context_provider: Literal["lm_studio", "openrouter"] = "lm_studio" high_context_model: str = "" # Empty string allowed, will use submitter model as fallback high_context_openrouter_provider: Optional[str] = None + high_context_openrouter_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT high_context_lm_studio_fallback: Optional[str] = None - high_context_context_window: int = 131072 - high_context_max_tokens: int = 25000 + high_context_context_window: int = DEFAULT_CONTEXT_WINDOW + high_context_max_tokens: int = DEFAULT_MAX_OUTPUT_TOKENS + high_context_supercharge_enabled: bool = False # Compiler high-param settings high_param_provider: Literal["lm_studio", "openrouter"] = "lm_studio" high_param_model: str = "" # Empty string allowed, will use submitter model as fallback high_param_openrouter_provider: Optional[str] = None + high_param_openrouter_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT high_param_lm_studio_fallback: Optional[str] = None - high_param_context_window: int = 131072 - high_param_max_tokens: int = 25000 + high_param_context_window: int = DEFAULT_CONTEXT_WINDOW + high_param_max_tokens: int = DEFAULT_MAX_OUTPUT_TOKENS + high_param_supercharge_enabled: bool = False # Critique submitter settings critique_submitter_provider: Literal["lm_studio", "openrouter"] = "lm_studio" critique_submitter_model: str = "" # For critique generation and rewrite decisions (uses high_context if empty) critique_submitter_openrouter_provider: Optional[str] = None + critique_submitter_openrouter_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT critique_submitter_lm_studio_fallback: Optional[str] = None - critique_submitter_context_window: int = 131072 - critique_submitter_max_tokens: int = 25000 + critique_submitter_context_window: int = DEFAULT_CONTEXT_WINDOW + critique_submitter_max_tokens: int = DEFAULT_MAX_OUTPUT_TOKENS + critique_submitter_supercharge_enabled: bool = False # Tier 3 Final Answer settings tier3_enabled: bool = False # Default OFF — system stops at Tier 2 paper library @@ -520,9 +553,11 @@ class ProofRoleConfigSnapshot(BaseModel): provider: Literal["lm_studio", "openrouter"] = "lm_studio" model_id: str = "" openrouter_provider: Optional[str] = None + openrouter_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT lm_studio_fallback_id: Optional[str] = None - context_window: int = 131072 - max_output_tokens: int = 25000 + context_window: int = DEFAULT_CONTEXT_WINDOW + max_output_tokens: int = DEFAULT_MAX_OUTPUT_TOKENS + supercharge_enabled: bool = False class ProofRuntimeConfigSnapshot(BaseModel): @@ -555,7 +590,9 @@ class ProofAttemptFeedback(BaseModel): reasoning: str = "" lean_code: str = "" error_output: str = "" + diagnostic_output: str = "" goal_states: str = "" + raw_stderr: str = "" strategy: Literal["full_script", "tactic_script"] = "full_script" tactic_trace: List[str] = Field(default_factory=list) success: bool = False @@ -568,7 +605,7 @@ class ProofRecord(BaseModel): theorem_statement: str theorem_name: str = "" formal_sketch: str = "" - source_type: Literal["brainstorm", "paper"] + source_type: Literal["brainstorm", "paper", "leanoj_subproof", "leanoj_final"] source_id: str source_title: str = "" solver: str = "Lean 4" @@ -610,19 +647,143 @@ class ProofCheckRequest(BaseModel): """Request body for manually triggering a proof check.""" source_type: Literal["brainstorm", "paper"] source_id: str + proof_runtime_config: Optional[Dict[str, Any]] = None class ProofSettingsUpdateRequest(BaseModel): """Request body for updating runtime Lean 4 proof settings.""" + model_config = ConfigDict(extra="forbid") + enabled: bool timeout: int = Field(default=120, ge=10, le=3600) lean4_lsp_enabled: Optional[bool] = None lean4_lsp_idle_timeout: Optional[int] = Field(default=None, ge=60, le=7200) smt_enabled: Optional[bool] = None - z3_path: Optional[str] = None smt_timeout: Optional[int] = Field(default=None, ge=1, le=600) +# ============================================================================ +# LEANOJ PROOF SOLVER MODELS +# ============================================================================ + + +class LeanOJRoleConfig(BaseModel): + """Model/runtime configuration for one LeanOJ proof-solver role.""" + provider: Literal["lm_studio", "openrouter"] = "lm_studio" + model_id: str = "" + openrouter_provider: Optional[str] = None + openrouter_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT + lm_studio_fallback_id: Optional[str] = None + context_window: int = DEFAULT_CONTEXT_WINDOW + max_output_tokens: int = DEFAULT_MAX_OUTPUT_TOKENS + supercharge_enabled: bool = False + + +class LeanOJStartRequest(BaseModel): + """Request to start the LeanOJ proof-solver mode.""" + user_prompt: str + lean_template: str + topic_generator: LeanOJRoleConfig + topic_validator: LeanOJRoleConfig + brainstorm_submitters: List[LeanOJRoleConfig] = Field(default_factory=list, min_length=1, max_length=10) + brainstorm_validator: LeanOJRoleConfig + path_decider: LeanOJRoleConfig = Field(default_factory=LeanOJRoleConfig) + final_solver: LeanOJRoleConfig + max_initial_brainstorm_accepts: int = Field(default=30, ge=1, le=200) + max_recursive_brainstorm_accepts: int = Field(default=10, ge=1, le=100) + final_attempts_per_cycle: int = Field(default=30, ge=30, le=200) + + +class LeanOJAttemptRecord(BaseModel): + """One Lean 4 attempt made by the LeanOJ solver.""" + attempt: int + target: Literal["subproof", "final"] + request: str = "" + lean_code: str = "" + success: bool = False + error_output: str = "" + reasoning: str = "" + created_at: datetime = Field(default_factory=datetime.now) + + +class LeanOJSubproofRecord(BaseModel): + """Verified or exhausted subproof produced during one LeanOJ run.""" + subproof_id: str + request: str + role: str = "" + theorem_or_lemma: str = "" + verified: bool = False + lean_code: str = "" + lean_feedback: str = "" + attempts_used: int = 0 + error_summary: str = "" + proof_id: str = "" + novel: bool = False + novelty_tier: str = "not_novel" + novelty_reasoning: str = "" + created_at: datetime = Field(default_factory=datetime.now) + + +class LeanOJState(BaseModel): + """Current state snapshot for LeanOJ proof-solver mode.""" + is_running: bool = False + phase: Literal[ + "idle", + "initial_topic_candidates", + "initial_brainstorm", + "path_decision", + "recursive_brainstorm", + "proof_storm", + "final_proof_loop", + "verified", + "stopped", + "error", + ] = "idle" + last_active_phase: str = "" + active_brainstorm_phase: str = "" + active_brainstorm_start_count: int = 0 + session_id: str = "" + selected_topic: str = "" + current_path_decision: str = "" + accepted_brainstorm_count: int = 0 + rejected_brainstorm_count: int = 0 + brainstorm_acceptance_events: int = 0 + active_brainstorm_last_sufficiency_check_count: int = 0 + active_brainstorm_last_prune_review_count: int = 0 + brainstorm_prune_reviews_performed: int = 0 + brainstorm_prune_operations_applied: int = 0 + recursive_cycle_count: int = 0 + verified_subproofs: List[LeanOJSubproofRecord] = Field(default_factory=list) + failed_subproofs: List[LeanOJSubproofRecord] = Field(default_factory=list) + final_attempt_count: int = 0 + final_solution: str = "" + final_proof_id: str = "" + final_novel: bool = False + final_novelty_tier: str = "not_novel" + final_novelty_reasoning: str = "" + master_proof_initialized: bool = False + master_proof_version: int = 0 + master_proof_hash: str = "" + master_proof_line_count: int = 0 + master_proof_char_count: int = 0 + master_proof_last_edit_summary: str = "" + master_proof_last_stuck_reason: str = "" + master_proof_old_attempt_before_redo_version: int = 0 + master_proof_old_attempt_before_redo_hash: str = "" + master_proof_old_attempt_before_redo_line_count: int = 0 + master_proof_old_attempt_before_redo_char_count: int = 0 + master_proof_old_attempt_before_redo_summary: str = "" + master_proof_old_attempt_before_redo_validator_justification: str = "" + master_proof_old_attempt_before_redo_apparent_issue: str = "" + master_proof_last_shortening_approval_justification: str = "" + master_proof_last_shortening_apparent_issue: str = "" + last_error: str = "" + skip_brainstorm_requested: bool = False + force_brainstorm_requested: bool = False + user_forced_final_cycle: bool = False + updated_at: datetime = Field(default_factory=datetime.now) + + # ============================================================================ # TIER 3: FINAL ANSWER MODELS (Part 3 - Final Answer Generation) # ============================================================================ @@ -829,4 +990,6 @@ class CritiqueRequest(BaseModel): validator_context_window: Optional[int] = None validator_max_tokens: Optional[int] = None validator_provider: Optional[str] = None # "lm_studio" or "openrouter" - validator_openrouter_provider: Optional[str] = None # Specific provider like "Anthropic" \ No newline at end of file + validator_openrouter_provider: Optional[str] = None # Specific provider like "Anthropic" + validator_openrouter_reasoning_effort: OpenRouterReasoningEffort = DEFAULT_OPENROUTER_REASONING_EFFORT + validator_supercharge_enabled: bool = False \ No newline at end of file diff --git a/backend/shared/openrouter_client.py b/backend/shared/openrouter_client.py index 60c4bdf..d034b11 100644 --- a/backend/shared/openrouter_client.py +++ b/backend/shared/openrouter_client.py @@ -8,12 +8,38 @@ import asyncio import json import logging +import re import time from typing import List, Dict, Any, Optional +from backend.shared.config import system_config + logger = logging.getLogger(__name__) +_PROVIDER_SECRET_PATTERNS = ( + re.compile(r"Bearer\s+[A-Za-z0-9._~+/=-]+", re.IGNORECASE), + re.compile(r'("(?:api[_-]?key|appid|authorization|token|secret)"\s*:\s*)"[^"]*"', re.IGNORECASE), + re.compile(r"((?:api[_-]?key|appid|authorization|token|secret)\s*[=:]\s*)[^\s,&}]+", re.IGNORECASE), +) + + +def sanitize_provider_error_text(value: Any, max_chars: int = 500) -> str: + """Return a capped provider error preview with obvious secrets/body fields redacted.""" + text = str(value or "") + for pattern in _PROVIDER_SECRET_PATTERNS: + text = pattern.sub(lambda match: f"{match.group(1) if match.lastindex else 'Bearer '}[redacted]", text) + + # Provider error pages occasionally echo request JSON. Drop large message + # arrays rather than persisting prompt/user-file content in local logs. + text = re.sub(r'("messages"\s*:\s*)\[[\s\S]*?\]', r'\1[redacted]', text, flags=re.IGNORECASE) + text = re.sub(r'("prompt"\s*:\s*)"[\s\S]*?"', r'\1"[redacted]"', text, flags=re.IGNORECASE) + + if len(text) > max_chars: + return text[:max_chars] + "...[truncated]" + return text + + class OpenRouterClient: """Client for OpenRouter API.""" @@ -21,6 +47,9 @@ class OpenRouterClient: MAX_RETRIES = 3 RETRY_DELAY = 2.0 # seconds RATE_LIMIT_COOLDOWN = 3600.0 # 1 hour in seconds + AUTO_IGNORED_PROVIDERS = ("Venice",) + HIGHEST_REASONING_EFFORT = "xhigh" + REASONING_EFFORT_LEVELS = {"xhigh", "high", "medium", "low", "minimal", "none"} # Per-model semaphores for rate limiting _model_semaphores: Dict[str, asyncio.Semaphore] = {} @@ -50,7 +79,7 @@ def __init__(self, api_key: str): async def _get_model_semaphore(self, model: str) -> asyncio.Semaphore: """ Get or create semaphore for a specific model. - Each model gets its own semaphore (limit=1) to prevent concurrent requests. + Each model gets its own semaphore to bound concurrent requests. Args: model: Model name/identifier @@ -60,8 +89,9 @@ async def _get_model_semaphore(self, model: str) -> asyncio.Semaphore: """ async with self._semaphore_lock: if model not in self._model_semaphores: - self._model_semaphores[model] = asyncio.Semaphore(1) - logger.debug(f"Created semaphore for OpenRouter model: {model}") + limit = max(1, int(system_config.max_model_concurrency_per_model or 1)) + self._model_semaphores[model] = asyncio.Semaphore(limit) + logger.debug(f"Created semaphore for OpenRouter model: {model} (limit={limit})") return self._model_semaphores[model] def _is_free_model(self, model: str) -> bool: @@ -329,6 +359,7 @@ async def generate_completion( max_tokens: Optional[int] = None, response_format: Optional[Dict[str, str]] = None, provider: Optional[str] = None, + reasoning_effort: Optional[str] = None, tools: Optional[List[Dict[str, Any]]] = None, tool_choice: Optional[Any] = None, ) -> Dict[str, Any]: @@ -342,6 +373,7 @@ async def generate_completion( max_tokens: Maximum tokens to generate response_format: Optional response format constraints provider: Optional specific provider to use (None lets OpenRouter choose) + reasoning_effort: Optional OpenRouter reasoning effort (auto/xhigh/high/medium/low/minimal/none). tools: Optional OpenAI-compatible tool schemas the model may call. tool_choice: Optional tool-choice directive (e.g. "auto", "none", or ``{"type": "function", "function": {"name": "..."}}``). @@ -355,7 +387,7 @@ async def generate_completion( """ model_semaphore = await self._get_model_semaphore(model) - # ACQUIRE THIS MODEL'S SEMAPHORE to prevent concurrent requests + # Bound same-model parallelism so multi-submitter phases can overlap without unbounded fanout. async with model_semaphore: return await self._execute_completion_request( model, @@ -364,6 +396,7 @@ async def generate_completion( max_tokens, response_format, provider, + reasoning_effort, tools=tools, tool_choice=tool_choice, ) @@ -392,6 +425,31 @@ def _is_reasoning_model_without_temperature(self, model: str) -> bool: ] return any(pattern in model_lower for pattern in reasoning_model_patterns) + + def _build_reasoning_config(self, reasoning_effort: Optional[str]) -> Optional[Dict[str, str]]: + """ + Build OpenRouter's normalized reasoning config. + + ``auto`` intentionally means maximum reasoning for this app: OpenRouter + maps the normalized effort field onto provider-specific reasoning knobs + where supported and ignores unsupported parameters by default. + """ + if reasoning_effort is None: + return None + + effort = str(reasoning_effort).strip().lower() + if not effort: + return None + if effort in {"auto", "max", "maximum", "highest"}: + effort = self.HIGHEST_REASONING_EFFORT + elif effort in {"off", "disabled", "disable"}: + effort = "none" + + if effort not in self.REASONING_EFFORT_LEVELS: + logger.warning("Unknown OpenRouter reasoning effort '%s'; defaulting to max", reasoning_effort) + effort = self.HIGHEST_REASONING_EFFORT + + return {"effort": effort} async def _execute_completion_request( self, @@ -401,6 +459,7 @@ async def _execute_completion_request( max_tokens: Optional[int], response_format: Optional[Dict[str, str]], provider: Optional[str] = None, + reasoning_effort: Optional[str] = None, tools: Optional[List[Dict[str, Any]]] = None, tool_choice: Optional[Any] = None, ) -> Dict[str, Any]: @@ -433,15 +492,17 @@ async def _execute_completion_request( else: logger.debug(f"Skipping temperature parameter for reasoning model: {model}") - # Set max_tokens if provided - if max_tokens is None: - max_tokens = 25000 # Default for reasoning models - logger.debug(f"Auto-limiting max_tokens to {max_tokens}") - - payload["max_tokens"] = max_tokens + if max_tokens is not None: + payload["max_tokens"] = max_tokens + else: + logger.debug("No max_tokens supplied; letting OpenRouter/model defaults apply") if response_format: payload["response_format"] = response_format + + reasoning_config = self._build_reasoning_config(reasoning_effort) + if reasoning_config: + payload["reasoning"] = reasoning_config # OpenAI-compatible tool calling: pass tools + tool_choice straight # through to OpenRouter. Providers that do not support tools tend to @@ -454,8 +515,16 @@ async def _execute_completion_request( # Add provider routing if specified if provider: - payload["provider"] = {"order": [provider]} + payload["provider"] = { + "order": [provider], + "allow_fallbacks": False, + } logger.debug(f"Using specific provider: {provider}") + elif self.AUTO_IGNORED_PROVIDERS: + payload["provider"] = { + "ignore": list(self.AUTO_IGNORED_PROVIDERS), + } + logger.debug(f"Ignoring weak OpenRouter auto-routing providers: {self.AUTO_IGNORED_PROVIDERS}") # NOTE: Stop sequences were removed because they caused premature truncation # with certain models (e.g., Grok 4.1). Models will now generate until max_tokens @@ -472,7 +541,7 @@ async def _execute_completion_request( # Check for credit exhaustion (402 Payment Required) if response.status_code == 402: - error_text = response.text + error_text = sanitize_provider_error_text(response.text) logger.error( f"OpenRouter credit exhaustion detected (402): {error_text}" ) @@ -496,7 +565,7 @@ async def _execute_completion_request( body_text = response.text or "" except Exception: body_text = "" - body_preview = body_text[:500] + body_preview = sanitize_provider_error_text(body_text) content_type = response.headers.get("content-type", "") if hasattr(response, "headers") else "" logger.error( f"OpenRouter returned non-JSON body (status={response.status_code}, " @@ -529,7 +598,7 @@ async def _execute_completion_request( raise except httpx.HTTPStatusError as e: - error_detail = e.response.text if hasattr(e.response, 'text') else str(e) + error_detail = sanitize_provider_error_text(e.response.text if hasattr(e.response, 'text') else str(e)) # Check for rate limit (429 Too Many Requests) if e.response.status_code == 429: @@ -667,7 +736,7 @@ async def get_embeddings(self, texts: List[str], model: str = None) -> List[List # Check for credit exhaustion (402 Payment Required) if response.status_code == 402: - error_text = response.text + error_text = sanitize_provider_error_text(response.text) logger.error(f"OpenRouter credit exhaustion for embeddings (402): {error_text}") raise CreditExhaustionError("OpenRouter credits exhausted for embeddings") @@ -680,7 +749,7 @@ async def get_embeddings(self, texts: List[str], model: str = None) -> List[List body_text = response.text or "" except Exception: body_text = "" - body_preview = body_text[:500] + body_preview = sanitize_provider_error_text(body_text) content_type = response.headers.get("content-type", "") if hasattr(response, "headers") else "" logger.error( f"OpenRouter embeddings returned non-JSON body (status={response.status_code}, " @@ -712,7 +781,7 @@ async def get_embeddings(self, texts: List[str], model: str = None) -> List[List raise except httpx.HTTPStatusError as e: - error_detail = e.response.text if hasattr(e.response, 'text') else str(e) + error_detail = sanitize_provider_error_text(e.response.text if hasattr(e.response, 'text') else str(e)) # Check for rate limit (429 Too Many Requests) if e.response.status_code == 429: diff --git a/backend/shared/smt_client.py b/backend/shared/smt_client.py index e056829..1429c89 100644 --- a/backend/shared/smt_client.py +++ b/backend/shared/smt_client.py @@ -18,6 +18,8 @@ class SmtClient: """Thin async wrapper around an external Z3 binary.""" + _ALLOWED_EXECUTABLE_NAMES = {"z3", "z3.exe"} + def __init__(self, z3_path: str, timeout: int) -> None: self.z3_path = str(z3_path or "").strip() self.timeout = max(int(timeout or 0), 1) @@ -25,14 +27,15 @@ def __init__(self, z3_path: str, timeout: int) -> None: def _resolve_executable(self) -> str: if self.z3_path: candidate = Path(self.z3_path).resolve() - if candidate.exists(): + if candidate.exists() and candidate.name.lower() in self._ALLOWED_EXECUTABLE_NAMES: return str(candidate) + raise RuntimeError("Configured Z3 path must point to a z3 executable.") for name in ("z3", "z3.exe"): resolved = shutil.which(name) if resolved: return resolved - return self.z3_path or "z3" + return "z3" async def _run_process( self, diff --git a/backend/shared/wolfram_alpha_client.py b/backend/shared/wolfram_alpha_client.py index 5acabde..982710e 100644 --- a/backend/shared/wolfram_alpha_client.py +++ b/backend/shared/wolfram_alpha_client.py @@ -48,12 +48,12 @@ async def query(self, question: str) -> Optional[str]: "i": question } - logger.info(f"Querying Wolfram Alpha: {question[:100]}") + logger.info("Querying Wolfram Alpha (query_len=%d)", len(question or "")) response = await self.client.get(self.BASE_URL, params=params) if response.status_code == 200: result = response.text.strip() - logger.info(f"Wolfram Alpha success: {result[:200]}") + logger.info("Wolfram Alpha success (result_len=%d)", len(result)) return result elif response.status_code == 401: logger.warning("Wolfram Alpha: Invalid API key (401)") @@ -62,14 +62,14 @@ async def query(self, question: str) -> Optional[str]: logger.warning("Wolfram Alpha: API key forbidden or rate limited (403)") return None elif response.status_code == 501: - logger.warning(f"Wolfram Alpha: Could not interpret query (501): {question}") + logger.warning("Wolfram Alpha: Could not interpret query (501; query_len=%d)", len(question or "")) return None else: logger.warning(f"Wolfram Alpha query failed: status {response.status_code}") return None except httpx.TimeoutException: - logger.warning(f"Wolfram Alpha query timeout after 30s: {question[:100]}") + logger.warning("Wolfram Alpha query timeout after 30s (query_len=%d)", len(question or "")) return None except Exception as e: logger.error(f"Wolfram Alpha API error: {e}", exc_info=True) diff --git a/backend/shared/workflow_start_guard.py b/backend/shared/workflow_start_guard.py new file mode 100644 index 0000000..1b11437 --- /dev/null +++ b/backend/shared/workflow_start_guard.py @@ -0,0 +1,23 @@ +""" +Process-wide guard for mutually exclusive top-level workflow starts. +""" +from __future__ import annotations + +import asyncio +from contextlib import asynccontextmanager +from typing import AsyncIterator + + +class WorkflowStartGuard: + """Serialize conflict checks and startup side effects across top-level modes.""" + + def __init__(self) -> None: + self._lock = asyncio.Lock() + + @asynccontextmanager + async def reserve(self) -> AsyncIterator[None]: + async with self._lock: + yield + + +workflow_start_guard = WorkflowStartGuard() diff --git a/frontend/index.html b/frontend/index.html index 5b7a03d..096b2b6 100644 --- a/frontend/index.html +++ b/frontend/index.html @@ -3,7 +3,7 @@ - ASI Aggregator System + MOTO Autonomous ASI
diff --git a/frontend/package-lock.json b/frontend/package-lock.json index 389caa3..2bee959 100644 --- a/frontend/package-lock.json +++ b/frontend/package-lock.json @@ -1,12 +1,12 @@ { "name": "asi-aggregator-frontend", - "version": "1.0.7", + "version": "1.0.8", "lockfileVersion": 3, "requires": true, "packages": { "": { "name": "asi-aggregator-frontend", - "version": "1.0.7", + "version": "1.0.8", "license": "MIT", "dependencies": { "dompurify": "^3.2.4", diff --git a/frontend/package.json b/frontend/package.json index 90b3c08..6f9e4e4 100644 --- a/frontend/package.json +++ b/frontend/package.json @@ -1,6 +1,6 @@ { "name": "asi-aggregator-frontend", - "version": "1.0.7", + "version": "1.0.8", "description": "Frontend UI for MOTO S.T.E.M. Mathematics Variant - Autonomous ASI Research System for Novel S.T.E.M. Mathematical Paper Generation", "author": "Intrafere LLC", "license": "MIT", diff --git a/frontend/src/App.jsx b/frontend/src/App.jsx index 56a16aa..ff19b56 100644 --- a/frontend/src/App.jsx +++ b/frontend/src/App.jsx @@ -19,6 +19,15 @@ import { MathematicalProofs, ProofLibrary } from './components/autonomous'; +import { + LeanOJBrainstorms, + LeanOJInterface, + LeanOJLogs, + LeanOJMasterProof, + LeanOJMathematicalProofs, + LeanOJProofLibrary, + LeanOJSettings, +} from './components/leanoj'; import WorkflowPanel from './components/WorkflowPanel'; import BoostControlModal from './components/BoostControlModal'; import StartupProviderSetupModal from './components/StartupProviderSetupModal'; @@ -31,7 +40,7 @@ import HungConnectionNotificationStack from './components/HungConnectionNotifica import UpdateNotificationBanner from './components/UpdateNotificationBanner'; import PaperCritiqueModal from './components/PaperCritiqueModal'; import { websocket } from './services/websocket'; -import { api, autonomousAPI, openRouterAPI } from './services/api'; +import { api, autonomousAPI, leanojAPI, openRouterAPI } from './services/api'; import { LM_STUDIO_STARTUP_CHOICE, RECOMMENDED_PROFILE_KEY, @@ -42,14 +51,27 @@ import { settingsToAutonomousConfig, persistAutonomousSettings, } from './utils/autonomousProfiles'; +import { + getStoredLeanOJSettings, + persistLeanOJSettings, +} from './utils/leanojProfiles'; +import { + DEFAULT_CONTEXT_WINDOW, + DEFAULT_MAX_OUTPUT_TOKENS, +} from './utils/openRouterSelection'; const APP_MODE_STORAGE_KEY = 'appMode'; const AUTONOMOUS_TAB_STORAGE_KEY = 'autonomousActiveTab'; const MANUAL_TAB_STORAGE_KEY = 'manualActiveTab'; +const LEANOJ_TAB_STORAGE_KEY = 'leanojActiveTab'; const COMPLETED_WORKS_SUB_TAB_STORAGE_KEY = 'completedWorksSubTab'; const LEGACY_SINGLE_PAPER_WRITER_STORAGE_KEY = 'singlePaperWriterExpanded'; +const DEVELOPER_MODE_STORAGE_KEY = 'developerModeSettingsEnabled'; const EMBEDDING_MODEL_HINTS = ['embed', 'embedding', 'nomic', 'bge', 'e5', 'gte']; const AUTONOMOUS_ROLE_PREFIXES = ['validator', 'high_context', 'high_param', 'critique_submitter']; +const HIGH_SCORE_CRITIQUE_THRESHOLD = 6.25; +const SEEN_HIGH_SCORE_CRITIQUES_STORAGE_KEY = 'seenHighScoreCritiqueNotifications'; +const MAX_SEEN_HIGH_SCORE_CRITIQUES = 500; const DEFAULT_CAPABILITIES = Object.freeze({ genericMode: false, lmStudioEnabled: true, @@ -60,6 +82,60 @@ const DEFAULT_CAPABILITIES = Object.freeze({ apiContractVersion: '', }); +function readDeveloperModeEnabled() { + return localStorage.getItem(DEVELOPER_MODE_STORAGE_KEY) === 'true'; +} + +function getHighScoreCritiqueNotificationKey(paperId, averageRating) { + const rating = Number(averageRating); + if (!paperId || !Number.isFinite(rating)) { + return null; + } + return `${paperId}:${rating.toFixed(1)}`; +} + +function readSeenHighScoreCritiques() { + if (typeof window === 'undefined') { + return new Set(); + } + + try { + const raw = window.localStorage.getItem(SEEN_HIGH_SCORE_CRITIQUES_STORAGE_KEY); + const values = raw ? JSON.parse(raw) : []; + return new Set(Array.isArray(values) ? values.filter(value => typeof value === 'string') : []); + } catch (error) { + console.warn('Could not read seen high-score critique notifications:', error); + return new Set(); + } +} + +function persistSeenHighScoreCritiques(seenSet) { + if (typeof window === 'undefined') { + return; + } + + try { + const values = Array.from(seenSet).slice(-MAX_SEEN_HIGH_SCORE_CRITIQUES); + window.localStorage.setItem(SEEN_HIGH_SCORE_CRITIQUES_STORAGE_KEY, JSON.stringify(values)); + } catch (error) { + console.warn('Could not save seen high-score critique notifications:', error); + } +} + +const createDefaultAggregatorSubmitterConfigs = () => ( + [1, 2, 3].map((submitterId) => ({ + submitterId, + provider: 'lm_studio', + modelId: '', + openrouterProvider: null, + openrouterReasoningEffort: 'auto', + lmStudioFallbackId: null, + contextWindow: DEFAULT_CONTEXT_WINDOW, + maxOutputTokens: DEFAULT_MAX_OUTPUT_TOKENS, + superchargeEnabled: false, + })) +); + function normalizeLoadedLmStudioModelId(modelId = '') { return String(modelId).replace(/:\d+$/, ''); } @@ -106,6 +182,7 @@ function normalizeRuntimeModelConfig(config = {}, lmStudioEnabled) { provider: normalizeRuntimeProvider(config.provider, lmStudioEnabled), modelId: shouldResetLmState ? '' : (config.modelId || ''), openrouterProvider: shouldResetLmState ? null : (config.openrouterProvider || null), + openrouterReasoningEffort: config.openrouterReasoningEffort || 'auto', lmStudioFallbackId: lmStudioEnabled ? (config.lmStudioFallbackId || null) : null, }; } @@ -124,6 +201,7 @@ function normalizeAggregatorConfigForCapabilities(config, lmStudioEnabled) { validatorOpenrouterProvider: shouldResetValidator ? null : (config.validatorOpenrouterProvider || null), + validatorOpenrouterReasoningEffort: config.validatorOpenrouterReasoningEffort || 'auto', validatorLmStudioFallback: lmStudioEnabled ? (config.validatorLmStudioFallback || null) : null, }; } @@ -149,6 +227,7 @@ function normalizeAutonomousConfigForCapabilities(config, lmStudioEnabled) { nextConfig[openRouterProviderKey] = shouldResetRole ? null : (nextConfig[openRouterProviderKey] || null); + nextConfig[`${rolePrefix}_openrouter_reasoning_effort`] = nextConfig[`${rolePrefix}_openrouter_reasoning_effort`] || 'auto'; nextConfig[fallbackKey] = lmStudioEnabled ? (nextConfig[fallbackKey] || null) : null; }); @@ -158,7 +237,10 @@ function normalizeAutonomousConfigForCapabilities(config, lmStudioEnabled) { function App() { const [appMode, setAppMode] = useState(() => { const savedMode = localStorage.getItem(APP_MODE_STORAGE_KEY); - if (savedMode === 'autonomous' || savedMode === 'manual') { + if (savedMode === 'leanoj' && !readDeveloperModeEnabled()) { + return 'autonomous'; + } + if (savedMode === 'autonomous' || savedMode === 'manual' || savedMode === 'leanoj') { return savedMode; } @@ -181,6 +263,9 @@ function App() { return saved || 'auto-interface'; }); const [manualActiveTab, setManualActiveTab] = useState('aggregator-interface'); + const [leanojActiveTab, setLeanojActiveTab] = useState(() => { + return localStorage.getItem(LEANOJ_TAB_STORAGE_KEY) || 'leanoj-interface'; + }); const [completedWorksSubTab, setCompletedWorksSubTab] = useState(() => { const savedSubTab = localStorage.getItem(COMPLETED_WORKS_SUB_TAB_STORAGE_KEY); if (savedSubTab) return savedSubTab; @@ -189,7 +274,11 @@ function App() { if (savedTab === 'auto-final-answer-library') return 'stage3-history'; return 'stage2-history'; }); - const activeTab = appMode === 'manual' ? manualActiveTab : autonomousActiveTab; + const activeTab = appMode === 'manual' + ? manualActiveTab + : appMode === 'leanoj' + ? leanojActiveTab + : autonomousActiveTab; const shimmerAccentsEnabled = (() => { const saved = localStorage.getItem('banner_shimmer_enabled'); return saved !== null ? JSON.parse(saved) : true; @@ -233,6 +322,9 @@ function App() { const savedState = localStorage.getItem('workflow_panel_collapsed'); return savedState !== 'false'; }); + const [developerModeEnabled, setDeveloperModeEnabled] = useState(() => { + return readDeveloperModeEnabled(); + }); // Update notice banner state (dismissible per session, re-appears on restart) const [updateNotice, setUpdateNotice] = useState(null); @@ -254,9 +346,90 @@ function App() { localStorage.setItem(MANUAL_TAB_STORAGE_KEY, manualActiveTab); }, [manualActiveTab]); + useEffect(() => { + localStorage.setItem(LEANOJ_TAB_STORAGE_KEY, leanojActiveTab); + }, [leanojActiveTab]); + useEffect(() => { localStorage.setItem(COMPLETED_WORKS_SUB_TAB_STORAGE_KEY, completedWorksSubTab); }, [completedWorksSubTab]); + + useEffect(() => { + if (!developerModeEnabled && appMode === 'leanoj') { + setAppMode('autonomous'); + } + }, [developerModeEnabled, appMode]); + + useEffect(() => { + const pressedCodes = new Set(); + let shortcutChordActive = false; + + const toggleDeveloperMode = () => { + setDeveloperModeEnabled((currentValue) => { + const nextValue = !currentValue; + localStorage.setItem(DEVELOPER_MODE_STORAGE_KEY, String(nextValue)); + return nextValue; + }); + }; + + const getShortcutCode = (event) => { + if (event.code?.startsWith('Shift') || event.key === 'Shift') { + return 'Shift'; + } + if (event.code === 'KeyZ' || event.key?.toLowerCase() === 'z') { + return 'KeyZ'; + } + if (event.code === 'KeyX' || event.key?.toLowerCase() === 'x') { + return 'KeyX'; + } + return null; + }; + + const hasDeveloperShortcutChord = () => ( + pressedCodes.has('Shift') && + pressedCodes.has('KeyZ') && + pressedCodes.has('KeyX') + ); + + const handleKeyDown = (event) => { + const shortcutCode = getShortcutCode(event); + if (!shortcutCode) { + return; + } + + pressedCodes.add(shortcutCode); + if (hasDeveloperShortcutChord() && !shortcutChordActive) { + shortcutChordActive = true; + event.preventDefault(); + toggleDeveloperMode(); + } + }; + + const handleKeyUp = (event) => { + const shortcutCode = getShortcutCode(event); + if (shortcutCode) { + pressedCodes.delete(shortcutCode); + } + if (!hasDeveloperShortcutChord()) { + shortcutChordActive = false; + } + }; + + const clearPressedCodes = () => { + pressedCodes.clear(); + shortcutChordActive = false; + }; + + window.addEventListener('keydown', handleKeyDown, true); + window.addEventListener('keyup', handleKeyUp, true); + window.addEventListener('blur', clearPressedCodes); + + return () => { + window.removeEventListener('keydown', handleKeyDown, true); + window.removeEventListener('keyup', handleKeyUp, true); + window.removeEventListener('blur', clearPressedCodes); + }; + }, []); // Initialize config from localStorage or use defaults // CRITICAL: Read from 'aggregator_settings' (used by AggregatorSettings component) @@ -268,17 +441,15 @@ function App() { const settings = JSON.parse(settingsConfig); return { userPrompt: settings.userPrompt || '', - submitterConfigs: settings.submitterConfigs || [ - { submitterId: 1, provider: 'lm_studio', modelId: '', openrouterProvider: null, lmStudioFallbackId: null, contextWindow: 131072, maxOutputTokens: 25000 }, - { submitterId: 2, provider: 'lm_studio', modelId: '', openrouterProvider: null, lmStudioFallbackId: null, contextWindow: 131072, maxOutputTokens: 25000 }, - { submitterId: 3, provider: 'lm_studio', modelId: '', openrouterProvider: null, lmStudioFallbackId: null, contextWindow: 131072, maxOutputTokens: 25000 } - ], + submitterConfigs: settings.submitterConfigs || createDefaultAggregatorSubmitterConfigs(), validatorModel: settings.validatorModel || '', validatorProvider: settings.validatorProvider || 'lm_studio', validatorOpenrouterProvider: settings.validatorOpenrouterProvider || null, + validatorOpenrouterReasoningEffort: settings.validatorOpenrouterReasoningEffort || 'auto', validatorLmStudioFallback: settings.validatorLmStudioFallback || null, - validatorContextSize: settings.validatorContextSize || 131072, - validatorMaxOutput: settings.validatorMaxOutput || 25000, + validatorContextSize: settings.validatorContextSize || DEFAULT_CONTEXT_WINDOW, + validatorMaxOutput: settings.validatorMaxOutput || DEFAULT_MAX_OUTPUT_TOKENS, + validatorSuperchargeEnabled: Boolean(settings.validatorSuperchargeEnabled), uploadedFiles: [], }; } catch (e) { @@ -293,17 +464,15 @@ function App() { const parsed = JSON.parse(savedConfig); return { userPrompt: parsed.userPrompt || '', - submitterConfigs: parsed.submitterConfigs || [ - { submitterId: 1, provider: 'lm_studio', modelId: '', openrouterProvider: null, lmStudioFallbackId: null, contextWindow: 131072, maxOutputTokens: 25000 }, - { submitterId: 2, provider: 'lm_studio', modelId: '', openrouterProvider: null, lmStudioFallbackId: null, contextWindow: 131072, maxOutputTokens: 25000 }, - { submitterId: 3, provider: 'lm_studio', modelId: '', openrouterProvider: null, lmStudioFallbackId: null, contextWindow: 131072, maxOutputTokens: 25000 } - ], + submitterConfigs: parsed.submitterConfigs || createDefaultAggregatorSubmitterConfigs(), validatorModel: parsed.validatorModel || '', validatorProvider: parsed.validatorProvider || 'lm_studio', validatorOpenrouterProvider: parsed.validatorOpenrouterProvider || null, + validatorOpenrouterReasoningEffort: parsed.validatorOpenrouterReasoningEffort || 'auto', validatorLmStudioFallback: parsed.validatorLmStudioFallback || null, - validatorContextSize: parsed.validatorContextSize || 131072, - validatorMaxOutput: parsed.validatorMaxOutput || 25000, + validatorContextSize: parsed.validatorContextSize || DEFAULT_CONTEXT_WINDOW, + validatorMaxOutput: parsed.validatorMaxOutput || DEFAULT_MAX_OUTPUT_TOKENS, + validatorSuperchargeEnabled: Boolean(parsed.validatorSuperchargeEnabled), uploadedFiles: [], }; } catch (e) { @@ -312,17 +481,15 @@ function App() { } return { userPrompt: '', - submitterConfigs: [ - { submitterId: 1, provider: 'lm_studio', modelId: '', openrouterProvider: null, lmStudioFallbackId: null, contextWindow: 131072, maxOutputTokens: 25000 }, - { submitterId: 2, provider: 'lm_studio', modelId: '', openrouterProvider: null, lmStudioFallbackId: null, contextWindow: 131072, maxOutputTokens: 25000 }, - { submitterId: 3, provider: 'lm_studio', modelId: '', openrouterProvider: null, lmStudioFallbackId: null, contextWindow: 131072, maxOutputTokens: 25000 } - ], + submitterConfigs: createDefaultAggregatorSubmitterConfigs(), validatorModel: '', validatorProvider: 'lm_studio', validatorOpenrouterProvider: null, + validatorOpenrouterReasoningEffort: 'auto', validatorLmStudioFallback: null, - validatorContextSize: 131072, - validatorMaxOutput: 25000, + validatorContextSize: DEFAULT_CONTEXT_WINDOW, + validatorMaxOutput: DEFAULT_MAX_OUTPUT_TOKENS, + validatorSuperchargeEnabled: false, uploadedFiles: [], }; }); @@ -336,14 +503,16 @@ function App() { validatorModel: config.validatorModel, validatorProvider: config.validatorProvider, validatorOpenrouterProvider: config.validatorOpenrouterProvider, + validatorOpenrouterReasoningEffort: config.validatorOpenrouterReasoningEffort, validatorLmStudioFallback: config.validatorLmStudioFallback, validatorContextSize: config.validatorContextSize, validatorMaxOutput: config.validatorMaxOutput, + validatorSuperchargeEnabled: config.validatorSuperchargeEnabled, }; // Save to both old and new keys localStorage.setItem('aggregatorConfig', JSON.stringify(configToSave)); localStorage.setItem('aggregator_settings', JSON.stringify(configToSave)); - }, [config.userPrompt, config.submitterConfigs, config.validatorModel, config.validatorProvider, config.validatorOpenrouterProvider, config.validatorLmStudioFallback, config.validatorContextSize, config.validatorMaxOutput]); + }, [config.userPrompt, config.submitterConfigs, config.validatorModel, config.validatorProvider, config.validatorOpenrouterProvider, config.validatorOpenrouterReasoningEffort, config.validatorLmStudioFallback, config.validatorContextSize, config.validatorMaxOutput, config.validatorSuperchargeEnabled]); // Autonomous mode state const [autonomousRunning, setAutonomousRunning] = useState(false); @@ -353,6 +522,13 @@ function App() { const [brainstorms, setBrainstorms] = useState([]); const [papers, setPapers] = useState([]); const [autonomousStats, setAutonomousStats] = useState(null); + + // LeanOJ mode state + const [leanojRunning, setLeanojRunning] = useState(false); + const [leanojStatus, setLeanojStatus] = useState(null); + const [leanojActivity, setLeanojActivity] = useState([]); + const [leanojSettings, setLeanojSettings] = useState(() => getStoredLeanOJSettings()); + const [leanojProofRefreshToken, setLeanojProofRefreshToken] = useState(0); // Disclaimer modal state (shows on every app load) const [showDisclaimer, setShowDisclaimer] = useState(true); @@ -386,6 +562,12 @@ function App() { const autonomousRunningRef = useRef(autonomousRunning); const autonomousTierRef = useRef(autonomousStatus?.current_tier || null); const openRouterKeyJustSavedRef = useRef(false); + const seenHighScoreCritiquesRef = useRef(null); + const shownHighScoreCritiquesRef = useRef(null); + if (seenHighScoreCritiquesRef.current === null) { + seenHighScoreCritiquesRef.current = readSeenHighScoreCritiques(); + shownHighScoreCritiquesRef.current = new Set(seenHighScoreCritiquesRef.current); + } useEffect(() => { autonomousRunningRef.current = autonomousRunning; @@ -395,6 +577,20 @@ function App() { autonomousTierRef.current = autonomousStatus?.current_tier || null; }, [autonomousStatus]); + const markHighScoreCritiqueSeen = useCallback((seenKey) => { + if (!seenKey) { + return; + } + + const seen = seenHighScoreCritiquesRef.current; + if (seen.has(seenKey)) { + return; + } + + seen.add(seenKey); + persistSeenHighScoreCritiques(seen); + }, []); + // Autonomous config with localStorage persistence // CRITICAL: Read from 'autonomous_research_settings' (used by AutonomousResearchSettings component) const [autonomousConfig, setAutonomousConfig] = useState(() => { @@ -416,29 +612,37 @@ function App() { validator_lm_studio_fallback: autonomousConfig.validator_lm_studio_fallback, validator_context_window: autonomousConfig.validator_context_window, validator_max_tokens: autonomousConfig.validator_max_tokens, + validator_supercharge_enabled: autonomousConfig.validator_supercharge_enabled, high_context_provider: autonomousConfig.high_context_provider, high_context_model: autonomousConfig.high_context_model, high_context_openrouter_provider: autonomousConfig.high_context_openrouter_provider, high_context_lm_studio_fallback: autonomousConfig.high_context_lm_studio_fallback, high_context_context_window: autonomousConfig.high_context_context_window, high_context_max_tokens: autonomousConfig.high_context_max_tokens, + high_context_supercharge_enabled: autonomousConfig.high_context_supercharge_enabled, high_param_provider: autonomousConfig.high_param_provider, high_param_model: autonomousConfig.high_param_model, high_param_openrouter_provider: autonomousConfig.high_param_openrouter_provider, high_param_lm_studio_fallback: autonomousConfig.high_param_lm_studio_fallback, high_param_context_window: autonomousConfig.high_param_context_window, high_param_max_tokens: autonomousConfig.high_param_max_tokens, + high_param_supercharge_enabled: autonomousConfig.high_param_supercharge_enabled, critique_submitter_provider: autonomousConfig.critique_submitter_provider, critique_submitter_model: autonomousConfig.critique_submitter_model, critique_submitter_openrouter_provider: autonomousConfig.critique_submitter_openrouter_provider, critique_submitter_lm_studio_fallback: autonomousConfig.critique_submitter_lm_studio_fallback, critique_submitter_context_window: autonomousConfig.critique_submitter_context_window, critique_submitter_max_tokens: autonomousConfig.critique_submitter_max_tokens, + critique_submitter_supercharge_enabled: autonomousConfig.critique_submitter_supercharge_enabled, }, tier3Enabled: autonomousConfig.tier3_enabled ?? existingSettings.tier3Enabled ?? false, }); }, [autonomousConfig]); + useEffect(() => { + persistLeanOJSettings(leanojSettings); + }, [leanojSettings]); + const syncProviderAvailability = useCallback(async () => { let nextCapabilities = DEFAULT_CAPABILITIES; try { @@ -665,6 +869,7 @@ function App() { if (status.is_running) { console.log('Autonomous research detected as running, syncing state...'); setAutonomousRunning(true); + setAnyWorkflowRunning(true); } } catch (error) { console.error('Failed to check initial autonomous status:', error); @@ -674,6 +879,69 @@ function App() { checkInitialStatus(); }, []); + // Recover high-score critique popups from persisted paper metadata. WebSocket + // events are best-effort, so a sleeping/closed browser can miss the live event. + useEffect(() => { + if (!papers || papers.length === 0) { + return; + } + + const recoveredNotifications = []; + for (const paper of papers) { + const averageRating = Number(paper.critique_avg); + if (!Number.isFinite(averageRating) || averageRating < HIGH_SCORE_CRITIQUE_THRESHOLD) { + continue; + } + + const seenKey = getHighScoreCritiqueNotificationKey(paper.paper_id, averageRating); + if (!seenKey || shownHighScoreCritiquesRef.current.has(seenKey)) { + continue; + } + + shownHighScoreCritiquesRef.current.add(seenKey); + recoveredNotifications.push({ + id: `critique_recovered_${seenKey}_${Date.now()}`, + paper_id: paper.paper_id, + paper_title: paper.title || paper.paper_title || paper.paper_id, + average_rating: averageRating, + timestamp: paper.created_at || new Date().toISOString(), + seenKey, + recovered: true, + }); + } + + if (recoveredNotifications.length === 0) { + return; + } + + setCritiqueNotifications(prev => { + const existingSeenKeys = new Set(prev.map(notification => notification.seenKey).filter(Boolean)); + const newNotifications = recoveredNotifications.filter(notification => !existingSeenKeys.has(notification.seenKey)); + if (newNotifications.length === 0) { + return prev; + } + + const newStack = [...prev, ...newNotifications]; + return newStack.length > 3 ? newStack.slice(-3) : newStack; + }); + }, [papers]); + + useEffect(() => { + const checkLeanOJStatus = async () => { + try { + const status = await leanojAPI.getStatus(); + setLeanojStatus(status); + if (status.is_running) { + setLeanojRunning(true); + setAnyWorkflowRunning(true); + } + } catch (error) { + console.error('Failed to check initial Proof Solver status:', error); + } + }; + checkLeanOJStatus(); + }, []); + // WebSocket connection useEffect(() => { // Connect to WebSocket @@ -719,6 +987,33 @@ function App() { if (!cleaned) return ''; return cleaned.length > maxLen ? `${cleaned.slice(0, maxLen)}...` : cleaned; }; + const proofName = (data = {}) => (data.proof_label ? `Proof ${data.proof_label}` : 'Proof'); + const proofTarget = (data = {}) => data.theorem_statement || data.theorem_id || ''; + const proofLeanResponse = (data = {}) => { + if (data.lean_response) return data.lean_response; + if (data.proof_verified === true) return 'Lean 4 response: proof verified.'; + const error = formatReason(data.error_summary || data.error_output || data.reason || '', 960); + return error ? `Lean 4 response: ${error} - proof not verified.` : 'Lean 4 response: proof not verified.'; + }; + const isLeanOJProofEvent = (data = {}) => { + const sourceType = String(data.source_type || ''); + const sourceId = String(data.source_id || ''); + const trigger = String(data.trigger || ''); + return sourceType === 'leanoj_final' + || sourceType === 'leanoj_subproof' + || sourceId.startsWith('leanoj_') + || trigger.startsWith('leanoj'); + }; + const formatProofCheckCompleteMessage = (data = {}) => { + const verified = data.verified_count ?? 0; + const novel = data.novel_count ?? 0; + const hasTotal = data.total_candidates !== undefined && data.total_candidates !== null; + const base = hasTotal + ? `Proof check complete: ${verified}/${data.total_candidates} candidates verified, ${novel} novel` + : `Proof check complete: ${verified} verified`; + const detail = formatReason(data.message, 220); + return detail ? `${base} - ${detail}` : base; + }; // Topic exploration events (pre-brainstorm candidate collection) unsubscribers.push(websocket.on('topic_exploration_started', (data) => { @@ -908,7 +1203,7 @@ function App() { addActivity({ event: 'critique_phase_started', timestamp: getTimestamp(data), - message: `Critique phase started (Paper v${data.paper_version || '?'}, target: ${data.target_critiques || 5} critiques)`, + message: `Critique phase started (Paper v${data.paper_version || '?'}, target: ${data.target_critiques || 3} attempts)`, data }); })); @@ -925,20 +1220,11 @@ function App() { } })); - unsubscribers.push(websocket.on('body_rewrite_started', (data) => { + unsubscribers.push(websocket.on('self_review_appended', (data) => { addActivity({ - event: 'body_rewrite_started', + event: 'self_review_appended', timestamp: getTimestamp(data), - message: `REWRITE PHASE: Total rewrite started for Paper v${data.version || '?'}${data.title_changed ? ' (Title updated)' : ''}`, - data - }); - })); - - unsubscribers.push(websocket.on('partial_revision_complete', (data) => { - addActivity({ - event: 'partial_revision_complete', - timestamp: getTimestamp(data), - message: `PARTIAL REVISION: Applied ${data.edits_applied || 0} targeted edits (Paper v${data.version || '?'})${data.title_changed ? ' (Title updated)' : ''}`, + message: `AI self-review appended (${data.critique_count || 0} accepted critique${data.critique_count === 1 ? '' : 's'})`, data }); })); @@ -947,7 +1233,7 @@ function App() { addActivity({ event: 'critique_phase_ended', timestamp: getTimestamp(data), - message: `Critique phase complete (${data.decision || 'unknown'})`, + message: `Critique phase complete (self-review appended: ${data.self_review_appended ? 'yes' : 'no'})`, data }); })); @@ -1006,96 +1292,70 @@ function App() { })); unsubscribers.push(websocket.on('proof_check_started', (data) => { - const prefix = data.trigger === 'manual' - ? 'Manual proof check started' - : data.trigger === 'retry' - ? 'Paper-stage proof retry started' - : 'Proof check started'; - addActivity({ - event: 'proof_check_started', - timestamp: getTimestamp(data), - message: `${prefix} for ${data.source_type} ${data.source_id}`, - data - }); + setProofRefreshToken((prev) => prev + 1); })); unsubscribers.push(websocket.on('proof_retry_scheduled', (data) => { - addActivity({ - event: 'proof_retry_scheduled', - timestamp: getTimestamp(data), - message: `Scheduled ${data.count || 0} proof retry candidate(s) for paper ${data.source_id}`, - data - }); + setProofRefreshToken((prev) => prev + 1); })); unsubscribers.push(websocket.on('proof_retry_started', (data) => { - addActivity({ - event: 'proof_retry_started', - timestamp: getTimestamp(data), - message: `Retrying ${data.count || 0} failed proof candidate(s) against paper ${data.source_id}`, - data - }); + setProofRefreshToken((prev) => prev + 1); })); unsubscribers.push(websocket.on('proof_check_no_candidates', (data) => { - addActivity({ - event: 'proof_check_no_candidates', - timestamp: getTimestamp(data), - message: `No formal proof candidates found in ${data.source_type} ${data.source_id}`, - data - }); + setProofRefreshToken((prev) => prev + 1); })); unsubscribers.push(websocket.on('proof_check_candidates_found', (data) => { - addActivity({ - event: 'proof_check_candidates_found', - timestamp: getTimestamp(data), - message: `Proof check found ${data.count || 0} theorem candidate(s)`, - data - }); + setProofRefreshToken((prev) => prev + 1); })); unsubscribers.push(websocket.on('proof_attempt_started', (data) => { addActivity({ event: 'proof_attempt_started', timestamp: getTimestamp(data), - message: `Proof attempt ${data.attempt || 1} started: ${data.theorem_statement || data.theorem_id}`, + message: `${proofName(data)}, Attempt ${data.attempt || 1} started: ${proofTarget(data)}`, data }); })); - unsubscribers.push(websocket.on('smt_check_started', (data) => { + unsubscribers.push(websocket.on('smt_check_error', (data) => { addActivity({ - event: 'smt_check_started', + event: 'smt_check_error', timestamp: getTimestamp(data), - message: `SMT check started: ${data.theorem_statement || data.theorem_id}`, + message: `${proofName(data)} SMT error: ${formatReason(data.error_summary, 960) || proofTarget(data)}`, data }); })); - unsubscribers.push(websocket.on('smt_check_complete', (data) => { + unsubscribers.push(websocket.on('proof_attempt_failed', (data) => { addActivity({ - event: 'smt_check_complete', + event: 'proof_attempt_failed', timestamp: getTimestamp(data), - message: `SMT check complete (${data.result || 'unknown'}): ${data.theorem_statement || data.theorem_id}`, + message: `${proofName(data)}, Attempt ${data.attempt || '?'} final: ${proofLeanResponse(data)}`, data }); })); - unsubscribers.push(websocket.on('proof_attempt_failed', (data) => { + unsubscribers.push(websocket.on('proof_verified', (data) => { + setProofRefreshToken((prev) => prev + 1); + })); + + unsubscribers.push(websocket.on('proof_lean_accepted', (data) => { addActivity({ - event: 'proof_attempt_failed', + event: 'proof_lean_accepted', timestamp: getTimestamp(data), - message: `Proof attempt ${data.attempt || '?'} failed: ${formatReason(data.error_summary, 960) || data.theorem_statement || data.theorem_id}`, + message: `${proofName(data)}, Attempt ${data.attempt || '?'} final: ${proofLeanResponse(data)}`, data }); })); - unsubscribers.push(websocket.on('proof_verified', (data) => { + unsubscribers.push(websocket.on('proof_integrity_rejected', (data) => { addActivity({ - event: 'proof_verified', + event: 'proof_integrity_rejected', timestamp: getTimestamp(data), - message: `Lean 4 verified: ${data.theorem_statement || data.theorem_id}`, + message: `${proofName(data)} error: integrity rejected - ${formatReason(data.reason, 960) || proofTarget(data)}`, data }); })); @@ -1104,7 +1364,7 @@ function App() { addActivity({ event: 'proof_attempts_exhausted', timestamp: getTimestamp(data), - message: `Proof attempts exhausted: ${data.theorem_statement || data.theorem_id}`, + message: `${proofName(data)} terminated: proof attempts exhausted for ${proofTarget(data)}`, data }); })); @@ -1126,41 +1386,28 @@ function App() { ]; return next.length > 3 ? next.slice(-3) : next; }); - addActivity({ - event: 'novel_proof_discovered', - timestamp: getTimestamp(data), - message: `Novel proof discovered: ${data.theorem_statement}`, - data - }); })); unsubscribers.push(websocket.on('known_proof_verified', (data) => { setProofRefreshToken((prev) => prev + 1); - addActivity({ - event: 'known_proof_verified', - timestamp: getTimestamp(data), - message: `Verified known proof recorded for ${data.source_type} ${data.source_id}`, - data - }); })); unsubscribers.push(websocket.on('proof_dependency_added', (data) => { setLatestProofDependencyEvent(data); setProofRefreshToken((prev) => prev + 1); - addActivity({ - event: 'proof_dependency_added', - timestamp: getTimestamp(data), - message: `Dependency graph updated for ${data.theorem_name || data.proof_id}`, - data - }); })); unsubscribers.push(websocket.on('proof_check_complete', (data) => { + if (isLeanOJProofEvent(data)) return; + if (data.source_type === 'compiler_rigor' && !isAutonomousTier2Active()) return; + setProofRefreshToken((prev) => prev + 1); + const message = formatProofCheckCompleteMessage(data); + addActivity({ event: 'proof_check_complete', timestamp: getTimestamp(data), - message: `Proof check complete: ${data.verified_count || 0} verified, ${data.novel_count || 0} novel`, + message, data }); })); @@ -1168,6 +1415,7 @@ function App() { unsubscribers.push(websocket.on('auto_research_started', () => { setAutonomousActivity([]); setAutonomousRunning(true); + setAnyWorkflowRunning(true); setAutonomousStopping(false); })); @@ -1175,6 +1423,7 @@ function App() { // Handle resume after crash/restart - sync running state console.log('Autonomous research resumed:', data); setAutonomousRunning(true); + setAnyWorkflowRunning(true); setAutonomousStopping(false); if (data?.tier) { autonomousTierRef.current = data.tier; @@ -1585,6 +1834,14 @@ function App() { // Add to notification stack (max 3, FIFO) setCritiqueNotifications(prev => { + const seenKey = getHighScoreCritiqueNotificationKey(data.paper_id, data.average_rating); + if (seenKey && (seenHighScoreCritiquesRef.current.has(seenKey) || prev.some(notification => notification.seenKey === seenKey))) { + return prev; + } + if (seenKey) { + shownHighScoreCritiquesRef.current.add(seenKey); + } + const newNotification = { id: `critique_${data.paper_id}_${Date.now()}`, paper_id: data.paper_id, @@ -1593,7 +1850,8 @@ function App() { novelty_rating: data.novelty_rating, correctness_rating: data.correctness_rating, impact_rating: data.impact_rating, - timestamp: data.timestamp + timestamp: data.timestamp, + seenKey }; // Add to stack, keep max 3 (remove oldest if full) @@ -1618,6 +1876,233 @@ function App() { }; }, []); + useEffect(() => { + const MAX_LEANOJ_ACTIVITY_EVENTS = 500; + const getTimestamp = (data = {}) => data?._serverTimestamp || data?.timestamp || new Date().toISOString(); + const shouldTrackLeanOJModelCall = (data = {}) => { + const taskId = String(data.task_id || ''); + const roleId = String(data.role_id || ''); + const summary = String(data.result_summary || data.message || '').toLowerCase(); + return !( + taskId === 'leanoj_sufficiency' || + taskId === 'leanoj_path' || + taskId === 'leanoj_path_val' || + summary.startsWith('sufficiency result:') || + summary.startsWith('path result:') || + ( + roleId === 'leanoj_path_validator' && + (summary.startsWith('decision: accept') || summary.startsWith('decision: reject')) + ) + ); + }; + const addLeanOJActivity = (event, data = {}, message = '') => { + setLeanojActivity(prev => [ + ...prev, + { + event, + timestamp: getTimestamp(data), + message: message || data.message || data.reasoning || data.decision || data.phase || 'Proof Solver update', + data, + }, + ].slice(-MAX_LEANOJ_ACTIVITY_EVENTS)); + }; + const summarizeLeanOJText = (text = '', limit = 220) => { + const cleaned = String(text || '').replace(/\s+/g, ' ').trim(); + return cleaned.length > limit ? `${cleaned.slice(0, limit)}...` : cleaned; + }; + const formatModelName = (modelId = '') => { + const cleaned = String(modelId || '').trim(); + if (!cleaned) return ''; + const displayName = cleaned.split('/').pop() || cleaned; + return displayName.length > 32 ? `${displayName.slice(0, 32)}...` : displayName; + }; + const formatLeanOJRole = (roleId = '') => { + const cleaned = String(roleId || '').replace(/^leanoj_/, '').replace(/_/g, ' ').trim(); + return cleaned ? cleaned.replace(/\b\w/g, (char) => char.toUpperCase()) : 'Proof Solver Model'; + }; + const formatLeanOJDuration = (durationMs) => { + if (durationMs === null || durationMs === undefined || Number.isNaN(Number(durationMs))) return ''; + const seconds = Number(durationMs) / 1000; + return seconds >= 60 ? `${(seconds / 60).toFixed(1)}m` : `${seconds.toFixed(1)}s`; + }; + const formatLeanOJCallResult = (data = {}) => { + const role = formatLeanOJRole(data.role_id); + const modelName = formatModelName(data.model) || 'model'; + const summary = summarizeLeanOJText(data.result_summary || data.message || '', 220); + const attemptSuffix = Number(data.attempt || 1) > 1 ? `, attempt ${data.attempt}` : ''; + const duration = formatLeanOJDuration(data.duration_ms); + const durationSuffix = duration ? `, ${duration}` : ''; + return `${role} [${modelName}]: ✓ RESULT${attemptSuffix}${durationSuffix}${summary ? ` - ${summary}` : ''}`; + }; + const formatLeanOJBrainstormMessage = (data = {}, accepted = true) => { + const submitterId = data.submitter_id ?? data.submitter ?? '?'; + const modelName = formatModelName(data.submitter_model || data.model) || 'N/A'; + const totalValue = accepted ? data.total_acceptances : data.total_rejections; + const total = totalValue !== undefined ? ` (total: ${totalValue})` : ''; + const detail = accepted + ? summarizeLeanOJText(data.submission_preview || data.submission, 160) + : summarizeLeanOJText( + data.rejection_reason + || data.validator_summary + || data.validator_reasoning + || data.submission_preview + || data.submission, + 160 + ); + return `Brainstorm Submitter ${submitterId} [${modelName}]: ${accepted ? '✓ ACCEPTED' : '✗ REJECTED'}${total}${detail ? ` - ${detail}` : ''}`; + }; + const leanOJProofName = (data = {}) => { + const attempt = data.attempt || {}; + if (data.proof_label) return `Proof ${data.proof_label}`; + if (data.source_type === 'leanoj_final' || attempt.target === 'final') return 'Final proof'; + if (data.source_type === 'leanoj_subproof' || data.subproof_id || data.subproof || attempt.target === 'subproof') return 'Proof fragment'; + return 'Proof'; + }; + const leanOJProofTarget = (data = {}) => { + const attempt = data.attempt || {}; + const subproof = data.subproof || {}; + return data.theorem_statement + || data.theorem_id + || subproof.theorem_or_lemma + || subproof.request + || attempt.request + || data.subproof_id + || ''; + }; + const leanOJLeanResponse = (data = {}) => { + const attempt = data.attempt || {}; + if (data.lean_response) return data.lean_response; + if (data.proof_verified === true || attempt.success === true) return 'Lean 4 response: proof verified.'; + const error = summarizeLeanOJText( + attempt.error_output || data.error_summary || data.error_output || data.reason || data.message || '', + 960 + ); + return error ? `Lean 4 response: ${error} - proof not verified.` : 'Lean 4 response: proof not verified.'; + }; + const leanOJAttemptStartedMessage = (data = {}) => { + const attemptNumber = data.attempt?.attempt || data.attempt || 1; + const target = leanOJProofTarget(data); + return `${leanOJProofName(data)}, Attempt ${attemptNumber} started${target ? `: ${target}` : ''}`; + }; + const leanOJAttemptFinalMessage = (data = {}) => { + const attemptNumber = data.attempt?.attempt || data.attempt || '?'; + return `${leanOJProofName(data)}, Attempt ${attemptNumber} final: ${leanOJLeanResponse(data)}`; + }; + const isLeanOJProofEvent = (data = {}) => { + const sourceType = String(data.source_type || ''); + const sourceId = String(data.source_id || ''); + const trigger = String(data.trigger || ''); + return sourceType === 'leanoj_final' + || sourceType === 'leanoj_subproof' + || sourceId.startsWith('leanoj_') + || trigger.startsWith('leanoj'); + }; + const addLeanOJSharedProofActivity = (event, data = {}, messageFactory) => { + if (!isLeanOJProofEvent(data)) return; + setLeanojProofRefreshToken((prev) => prev + 1); + addLeanOJActivity(event, data, messageFactory(data)); + }; + + const handlers = [ + ['leanoj_started', (data) => { + setLeanojRunning(true); + addLeanOJActivity('leanoj_started', data, 'Proof Solver started'); + }], + ['leanoj_stopped', (data) => { + setLeanojRunning(false); + setAnyWorkflowRunning(false); + addLeanOJActivity('leanoj_stopped', data, 'Proof Solver stopped'); + leanojAPI.getStatus().then(setLeanojStatus).catch(console.error); + }], + ['leanoj_status_updated', (data) => setLeanojStatus(data)], + ['leanoj_phase_changed', (data) => addLeanOJActivity('leanoj_phase_changed', data, `Proof Solver phase: ${data.phase || 'unknown'}`)], + ['leanoj_model_call_completed', (data) => { + if (shouldTrackLeanOJModelCall(data)) { + addLeanOJActivity('leanoj_model_call_completed', data, formatLeanOJCallResult(data)); + } + }], + ['leanoj_model_call_failed', (data) => addLeanOJActivity('leanoj_model_call_failed', data, `${formatLeanOJRole(data.role_id)} call failed${data.retryable ? '; retrying' : ''}: ${summarizeLeanOJText(data.message, 160)}`)], + ['leanoj_role_json_retrying', (data) => addLeanOJActivity('leanoj_role_json_retrying', data, `Proof Solver role ${data.role_id || 'model'} returned invalid JSON; retrying attempt ${data.attempt || '?'}`)], + ['leanoj_skip_brainstorm_requested', (data) => addLeanOJActivity('leanoj_skip_brainstorm_requested', data, 'Skip brainstorm requested')], + ['leanoj_brainstorm_skip_deferred', (data) => addLeanOJActivity('leanoj_brainstorm_skip_deferred', data, 'Brainstorm skip queued after topic setup')], + ['leanoj_brainstorm_skipped', (data) => addLeanOJActivity('leanoj_brainstorm_skipped', data, 'Brainstorm skipped; proceeding directly to proof solving')], + ['leanoj_force_brainstorm_requested', (data) => addLeanOJActivity('leanoj_force_brainstorm_requested', data, 'Force recursive brainstorm requested')], + ['leanoj_brainstorm_forced', (data) => addLeanOJActivity('leanoj_brainstorm_forced', data, 'Returning to recursive brainstorm with the current proof preserved')], + ['leanoj_topic_submitters_started', (data) => addLeanOJActivity('leanoj_topic_submitters_started', data, `Topic submitters started (${data.submitter_count || 0} parallel submitters)`)], + ['leanoj_topic_generation_started', (data) => addLeanOJActivity('leanoj_topic_generation_started', data, `Submitter ${data.submitter_id ?? data.submitter ?? '?'} generating topic ${data.topic_index || '?'}/${data.target_topics || 5}`)], + ['leanoj_topic_empty', (data) => addLeanOJActivity('leanoj_topic_empty', data, `Topic submitter ${data.submitter_id ?? data.submitter ?? '?'} returned empty output on attempt ${data.attempt || '?'}`)], + ['leanoj_topic_candidate_queued', (data) => addLeanOJActivity('leanoj_topic_candidate_queued', data, `Submitter ${data.submitter_id ?? data.submitter ?? '?'} queued topic for validation: ${summarizeLeanOJText(data.topic_preview, 140)}`)], + ['leanoj_topic_batch_validation_started', (data) => addLeanOJActivity('leanoj_topic_batch_validation_started', data, `Topic validator reviewing batch of ${data.batch_size || 0} topic(s)`)], + ['leanoj_topic_validated', (data) => addLeanOJActivity('leanoj_topic_validated', data, `Topic accepted: ${summarizeLeanOJText(data.topic, 140)}`)], + ['leanoj_topic_rejected', (data) => addLeanOJActivity('leanoj_topic_rejected', data, `Topic rejected: ${summarizeLeanOJText(data.topic, 140)}`)], + ['leanoj_recursive_brainstorm_started', (data) => addLeanOJActivity('leanoj_recursive_brainstorm_started', data, `Recursive brainstorm cycle ${data.cycle || '?'} ${data.resumed ? 'resumed' : 'started'}; targeting the current proof attempt`)], + ['leanoj_topic_submitter_failed', (data) => addLeanOJActivity('leanoj_topic_submitter_failed', data, `Topic submitter ${data.submitter || '?'} failed: ${summarizeLeanOJText(data.message, 160)}`)], + ['leanoj_recursive_brainstorm_completed', (data) => addLeanOJActivity('leanoj_recursive_brainstorm_completed', data, `Recursive brainstorm cycle ${data.cycle || '?'} completed with ${data.accepted_delta || 0} new accepted ideas`)], + ['leanoj_initial_topic_selected', (data) => addLeanOJActivity('leanoj_initial_topic_selected', data, `Initial topic: ${summarizeLeanOJText(data.topic, 140)}`)], + ['leanoj_brainstorm_submitters_started', (data) => addLeanOJActivity('leanoj_brainstorm_submitters_started', data, `Brainstorm submitters started for ${data.phase || 'brainstorm'} (${data.submitter_count || 0} parallel submitters)`)], + ['leanoj_brainstorm_submission_queued', (data) => addLeanOJActivity('leanoj_brainstorm_submission_queued', data, `Submitter ${data.submitter_id ?? data.submitter ?? '?'} queued brainstorm idea for validation: ${summarizeLeanOJText(data.submission_preview, 140)}`)], + ['leanoj_brainstorm_submitter_failed', (data) => addLeanOJActivity('leanoj_brainstorm_submitter_failed', data, `Brainstorm submitter ${data.submitter || '?'} failed: ${summarizeLeanOJText(data.message, 160)}`)], + ['leanoj_brainstorm_batch_validation_started', (data) => addLeanOJActivity('leanoj_brainstorm_batch_validation_started', data, `Brainstorm validator reviewing batch of ${data.batch_size || 0} submission(s)`)], + ['leanoj_brainstorm_accepted', (data) => addLeanOJActivity('leanoj_brainstorm_accepted', data, formatLeanOJBrainstormMessage(data, true))], + ['leanoj_brainstorm_rejected', (data) => addLeanOJActivity('leanoj_brainstorm_rejected', data, formatLeanOJBrainstormMessage(data, false))], + ['leanoj_brainstorm_phase_limit_reached', (data) => addLeanOJActivity('leanoj_brainstorm_phase_limit_reached', data, `Brainstorm phase limit reached for ${data.phase || 'brainstorm'} (${data.accepted_delta || 0}/${data.max_accepts || '?'})`)], + ['leanoj_brainstorm_prune_review_complete', (data) => addLeanOJActivity('leanoj_brainstorm_prune_review_complete', data, 'Brainstorm prune review complete: no removal needed')], + ['leanoj_brainstorm_prune_rejected', (data) => addLeanOJActivity('leanoj_brainstorm_prune_rejected', data, `Brainstorm prune rejected: ${summarizeLeanOJText(data.reasoning || data.reason, 140)}`)], + ['leanoj_brainstorm_prune_applied', (data) => addLeanOJActivity('leanoj_brainstorm_prune_applied', data, `Brainstorm prune applied: ${summarizeLeanOJText(data.reasoning || data.reason, 140)}`)], + ['leanoj_brainstorm_prune_apply_failed', (data) => addLeanOJActivity('leanoj_brainstorm_prune_apply_failed', data, 'Brainstorm prune apply failed')], + ['leanoj_brainstorm_prune_error', (data) => addLeanOJActivity('leanoj_brainstorm_prune_error', data, data.message || 'Brainstorm prune review error')], + ['leanoj_brainstorm_proof_failed', (data) => addLeanOJActivity('leanoj_brainstorm_proof_failed', data, `Brainstorm proof failed Lean gate: ${summarizeLeanOJText(data.feedback?.error_summary, 180)}`)], + ['leanoj_brainstorm_proof_registration_failed', (data) => addLeanOJActivity('leanoj_brainstorm_proof_registration_failed', data, `Brainstorm proof registration failed: ${summarizeLeanOJText(data.error, 180)}`)], + ['leanoj_brainstorm_proof_verified', (data) => { + setLeanojProofRefreshToken((prev) => prev + 1); + addLeanOJActivity('leanoj_brainstorm_proof_verified', data, `Brainstorm proof verified and accepted: ${leanOJProofTarget(data)}`); + }], + ['leanoj_path_decided', (data) => addLeanOJActivity('leanoj_path_decided', data, `Path decision: ${data.decision || ''}`)], + ['leanoj_partial_proof_saved', (data) => addLeanOJActivity('leanoj_partial_proof_saved', data, `Partial proof saved: ${data.partial_proof?.request || data.partial_proof?.target || ''}`)], + ['leanoj_master_proof_initialized', (data) => addLeanOJActivity('leanoj_master_proof_initialized', data, 'Proof Solver master proof initialized')], + ['leanoj_master_proof_edit_started', (data) => addLeanOJActivity('leanoj_master_proof_edit_started', data, `Master proof edit started for final attempt ${data.next_verification_attempt || '?'}`)], + ['leanoj_master_proof_edit_validation_started', (data) => addLeanOJActivity('leanoj_master_proof_edit_validation_started', data, `Master proof shortening validation started (${data.line_delta_removed || 0} line(s), ${data.char_delta_removed || 0} char(s) removed)`)], + ['leanoj_master_proof_edit_applied', (data) => addLeanOJActivity('leanoj_master_proof_edit_applied', data, `Master proof edit accepted (version ${data.master_proof_version || '?'})`)], + ['leanoj_master_proof_edit_rejected', (data) => addLeanOJActivity('leanoj_master_proof_edit_rejected', data, `Master proof edit rejected: ${summarizeLeanOJText(data.validator_feedback || data.error_summary || data.message, 180)}`)], + ['leanoj_master_proof_stuck', (data) => addLeanOJActivity('leanoj_master_proof_stuck', data, data.continuing_final_cycle ? `Master proof stuck; continuing final cycle (${data.attempts_in_cycle || '?'} / ${data.max_attempts || '?'})` : `Master proof stuck; path requested: ${data.requested_path || 'unknown'}`)], + ['leanoj_master_proof_progress_watchdog', (data) => addLeanOJActivity('leanoj_master_proof_progress_watchdog', data, data.continuing_final_cycle ? `Master proof watchdog fired; continuing final cycle (${data.attempts_in_cycle || '?'} / ${data.max_attempts || '?'})` : `Master proof watchdog returned to ${data.requested_path || 'path planning'}`)], + ['leanoj_final_attempt_started', (data) => addLeanOJActivity('leanoj_final_attempt_started', data, leanOJAttemptStartedMessage(data))], + ['leanoj_final_attempt_failed', (data) => addLeanOJActivity('leanoj_final_attempt_failed', data, leanOJAttemptFinalMessage(data))], + ['leanoj_final_attempt_cycle_exhausted', (data) => addLeanOJActivity('leanoj_final_attempt_cycle_exhausted', data, data.message || 'Final attempt cycle exhausted; returning to path planning')], + ['leanoj_final_verified', (data) => { + setLeanojRunning(false); + setAnyWorkflowRunning(false); + setLeanojProofRefreshToken((prev) => prev + 1); + addLeanOJActivity('leanoj_final_verified', data, `${leanOJProofName(data)} verified and accepted: ${leanOJProofTarget(data) || 'final Proof Solver submission'}`); + leanojAPI.getStatus().then(setLeanojStatus).catch(console.error); + }], + ['proof_check_started', (data) => addLeanOJSharedProofActivity('proof_check_started', data, (eventData) => `Proof check started for ${eventData.source_type} ${eventData.source_id}`)], + ['proof_check_no_candidates', (data) => addLeanOJSharedProofActivity('proof_check_no_candidates', data, (eventData) => `No formal theorem candidates found in ${eventData.source_type} ${eventData.source_id}`)], + ['proof_check_candidates_found', (data) => addLeanOJSharedProofActivity('proof_check_candidates_found', data, (eventData) => `Proof candidates found: ${eventData.count || 0}`)], + ['proof_attempt_started', (data) => addLeanOJSharedProofActivity('proof_attempt_started', data, leanOJAttemptStartedMessage)], + ['proof_attempt_failed', (data) => addLeanOJSharedProofActivity('proof_attempt_failed', data, leanOJAttemptFinalMessage)], + ['proof_lean_accepted', (data) => addLeanOJSharedProofActivity('proof_lean_accepted', data, leanOJAttemptFinalMessage)], + ['proof_integrity_rejected', (data) => addLeanOJSharedProofActivity('proof_integrity_rejected', data, (eventData) => `${leanOJProofName(eventData)} error: integrity rejected - ${summarizeLeanOJText(eventData.reason || leanOJProofTarget(eventData), 960)}`)], + ['proof_verified', (data) => addLeanOJSharedProofActivity('proof_verified', data, (eventData) => `${leanOJProofName(eventData)} verified and accepted: ${leanOJProofTarget(eventData)}`)], + ['proof_attempts_exhausted', (data) => addLeanOJSharedProofActivity('proof_attempts_exhausted', data, (eventData) => `${leanOJProofName(eventData)} terminated: proof attempts exhausted for ${leanOJProofTarget(eventData)}`)], + ['novel_proof_discovered', (data) => addLeanOJSharedProofActivity('novel_proof_discovered', data, (eventData) => `${leanOJProofName(eventData)} novel proof discovered: ${eventData.theorem_statement || leanOJProofTarget(eventData)}`)], + ['known_proof_verified', (data) => addLeanOJSharedProofActivity('known_proof_verified', data, (eventData) => `${leanOJProofName(eventData)} known proof verified for ${eventData.source_type} ${eventData.source_id}`)], + ['proof_dependency_added', (data) => addLeanOJSharedProofActivity('proof_dependency_added', data, () => 'Proof Solver proof dependency added')], + ['proof_check_complete', (data) => addLeanOJSharedProofActivity('proof_check_complete', data, (eventData) => `Proof check complete: ${eventData.verified_count || 0} verified, ${eventData.novel_count || 0} novel`)], + ['leanoj_error', (data) => addLeanOJActivity('leanoj_error', data, data.message || 'Proof Solver error')], + ['leanoj_cleared', (data) => { + setLeanojRunning(false); + setAnyWorkflowRunning(false); + setLeanojActivity([]); + setLeanojStatus(data); + setLeanojProofRefreshToken((prev) => prev + 1); + }], + ]; + + handlers.forEach(([event, handler]) => websocket.on(event, handler)); + return () => handlers.forEach(([event, handler]) => websocket.off(event, handler)); + }, []); + // Poll for autonomous data while running useEffect(() => { if (!autonomousRunning) return; @@ -1642,6 +2127,24 @@ function App() { return () => clearInterval(interval); }, [autonomousRunning]); + + useEffect(() => { + if (!leanojRunning) return; + + const interval = setInterval(async () => { + try { + const status = await leanojAPI.getStatus(); + setLeanojStatus(status); + if (!status.is_running) { + setLeanojRunning(false); + } + } catch (error) { + console.error('Failed to poll Proof Solver status:', error); + } + }, 3000); + + return () => clearInterval(interval); + }, [leanojRunning]); // Clean up expired rate limits every minute useEffect(() => { @@ -1667,6 +2170,7 @@ function App() { const handleAutonomousStart = async (researchPrompt) => { try { const lmStudioEnabled = capabilities.lmStudioEnabled; + const superchargeAllowed = developerModeEnabled; // Convert frontend camelCase to backend snake_case for submitter_configs (includes OpenRouter fields) const submitterConfigs = autonomousConfig.submitter_configs?.map(cfg => ({ @@ -1674,9 +2178,11 @@ function App() { provider: normalizeRuntimeProvider(cfg.provider, lmStudioEnabled), model_id: cfg.modelId, openrouter_provider: cfg.openrouterProvider || null, + openrouter_reasoning_effort: cfg.openrouterReasoningEffort || 'auto', lm_studio_fallback_id: lmStudioEnabled ? (cfg.lmStudioFallbackId || null) : null, context_window: cfg.contextWindow, - max_output_tokens: cfg.maxOutputTokens + max_output_tokens: cfg.maxOutputTokens, + supercharge_enabled: superchargeAllowed && Boolean(cfg.superchargeEnabled || cfg.supercharge_enabled) })) || []; await autonomousAPI.start({ @@ -1689,11 +2195,13 @@ function App() { ), validator_model: autonomousConfig.validator_model, validator_openrouter_provider: autonomousConfig.validator_openrouter_provider, + validator_openrouter_reasoning_effort: autonomousConfig.validator_openrouter_reasoning_effort || 'auto', validator_lm_studio_fallback: lmStudioEnabled ? autonomousConfig.validator_lm_studio_fallback : null, validator_context_window: autonomousConfig.validator_context_window, validator_max_tokens: autonomousConfig.validator_max_tokens, + validator_supercharge_enabled: superchargeAllowed && Boolean(autonomousConfig.validator_supercharge_enabled), // High-context submitter config with OpenRouter support high_context_provider: normalizeRuntimeProvider( autonomousConfig.high_context_provider, @@ -1701,11 +2209,13 @@ function App() { ), high_context_model: autonomousConfig.high_context_model, high_context_openrouter_provider: autonomousConfig.high_context_openrouter_provider, + high_context_openrouter_reasoning_effort: autonomousConfig.high_context_openrouter_reasoning_effort || 'auto', high_context_lm_studio_fallback: lmStudioEnabled ? autonomousConfig.high_context_lm_studio_fallback : null, high_context_context_window: autonomousConfig.high_context_context_window, high_context_max_tokens: autonomousConfig.high_context_max_tokens, + high_context_supercharge_enabled: superchargeAllowed && Boolean(autonomousConfig.high_context_supercharge_enabled), // High-param submitter config with OpenRouter support high_param_provider: normalizeRuntimeProvider( autonomousConfig.high_param_provider, @@ -1713,11 +2223,13 @@ function App() { ), high_param_model: autonomousConfig.high_param_model, high_param_openrouter_provider: autonomousConfig.high_param_openrouter_provider, + high_param_openrouter_reasoning_effort: autonomousConfig.high_param_openrouter_reasoning_effort || 'auto', high_param_lm_studio_fallback: lmStudioEnabled ? autonomousConfig.high_param_lm_studio_fallback : null, high_param_context_window: autonomousConfig.high_param_context_window, high_param_max_tokens: autonomousConfig.high_param_max_tokens, + high_param_supercharge_enabled: superchargeAllowed && Boolean(autonomousConfig.high_param_supercharge_enabled), // Critique submitter config with OpenRouter support critique_submitter_provider: normalizeRuntimeProvider( autonomousConfig.critique_submitter_provider, @@ -1725,16 +2237,19 @@ function App() { ), critique_submitter_model: autonomousConfig.critique_submitter_model, critique_submitter_openrouter_provider: autonomousConfig.critique_submitter_openrouter_provider, + critique_submitter_openrouter_reasoning_effort: autonomousConfig.critique_submitter_openrouter_reasoning_effort || 'auto', critique_submitter_lm_studio_fallback: lmStudioEnabled ? autonomousConfig.critique_submitter_lm_studio_fallback : null, critique_submitter_context_window: autonomousConfig.critique_submitter_context_window, critique_submitter_max_tokens: autonomousConfig.critique_submitter_max_tokens, + critique_submitter_supercharge_enabled: superchargeAllowed && Boolean(autonomousConfig.critique_submitter_supercharge_enabled), tier3_enabled: autonomousConfig.tier3_enabled ?? false }); setAutonomousRunning(true); setAutonomousStopping(false); setAutonomousActivity([]); + setAnyWorkflowRunning(true); } catch (error) { alert(`Failed to start autonomous research: ${error.details || error.message}`); } @@ -1785,6 +2300,94 @@ function App() { } }; + const normalizeLeanOJRoleForCapabilities = (roleConfig = {}) => { + const lmStudioEnabled = capabilities.lmStudioEnabled; + const provider = normalizeRuntimeProvider(roleConfig.provider, lmStudioEnabled); + const shouldResetLmState = !lmStudioEnabled && roleConfig.provider !== 'openrouter'; + return { + ...roleConfig, + provider, + model_id: shouldResetLmState ? '' : (roleConfig.model_id || ''), + openrouter_provider: shouldResetLmState ? null : (roleConfig.openrouter_provider || null), + lm_studio_fallback_id: lmStudioEnabled ? (roleConfig.lm_studio_fallback_id || null) : null, + supercharge_enabled: developerModeEnabled && Boolean(roleConfig.supercharge_enabled), + }; + }; + + const normalizeLeanOJRequestForCapabilities = (request) => ({ + ...request, + topic_generator: normalizeLeanOJRoleForCapabilities(request.topic_generator), + topic_validator: normalizeLeanOJRoleForCapabilities(request.topic_validator), + brainstorm_submitters: (request.brainstorm_submitters || []).map(normalizeLeanOJRoleForCapabilities), + brainstorm_validator: normalizeLeanOJRoleForCapabilities(request.brainstorm_validator), + path_decider: normalizeLeanOJRoleForCapabilities(request.path_decider || request.final_solver), + final_solver: normalizeLeanOJRoleForCapabilities(request.final_solver), + }); + + const handleLeanOJStart = async (request) => { + try { + await leanojAPI.start(normalizeLeanOJRequestForCapabilities(request)); + setLeanojRunning(true); + setLeanojActivity([]); + const status = await leanojAPI.getStatus(); + setLeanojStatus(status); + setLeanojProofRefreshToken((prev) => prev + 1); + setAnyWorkflowRunning(true); + } catch (error) { + alert(`Failed to start Proof Solver: ${error.details || error.message}`); + } + }; + + const handleLeanOJStop = async () => { + try { + await leanojAPI.stop(); + setLeanojRunning(false); + setAnyWorkflowRunning(false); + const status = await leanojAPI.getStatus(); + setLeanojStatus(status); + } catch (error) { + alert(`Failed to stop Proof Solver: ${error.message}`); + } + }; + + const handleLeanOJClear = async () => { + if (!window.confirm('Clear all saved Proof Solver progress?')) { + return; + } + try { + const result = await leanojAPI.clear(); + setLeanojRunning(false); + setAnyWorkflowRunning(false); + setLeanojActivity([]); + setLeanojStatus(result.status || null); + setLeanojProofRefreshToken((prev) => prev + 1); + } catch (error) { + alert(`Failed to clear Proof Solver progress: ${error.message}`); + } + }; + + const handleLeanOJSkipBrainstorm = async () => { + try { + const result = await leanojAPI.skipBrainstorm(); + if (result.status) { + setLeanojStatus(result.status); + } + } catch (error) { + alert(`Failed to skip Proof Solver brainstorming: ${error.message}`); + } + }; + + const handleLeanOJForceBrainstorm = async () => { + try { + const result = await leanojAPI.forceBrainstorm(); + if (result.status) { + setLeanojStatus(result.status); + } + } catch (error) { + alert(`Failed to force Proof Solver recursive brainstorming: ${error.message}`); + } + }; + const refreshBrainstorms = async () => { try { const data = await autonomousAPI.getBrainstorms(); @@ -1796,8 +2399,12 @@ function App() { const refreshPapers = async () => { try { - const data = await autonomousAPI.getPapers(); + const [data, stats] = await Promise.all([ + autonomousAPI.getPapers(), + autonomousAPI.getStats(), + ]); setPapers(data.papers || []); + setAutonomousStats(stats); } catch (error) { console.error('Failed to refresh papers:', error); } @@ -1816,10 +2423,13 @@ function App() { // Critique notification handlers const handleDismissNotification = (notificationId) => { + const notification = critiqueNotifications.find(item => item.id === notificationId); + markHighScoreCritiqueSeen(notification?.seenKey); setCritiqueNotifications(prev => prev.filter(n => n.id !== notificationId)); }; - const handleClickNotification = (paperId, paperTitle) => { + const handleClickNotification = (paperId, paperTitle, seenKey) => { + markHighScoreCritiqueSeen(seenKey); setSelectedCritiquePaper({ paper_id: paperId, paper_title: paperTitle }); setShowCritiqueModal(true); }; @@ -1856,6 +2466,13 @@ function App() { } }; + const handleLeanOJTabSelect = (tabId) => { + setLeanojActiveTab(tabId); + if (appMode !== 'leanoj') { + setAppMode('leanoj'); + } + }; + // Credit exhaustion notification handler const handleDismissCreditNotification = (notificationId) => { setCreditExhaustionNotifications(prev => prev.filter(n => n.id !== notificationId)); @@ -2029,6 +2646,19 @@ function App() { { id: 'compiler-live-paper', label: 'Live Paper', subtext: 'Part 2 Live Results', subtextClass: 'green', group: 'compiler' }, ]; + const leanojMainTabs = [ + { id: 'leanoj-interface', label: 'Proof Solver', group: 'leanoj-main' }, + { id: 'leanoj-brainstorms', label: 'Brainstorms', group: 'leanoj-main' }, + { id: 'leanoj-master-proof', label: 'Master Proof Draft', group: 'leanoj-main' }, + { id: 'leanoj-proofs', label: 'Mathematical Proofs', group: 'leanoj-main' }, + ]; + + const leanojSettingsTabs = [ + { id: 'leanoj-completed-proof-works', label: 'Your Completed Proof Works Library', group: 'leanoj-settings' }, + { id: 'leanoj-logs', label: 'API Call Logs', group: 'leanoj-settings' }, + { id: 'leanoj-settings', label: 'Proof Solver Model Profiles & Settings', group: 'leanoj-settings' }, + ]; + useEffect(() => { if (!autonomousConfig.tier3_enabled && autonomousActiveTab === 'auto-final-answer') { setAutonomousActiveTab('auto-interface'); @@ -2050,13 +2680,14 @@ function App() { useEffect(() => { const checkWorkflowStatus = async () => { try { - const [aggStatus, compStatus, autoStatus] = await Promise.all([ + const [aggStatus, compStatus, autoStatus, leanojCurrentStatus] = await Promise.all([ api.get('/api/aggregator/status').catch(() => ({ is_running: false })), api.get('/api/compiler/status').catch(() => ({ is_running: false })), - autonomousAPI.getStatus().catch(() => ({ is_running: false })) + autonomousAPI.getStatus().catch(() => ({ is_running: false })), + leanojAPI.getStatus().catch(() => ({ is_running: false })) ]); - const running = aggStatus.is_running || compStatus.is_running || autoStatus.is_running; + const running = aggStatus.is_running || compStatus.is_running || autoStatus.is_running || leanojCurrentStatus.is_running; setAnyWorkflowRunning(running); } catch (error) { console.error('Failed to check workflow status:', error); @@ -2079,6 +2710,12 @@ function App() {

By Intrafere Research Group

A Prototype Artificial Superintelligence - Novelty Seeking Autonomous S.T.E.M. Researcher For Automated Theorem Generation

+

+ {appMode === 'manual' ? 'MANUAL S.T.E.M. WRITER' : 'Proof Solver Mode'} +

@@ -2107,6 +2744,9 @@ function App() { > + {developerModeEnabled && ( + + )}
@@ -2178,9 +2818,17 @@ function App() { Hosted Web Mode )} + {developerModeEnabled && ( + + Developer Mode + + )}
-
+
{appMode === 'autonomous' ? ( <> {mainTabs.map((tab, index) => { @@ -2227,6 +2875,32 @@ function App() { ); })} + ) : appMode === 'leanoj' ? ( + <> + {leanojMainTabs.map((tab) => ( + + + + ))} + +
+ + {leanojSettingsTabs.map((tab) => ( + + + + ))} + ) : ( <> {manualTabs.map((tab, index) => { @@ -2309,10 +2983,11 @@ function App() { )} @@ -2382,6 +3057,49 @@ function App() { events={autonomousActivity} /> )} + + {activeTab === 'leanoj-interface' && ( + + )} + {activeTab === 'leanoj-brainstorms' && ( + + )} + {activeTab === 'leanoj-proofs' && ( + + )} + {activeTab === 'leanoj-master-proof' && ( + + )} + {activeTab === 'leanoj-completed-proof-works' && ( + + )} + {activeTab === 'leanoj-logs' && ( + + )} + {/* Full-width settings screens with model sidebars are rendered outside the padded tab container. */} {activeTab === 'aggregator-interface' && ( )} - {activeTab === 'aggregator-settings' && ( - - )} + {/* Full-width settings screens with model sidebars are rendered outside the padded tab container. */} {activeTab === 'aggregator-logs' && } {activeTab === 'aggregator-results' && } @@ -2406,11 +3120,11 @@ function App() { activeTab={activeTab} capabilities={capabilities} anyWorkflowRunning={anyWorkflowRunning} + onWorkflowRunningChange={setAnyWorkflowRunning} + developerModeEnabled={developerModeEnabled} /> )} - {activeTab === 'compiler-settings' && ( - - )} + {/* Full-width settings screens with model sidebars are rendered outside the padded tab container. */} {activeTab === 'compiler-logs' && } {activeTab === 'compiler-live-paper' && }
@@ -2424,6 +3138,33 @@ function App() { models={models} capabilities={capabilities} isRunning={autonomousRunning} + developerModeEnabled={developerModeEnabled} + /> + )} + + {activeTab === 'leanoj-settings' && ( + + )} + + {activeTab === 'aggregator-settings' && ( + + )} + + {activeTab === 'compiler-settings' && ( + )} @@ -2509,6 +3250,7 @@ function App() { isOpen={showBoostModal} onClose={() => setShowBoostModal(false)} capabilities={capabilities} + developerModeEnabled={developerModeEnabled} /> {/* OpenRouter API Key Modal */} @@ -2566,6 +3308,7 @@ function App() { paperTitle={selectedCritiquePaper.paper_title} onGenerateCritique={handleGenerateCritique} onGetCritiques={handleGetCritiques} + developerModeEnabled={developerModeEnabled} /> )} @@ -2585,7 +3328,6 @@ function App() { rel="noopener noreferrer" className="footer-link footer-link-github" > - ℹ️ How MOTO's Superintelligence Works - Purchase a Custom ASI Program + Purchase Custom Industrial-Grade ASI Programs - - Visit MOTO's GitHub (Star Us for More ASI Programs) + Intrafere GitHub
diff --git a/frontend/src/components/ApiCallLogs.jsx b/frontend/src/components/ApiCallLogs.jsx new file mode 100644 index 0000000..1b4bffd --- /dev/null +++ b/frontend/src/components/ApiCallLogs.jsx @@ -0,0 +1,412 @@ +import React, { useCallback, useEffect, useRef, useState } from 'react'; +import './autonomous/AutonomousResearch.css'; + +const EMPTY_API_STATS = Object.freeze({ + total_calls: 0, + successful_calls: 0, + failed_calls: 0, + success_rate: 0, + boosted_calls: 0, + by_phase: {}, + by_model: {}, + by_provider: {}, + by_source: {}, + by_boost_mode: {}, +}); + +function formatDuration(ms) { + if (ms === null || ms === undefined) return '-'; + if (ms < 1000) return `${Math.round(ms)}ms`; + return `${(ms / 1000).toFixed(1)}s`; +} + +function formatTimestamp(timestamp) { + try { + return new Date(timestamp).toLocaleString(); + } catch { + return timestamp; + } +} + +function getPhaseLabel(phase) { + switch (phase) { + case 'topic_selection': return 'Topic'; + case 'brainstorm': return 'Brainstorm'; + case 'paper_compilation': return 'Paper'; + case 'tier3': return 'Tier 3'; + case 'boost': return 'Boost'; + case 'initial_topic_candidates': return 'Initial Topics'; + case 'initial_brainstorm': return 'Initial Brainstorm'; + case 'recursive_brainstorm': return 'Recursive Brainstorm'; + case 'proof_storm': return 'Legacy Proof Storm'; + case 'path_decision': return 'Path Decision'; + case 'final_proof_loop': return 'Final Proof Loop'; + default: return phase || 'Unknown'; + } +} + +function getSourceLabel(source) { + switch (source) { + case 'api+boost': return 'Boosted'; + case 'boost': return 'Boost Only'; + default: return 'Standard'; + } +} + +function getBoostModeLabel(mode) { + switch (mode) { + case 'next_count': return 'Next X'; + case 'category': return 'Category'; + case 'task_id': return 'Task ID'; + default: return mode || 'Unknown'; + } +} + +function getProviderLabel(provider) { + switch (provider) { + case 'openrouter': return 'OR'; + case 'lm_studio': return 'LMS'; + default: return provider || 'UNK'; + } +} + +export default function ApiCallLogs({ + api, + workflow = null, + title = 'API Call Logs', + emptyHint = 'Run a workflow and make API calls to see the combined logs here.', + style, +}) { + const [apiLogs, setApiLogs] = useState([]); + const [apiStats, setApiStats] = useState(null); + const [apiLogsLoading, setApiLogsLoading] = useState(true); + const [expandedApiLogIdx, setExpandedApiLogIdx] = useState(null); + const [apiAutoRefresh, setApiAutoRefresh] = useState(true); + const [apiLogDetails, setApiLogDetails] = useState({}); + const abortControllerRef = useRef(null); + + const fetchApiLogs = useCallback(async () => { + if (abortControllerRef.current) { + abortControllerRef.current.abort(); + } + + const controller = new AbortController(); + abortControllerRef.current = controller; + + try { + const response = await api.getApiLogs(100, { signal: controller.signal, workflow }); + if (abortControllerRef.current !== controller) { + return; + } + + if (response.success) { + const logs = response.logs || []; + setApiLogs(logs); + setApiLogDetails((prev) => { + const visibleKeys = new Set(logs.map((log) => log.log_key).filter(Boolean)); + return Object.fromEntries( + Object.entries(prev).filter(([key]) => visibleKeys.has(key)) + ); + }); + setApiStats(response.stats || EMPTY_API_STATS); + } + } catch (error) { + if (abortControllerRef.current !== controller) { + return; + } + + if (error.name !== 'AbortError') { + console.error('Failed to fetch API logs:', error); + } + } finally { + if (abortControllerRef.current === controller) { + setApiLogsLoading(false); + } + } + }, [api, workflow]); + + const fetchApiLogDetail = useCallback(async (log) => { + const logKey = log?.log_key; + if (!logKey || typeof api.getApiLogDetail !== 'function') { + return log; + } + + if (apiLogDetails[logKey]) { + return apiLogDetails[logKey]; + } + + try { + const response = await api.getApiLogDetail(logKey, { workflow }); + const detailedLog = response.log || log; + setApiLogDetails((prev) => ({ + ...prev, + [logKey]: detailedLog, + })); + return detailedLog; + } catch (error) { + console.error('Failed to fetch API log detail:', error); + return log; + } + }, [api, apiLogDetails, workflow]); + + useEffect(() => { + fetchApiLogs(); + + let interval; + if (apiAutoRefresh) { + interval = setInterval(fetchApiLogs, 5000); + } + + return () => { + if (interval) clearInterval(interval); + if (abortControllerRef.current) { + abortControllerRef.current.abort(); + abortControllerRef.current = null; + } + }; + }, [fetchApiLogs, apiAutoRefresh]); + + const handleClearApiLogs = async () => { + if (!window.confirm('Are you sure you want to clear these API logs?')) { + return; + } + + try { + if (abortControllerRef.current) { + abortControllerRef.current.abort(); + abortControllerRef.current = null; + } + + await api.clearApiLogs({ workflow }); + setApiLogs([]); + setApiLogDetails({}); + setApiStats(EMPTY_API_STATS); + setExpandedApiLogIdx(null); + setApiLogsLoading(false); + } catch (error) { + console.error('Failed to clear API logs:', error); + } + }; + + const copyToClipboard = async (text) => { + try { + await navigator.clipboard.writeText(text); + } catch (error) { + console.error('Failed to copy to clipboard:', error); + } + }; + + const handleToggleApiLog = (log, index) => { + const nextIndex = expandedApiLogIdx === index ? null : index; + setExpandedApiLogIdx(nextIndex); + if (nextIndex !== null) { + fetchApiLogDetail(log); + } + }; + + const handleCopyLogText = async (log, fullField, previewField) => { + const detailedLog = await fetchApiLogDetail(log); + copyToClipboard(detailedLog?.[fullField] || log?.[previewField] || ''); + }; + + return ( +
+
+

{title}

+
+ + + +
+
+ + {apiStats && ( +
+
+ {apiStats.total_calls} + Total API Calls +
+
+ {apiStats.successful_calls} + Successful +
+
+ {apiStats.failed_calls} + Failed +
+
+ + {(apiStats.success_rate * 100).toFixed(1)}% + + Success Rate +
+
+ {apiStats.boosted_calls || 0} + Boosted Calls +
+
+ )} + + {apiStats && apiStats.by_phase && Object.keys(apiStats.by_phase).length > 0 && ( +
+ By Phase: + {Object.entries(apiStats.by_phase).map(([phase, count]) => ( + + {getPhaseLabel(phase)}: {count} + + ))} +
+ )} + + {apiStats && apiStats.by_source && Object.keys(apiStats.by_source).length > 0 && ( +
+ By Source: + {Object.entries(apiStats.by_source).map(([source, count]) => ( + + {getSourceLabel(source)}: {count} + + ))} +
+ )} + + {apiStats && apiStats.by_boost_mode && Object.keys(apiStats.by_boost_mode).length > 0 && ( +
+ Boost Modes: + {Object.entries(apiStats.by_boost_mode).map(([mode, count]) => ( + + {getBoostModeLabel(mode)}: {count} + + ))} +
+ )} + +
+ {apiLogsLoading ? ( +
Loading API logs...
+ ) : apiLogs.length === 0 ? ( +
+

No API calls logged yet.

+

{emptyHint}

+
+ ) : ( + apiLogs.map((log, index) => { + const detailedLog = log.log_key ? (apiLogDetails[log.log_key] || log) : log; + return ( +
+
handleToggleApiLog(log, index)} + > +
+ {log.success ? '✓' : '✗'} +
+
+
+ {log.task_id} + {getPhaseLabel(log.phase)} + + {getSourceLabel(log.source)} + + {log.boost_mode && ( + {getBoostModeLabel(log.boost_mode)} + )} +
+
+ {log.model} + {getProviderLabel(log.provider)} + {formatDuration(log.duration_ms)} + {log.tokens_used && ( + {log.tokens_used} tokens + )} +
+
+
{formatTimestamp(log.timestamp)}
+
{expandedApiLogIdx === index ? '▼' : '▶'}
+
+ + {expandedApiLogIdx === index && ( +
+
+

Role

+
{log.role_id}
+
+ +
+

Source

+
{getSourceLabel(log.source)}{log.boost_mode ? ` (${getBoostModeLabel(log.boost_mode)})` : ''}
+
+ + {log.error && ( +
+

Error

+
{log.error}
+
+ )} + +
+
+

Sent Prompt

+ +
+ {detailedLog.prompt_redacted && ( +
Full prompt redacted; preview and size/hash metadata are retained.
+ )} +
{log.prompt_preview || '(empty)'}
+
+ +
+
+

Received Response

+ +
+ {detailedLog.response_redacted && ( +
Full response redacted; preview and size/hash metadata are retained.
+ )} +
{detailedLog.response_preview || log.response_preview || '(empty)'}
+
+
+ )} +
+ ); + }) + )} +
+
+ ); +} diff --git a/frontend/src/components/BoostControlModal.jsx b/frontend/src/components/BoostControlModal.jsx index 7aafed3..7364d6d 100644 --- a/frontend/src/components/BoostControlModal.jsx +++ b/frontend/src/components/BoostControlModal.jsx @@ -2,19 +2,31 @@ import React, { useState, useEffect, useRef } from 'react'; import { boostAPI, openRouterAPI } from '../services/api'; import { computeOpenRouterAutoSettings, + DEFAULT_CONTEXT_WINDOW, + DEFAULT_MAX_OUTPUT_TOKENS, + DEFAULT_OPENROUTER_REASONING_EFFORT, findOpenRouterModel, getProviderNames, + getReasoningSupportInfo, + normalizeOpenRouterReasoningEffort, + OPENROUTER_REASONING_EFFORT_OPTIONS, } from '../utils/openRouterSelection'; import './BoostControlModal.css'; const BOOST_SETTINGS_STORAGE_KEY = 'boost_modal_settings'; -export default function BoostControlModal({ isOpen, onClose, capabilities }) { +export default function BoostControlModal({ + isOpen, + onClose, + capabilities, + developerModeEnabled = false, +}) { const [apiKey, setApiKey] = useState(''); const [boostModel, setBoostModel] = useState(''); const [selectedProvider, setSelectedProvider] = useState(''); - const [contextWindow, setContextWindow] = useState(131072); - const [maxOutputTokens, setMaxOutputTokens] = useState(25000); + const [reasoningEffort, setReasoningEffort] = useState(DEFAULT_OPENROUTER_REASONING_EFFORT); + const [contextWindow, setContextWindow] = useState(DEFAULT_CONTEXT_WINDOW); + const [maxOutputTokens, setMaxOutputTokens] = useState(DEFAULT_MAX_OUTPUT_TOKENS); const [models, setModels] = useState([]); const [providerData, setProviderData] = useState(null); const [loading, setLoading] = useState(false); @@ -31,6 +43,7 @@ export default function BoostControlModal({ isOpen, onClose, capabilities }) { const hasAvailableKey = Boolean(apiKey.trim() || hasGlobalKey); const providers = getProviderNames(providerData); + const reasoningInfo = getReasoningSupportInfo(providerData, selectedProvider || null); const lmStudioEnabled = capabilities?.lmStudioEnabled !== false; // Load saved settings from localStorage on mount @@ -41,6 +54,7 @@ export default function BoostControlModal({ isOpen, onClose, capabilities }) { const settings = JSON.parse(saved); if (settings.boostModel) setBoostModel(settings.boostModel); if (settings.selectedProvider) setSelectedProvider(settings.selectedProvider); + if (settings.reasoningEffort) setReasoningEffort(normalizeOpenRouterReasoningEffort(settings.reasoningEffort)); if (settings.contextWindow) setContextWindow(settings.contextWindow); if (settings.maxOutputTokens) setMaxOutputTokens(settings.maxOutputTokens); if (settings.freeOnly !== undefined) setFreeOnly(settings.freeOnly); @@ -53,11 +67,12 @@ export default function BoostControlModal({ isOpen, onClose, capabilities }) { // Save settings to localStorage whenever they change useEffect(() => { // Only save if we have meaningful values (not initial empty state) - if (boostModel || selectedProvider || contextWindow !== 131072 || maxOutputTokens !== 25000 || freeOnly) { + if (boostModel || selectedProvider || reasoningEffort !== DEFAULT_OPENROUTER_REASONING_EFFORT || contextWindow !== DEFAULT_CONTEXT_WINDOW || maxOutputTokens !== DEFAULT_MAX_OUTPUT_TOKENS || freeOnly) { try { const settings = { boostModel, selectedProvider, + reasoningEffort, contextWindow, maxOutputTokens, freeOnly @@ -67,7 +82,7 @@ export default function BoostControlModal({ isOpen, onClose, capabilities }) { console.error('Failed to save boost settings to localStorage:', e); } } - }, [boostModel, selectedProvider, contextWindow, maxOutputTokens, freeOnly]); + }, [boostModel, selectedProvider, reasoningEffort, contextWindow, maxOutputTokens, freeOnly]); const fetchProviders = async (modelId, keyOverride = undefined) => { if (!modelId) { @@ -128,6 +143,7 @@ export default function BoostControlModal({ isOpen, onClose, capabilities }) { // Boost is enabled - use backend values (they're authoritative) setBoostModel(response.status.model_id); setSelectedProvider(response.status.provider || ''); + setReasoningEffort(normalizeOpenRouterReasoningEffort(response.status.reasoning_effort)); setContextWindow(response.status.context_window); setMaxOutputTokens(response.status.max_output_tokens); if (response.status.model_id) { @@ -139,6 +155,7 @@ export default function BoostControlModal({ isOpen, onClose, capabilities }) { const settings = { boostModel: response.status.model_id, selectedProvider: response.status.provider || '', + reasoningEffort: normalizeOpenRouterReasoningEffort(response.status.reasoning_effort), contextWindow: response.status.context_window, maxOutputTokens: response.status.max_output_tokens, freeOnly @@ -161,6 +178,7 @@ export default function BoostControlModal({ isOpen, onClose, capabilities }) { const handleModelChange = async (modelId) => { setBoostModel(modelId); setSelectedProvider(''); // Reset provider when model changes + setReasoningEffort(DEFAULT_OPENROUTER_REASONING_EFFORT); if (modelId) { const autoSettings = await getAutoSettingsForModel(modelId, null); if (autoSettings) { @@ -300,6 +318,7 @@ export default function BoostControlModal({ isOpen, onClose, capabilities }) { openrouter_api_key: trimmedApiKey, boost_model_id: boostModel, boost_provider: selectedProvider || null, + boost_reasoning_effort: reasoningEffort, boost_context_window: contextWindow, boost_max_output_tokens: maxOutputTokens }; @@ -504,6 +523,26 @@ export default function BoostControlModal({ isOpen, onClose, capabilities }) { )} + {boostModel && ( +
+ + + + {reasoningInfo.hasEndpointMetadata && !reasoningInfo.supportsReasoning + ? 'This selected provider does not advertise reasoning support; OpenRouter may ignore the setting.' + : 'Auto sends OpenRouter max reasoning effort by default.'} + +
+ )} +
@@ -559,6 +598,9 @@ export default function BoostControlModal({ isOpen, onClose, capabilities }) {
  • Click tasks in the MOTO Workflow panel to toggle boost
  • Boosted tasks use your selected OpenRouter model and optional host provider
  • + {developerModeEnabled && ( +
  • Supercharge is enabled per role in each settings panel; when Boost also applies, all 5 Supercharge calls use this Boost model
  • + )}
  • {lmStudioEnabled ? 'If boost credits or provider capacity fail, the task falls back to its primary model path for that call' diff --git a/frontend/src/components/CritiqueNotificationStack.jsx b/frontend/src/components/CritiqueNotificationStack.jsx index b37a9fc..5b5f15f 100644 --- a/frontend/src/components/CritiqueNotificationStack.jsx +++ b/frontend/src/components/CritiqueNotificationStack.jsx @@ -41,12 +41,12 @@ function getRatingColor(rating) { * - Max 3 notifications (FIFO queue) * - Click to open critique modal * - X button to dismiss - * - Persists across screens (not localStorage) + * - Seen notification keys are tracked by the parent to avoid replay loops * * Props: - * - notifications: Array of notification objects { id, paper_id, paper_title, average_rating, timestamp } + * - notifications: Array of notification objects { id, paper_id, paper_title, average_rating, timestamp, seenKey } * - onDismiss: (id) => void - callback when notification is dismissed - * - onClickNotification: (paper_id, paper_title) => void - callback when notification is clicked + * - onClickNotification: (paper_id, paper_title, seenKey) => void - callback when notification is clicked */ export default function CritiqueNotificationStack({ notifications, onDismiss, onClickNotification, panelCollapsed }) { if (!notifications || notifications.length === 0) { @@ -98,7 +98,7 @@ function CritiqueNotification({ notification, index, onDismiss, onClickNotificat }; const handleClick = () => { - onClickNotification(notification.paper_id, notification.paper_title); + onClickNotification(notification.paper_id, notification.paper_title, notification.seenKey); }; return ( diff --git a/frontend/src/components/HelpTooltip.jsx b/frontend/src/components/HelpTooltip.jsx index 7a0eac1..d84d253 100644 --- a/frontend/src/components/HelpTooltip.jsx +++ b/frontend/src/components/HelpTooltip.jsx @@ -10,6 +10,7 @@ export default function HelpTooltip({ popupStyle, buttonContent = '?', useFixedPosition = false, + fixedPlacement = 'above-right', }) { const [isOpen, setIsOpen] = useState(false); const [fixedPopupStyle, setFixedPopupStyle] = useState(null); @@ -31,11 +32,22 @@ export default function HelpTooltip({ const viewportPadding = 16; let left = buttonRect.right + gap; - if (left + popupRect.width > window.innerWidth - viewportPadding) { + let top = buttonRect.top - popupRect.height - gap; + + if (fixedPlacement === 'side-right') { + if (left + popupRect.width > window.innerWidth - viewportPadding) { + left = buttonRect.left - popupRect.width - gap; + } + left = Math.max(viewportPadding, left); + top = buttonRect.top + (buttonRect.height - popupRect.height) / 2; + } else if (left + popupRect.width > window.innerWidth - viewportPadding) { left = Math.max(viewportPadding, window.innerWidth - popupRect.width - viewportPadding); } - const top = Math.max(viewportPadding, buttonRect.top - popupRect.height - gap); + top = Math.min( + Math.max(viewportPadding, top), + Math.max(viewportPadding, window.innerHeight - popupRect.height - viewportPadding) + ); setFixedPopupStyle({ position: 'fixed', @@ -45,7 +57,7 @@ export default function HelpTooltip({ bottom: 'auto', zIndex: 100000, }); - }, [useFixedPosition]); + }, [fixedPlacement, useFixedPosition]); const showTooltip = () => { setIsOpen(true); diff --git a/frontend/src/components/HighlightedModelsSidebar.jsx b/frontend/src/components/HighlightedModelsSidebar.jsx new file mode 100644 index 0000000..7b2a459 --- /dev/null +++ b/frontend/src/components/HighlightedModelsSidebar.jsx @@ -0,0 +1,200 @@ +import React, { useState } from 'react'; +import HelpTooltip from './HelpTooltip'; +import ProofStrengthBadge from './ProofStrengthBadge'; +import './settings-common.css'; +import './autonomous/AutonomousResearch.css'; + +const OsTag = () => ( + + OS + + Open source — weights available on Hugging Face for local use with LM Studio. + + +); + +export default function HighlightedModelsSidebar() { + const [showKothTooltip, setShowKothTooltip] = useState(false); + + return ( +
    +
    +

    + Highlighted Models + + The models and hosts listed here are not affiliated with MOTO or Intrafere LLC. This chart reflects developer-tested configurations intended to help guide model selection. All statements regarding pricing, performance, roles, rankings, or capabilities are speculative and based on individual testing experience. Intrafere LLC and the MOTO development team make no guarantees about the accuracy of this chart. MOTO is compatible with the majority of models, including many not listed here. + +

    +

    + Note: Most models over 20 billion parameters are compatible with MOTO. +

    +
    +
    +
    Leaderboard
    +
    + +
    +
    Kimi K2.6
    +
    setShowKothTooltip(true)} + onMouseLeave={() => setShowKothTooltip(false)} + onFocus={() => setShowKothTooltip(true)} + onBlur={() => setShowKothTooltip(false)} + tabIndex={0} + > +
    👑 KING OF THE HILL
    + {showKothTooltip && ( +
    + This model was chosen by the Intrafere developers as the best overall performer in the MOTO harness, optimized for cost, speed, and knowledge. +
    + )} +
    +
    +
    Highly knowledgeable and balanced cost
    +
    + +
    +
    +
    Gemini 3.1 Flash Light
    +
    🥈 SILVER
    +
    +
    Highly Knowledgeable, Fast
    +
    + +
    + +
    +
    GPT OSS 120B
    +
    🥉 BRONZE
    +
    +
    Balanced knowledge and speed at low cost
    +
    +
    + +
    +
    Arcee AI's Trinity Large
    +
    Highly knowledgeable
    +
    + +
    +
    Amazon Nova Pro/Premier
    +
    Highly knowledgeable
    +
    + +
    + +
    Claude Opus/Sonnet/Haiku
    +
    Highly knowledgeable
    +
    + +
    + +
    DeepSeek
    +
    Highly knowledgeable
    +
    + +
    +
    Gemini Flash
    +
    Fast validator
    +
    + +
    +
    Gemini Pro
    +
    Highly knowledgeable
    +
    + +
    + +
    Google's Gemma
    +
    Balanced knowledge and speed
    +
    + +
    + +
    GLM
    +
    Highly knowledgeable
    +
    + +
    + +
    GLM Turbo
    +
    Fast validator
    +
    + +
    + +
    OpenAI's GPT OSS
    +
    Balanced knowledge and speed
    +
    + +
    +
    Grok
    +
    Highly knowledgeable
    +
    + +
    + +
    ChatGPT
    +
    Highly knowledgeable
    +
    + +
    +
    Inception's Mercury
    +
    Rapid knowledge
    +
    + +
    + +
    Nemotron Super
    +
    Balanced knowledge and speed
    +
    + +
    + +
    Nous Hermes
    +
    Highly knowledgeable
    +
    + +
    +
    Perplexity's Sonar
    +
    Native internet search capability
    +
    + +
    + +
    Microsoft's Phi
    +
    Balanced knowledge and speed
    +
    + +
    +
    MiniMax
    +
    Highly knowledgeable
    +
    + +
    + +
    Qwen Coder
    +
    Computer science
    +
    + +
    + +
    Qwen
    +
    Highly knowledgeable
    +
    +
    +
    +
    + ); +} diff --git a/frontend/src/components/LatexRenderer.jsx b/frontend/src/components/LatexRenderer.jsx index f746ac5..3589061 100644 --- a/frontend/src/components/LatexRenderer.jsx +++ b/frontend/src/components/LatexRenderer.jsx @@ -451,22 +451,11 @@ const replaceSectionCommand = (text, command, tag, endTag) => { const decodeHtmlEntities = (text) => { if (!text) return text; - // Use a textarea to decode HTML entities properly + // Decode exactly one entity layer. Repeated unescaping can turn literal + // escaped HTML into active markup before DOMPurify sees it. const textarea = document.createElement('textarea'); textarea.innerHTML = text; - let decoded = textarea.textContent; - - // Also handle common named entities that might not be decoded - decoded = decoded - .replace(/&/g, '&') - .replace(/</g, '<') - .replace(/>/g, '>') - .replace(/"/g, '"') - .replace(/'/g, "'") - .replace(/'/g, "'") - .replace(/'/g, "'"); - - return decoded; + return textarea.textContent || ''; }; /** @@ -475,8 +464,6 @@ const decodeHtmlEntities = (text) => { const cleanTikzContent = (content) => { return content.trim() .replace(/<br\/>/g, '\n') // Fix HTML-encoded line breaks - .replace(/&amp;/g, '&') // Fix double-encoded ampersands - .replace(/&/g, '&') // Fix encoded ampersands .replace(//g, '\n'); // Fix actual HTML line breaks }; @@ -521,6 +508,59 @@ const renderKatexSafely = (latex, displayMode, originalMatch) => { } }; +const isEscapedAt = (text, index) => { + let backslashCount = 0; + for (let i = index - 1; i >= 0 && text[i] === '\\'; i--) { + backslashCount += 1; + } + return backslashCount % 2 === 1; +}; + +const isSingleDollarDelimiter = (text, index) => ( + text[index] === '$' + && text[index - 1] !== '$' + && text[index + 1] !== '$' + && !isEscapedAt(text, index) +); + +const renderInlineDollarMath = (text) => { + let output = ''; + let segmentStart = 0; + let index = 0; + + while (index < text.length) { + if (!isSingleDollarDelimiter(text, index)) { + index += 1; + continue; + } + + const openIndex = index; + let closeIndex = -1; + for (let scan = openIndex + 1; scan < text.length; scan++) { + if (isSingleDollarDelimiter(text, scan)) { + closeIndex = scan; + break; + } + } + + if (closeIndex === -1) { + break; + } + + const latex = text.slice(openIndex + 1, closeIndex); + const match = text.slice(openIndex, closeIndex + 1); + output += text.slice(segmentStart, openIndex); + output += (latex.includes(' { + if (!timestamp) return ''; + + const parsed = new Date(timestamp); + if (!Number.isNaN(parsed.getTime())) { + return parsed.toLocaleTimeString(); + } + + return timestamp; +}; + +export default function LiveActivityFeed({ + title = 'Live Activity', + items = [], + emptyMessage = 'No activity yet.', + maxItems, + getEventName = (item) => item?.event || item?.type || '', + getMessage = (item) => item?.message || item?.data?.message || '', + getTimestamp = (item) => item?.timestamp || item?.fullTimestamp || '', + getIcon = getActivityIcon, + getClassName = getActivityClass, + headerAction = null, +}) { + const feedRef = useRef(null); + const prevLengthRef = useRef(0); + const visibleItems = maxItems ? items.slice(-maxItems) : items; + + useEffect(() => { + if (visibleItems.length > prevLengthRef.current && feedRef.current) { + feedRef.current.scrollTop = feedRef.current.scrollHeight; + } + prevLengthRef.current = visibleItems.length; + }, [visibleItems.length]); + + return ( +
    +
    +

    {title}

    + {headerAction} +
    +
    + {visibleItems.length === 0 ? ( +
    {emptyMessage}
    + ) : ( + visibleItems.map((item, index) => { + const eventName = getEventName(item); + const timestamp = getTimestamp(item); + const message = getMessage(item) || eventName; + + return ( +
    + {getIcon(eventName, item)} + {formatActivityTime(timestamp)} + {message} +
    + ); + }) + )} +
    +
    + ); +} diff --git a/frontend/src/components/PaperCritiqueModal.jsx b/frontend/src/components/PaperCritiqueModal.jsx index a0386f7..ee5326a 100644 --- a/frontend/src/components/PaperCritiqueModal.jsx +++ b/frontend/src/components/PaperCritiqueModal.jsx @@ -59,7 +59,7 @@ function getRatingBgColor(rating) { const AUTONOMOUS_SETTINGS_STORAGE_KEY = 'autonomous_research_settings'; const COMPILER_SETTINGS_STORAGE_KEY = 'compiler_settings'; -function readStoredValidatorConfig(paperType) { +function readStoredValidatorConfig(paperType, developerModeEnabled = false) { try { if (paperType === 'compiler_paper') { const raw = localStorage.getItem(COMPILER_SETTINGS_STORAGE_KEY); @@ -79,6 +79,8 @@ function readStoredValidatorConfig(paperType) { validator_max_tokens: config.validatorMaxOutput, validator_provider: config.validatorProvider, validator_openrouter_provider: config.validatorOpenrouterProvider, + validator_openrouter_reasoning_effort: config.validatorOpenrouterReasoningEffort || 'auto', + validator_supercharge_enabled: developerModeEnabled && Boolean(config.validatorSuperchargeEnabled), }; } @@ -102,6 +104,8 @@ function readStoredValidatorConfig(paperType) { validator_max_tokens: localConfig.validator_max_tokens, validator_provider: localConfig.validator_provider, validator_openrouter_provider: localConfig.validator_openrouter_provider, + validator_openrouter_reasoning_effort: localConfig.validator_openrouter_reasoning_effort || 'auto', + validator_supercharge_enabled: developerModeEnabled && Boolean(localConfig.validator_supercharge_enabled), }; } catch (error) { console.warn('Could not read validator config from localStorage:', error); @@ -129,6 +133,7 @@ export default function PaperCritiqueModal({ paperTitle, onGenerateCritique, onGetCritiques, + developerModeEnabled = false, }) { const [loading, setLoading] = useState(false); const [generating, setGenerating] = useState(false); @@ -177,7 +182,7 @@ export default function PaperCritiqueModal({ : 'autonomous_critique_custom_prompt'; const customPrompt = localStorage.getItem(storageKey); - const validatorConfig = readStoredValidatorConfig(paperType); + const validatorConfig = readStoredValidatorConfig(paperType, developerModeEnabled); const result = await onGenerateCritique(customPrompt, validatorConfig); // Reload critiques to get the updated list diff --git a/frontend/src/components/ProofStrengthBadge.jsx b/frontend/src/components/ProofStrengthBadge.jsx new file mode 100644 index 0000000..e7402ec --- /dev/null +++ b/frontend/src/components/ProofStrengthBadge.jsx @@ -0,0 +1,18 @@ +import React from 'react'; +import './settings-common.css'; + +const LEADERBOARD_TOOLTIP = 'This company\'s state-of-the-art model has been seen in MOTO testing to solve complex mathematical proofs and perform well in Submitter 1 (Main Submitter), High-Context Submitter, and High-Parameter Submitter, the three primary proof-creation roles.'; + +const ROLE_TOOLTIP = 'These are the three roles that submit proofs: Submitter 1 (Main Submitter), High-Context Submitter, and High-Parameter Submitter. For the best chance of creating novel proofs, use models comparable to those marked PS in the Highlighted Models list.'; + +export default function ProofStrengthBadge({ variant = 'role', className = '' }) { + const tooltip = variant === 'leaderboard' ? LEADERBOARD_TOOLTIP : ROLE_TOOLTIP; + const variantClass = variant === 'leaderboard' ? 'ps-badge-anchor--leaderboard' : 'ps-badge-anchor--role'; + + return ( + + PS + {tooltip} + + ); +} diff --git a/frontend/src/components/RawSettingsEditor.jsx b/frontend/src/components/RawSettingsEditor.jsx new file mode 100644 index 0000000..111d14e --- /dev/null +++ b/frontend/src/components/RawSettingsEditor.jsx @@ -0,0 +1,39 @@ +import React from 'react'; + +export default function RawSettingsEditor({ + value, + onChange, + onSave, + message, + disabled = false, +}) { + return ( +
    +

    Raw Settings JSON

    +

    + Edit the full settings payload directly. Save only valid JSON. +

    +