|
| 1 | +# Survey: Ornith-1.0 — Self-Scaffolding LLMs for Agentic Coding |
| 2 | + |
| 3 | +> Source survey (verified 2026-06-25): |
| 4 | +> - Blog: https://deep-reinforce.com/ornith_1_0.html |
| 5 | +> - HF org: https://huggingface.co/deepreinforce-ai |
| 6 | +> - HF model card: https://huggingface.co/deepreinforce-ai/Ornith-1.0-397B |
| 7 | +> |
| 8 | +> Status of benchmark numbers: **maintainer-reported, not independently reproduced.** |
| 9 | +
|
| 10 | +--- |
| 11 | + |
| 12 | +## 1. Abstract Concepts (the reusable ideas) |
| 13 | + |
| 14 | +These are the transferable patterns worth remembering independent of this specific release. |
| 15 | + |
| 16 | +### 1.1 Scaffold-as-learnable-object |
| 17 | +The agent **harness/scaffold is treated as a first-class object that is trained**, not hand-engineered and frozen. Instead of humans writing the orchestration (memory, retry, error-handling, search trajectory, tool plan), the model learns to author it. |
| 18 | + |
| 19 | +### 1.2 Self-scaffolding RL (scaffold–policy co-evolution) |
| 20 | +A two-stage RL step where a single model both **proposes the scaffold** and **executes the rollout** under it: |
| 21 | +1. Given a task + the previous scaffold → propose a **refined scaffold**. |
| 22 | +2. Given that scaffold + the task → generate a **solution rollout**. |
| 23 | +3. Reward from the rollout is **propagated to both stages** (scaffold-authoring and solution-generating). |
| 24 | + |
| 25 | +Effect: per-task-category orchestration strategies emerge automatically via mutation + selection toward higher-reward trajectories. The scaffold co-evolves with the policy. |
| 26 | + |
| 27 | +### 1.3 Reward-hacking defense as a layered trust architecture |
| 28 | +Letting the model author its own scaffold invites verifier-gaming (e.g. reading hidden test files, touching the checked-for artifact, copying an oracle solution). Defense generalizes as **three layers**: |
| 29 | +1. **Fixed outer trust boundary** — environment, tool surface, test isolation are immutable; only the *inner* policy scaffold is mutable. |
| 30 | +2. **Deterministic monitor** — exact-spec rule checker; flags forbidden reads / verification-script edits / out-of-surface tool calls → zero reward + excluded from advantage. |
| 31 | +3. **Frozen LLM-judge veto** — catches *intent-level* gaming that stays within the allowed tool surface; sits **on top of** the verifier, not as the primary reward. |
| 32 | + |
| 33 | +Abstract takeaway: **specify exactly what you can (deterministic), judge what you cannot (frozen LLM), and make the boundary itself unreachable by the learner.** |
| 34 | + |
| 35 | +### 1.4 Staleness-weighted off-policy correction (pipeline-RL) |
| 36 | +For long rollouts in asynchronous/pipeline RL, weight tokens by age $d_t$: |
| 37 | +$$ w(d_t)= \begin{cases} 1, & d_t \le K_1 \\ \exp(-\lambda(d_t-K_1)), & K_1 < d_t \le K_2 \\ 0, & d_t > K_2 \end{cases} $$ |
| 38 | +Applied to a token-level GRPO objective: |
| 39 | +$$ L_t=\min\!\big(r_t A_t,\ \mathrm{clip}(r_t,1-\epsilon^{-},1+\epsilon^{+})A_t\big)\cdot w(d_t) $$ |
| 40 | +Abstract takeaway: **fresh tokens count fully, mid-age tokens decay exponentially, stale tokens are dropped** — a graceful off-policy discount rather than a hard cutoff. |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +## 2. What Is Being Announced |
| 45 | + |
| 46 | +DeepReinforce released **Ornith-1.0**, an MIT-licensed open-source family of **agentic-coding** LLMs, post-trained on **Gemma 4** and **Qwen 3.5** bases. |
| 47 | + |
| 48 | +| Variant | Type | Positioning | |
| 49 | +|---|---|---| |
| 50 | +| Ornith-1.0-9B | Dense | edge / efficient | |
| 51 | +| Ornith-1.0-31B | Dense | mid-size dense (named in blog; not seen in HF org listing) | |
| 52 | +| Ornith-1.0-35B | MoE | mid-tier | |
| 53 | +| Ornith-1.0-397B | MoE | flagship | |
| 54 | + |
| 55 | +HF repos observed: 397B, 397B-FP8, 35B, 35B-GGUF, 9B, 9B-GGUF. |
| 56 | + |
| 57 | +--- |
| 58 | + |
| 59 | +## 3. Benchmark Claims (maintainer-reported) |
| 60 | + |
| 61 | +### 397B flagship |
| 62 | +| Benchmark | Ornith-397B | Context | |
| 63 | +|---|---|---| |
| 64 | +| Terminal-Bench 2.1 (Terminus-2) | 77.5 | > Opus 4.7 (70.3); < Opus 4.8 (85), GLM-5.2 (81.0) | |
| 65 | +| Terminal-Bench 2.1 (Claude Code) | 78.2 | ≈ Opus 4.8 (78.9); < GLM-5.2 (82.7) | |
| 66 | +| SWE-Bench Verified | 82.4 | > Opus 4.7 (80.8); < Opus 4.8 (87.6) | |
| 67 | +| SWE-Bench Pro | 62.2 | ≈ GLM-5.2 (62.1); < Opus | |
| 68 | +| NL2Repo | 48.2 | ≈ GLM-5.2 (48.9); < Opus 4.8 (69.7) | |
| 69 | +| ClawEval Avg | 77.1 | ≈ Opus 4.7 (78.2) | |
| 70 | + |
| 71 | +Defensible framing: strong among open/comparable-size models, competitive with **Claude Opus 4.7** on several agentic benchmarks — but **not** the table's top model (Opus 4.8 and GLM-5.2 lead several rows). |
| 72 | + |
| 73 | +### 9B (efficiency story) |
| 74 | +Terminal-Bench 2.1 (Terminus-2) 43.1 · SWE-Bench Verified 69.4 · ClawEval Avg 63.1 — claimed to match/exceed much larger Gemma/Qwen models on some rows. |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +## 4. Serving / Release Facts (verified on HF) |
| 79 | +- License **MIT**; `text-generation`; tags include `qwen3_5_moe`, `safetensors`, `transformers`. |
| 80 | +- Reasoning model: emits `<think>...</think>`; serve with Qwen3 reasoning parser + Qwen XML/tool-call parser. |
| 81 | +- Recipes: vLLM / SGLang, TP=8 on 8×80GB, 262k context. |
| 82 | + |
| 83 | +--- |
| 84 | + |
| 85 | +## 5. Discrepancies / Unknowns |
| 86 | +- **35B score mismatch**: prose 64.4 vs table 64.2 (Terminal-Bench 2.1). |
| 87 | +- **397B doc error**: card calls 397B "lightweight ... single-GPU" yet recipe is 8×80GB TP=8 — likely copy/paste. |
| 88 | +- **Typo** "scallfold" on the model card. |
| 89 | +- **31B** named in blog, not visible in HF org listing checked. |
| 90 | +- No paper-grade methodology: no ablations, datasets, compute, monitor rules, judge prompts, or attack evals. |
| 91 | +- Benchmark comparisons depend on harness, tool budget, context, temperature, retries, timeout. |
| 92 | + |
| 93 | +--- |
| 94 | + |
| 95 | +## 6. Bottom Line |
| 96 | +Credible maintainer release with real HF artifacts and one genuinely distinctive thesis: **train the model to author the scaffold that guides its own coding rollouts.** Treat leaderboard claims as **strong vendor claims awaiting independent reproduction.** |
0 commit comments