Skip to content

Commit 6b83ae1

Browse files
committed
Do further surveys for Ornith RL self-scaffolding
1 parent 20a1cf2 commit 6b83ae1

1 file changed

Lines changed: 96 additions & 0 deletions

File tree

surveys/ornith-1.0-survey.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Survey: Ornith-1.0 — Self-Scaffolding LLMs for Agentic Coding
2+
3+
> Source survey (verified 2026-06-25):
4+
> - Blog: https://deep-reinforce.com/ornith_1_0.html
5+
> - HF org: https://huggingface.co/deepreinforce-ai
6+
> - HF model card: https://huggingface.co/deepreinforce-ai/Ornith-1.0-397B
7+
>
8+
> Status of benchmark numbers: **maintainer-reported, not independently reproduced.**
9+
10+
---
11+
12+
## 1. Abstract Concepts (the reusable ideas)
13+
14+
These are the transferable patterns worth remembering independent of this specific release.
15+
16+
### 1.1 Scaffold-as-learnable-object
17+
The agent **harness/scaffold is treated as a first-class object that is trained**, not hand-engineered and frozen. Instead of humans writing the orchestration (memory, retry, error-handling, search trajectory, tool plan), the model learns to author it.
18+
19+
### 1.2 Self-scaffolding RL (scaffold–policy co-evolution)
20+
A two-stage RL step where a single model both **proposes the scaffold** and **executes the rollout** under it:
21+
1. Given a task + the previous scaffold → propose a **refined scaffold**.
22+
2. Given that scaffold + the task → generate a **solution rollout**.
23+
3. Reward from the rollout is **propagated to both stages** (scaffold-authoring and solution-generating).
24+
25+
Effect: per-task-category orchestration strategies emerge automatically via mutation + selection toward higher-reward trajectories. The scaffold co-evolves with the policy.
26+
27+
### 1.3 Reward-hacking defense as a layered trust architecture
28+
Letting the model author its own scaffold invites verifier-gaming (e.g. reading hidden test files, touching the checked-for artifact, copying an oracle solution). Defense generalizes as **three layers**:
29+
1. **Fixed outer trust boundary** — environment, tool surface, test isolation are immutable; only the *inner* policy scaffold is mutable.
30+
2. **Deterministic monitor** — exact-spec rule checker; flags forbidden reads / verification-script edits / out-of-surface tool calls → zero reward + excluded from advantage.
31+
3. **Frozen LLM-judge veto** — catches *intent-level* gaming that stays within the allowed tool surface; sits **on top of** the verifier, not as the primary reward.
32+
33+
Abstract takeaway: **specify exactly what you can (deterministic), judge what you cannot (frozen LLM), and make the boundary itself unreachable by the learner.**
34+
35+
### 1.4 Staleness-weighted off-policy correction (pipeline-RL)
36+
For long rollouts in asynchronous/pipeline RL, weight tokens by age $d_t$:
37+
$$ w(d_t)= \begin{cases} 1, & d_t \le K_1 \\ \exp(-\lambda(d_t-K_1)), & K_1 < d_t \le K_2 \\ 0, & d_t > K_2 \end{cases} $$
38+
Applied to a token-level GRPO objective:
39+
$$ L_t=\min\!\big(r_t A_t,\ \mathrm{clip}(r_t,1-\epsilon^{-},1+\epsilon^{+})A_t\big)\cdot w(d_t) $$
40+
Abstract takeaway: **fresh tokens count fully, mid-age tokens decay exponentially, stale tokens are dropped** — a graceful off-policy discount rather than a hard cutoff.
41+
42+
---
43+
44+
## 2. What Is Being Announced
45+
46+
DeepReinforce released **Ornith-1.0**, an MIT-licensed open-source family of **agentic-coding** LLMs, post-trained on **Gemma 4** and **Qwen 3.5** bases.
47+
48+
| Variant | Type | Positioning |
49+
|---|---|---|
50+
| Ornith-1.0-9B | Dense | edge / efficient |
51+
| Ornith-1.0-31B | Dense | mid-size dense (named in blog; not seen in HF org listing) |
52+
| Ornith-1.0-35B | MoE | mid-tier |
53+
| Ornith-1.0-397B | MoE | flagship |
54+
55+
HF repos observed: 397B, 397B-FP8, 35B, 35B-GGUF, 9B, 9B-GGUF.
56+
57+
---
58+
59+
## 3. Benchmark Claims (maintainer-reported)
60+
61+
### 397B flagship
62+
| Benchmark | Ornith-397B | Context |
63+
|---|---|---|
64+
| Terminal-Bench 2.1 (Terminus-2) | 77.5 | > Opus 4.7 (70.3); < Opus 4.8 (85), GLM-5.2 (81.0) |
65+
| Terminal-Bench 2.1 (Claude Code) | 78.2 | ≈ Opus 4.8 (78.9); < GLM-5.2 (82.7) |
66+
| SWE-Bench Verified | 82.4 | > Opus 4.7 (80.8); < Opus 4.8 (87.6) |
67+
| SWE-Bench Pro | 62.2 | ≈ GLM-5.2 (62.1); < Opus |
68+
| NL2Repo | 48.2 | ≈ GLM-5.2 (48.9); < Opus 4.8 (69.7) |
69+
| ClawEval Avg | 77.1 | ≈ Opus 4.7 (78.2) |
70+
71+
Defensible framing: strong among open/comparable-size models, competitive with **Claude Opus 4.7** on several agentic benchmarks — but **not** the table's top model (Opus 4.8 and GLM-5.2 lead several rows).
72+
73+
### 9B (efficiency story)
74+
Terminal-Bench 2.1 (Terminus-2) 43.1 · SWE-Bench Verified 69.4 · ClawEval Avg 63.1 — claimed to match/exceed much larger Gemma/Qwen models on some rows.
75+
76+
---
77+
78+
## 4. Serving / Release Facts (verified on HF)
79+
- License **MIT**; `text-generation`; tags include `qwen3_5_moe`, `safetensors`, `transformers`.
80+
- Reasoning model: emits `<think>...</think>`; serve with Qwen3 reasoning parser + Qwen XML/tool-call parser.
81+
- Recipes: vLLM / SGLang, TP=8 on 8×80GB, 262k context.
82+
83+
---
84+
85+
## 5. Discrepancies / Unknowns
86+
- **35B score mismatch**: prose 64.4 vs table 64.2 (Terminal-Bench 2.1).
87+
- **397B doc error**: card calls 397B "lightweight ... single-GPU" yet recipe is 8×80GB TP=8 — likely copy/paste.
88+
- **Typo** "scallfold" on the model card.
89+
- **31B** named in blog, not visible in HF org listing checked.
90+
- No paper-grade methodology: no ablations, datasets, compute, monitor rules, judge prompts, or attack evals.
91+
- Benchmark comparisons depend on harness, tool budget, context, temperature, retries, timeout.
92+
93+
---
94+
95+
## 6. Bottom Line
96+
Credible maintainer release with real HF artifacts and one genuinely distinctive thesis: **train the model to author the scaffold that guides its own coding rollouts.** Treat leaderboard claims as **strong vendor claims awaiting independent reproduction.**

0 commit comments

Comments
 (0)