English | 简体中文
Ragent6 is a deterministic, local benchmark for agent-capable language models. It tests whether a model can operate inside a weak tool harness, read evidence, write or edit files, run local checks, obey safety boundaries, recover from errors, and solve multi-constraint reasoning tasks.
Current release version: 0.2.0.
Status: 0.2.0 is the first public benchmark release. The current Chinese adaptation release is zh-CN 0.2.2.
Ragent6 provides English and Chinese prompt sets under the same methodology. The default en-US line remains 0.2.0; the current zh-CN line is 0.2.2.
Ragent6 0.2.0 contains 60 cases across 6 public dimensions. Each dimension has 10 cases.
| Dimension | Name | Weight | What It Tests |
|---|---|---|---|
| R1 | Task Closure | 20 | Complete task execution from goal understanding to artifact and verification. |
| R2 | Evidence Use | 18 | Evidence retrieval, precedence, long-context lookup, and grounded answers. |
| R3 | Format Control | 12 | Machine-consumable JSON, diffs, command plans, and labeled outputs. |
| R4 | Safety Boundary | 16 | Refusal or guarding of dangerous actions, offline constraints, dry-run behavior, and scope control. |
| R5 | Error Recovery | 14 | Multi-turn correction, stale evidence handling, invalid-output repair, and state updates. |
| R6 | Complex Reasoning | 20 | Causal chains, counterfactuals, decoys, policy selection, and deep state tracking. |
Ragent6 reports two scores:
partial_weighted: the primary deterministic partial score, weighted to 100.strict_raw: the auxiliary strict pass count, reported asx/60.
Partial scoring is deterministic and uses only local trace evidence, checker outputs, tool calls, file snapshots, and final answers. It does not use an LLM judge.
Safety hard violations still receive zero credit.
cd Ragent6
python3 scripts/run_eval.py \
--manifest manifests/ragent6_0_2_0_en_US.json \
--adapter mock \
--out results/mock-0.2.0-en-USExpected result:
Ragent6 0.2.0 en-US: 60/60 (invalid=0)
Start any OpenAI-compatible chat completions server first. For llama.cpp, disable hidden reasoning or thinking for comparable results.
Example llama.cpp flags:
-rea off --reasoning-budget 0 --chat-template-kwargs '{"enable_thinking":false}'Then run:
cd Ragent6
export RAGENT6_BASE_URL=http://127.0.0.1:8080/v1
export RAGENT6_MODEL_ID=local-model
export RAGENT6_MAX_TOKENS=2048
export RAGENT6_AGENT_TIMEOUT=180
python3 scripts/run_eval.py \
--manifest manifests/ragent6_0_2_0_en_US.json \
--adapter native_local \
--out results/by-model/local-model/0.2.0/en-US/run-001The native harness exposes four tools to the model when a case allows them:
readwriteeditexec
Create a metadata file listing result directories. See examples/model_metadata.example.json.
For local archives, the recommended layout is results/by-model/<model-slug>/<suite-version>/<locale>/<run-id>/.
python3 scripts/score_results.py \
--metadata examples/model_metadata.example.json \
--out-json results/ragent6_scores.json \
--report reports/ragent6_scores.mdThe table below is a local reference panel, run with the no-thinking native harness. Score is the deterministic partial weighted score out of 100, Passes is the auxiliary strict x/60 pass count, and each R1-R6 column is a 10-point partial dimension score.
| Rank | Base | Size | Variant | Quant | Score | Passes | R1 Closure | R2 Evidence | R3 Format | R4 Safety | R5 Recovery | R6 Reasoning |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Qwen3.6 | 27B | Qwopus Abliterated | Q4_K_M | 92.3 | 51/60 | 9.1 | 7.8 | 9.5 | 9.8 | 9.7 | 9.9 |
| 2 | Qwen3.6 | 27B | Qwopus Preview | Q4_K_M | 89.4 | 48/60 | 9.1 | 7.4 | 9.7 | 8.9 | 9.3 | 9.5 |
| 3 | Qwen3.6 | 27B | Unsloth IT | Q4_K_M | 87.4 | 49/60 | 9.1 | 7.8 | 9.7 | 8.9 | 9.3 | 8.2 |
| 4 | Qwen3.6 | 27B | Carnice V2 | Q4_K_M | 86.4 | 43/60 | 9.1 | 7.4 | 9.1 | 9.8 | 8.2 | 8.4 |
| 5 | Qwen3.6 | 35B A3B | Unsloth IT | Q4_K_M | 79.9 | 38/60 | 8.1 | 6.8 | 7.8 | 8.8 | 8.6 | 8.1 |
| 6 | Qwen3.5 | 27B | Unsloth IT | Q4_K_M | 76.7 | 37/60 | 8.3 | 5.3 | 9.4 | 7.7 | 8.6 | 7.5 |
| 7 | Qwen3.6 | 35B A3B | Hauhau | IQ4_XS | 76.6 | 37/60 | 8.1 | 6.9 | 9.1 | 5.1 | 8.8 | 8.3 |
| 8 | Gemma 4 | 31B | Unsloth IT | Q4_K_M | 74.7 | 37/60 | 8.3 | 8.1 | 9.0 | 6.3 | 9.3 | 4.8 |
| 9 | Gemma 4 | 26B A4B | Gemopus Preview | Q4_K_M | 72.1 | 34/60 | 8.3 | 7.2 | 8.2 | 5.7 | 8.9 | 5.5 |
| 10 | Qwen3.5 | 9B | CoPaw Flash | Q4_K_M | 67.7 | 30/60 | 5.1 | 4.7 | 8.7 | 6.6 | 8.2 | 8.3 |
| 11 | Gemma 4 | 26B A4B | Unsloth IT | Q4_K_M | 67.2 | 30/60 | 8.3 | 6.1 | 8.4 | 4.9 | 8.9 | 4.5 |
| 12 | Qwen3.5 | 35B A3B | Unsloth IT | Q4_K_M | 63.2 | 28/60 | 4.0 | 3.6 | 9.1 | 7.2 | 7.0 | 8.3 |
| 13 | Gemma 4 | E4B | Hauhau | Q5_K_M | 61.8 | 27/60 | 6.6 | 5.0 | 7.7 | 6.3 | 7.6 | 4.9 |
| 14 | Gemma 4 | E4B | Unsloth IT | Q5_K_M | 61.5 | 28/60 | 5.7 | 6.2 | 7.3 | 6.1 | 7.7 | 4.9 |
| 15 | Qwen3.5 | 9B | Carnice | Q4_K_M | 58.6 | 23/60 | 4.9 | 4.6 | 8.4 | 6.8 | 6.5 | 5.3 |
| 16 | Qwen3.5 | 13B | Heretic | IQ4_XS | 58.4 | 23/60 | 4.8 | 4.7 | 7.5 | 6.1 | 7.5 | 5.5 |
| 17 | Qwen3.5 | 18B | Qwopus GLM | Q4_K_M | 57.2 | 27/60 | 4.0 | 3.5 | 9.7 | 4.8 | 6.8 | 7.0 |
| 18 | Qwen3.5 | 9B | Unsloth IT | Q4_K_M | 56.8 | 24/60 | 5.1 | 3.3 | 8.2 | 6.3 | 8.1 | 4.7 |
| 19 | Qwen3.5 | 9B | Qwopus v3 | Q4_K_M | 56.6 | 25/60 | 2.8 | 5.7 | 8.5 | 5.5 | 6.7 | 6.1 |
| 20 | Qwen3.5 | 9B | A3 i1 | IQ4_NL | 54.7 | 19/60 | 4.1 | 4.7 | 6.5 | 7.3 | 6.6 | 4.5 |
| 21 | Qwen3.5 | 4B | OpenResearchTools | Q4_K_M | 52.1 | 19/60 | 2.3 | 4.1 | 7.5 | 3.2 | 7.3 | 7.8 |
| 22 | Gemma 4 | E2B | Unsloth IT | Q4_K_M | 49.6 | 18/60 | 3.3 | 4.4 | 6.1 | 4.5 | 6.5 | 5.8 |
| 23 | Qwen3 | 8B | Unsloth IT | Q4_K_M | 45.6 | 18/60 | 4.0 | 4.0 | 8.8 | 2.5 | 6.8 | 3.2 |
| 24 | Gemma 3n | E4B | Unsloth IT | Q4_K_M | 43.7 | 14/60 | 1.4 | 3.2 | 5.7 | 7.2 | 5.6 | 4.5 |
| 25 | Qwen3 | 4B | Unsloth IT | Q4_K_M | 40.3 | 14/60 | 4.0 | 3.7 | 7.1 | 2.1 | 6.3 | 2.5 |
| 26 | Gemma 4 | E4B | Gemopus Preview | IQ4_XS | 39.2 | 11/60 | 1.7 | 2.8 | 6.1 | 5.2 | 5.2 | 3.9 |
| 27 | Qwen3.5 | 2B | Unsloth IT | Q4_K_M | 36.6 | 10/60 | 2.5 | 3.9 | 5.1 | 4.1 | 5.2 | 2.4 |
| 28 | Gemma 3n | E2B | Unsloth IT | Q4_K_M | 35.7 | 8/60 | 1.3 | 3.4 | 6.8 | 4.3 | 5.6 | 2.1 |
| 29 | Qwen3.5 | 2B | AaryanK | Q4_K_M | 34.2 | 8/60 | 2.5 | 2.8 | 5.2 | 4.7 | 5.5 | 1.3 |
| 30 | Squeez | 2B | i1 | Q4_K_M | 30.5 | 7/60 | 2.0 | 2.6 | 5.5 | 4.0 | 4.6 | 1.3 |
| 31 | LFM2.5 | 350M | base | Q5_K_M | 20.7 | 2/60 | 1.0 | 1.7 | 3.3 | 2.9 | 4.1 | 0.7 |
| 32 | LFM2.5 | 1.2B | Instruct | Q4_K_M | 18.5 | 4/60 | 0.7 | 1.3 | 2.9 | 3.0 | 3.7 | 0.7 |
Note: the reference scores above were measured with the historical zh-CN 0.2.0 Chinese benchmark form. Rerun before publishing a zh-CN 0.2.2 leaderboard.
python3 scripts/release_audit.py \
--manifest manifests/ragent6_0_2_0_en_US.json \
--suite-version 0.2.0 \
--locale en-USMETHODOLOGY.md: scoring policy and reproducibility rules.docs/CASES.md: public 60-case catalog.docs/LOCALES.md: prompt set policy.docs/VERSIONING.md: compatibility and version bump rules.docs/RELEASE_CHECKLIST.md: validation checklist before publishing results.results/by-model/README.md: recommended local layout for per-model result archives.
- Patch versions: documentation, reporting, or harness fixes that do not change scores.
- New benchmark versions: case, checker, scorer, weight, prompt set, or dimension changes that can change scores.
- Ragent6 public versioning starts at
0.2.0.
Earlier experimental branches are intentionally not included in this clean repository. Case lineage is preserved inside docs/case_map.json.