Ragent6

Ragent6 is a deterministic, local benchmark for agent-capable language models. It tests whether a model can operate inside a weak tool harness, read evidence, write or edit files, run local checks, obey safety boundaries, recover from errors, and solve multi-constraint reasoning tasks.

Current release version: 0.2.0.

Status: 0.2.0 is the first public benchmark release. The current Chinese adaptation release is zh-CN 0.2.2.

Ragent6 provides English and Chinese prompt sets under the same methodology. The default en-US line remains 0.2.0; the current zh-CN line is 0.2.2.

What It Measures

Ragent6 0.2.0 contains 60 cases across 6 public dimensions. Each dimension has 10 cases.

Dimension	Name	Weight	What It Tests
R1	Task Closure	20	Complete task execution from goal understanding to artifact and verification.
R2	Evidence Use	18	Evidence retrieval, precedence, long-context lookup, and grounded answers.
R3	Format Control	12	Machine-consumable JSON, diffs, command plans, and labeled outputs.
R4	Safety Boundary	16	Refusal or guarding of dangerous actions, offline constraints, dry-run behavior, and scope control.
R5	Error Recovery	14	Multi-turn correction, stale evidence handling, invalid-output repair, and state updates.
R6	Complex Reasoning	20	Causal chains, counterfactuals, decoys, policy selection, and deep state tracking.

Scoring

Ragent6 reports two scores:

partial_weighted: the primary deterministic partial score, weighted to 100.
strict_raw: the auxiliary strict pass count, reported as x/60.

Partial scoring is deterministic and uses only local trace evidence, checker outputs, tool calls, file snapshots, and final answers. It does not use an LLM judge.

Safety hard violations still receive zero credit.

Run The Mock Smoke Test

cd Ragent6
python3 scripts/run_eval.py \
  --manifest manifests/ragent6_0_2_0_en_US.json \
  --adapter mock \
  --out results/mock-0.2.0-en-US

Expected result:

Ragent6 0.2.0 en-US: 60/60 (invalid=0)

Run A Local Model

Start any OpenAI-compatible chat completions server first. For llama.cpp, disable hidden reasoning or thinking for comparable results.

Example llama.cpp flags:

-rea off --reasoning-budget 0 --chat-template-kwargs '{"enable_thinking":false}'

Then run:

cd Ragent6
export RAGENT6_BASE_URL=http://127.0.0.1:8080/v1
export RAGENT6_MODEL_ID=local-model
export RAGENT6_MAX_TOKENS=2048
export RAGENT6_AGENT_TIMEOUT=180

python3 scripts/run_eval.py \
  --manifest manifests/ragent6_0_2_0_en_US.json \
  --adapter native_local \
  --out results/by-model/local-model/0.2.0/en-US/run-001

The native harness exposes four tools to the model when a case allows them:

read
write
edit
exec

Compute Partial Scores

Create a metadata file listing result directories. See examples/model_metadata.example.json. For local archives, the recommended layout is results/by-model/<model-slug>/<suite-version>/<locale>/<run-id>/.

python3 scripts/score_results.py \
  --metadata examples/model_metadata.example.json \
  --out-json results/ragent6_scores.json \
  --report reports/ragent6_scores.md

Reference Scores

The table below is a local reference panel, run with the no-thinking native harness. Score is the deterministic partial weighted score out of 100, Passes is the auxiliary strict x/60 pass count, and each R1-R6 column is a 10-point partial dimension score.

Rank	Base	Size	Variant	Quant	Score	Passes	R1 Closure	R2 Evidence	R3 Format	R4 Safety	R5 Recovery	R6 Reasoning
1	Qwen3.6	27B	Qwopus Abliterated	Q4_K_M	92.3	51/60	9.1	7.8	9.5	9.8	9.7	9.9
2	Qwen3.6	27B	Qwopus Preview	Q4_K_M	89.4	48/60	9.1	7.4	9.7	8.9	9.3	9.5
3	Qwen3.6	27B	Unsloth IT	Q4_K_M	87.4	49/60	9.1	7.8	9.7	8.9	9.3	8.2
4	Qwen3.6	27B	Carnice V2	Q4_K_M	86.4	43/60	9.1	7.4	9.1	9.8	8.2	8.4
5	Qwen3.6	35B A3B	Unsloth IT	Q4_K_M	79.9	38/60	8.1	6.8	7.8	8.8	8.6	8.1
6	Qwen3.5	27B	Unsloth IT	Q4_K_M	76.7	37/60	8.3	5.3	9.4	7.7	8.6	7.5
7	Qwen3.6	35B A3B	Hauhau	IQ4_XS	76.6	37/60	8.1	6.9	9.1	5.1	8.8	8.3
8	Gemma 4	31B	Unsloth IT	Q4_K_M	74.7	37/60	8.3	8.1	9.0	6.3	9.3	4.8
9	Gemma 4	26B A4B	Gemopus Preview	Q4_K_M	72.1	34/60	8.3	7.2	8.2	5.7	8.9	5.5
10	Qwen3.5	9B	CoPaw Flash	Q4_K_M	67.7	30/60	5.1	4.7	8.7	6.6	8.2	8.3
11	Gemma 4	26B A4B	Unsloth IT	Q4_K_M	67.2	30/60	8.3	6.1	8.4	4.9	8.9	4.5
12	Qwen3.5	35B A3B	Unsloth IT	Q4_K_M	63.2	28/60	4.0	3.6	9.1	7.2	7.0	8.3
13	Gemma 4	E4B	Hauhau	Q5_K_M	61.8	27/60	6.6	5.0	7.7	6.3	7.6	4.9
14	Gemma 4	E4B	Unsloth IT	Q5_K_M	61.5	28/60	5.7	6.2	7.3	6.1	7.7	4.9
15	Qwen3.5	9B	Carnice	Q4_K_M	58.6	23/60	4.9	4.6	8.4	6.8	6.5	5.3
16	Qwen3.5	13B	Heretic	IQ4_XS	58.4	23/60	4.8	4.7	7.5	6.1	7.5	5.5
17	Qwen3.5	18B	Qwopus GLM	Q4_K_M	57.2	27/60	4.0	3.5	9.7	4.8	6.8	7.0
18	Qwen3.5	9B	Unsloth IT	Q4_K_M	56.8	24/60	5.1	3.3	8.2	6.3	8.1	4.7
19	Qwen3.5	9B	Qwopus v3	Q4_K_M	56.6	25/60	2.8	5.7	8.5	5.5	6.7	6.1
20	Qwen3.5	9B	A3 i1	IQ4_NL	54.7	19/60	4.1	4.7	6.5	7.3	6.6	4.5
21	Qwen3.5	4B	OpenResearchTools	Q4_K_M	52.1	19/60	2.3	4.1	7.5	3.2	7.3	7.8
22	Gemma 4	E2B	Unsloth IT	Q4_K_M	49.6	18/60	3.3	4.4	6.1	4.5	6.5	5.8
23	Qwen3	8B	Unsloth IT	Q4_K_M	45.6	18/60	4.0	4.0	8.8	2.5	6.8	3.2
24	Gemma 3n	E4B	Unsloth IT	Q4_K_M	43.7	14/60	1.4	3.2	5.7	7.2	5.6	4.5
25	Qwen3	4B	Unsloth IT	Q4_K_M	40.3	14/60	4.0	3.7	7.1	2.1	6.3	2.5
26	Gemma 4	E4B	Gemopus Preview	IQ4_XS	39.2	11/60	1.7	2.8	6.1	5.2	5.2	3.9
27	Qwen3.5	2B	Unsloth IT	Q4_K_M	36.6	10/60	2.5	3.9	5.1	4.1	5.2	2.4
28	Gemma 3n	E2B	Unsloth IT	Q4_K_M	35.7	8/60	1.3	3.4	6.8	4.3	5.6	2.1
29	Qwen3.5	2B	AaryanK	Q4_K_M	34.2	8/60	2.5	2.8	5.2	4.7	5.5	1.3
30	Squeez	2B	i1	Q4_K_M	30.5	7/60	2.0	2.6	5.5	4.0	4.6	1.3
31	LFM2.5	350M	base	Q5_K_M	20.7	2/60	1.0	1.7	3.3	2.9	4.1	0.7
32	LFM2.5	1.2B	Instruct	Q4_K_M	18.5	4/60	0.7	1.3	2.9	3.0	3.7	0.7

Note: the reference scores above were measured with the historical zh-CN 0.2.0 Chinese benchmark form. Rerun before publishing a zh-CN 0.2.2 leaderboard.

Audit Release Assets

python3 scripts/release_audit.py \
  --manifest manifests/ragent6_0_2_0_en_US.json \
  --suite-version 0.2.0 \
  --locale en-US

Documentation

METHODOLOGY.md: scoring policy and reproducibility rules.
docs/CASES.md: public 60-case catalog.
docs/LOCALES.md: prompt set policy.
docs/VERSIONING.md: compatibility and version bump rules.
docs/RELEASE_CHECKLIST.md: validation checklist before publishing results.
results/by-model/README.md: recommended local layout for per-model result archives.

Versioning

Patch versions: documentation, reporting, or harness fixes that do not change scores.
New benchmark versions: case, checker, scorer, weight, prompt set, or dimension changes that can change scores.
Ragent6 public versioning starts at 0.2.0.

Earlier experimental branches are intentionally not included in this clean repository. Case lineage is preserved inside docs/case_map.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ragent6

What It Measures

Scoring

Run The Mock Smoke Test

Run A Local Model

Compute Partial Scores

Reference Scores

Audit Release Assets

Documentation

Versioning

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
cases		cases
docs		docs
examples		examples
manifests		manifests
metadata		metadata
ragent6		ragent6
reports		reports
results		results
schema		schema
scripts		scripts
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
METHODOLOGY.md		METHODOLOGY.md
Makefile		Makefile
README.md		README.md
README.zh-CN.md		README.zh-CN.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Ragent6

What It Measures

Scoring

Run The Mock Smoke Test

Run A Local Model

Compute Partial Scores

Reference Scores

Audit Release Assets

Documentation

Versioning

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages