Skip to content

Commit 1036ac5

Browse files
committed
Add Agentic harness engineering concepts
1 parent d57b882 commit 1036ac5

11 files changed

Lines changed: 341 additions & 9 deletions

File tree

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,13 @@ All notable changes to this project are documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [0.1.8] - 2026-05-01
9+
10+
### Added
11+
- AHE-style layered trace evidence corpus export from `TraceStore`.
12+
- New `trace_analysis` action `export_evidence_corpus` for writing `overview.md`, per-trace detail reports, `index.json`, and optional processed raw JSONL spans.
13+
- Evidence corpus tests covering direct store export and environment action export.
14+
815
## [0.1.7] - 2026-04-30
916

1017
### Added
@@ -69,4 +76,5 @@ Initial public release of **RLM Code**.
6976

7077
[0.1.5]: https://github.com/SuperagenticAI/rlm-code/releases/tag/v0.1.5
7178
[0.1.6]: https://github.com/SuperagenticAI/rlm-code/releases/tag/v0.1.6
79+
[0.1.8]: https://github.com/SuperagenticAI/rlm-code/releases/tag/v0.1.8
7280
[0.1.7]: https://github.com/SuperagenticAI/rlm-code/releases/tag/v0.1.7

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,12 +25,13 @@ RLM Code implements the [Recursive Language Models](https://arxiv.org/abs/2502.0
2525

2626
RLM Code wraps this algorithm in an interactive terminal UI with built-in benchmarks, trajectory replay, and observability.
2727

28-
## Release v0.1.7
28+
## Release v0.1.8
2929

30-
This release adds HALO-style trace analysis as a new RLM environment.
30+
This release extends HALO/AHE-style trace analysis with layered evidence export.
3131

3232
- New `trace_analysis` environment for diagnosing agent harness failures from OTel-shaped JSONL traces
3333
- Sidecar trace indexing with dataset overview, query, count, search, full-trace view, and selected-span view actions
34+
- AHE-style evidence corpus export with `overview.md`, per-trace detail reports, `index.json`, and optional processed raw JSONL spans
3435
- Bounded payload handling for large traces, including oversized summaries and higher-cap surgical span reads
3536
- `/rlm` help/docs updated for `env=trace_analysis`
3637
- Dedicated trace analysis docs under the Core Engine section

docs/core/trace-analysis.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,11 @@ sidecar cache, exposes bounded trace-inspection actions to the RLM planner, and
88
keeps large payloads under control by returning summaries or selected spans
99
instead of blindly loading full traces into context.
1010

11+
It can also export an AHE-style layered evidence corpus for downstream coding
12+
agents or `meta-harness`: a benchmark-level `overview.md`, one detail report per
13+
selected trace, an `index.json`, and optional processed raw JSONL span files for
14+
drill-down.
15+
1116
## Usage
1217

1318
```text
@@ -30,11 +35,43 @@ The environment supports these planner actions:
3035
| `view_trace` | Read all spans for a small trace, or return an oversized summary |
3136
| `search_trace` | Search one trace for a literal substring |
3237
| `view_spans` | Read selected spans at a higher per-attribute cap |
38+
| `export_evidence_corpus` | Write layered evidence files for downstream harness optimization |
3339
| `final` | Return the final evidence report |
3440

3541
Supported filters are `has_errors`, `model_names`, `service_names`,
3642
`agent_names`, and `project_id`.
3743

44+
## Evidence Corpus Export
45+
46+
Use `export_evidence_corpus` when a report should be handed to another coding
47+
agent or to `meta-harness --trace-evidence`.
48+
49+
Planner action shape:
50+
51+
```json
52+
{
53+
"action": "export_evidence_corpus",
54+
"output_dir": "./trace-evidence",
55+
"filters": {"has_errors": true},
56+
"limit": 100,
57+
"include_raw": true
58+
}
59+
```
60+
61+
The output directory contains:
62+
63+
- `overview.md`: compact entry point with dataset counts and links to detail files
64+
- `detail/<trace-id>.md`: per-trace summary, task ids, error spans, and tool-like spans
65+
- `raw/<trace-id>.jsonl`: processed selected raw spans for drill-down when `include_raw` is true
66+
- `index.json`: machine-readable corpus metadata and trace file references
67+
68+
For MetaHarness, pass the generated overview directly:
69+
70+
```bash
71+
uv run metaharness run ./my-harness \
72+
--trace-evidence ./trace-evidence/overview.md
73+
```
74+
3875
## Trace Shape
3976

4077
The first implementation expects one JSON object per line. Each line should

docs/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
<p class="rlm-tagline">Research Playground & Evaluation OS for Recursive Language Model Agentic Systems</p>
88

9-
<span class="rlm-badge rlm-badge--purple">v0.1.7</span>
9+
<span class="rlm-badge rlm-badge--purple">v0.1.8</span>
1010
<span class="rlm-badge rlm-badge--green">Python 3.11+</span>
1111
<span class="rlm-badge rlm-badge--blue">Apache 2.0</span>
1212

@@ -47,7 +47,7 @@ Run **Pure RLM** (paper-compliant with context-as-variable), **CodeAct** (contex
4747
<div class="rlm-feature-card" markdown>
4848

4949
### 🔎 Trace Analysis
50-
Run HALO-style trace diagnosis with `env=trace_analysis` over OTel-shaped JSONL traces to find repeated harness failure modes.
50+
Run HALO/AHE-style trace diagnosis with `env=trace_analysis` over OTel-shaped JSONL traces, then export layered evidence for MetaHarness.
5151

5252
</div>
5353

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "rlm-code"
7-
version = "0.1.7"
7+
version = "0.1.8"
88
description = "RLM Code: Research Playground & Evaluation OS for Recursive Language Model Agentic Systems"
99
readme = "README.md"
1010
license = "Apache-2.0"

rlm_code/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,5 @@
55
through natural language interactions.
66
"""
77

8-
__version__ = "0.1.7"
8+
__version__ = "0.1.8"
99
__author__ = "Super Agentic AI"

rlm_code/mcp/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
)
1818
from .session_wrapper import MCPSessionWrapper
1919

20-
__version__ = "0.1.7"
20+
__version__ = "0.1.8"
2121

2222
__all__ = [
2323
"MCPClientManager",

rlm_code/rlm/environments.py

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -306,8 +306,10 @@ def system_prompt(self) -> str:
306306
"Return ONLY valid JSON object with keys:\n"
307307
"{"
308308
'"action": "set_trace_path" | "get_dataset_overview" | "query_traces" | '
309-
'"count_traces" | "view_trace" | "search_trace" | "view_spans" | "final", '
309+
'"count_traces" | "view_trace" | "search_trace" | "view_spans" | '
310+
'"export_evidence_corpus" | "final", '
310311
'"trace_path": "<path to JSONL traces>", '
312+
'"output_dir": "<directory for exported evidence corpus>", '
311313
'"filters": {"has_errors": true, "model_names": ["..."], "service_names": ["..."], '
312314
'"agent_names": ["..."], "project_id": "..."}, '
313315
'"trace_id": "<trace id>", '
@@ -324,6 +326,7 @@ def system_prompt(self) -> str:
324326
"- Always begin analysis with get_dataset_overview.\n"
325327
"- Use query_traces to choose real trace ids; never invent trace ids.\n"
326328
"- For large traces, prefer search_trace followed by view_spans.\n"
329+
"- Use export_evidence_corpus when the caller needs files for MetaHarness or another coding agent.\n"
327330
"- Identify systemic harness failures, not one-off anomalies.\n"
328331
"- Output JSON only."
329332
)
@@ -448,6 +451,21 @@ def execute_action(
448451
reward=0.7,
449452
memory_note=f"Viewed selected spans for trace {trace_id}.",
450453
)
454+
if action_name == "export_evidence_corpus":
455+
output_dir = self._required_str(action, "output_dir")
456+
resolved_output = Path(output_dir).expanduser()
457+
if not resolved_output.is_absolute():
458+
resolved_output = self.workdir / resolved_output
459+
return self._ok(
460+
observation=store.export_evidence_corpus(
461+
resolved_output,
462+
filters,
463+
limit=self._int_arg(action, "limit", 100, minimum=1, maximum=1000),
464+
include_raw=self._bool_arg(action, "include_raw", True),
465+
),
466+
reward=0.75,
467+
memory_note="Exported layered trace evidence corpus.",
468+
)
451469
except Exception as exc:
452470
return EnvironmentActionResult(
453471
observation={"success": False, "error": f"{type(exc).__name__}: {exc}"},
@@ -530,6 +548,19 @@ def _int_arg(
530548
parsed = default
531549
return max(minimum, min(maximum, parsed))
532550

551+
@staticmethod
552+
def _bool_arg(action: dict[str, Any], key: str, default: bool) -> bool:
553+
value = action.get(key, default)
554+
if isinstance(value, bool):
555+
return value
556+
if isinstance(value, str):
557+
normalized = value.strip().lower()
558+
if normalized in {"1", "true", "yes", "on"}:
559+
return True
560+
if normalized in {"0", "false", "no", "off"}:
561+
return False
562+
return default
563+
533564

534565
class DSPyCodingRLMEnvironment(GenericRLMEnvironment):
535566
"""DSPy-focused environment with file edit + tests + DSPy-aware scoring."""

0 commit comments

Comments
 (0)