You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We trained and open-sourced **Squeez-2B**, a compact model for pruning tool output in coding agents. Given a focused query and one raw tool observation, it returns the smallest verbatim evidence block that the agent should inspect next. On our held-out benchmark it reaches **0.862 recall at 91.5% compression**, outperforming a zero-shot **Qwen 3.5 35B A3B** baseline by **11.3 recall points**while operating at essentially the same compression level.
17
+
We trained and open-sourced **Squeez-2B**, a compact model for pruning tool output in coding agents. Given a focused query and one raw tool observation, it returns the smallest verbatim evidence block that the agent should inspect next. On our held-out benchmark it reaches **0.86 recall at 92% compression**, outperforming a zero-shot **Qwen 3.5 35B A3B** baseline by **11 recall points** at essentially the same compression level. The model, dataset, and code are released on [Hugging Face](https://huggingface.co/KRLabsOrg/squeez-2b), [the dataset hub](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench), and [GitHub](https://github.com/KRLabsOrg/squeez).
- Code and CLI: [github.com/KRLabsOrg/squeez](https://github.com/KRLabsOrg/squeez)
24
-
25
-
This post describes the problem, explains how we built a benchmark for it, and shows that dedicated supervision works substantially better than larger zero-shot models or simple retrieval heuristics.
19
+
This post explains the problem, describes how we built the benchmark, and shows why narrow supervision works better here than larger zero-shot models or simple retrieval heuristics.
26
20
27
21
## The Problem
28
22
@@ -99,13 +93,15 @@ The overall pipeline is shown below:
99
93
<imgsrc="./assets/squeez_overview.svg"alt="Squeez pipeline: from raw tool output through span annotation to generative model"width="920">
100
94
</p>
101
95
102
-
The benchmark is built from two sources. The first is [SWE-bench](https://openreview.net/forum?id=VTF8yNQM66), which provides real GitHub issue-resolution tasks over real repositories. We do not use SWE-bench as another patch-generation benchmark. Instead, we use it as a source of realistic repository snapshots, issue contexts, and raw tool observations. Starting from cloned SWE-bench repositories, we collected or reused**10,713** raw tool observations, including file reads, grep hits, Git history, shell output, test results, Python exceptions, and package-manager traces.
96
+
The benchmark is built from two sources. The first is [SWE-bench](https://openreview.net/forum?id=VTF8yNQM66), which provides real GitHub issue-resolution tasks over real repositories. We clone repository snapshots and execute 14 tool types against them — file reads, grep, Git log and blame, test runners, linters, type checkers, package installation, curl, and others — collecting**10,713** raw observations that reflect the kind of output a coding agent encounters during issue resolution.
103
97
104
-
The second source is synthetic multi-ecosystem tool output. Its role is to broaden coverage where SWE-bench is thin, especially outside the Python-heavy distribution of repository-level issue fixing. Starting from **2,039**raw synthetic observations, we add examples from TypeScript, Go, Rust, Java, Docker, Terraform, and Kubernetes workflows, and we also construct explicit negatives where the correct pruning decision is to return nothing.
98
+
The second source is synthetic multi-ecosystem tool output, which extends coverage beyond SWE-bench's Python-heavy distribution. We use `openai/gpt-oss-120b` to generate **2,039**realistic tool outputs for representative tasks in TypeScript, Go, Rust, Java, Docker, Terraform, and Kubernetes workflows. We also construct explicit negatives by pairing mismatched queries and tool outputs, where the correct pruning decision is to return nothing.
105
99
106
-
Each released example is built with a two-stage teacher-labeling pipeline using `openai/gpt-oss-120b`. First, the teacher writes a focused extraction query for one observation. Second, it selects the smallest contiguous span or set of spans that answers that query. The teacher sees a numbered rendering of the output for stable span selection, but the released labels are always mapped back onto the original raw text. This is a deliberate design choice: the benchmark stores a pruning decision over the source observation, not a free-form textual explanation of it. Positive examples whose query cannot be supported by the observation are dropped rather than retained as accidental empty outputs. Explicit negatives are created separately in the synthetic portion.
100
+
The executed SWE-derived subset covers 14 tool types; the full released benchmark reaches 27 tool families once the synthetic multi-ecosystem portion is added.
107
101
108
-
The held-out set was manually curated. Starting from **729** candidate test examples, we removed **111** cases (15.2%) that were near-duplicates, trivial 1--2 line outputs, overly broad spans, or incorrect annotations. The final test set contains **618** manually reviewed examples.
102
+
Each released positive example is labeled with the same two-stage teacher pipeline, again using `openai/gpt-oss-120b`, regardless of whether it comes from SWE-bench or from the synthetic portion. First, the teacher writes a focused extraction query for one observation. Second, it selects the smallest contiguous span, or small set of spans, that answers that query. The teacher sees a numbered rendering of the output for stable span selection, but the released labels are always mapped back onto the original raw text. Positive examples whose query cannot be supported by the observation are dropped rather than retained as accidental empty outputs. Explicit negatives are created separately in the synthetic portion, where the correct target is an empty extraction.
103
+
104
+
The held-out set was also manually curated. Starting from **729** candidate test examples, we removed **111** cases (15.2%) that were near-duplicates, trivial 1--2 line outputs, overly broad spans, or incorrect annotations. The final test set contains **618** manually reviewed examples.
109
105
110
106
The released benchmark contains **11,477** examples in total: **9,205** SWE-derived examples, **1,697** synthetic positives, and **575** synthetic negatives. SWE-derived examples are split by repository and synthetic examples by tool family.
111
107
@@ -129,13 +125,13 @@ The benchmark covers **27** tool types. The largest families are shown below.
129
125
|`pip_install`| 441 | 438 | 79 |
130
126
|`type_check`| 317 | 3418 | 39 |
131
127
|`git_blame`| 291 | 4210 | 139 |
132
-
| remaining tools |3873| 688 | 47 |
128
+
| remaining tools |2873| 688 | 47 |
133
129
134
-
The distribution is intentionally heterogeneous. `python` and `test_output` rows are short; `read_file`, `type_check`, and `git_blame` can be extremely long. This variation is one reason simple truncation and lexical retrieval perform poorly: the useful evidence does not follow one structural pattern, and it may occur at the beginning, middle, or end of the observation.
130
+
The distribution is intentionally heterogeneous. `python` and `test_output` rows are short; `read_file`, `type_check`, and `git_blame` can be extremely long. This matters because the useful evidence does not follow one structural pattern, and it may occur at the beginning, middle, or end of the observation. That is also why simple truncation and lexical retrieval remain weak baselines here.
135
131
136
132
## Training a Small Model for a Narrow Task
137
133
138
-
We chose **Qwen 3.5 2B** as the base model ([Qwen3 Technical Report](https://arxiv.org/abs/2505.09388)). The choice was deliberate. The goal here is not to maximize zero-shot reasoning with the largest possible decoder. It is to learn a narrow supervised extraction policy that can run cheaply inside an agent loop. A dense 2B model is large enough to benefit from supervision, but still small enough to be practical for local serving and repeated tool use.
134
+
We chose **Qwen 3.5 2B** as the base model ([Qwen3.5 blog post](https://qwen.ai/blog?id=qwen3.5)). The goal here is not to maximize zero-shot reasoning with the largest possible decoder. It is to learn a narrow supervised extraction policy that can run cheaply inside an agent loop. A 2B model is large enough to benefit from supervision, but still small enough to be practical for local serving and repeated tool use.
139
135
140
136
We fine-tuned the model with **LoRA** ([Hu et al., 2022](https://openreview.net/forum?id=nZeVKeeFYf9); [Dettmers et al., 2023](https://proceedings.neurips.cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html)) using the **Unsloth** stack. The model receives a focused extraction query and the raw tool observation, and is trained to emit the extracted evidence wrapped in `<relevant_lines>` tags. In other words, the supervision target is not a classification label and not a summary. It is the exact evidence block the model should keep.
141
137
@@ -147,32 +143,73 @@ We compare Squeez-2B against three zero-shot generative baselines and four heuri
147
143
148
144
| Model | Recall | F1 | Compression |
149
145
|---|---:|---:|---:|
150
-
|**Squeez-2B**|**0.8624**|**0.8035**| 0.9150|
151
-
| Qwen 3.5 35B A3B | 0.7498| 0.7254| 0.9177|
152
-
| Kimi K2 | 0.5286| 0.6827| 0.9425|
153
-
| Qwen 3.5 2B (base) | 0.5299| 0.5482| 0.8197|
154
-
| BM25 (10%) | 0.2172| 0.2314| 0.9036|
155
-
| First-N (10%) | 0.1445| 0.1570| 0.9055|
156
-
| Random (10%) | 0.1009| 0.1966| 0.9067|
157
-
| Last-N (10%) | 0.0503| 0.1393| 0.9130|
146
+
|**Squeez-2B**|**0.86**|**0.80**| 0.92|
147
+
| Qwen 3.5 35B A3B | 0.75| 0.73| 0.92|
148
+
| Kimi K2 | 0.53| 0.68| 0.94|
149
+
| Qwen 3.5 2B (base) | 0.53| 0.55| 0.82|
150
+
| BM25 (10%) | 0.22| 0.23| 0.90|
151
+
| First-N (10%) | 0.14| 0.16| 0.91|
152
+
| Random (10%) | 0.10| 0.20| 0.91|
153
+
| Last-N (10%) | 0.05| 0.14| 0.91|
158
154
159
-
Three results matter most. First, **task-specific training matters**: a fine-tuned 2B model outperforms the 18x larger Qwen 3.5 35B A3B by **11.3 recall points** at almost the same compression level. Second, **heuristics are not sufficient**: BM25 reaches only **0.22 recall**, because lexical overlap is a poor proxy for relevance in stack traces, logs, and mixed-format observations. Third, **aggressive compression alone is not enough**: Kimi K2 removes the largest fraction of tokens, but pays for that compression with a large recall drop.
155
+
Three results matter most. First, **task-specific training matters**: a fine-tuned 2B model outperforms the 18x larger Qwen 3.5 35B A3B by **11 recall points** at almost the same compression level. Second, **heuristics are not sufficient**: BM25 reaches only **0.22 recall**, because lexical overlap is a poor proxy for relevance in stack traces, logs, and mixed-format observations. Third, **aggressive compression alone is not enough**: Kimi K2 removes the largest fraction of tokens, but pays for that compression with a large recall drop.
160
156
161
157
The recall-compression trade-off is shown below. Squeez-2B occupies the upper-left region: high recall with strong compression.
162
158
163
159
<palign="center">
164
160
<imgsrc="./assets/squeez_results_chart.svg"alt="Recall vs compression across all models"width="920">
165
161
</p>
166
162
167
-
The aggregate numbers are only part of the story. Qualitatively, the model appears to learn tool-specific pruning regularities. In `grep` and `git_log`, it tends to return the single relevant hit rather than a broader lexical neighborhood. In `test_output`, `build_output`, and package-manager logs, it keeps the failure block and drops surrounding boilerplate. In `read_file`, it often retains the smallest contiguous code block that answers the query instead of an entire surrounding function or class.
163
+
The aggregate numbers are only part of the story. Below are four qualitative patterns from the held-out test set.
164
+
165
+
**Precise selection in structured output.** In `grep` and `git_log`, the fine-tuned model learns to return the single relevant hit. Here is a 21-line `git_log` where the task is to find the commit that changed the dimension order of `xr.polyval` output:
166
+
167
+
```
168
+
fc282d59 re-add timedelta support for polyval (#6599)
169
+
cad4474a Fix polyval overloads (#6593)
170
+
6fbeb131 polyval: Use Horner's algorithm + support chunked inputs (#6548) ← gold
171
+
07de257c Simplify transpose in xr.dot (#5849)
172
+
... 17 more lines ...
173
+
```
174
+
175
+
| Model | Prediction | Correct? |
176
+
|---|---|---|
177
+
|**Squeez-2B**|`6fbeb131 polyval: Use Horner's algorithm...`| Yes |
178
+
| Qwen 3.5 35B A3B |`07de257c Simplify transpose in xr.dot`| No (wrong commit) |
Squeez picks the exact commit. Qwen 35B picks a plausible but wrong commit about transpose — right neighborhood, wrong entry.
182
+
183
+
**Failure-block extraction in logs.** This 176-line service log contains **two** separate TLS handshake failures at different timestamps. The query asks for the health-check failure:
| Kimi K2 | Health-check TLS block (10:00:00) | Partial (3 of 5 lines) |
201
+
202
+
Qwen 35B selects a semantically similar but wrong block from a later request. This "right pattern, wrong instance" failure is common among zero-shot models on repetitive log output.
203
+
204
+
**Correct empty predictions.** On negative examples where the tool output does not contain the requested evidence, Squeez correctly returns nothing. In a 316-line `docker_logs` output, the query asks about a numpy version conflict between torch and tensorflow — but no such conflict exists. Squeez returns empty output; Qwen 35B generates "No relevant lines found..." (not verbatim tool output); the 2B base returns unrelated database errors. On the 59 negative examples in the test set, Squeez-2B correctly returns empty 80% of the time. Kimi K2 matches this (81%), likely because its aggressive compression tends toward empty output. Qwen 35B returns empty only 7% of the time.
168
205
169
-
The following `kubectl` example illustrates the intended use case. The full observation contains 250 lines of pod description; the relevant evidence is a two-line block reporting `OOMKilled` and the exit code.
206
+
**The kubectl example** illustrates the intended use case at a glance. The full observation contains 250 lines of pod description; the relevant evidence is a two-line block reporting `OOMKilled` and the exit code.
170
207
171
208
<palign="center">
172
209
<imgsrc="./assets/squeez_qualitative_example.svg"alt="kubectl example: 2 relevant lines from 250"width="920">
173
210
</p>
174
211
175
-
The strongest remaining failures are usually semantically adjacent but incorrect selections: choosing the wrong file from an `ls` listing, or returning a related commit that touches the same module without directly answering the query.
212
+
**Remaining errors.**The strongest failures of Squeez-2B are semantically adjacent but incorrect selections. In a build log containing both a Dockerfile syntax error and a Python `SyntaxError`, Squeez correctly finds the Dockerfile error but also includes the nearby Python error. Qwen 35B picks *only*the Python error and misses the Dockerfile error entirely. This pattern — correct evidence plus some extra noise — accounts for most of the gap between Squeez's 0.86 recall and its 0.80 precision.
176
213
177
214
## Using Squeez
178
215
@@ -203,16 +240,15 @@ Examples:
203
240
204
241
The same pattern works with Codex and other agent setups that accept system-level instructions or shell wrappers.
205
242
206
-
## Takeaway
243
+
## Closing Remarks
207
244
208
-
One recurring bottleneck in coding agents is deciding what to keep from a single tool observation. Our results suggest this is learnable, practically useful, and not handled well by simple heuristics or larger zero-shot models alone. Squeez is our attempt at a focused solution: a narrow model for a narrow problem.
245
+
One recurring bottleneck in coding agents is deciding what to keep from a single tool observation. Our results suggest that this bottleneck is both measurable and learnable: mixed-format tool output is not handled well by simple heuristics or larger zero-shot models alone, but it responds well to narrow supervision. That is the main claim behind Squeez. It is a small model for a small problem, but the problem turns out to matter.
@@ -221,9 +257,10 @@ One recurring bottleneck in coding agents is deciding what to keep from a single
221
257
- Jiang, H., et al. (2023). *LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models*. [EMNLP](https://aclanthology.org/2023.emnlp-main.825/)
222
258
- Jiang, H., et al. (2024). *LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression*. [ACL](https://aclanthology.org/2024.acl-long.91/)
223
259
- Hwang, T., et al. (2025). *EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation*. [Findings of ACL](https://aclanthology.org/2025.findings-acl.253/)
224
-
- Chirkova, N., et al. (2025). *Provence: Efficient and Robust Context Pruning for Retrieval-Augmented Generation*. [arXiv](https://arxiv.org/abs/2501.16214)
260
+
- Chirkova, N., et al. (2025). *Provence: Efficient and Robust Context Pruning for Retrieval-Augmented Generation*. [ICLR](https://arxiv.org/abs/2501.16214)
- Kerboua, I., et al. (2025). *FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents*. [arXiv](https://arxiv.org/abs/2510.03204)
227
263
- Wang, Y., et al. (2026). *SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents*. [arXiv](https://arxiv.org/abs/2601.16746)
264
+
- Kovacs, A., Schmitt, P., Recski, G. (2025). *KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering*. [BioNLP Workshop](https://aclanthology.org/2025.bionlp-share.8/)
228
265
- Jimenez, C. E., et al. (2024). *SWE-bench: Can Language Models Resolve Real-World GitHub Issues?*[ICLR](https://openreview.net/forum?id=VTF8yNQM66)
229
-
-Yang, A., et al. (2025). *Qwen3 Technical Report*. [arXiv](https://arxiv.org/abs/2505.09388)
266
+
-Qwen Team. (2026). *Qwen3.5: Towards Native Multimodal Agents*. [Blog post](https://qwen.ai/blog?id=qwen3.5)
0 commit comments