Skip to content

Commit 5b58d19

Browse files
committed
Blogpost update
1 parent 6a92ab1 commit 5b58d19

2 files changed

Lines changed: 74 additions & 37 deletions

File tree

blog/assets/squeez_overview.svg

Lines changed: 1 addition & 1 deletion
Loading

blog/huggingface_blogpost.md

Lines changed: 73 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -8,21 +8,15 @@ pinned: false
88
license: "apache-2.0"
99
---
1010

11-
# How We Built a Tool-Output Pruner for Coding Agents
11+
# Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents
1212

1313
<p align="center">
1414
<img src="./assets/squeez_mascot.png" alt="Squeez mascot" width="180">
1515
</p>
1616

17-
We trained and open-sourced **Squeez-2B**, a compact model for pruning tool output in coding agents. Given a focused query and one raw tool observation, it returns the smallest verbatim evidence block that the agent should inspect next. On our held-out benchmark it reaches **0.862 recall at 91.5% compression**, outperforming a zero-shot **Qwen 3.5 35B A3B** baseline by **11.3 recall points** while operating at essentially the same compression level.
17+
We trained and open-sourced **Squeez-2B**, a compact model for pruning tool output in coding agents. Given a focused query and one raw tool observation, it returns the smallest verbatim evidence block that the agent should inspect next. On our held-out benchmark it reaches **0.86 recall at 92% compression**, outperforming a zero-shot **Qwen 3.5 35B A3B** baseline by **11 recall points** at essentially the same compression level. The model, dataset, and code are released on [Hugging Face](https://huggingface.co/KRLabsOrg/squeez-2b), [the dataset hub](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench), and [GitHub](https://github.com/KRLabsOrg/squeez).
1818

19-
The release consists of three parts:
20-
21-
- Model: [KRLabsOrg/squeez-2b](https://huggingface.co/KRLabsOrg/squeez-2b)
22-
- Dataset: [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench)
23-
- Code and CLI: [github.com/KRLabsOrg/squeez](https://github.com/KRLabsOrg/squeez)
24-
25-
This post describes the problem, explains how we built a benchmark for it, and shows that dedicated supervision works substantially better than larger zero-shot models or simple retrieval heuristics.
19+
This post explains the problem, describes how we built the benchmark, and shows why narrow supervision works better here than larger zero-shot models or simple retrieval heuristics.
2620

2721
## The Problem
2822

@@ -99,13 +93,15 @@ The overall pipeline is shown below:
9993
<img src="./assets/squeez_overview.svg" alt="Squeez pipeline: from raw tool output through span annotation to generative model" width="920">
10094
</p>
10195

102-
The benchmark is built from two sources. The first is [SWE-bench](https://openreview.net/forum?id=VTF8yNQM66), which provides real GitHub issue-resolution tasks over real repositories. We do not use SWE-bench as another patch-generation benchmark. Instead, we use it as a source of realistic repository snapshots, issue contexts, and raw tool observations. Starting from cloned SWE-bench repositories, we collected or reused **10,713** raw tool observations, including file reads, grep hits, Git history, shell output, test results, Python exceptions, and package-manager traces.
96+
The benchmark is built from two sources. The first is [SWE-bench](https://openreview.net/forum?id=VTF8yNQM66), which provides real GitHub issue-resolution tasks over real repositories. We clone repository snapshots and execute 14 tool types against them — file reads, grep, Git log and blame, test runners, linters, type checkers, package installation, curl, and others — collecting **10,713** raw observations that reflect the kind of output a coding agent encounters during issue resolution.
10397

104-
The second source is synthetic multi-ecosystem tool output. Its role is to broaden coverage where SWE-bench is thin, especially outside the Python-heavy distribution of repository-level issue fixing. Starting from **2,039** raw synthetic observations, we add examples from TypeScript, Go, Rust, Java, Docker, Terraform, and Kubernetes workflows, and we also construct explicit negatives where the correct pruning decision is to return nothing.
98+
The second source is synthetic multi-ecosystem tool output, which extends coverage beyond SWE-bench's Python-heavy distribution. We use `openai/gpt-oss-120b` to generate **2,039** realistic tool outputs for representative tasks in TypeScript, Go, Rust, Java, Docker, Terraform, and Kubernetes workflows. We also construct explicit negatives by pairing mismatched queries and tool outputs, where the correct pruning decision is to return nothing.
10599

106-
Each released example is built with a two-stage teacher-labeling pipeline using `openai/gpt-oss-120b`. First, the teacher writes a focused extraction query for one observation. Second, it selects the smallest contiguous span or set of spans that answers that query. The teacher sees a numbered rendering of the output for stable span selection, but the released labels are always mapped back onto the original raw text. This is a deliberate design choice: the benchmark stores a pruning decision over the source observation, not a free-form textual explanation of it. Positive examples whose query cannot be supported by the observation are dropped rather than retained as accidental empty outputs. Explicit negatives are created separately in the synthetic portion.
100+
The executed SWE-derived subset covers 14 tool types; the full released benchmark reaches 27 tool families once the synthetic multi-ecosystem portion is added.
107101

108-
The held-out set was manually curated. Starting from **729** candidate test examples, we removed **111** cases (15.2%) that were near-duplicates, trivial 1--2 line outputs, overly broad spans, or incorrect annotations. The final test set contains **618** manually reviewed examples.
102+
Each released positive example is labeled with the same two-stage teacher pipeline, again using `openai/gpt-oss-120b`, regardless of whether it comes from SWE-bench or from the synthetic portion. First, the teacher writes a focused extraction query for one observation. Second, it selects the smallest contiguous span, or small set of spans, that answers that query. The teacher sees a numbered rendering of the output for stable span selection, but the released labels are always mapped back onto the original raw text. Positive examples whose query cannot be supported by the observation are dropped rather than retained as accidental empty outputs. Explicit negatives are created separately in the synthetic portion, where the correct target is an empty extraction.
103+
104+
The held-out set was also manually curated. Starting from **729** candidate test examples, we removed **111** cases (15.2%) that were near-duplicates, trivial 1--2 line outputs, overly broad spans, or incorrect annotations. The final test set contains **618** manually reviewed examples.
109105

110106
The released benchmark contains **11,477** examples in total: **9,205** SWE-derived examples, **1,697** synthetic positives, and **575** synthetic negatives. SWE-derived examples are split by repository and synthetic examples by tool family.
111107

@@ -129,13 +125,13 @@ The benchmark covers **27** tool types. The largest families are shown below.
129125
| `pip_install` | 441 | 438 | 79 |
130126
| `type_check` | 317 | 3418 | 39 |
131127
| `git_blame` | 291 | 4210 | 139 |
132-
| remaining tools | 3873 | 688 | 47 |
128+
| remaining tools | 2873 | 688 | 47 |
133129

134-
The distribution is intentionally heterogeneous. `python` and `test_output` rows are short; `read_file`, `type_check`, and `git_blame` can be extremely long. This variation is one reason simple truncation and lexical retrieval perform poorly: the useful evidence does not follow one structural pattern, and it may occur at the beginning, middle, or end of the observation.
130+
The distribution is intentionally heterogeneous. `python` and `test_output` rows are short; `read_file`, `type_check`, and `git_blame` can be extremely long. This matters because the useful evidence does not follow one structural pattern, and it may occur at the beginning, middle, or end of the observation. That is also why simple truncation and lexical retrieval remain weak baselines here.
135131

136132
## Training a Small Model for a Narrow Task
137133

138-
We chose **Qwen 3.5 2B** as the base model ([Qwen3 Technical Report](https://arxiv.org/abs/2505.09388)). The choice was deliberate. The goal here is not to maximize zero-shot reasoning with the largest possible decoder. It is to learn a narrow supervised extraction policy that can run cheaply inside an agent loop. A dense 2B model is large enough to benefit from supervision, but still small enough to be practical for local serving and repeated tool use.
134+
We chose **Qwen 3.5 2B** as the base model ([Qwen3.5 blog post](https://qwen.ai/blog?id=qwen3.5)). The goal here is not to maximize zero-shot reasoning with the largest possible decoder. It is to learn a narrow supervised extraction policy that can run cheaply inside an agent loop. A 2B model is large enough to benefit from supervision, but still small enough to be practical for local serving and repeated tool use.
139135

140136
We fine-tuned the model with **LoRA** ([Hu et al., 2022](https://openreview.net/forum?id=nZeVKeeFYf9); [Dettmers et al., 2023](https://proceedings.neurips.cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html)) using the **Unsloth** stack. The model receives a focused extraction query and the raw tool observation, and is trained to emit the extracted evidence wrapped in `<relevant_lines>` tags. In other words, the supervision target is not a classification label and not a summary. It is the exact evidence block the model should keep.
141137

@@ -147,32 +143,73 @@ We compare Squeez-2B against three zero-shot generative baselines and four heuri
147143

148144
| Model | Recall | F1 | Compression |
149145
|---|---:|---:|---:|
150-
| **Squeez-2B** | **0.8624** | **0.8035** | 0.9150 |
151-
| Qwen 3.5 35B A3B | 0.7498 | 0.7254 | 0.9177 |
152-
| Kimi K2 | 0.5286 | 0.6827 | 0.9425 |
153-
| Qwen 3.5 2B (base) | 0.5299 | 0.5482 | 0.8197 |
154-
| BM25 (10%) | 0.2172 | 0.2314 | 0.9036 |
155-
| First-N (10%) | 0.1445 | 0.1570 | 0.9055 |
156-
| Random (10%) | 0.1009 | 0.1966 | 0.9067 |
157-
| Last-N (10%) | 0.0503 | 0.1393 | 0.9130 |
146+
| **Squeez-2B** | **0.86** | **0.80** | 0.92 |
147+
| Qwen 3.5 35B A3B | 0.75 | 0.73 | 0.92 |
148+
| Kimi K2 | 0.53 | 0.68 | 0.94 |
149+
| Qwen 3.5 2B (base) | 0.53 | 0.55 | 0.82 |
150+
| BM25 (10%) | 0.22 | 0.23 | 0.90 |
151+
| First-N (10%) | 0.14 | 0.16 | 0.91 |
152+
| Random (10%) | 0.10 | 0.20 | 0.91 |
153+
| Last-N (10%) | 0.05 | 0.14 | 0.91 |
158154

159-
Three results matter most. First, **task-specific training matters**: a fine-tuned 2B model outperforms the 18x larger Qwen 3.5 35B A3B by **11.3 recall points** at almost the same compression level. Second, **heuristics are not sufficient**: BM25 reaches only **0.22 recall**, because lexical overlap is a poor proxy for relevance in stack traces, logs, and mixed-format observations. Third, **aggressive compression alone is not enough**: Kimi K2 removes the largest fraction of tokens, but pays for that compression with a large recall drop.
155+
Three results matter most. First, **task-specific training matters**: a fine-tuned 2B model outperforms the 18x larger Qwen 3.5 35B A3B by **11 recall points** at almost the same compression level. Second, **heuristics are not sufficient**: BM25 reaches only **0.22 recall**, because lexical overlap is a poor proxy for relevance in stack traces, logs, and mixed-format observations. Third, **aggressive compression alone is not enough**: Kimi K2 removes the largest fraction of tokens, but pays for that compression with a large recall drop.
160156

161157
The recall-compression trade-off is shown below. Squeez-2B occupies the upper-left region: high recall with strong compression.
162158

163159
<p align="center">
164160
<img src="./assets/squeez_results_chart.svg" alt="Recall vs compression across all models" width="920">
165161
</p>
166162

167-
The aggregate numbers are only part of the story. Qualitatively, the model appears to learn tool-specific pruning regularities. In `grep` and `git_log`, it tends to return the single relevant hit rather than a broader lexical neighborhood. In `test_output`, `build_output`, and package-manager logs, it keeps the failure block and drops surrounding boilerplate. In `read_file`, it often retains the smallest contiguous code block that answers the query instead of an entire surrounding function or class.
163+
The aggregate numbers are only part of the story. Below are four qualitative patterns from the held-out test set.
164+
165+
**Precise selection in structured output.** In `grep` and `git_log`, the fine-tuned model learns to return the single relevant hit. Here is a 21-line `git_log` where the task is to find the commit that changed the dimension order of `xr.polyval` output:
166+
167+
```
168+
fc282d59 re-add timedelta support for polyval (#6599)
169+
cad4474a Fix polyval overloads (#6593)
170+
6fbeb131 polyval: Use Horner's algorithm + support chunked inputs (#6548) ← gold
171+
07de257c Simplify transpose in xr.dot (#5849)
172+
... 17 more lines ...
173+
```
174+
175+
| Model | Prediction | Correct? |
176+
|---|---|---|
177+
| **Squeez-2B** | `6fbeb131 polyval: Use Horner's algorithm...` | Yes |
178+
| Qwen 3.5 35B A3B | `07de257c Simplify transpose in xr.dot` | No (wrong commit) |
179+
| Qwen 3.5 2B (base) | 3 polyval commits (over-selects) | Partial |
180+
181+
Squeez picks the exact commit. Qwen 35B picks a plausible but wrong commit about transpose — right neighborhood, wrong entry.
182+
183+
**Failure-block extraction in logs.** This 176-line service log contains **two** separate TLS handshake failures at different timestamps. The query asks for the health-check failure:
184+
185+
```
186+
... 40 lines of startup logs ...
187+
10:00:00.240 [ERROR] TLS handshake failed: certificate verify failed ← gold
188+
10:00:00.241 [ERROR] node-fetch: request to .../status failed ← gold
189+
10:00:00.260 [WARN] Health check #1 failed (TLS error) ← gold
190+
... 80 lines of normal operation ...
191+
10:00:21.165 [ERROR] TLS handshake failed: certificate verify failed ← wrong block
192+
10:00:21.166 [ERROR] node-fetch: request to .../pay failed ← wrong block
193+
... 50 more lines ...
194+
```
195+
196+
| Model | Prediction | Correct? |
197+
|---|---|---|
198+
| **Squeez-2B** | Health-check TLS block (10:00:00) | Yes |
199+
| Qwen 3.5 35B A3B | Payment TLS block (10:00:21) | No (wrong timestamp) |
200+
| Kimi K2 | Health-check TLS block (10:00:00) | Partial (3 of 5 lines) |
201+
202+
Qwen 35B selects a semantically similar but wrong block from a later request. This "right pattern, wrong instance" failure is common among zero-shot models on repetitive log output.
203+
204+
**Correct empty predictions.** On negative examples where the tool output does not contain the requested evidence, Squeez correctly returns nothing. In a 316-line `docker_logs` output, the query asks about a numpy version conflict between torch and tensorflow — but no such conflict exists. Squeez returns empty output; Qwen 35B generates "No relevant lines found..." (not verbatim tool output); the 2B base returns unrelated database errors. On the 59 negative examples in the test set, Squeez-2B correctly returns empty 80% of the time. Kimi K2 matches this (81%), likely because its aggressive compression tends toward empty output. Qwen 35B returns empty only 7% of the time.
168205

169-
The following `kubectl` example illustrates the intended use case. The full observation contains 250 lines of pod description; the relevant evidence is a two-line block reporting `OOMKilled` and the exit code.
206+
**The kubectl example** illustrates the intended use case at a glance. The full observation contains 250 lines of pod description; the relevant evidence is a two-line block reporting `OOMKilled` and the exit code.
170207

171208
<p align="center">
172209
<img src="./assets/squeez_qualitative_example.svg" alt="kubectl example: 2 relevant lines from 250" width="920">
173210
</p>
174211

175-
The strongest remaining failures are usually semantically adjacent but incorrect selections: choosing the wrong file from an `ls` listing, or returning a related commit that touches the same module without directly answering the query.
212+
**Remaining errors.** The strongest failures of Squeez-2B are semantically adjacent but incorrect selections. In a build log containing both a Dockerfile syntax error and a Python `SyntaxError`, Squeez correctly finds the Dockerfile error but also includes the nearby Python error. Qwen 35B picks *only* the Python error and misses the Dockerfile error entirely. This pattern — correct evidence plus some extra noise — accounts for most of the gap between Squeez's 0.86 recall and its 0.80 precision.
176213

177214
## Using Squeez
178215

@@ -203,16 +240,15 @@ Examples:
203240

204241
The same pattern works with Codex and other agent setups that accept system-level instructions or shell wrappers.
205242

206-
## Takeaway
243+
## Closing Remarks
207244

208-
One recurring bottleneck in coding agents is deciding what to keep from a single tool observation. Our results suggest this is learnable, practically useful, and not handled well by simple heuristics or larger zero-shot models alone. Squeez is our attempt at a focused solution: a narrow model for a narrow problem.
245+
One recurring bottleneck in coding agents is deciding what to keep from a single tool observation. Our results suggest that this bottleneck is both measurable and learnable: mixed-format tool output is not handled well by simple heuristics or larger zero-shot models alone, but it responds well to narrow supervision. That is the main claim behind Squeez. It is a small model for a small problem, but the problem turns out to matter.
209246

210247
## Resources
211248

212-
- **Model:** [KRLabsOrg/squeez-2b](https://huggingface.co/KRLabsOrg/squeez-2b)
213-
- **Dataset:** [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench)
214-
- **Code:** [github.com/KRLabsOrg/squeez](https://github.com/KRLabsOrg/squeez)
215-
- **Paper:** [arXiv (coming soon)]()
249+
- **Model:** [KRLabsOrg/squeez-2b](https://huggingface.co/KRLabsOrg/squeez-2b) (Apache 2.0)
250+
- **Dataset:** [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) (Apache 2.0)
251+
- **Code & CLI:** [github.com/KRLabsOrg/squeez](https://github.com/KRLabsOrg/squeez) (Apache 2.0)
216252

217253
## References
218254

@@ -221,9 +257,10 @@ One recurring bottleneck in coding agents is deciding what to keep from a single
221257
- Jiang, H., et al. (2023). *LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models*. [EMNLP](https://aclanthology.org/2023.emnlp-main.825/)
222258
- Jiang, H., et al. (2024). *LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression*. [ACL](https://aclanthology.org/2024.acl-long.91/)
223259
- Hwang, T., et al. (2025). *EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation*. [Findings of ACL](https://aclanthology.org/2025.findings-acl.253/)
224-
- Chirkova, N., et al. (2025). *Provence: Efficient and Robust Context Pruning for Retrieval-Augmented Generation*. [arXiv](https://arxiv.org/abs/2501.16214)
260+
- Chirkova, N., et al. (2025). *Provence: Efficient and Robust Context Pruning for Retrieval-Augmented Generation*. [ICLR](https://arxiv.org/abs/2501.16214)
225261
- Zilliz. (2025). *Semantic Highlight Bilingual v1*. [Model card](https://huggingface.co/zilliz/semantic-highlight-bilingual-v1)
226262
- Kerboua, I., et al. (2025). *FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents*. [arXiv](https://arxiv.org/abs/2510.03204)
227263
- Wang, Y., et al. (2026). *SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents*. [arXiv](https://arxiv.org/abs/2601.16746)
264+
- Kovacs, A., Schmitt, P., Recski, G. (2025). *KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering*. [BioNLP Workshop](https://aclanthology.org/2025.bionlp-share.8/)
228265
- Jimenez, C. E., et al. (2024). *SWE-bench: Can Language Models Resolve Real-World GitHub Issues?* [ICLR](https://openreview.net/forum?id=VTF8yNQM66)
229-
- Yang, A., et al. (2025). *Qwen3 Technical Report*. [arXiv](https://arxiv.org/abs/2505.09388)
266+
- Qwen Team. (2026). *Qwen3.5: Towards Native Multimodal Agents*. [Blog post](https://qwen.ai/blog?id=qwen3.5)

0 commit comments

Comments
 (0)