Blogpost update

adaamko · adaamko · commit 5b58d1975a0e · 2026-04-04T20:21:30.000+02:00
diff --git a/blog/assets/squeez_overview.svg b/blog/assets/squeez_overview.svg
@@ -75,7 +75,7 @@
 
     <!-- Tool count -->
     <rect x="118" y="666" width="170" height="40" rx="14" fill="#e8ecff"/>
-    <text x="142" y="692" class="small">14 tool types</text>
+    <text x="142" y="692" class="small">27 tool types</text>
   </g>
 
   <!-- Arrow 1→2 -->
diff --git a/blog/huggingface_blogpost.md b/blog/huggingface_blogpost.md
@@ -8,21 +8,15 @@ pinned: false
 license: "apache-2.0"
 ---
 
-# How We Built a Tool-Output Pruner for Coding Agents
+# Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents
 
 <p align="center">
   <img src="./assets/squeez_mascot.png" alt="Squeez mascot" width="180">
 </p>
 
-We trained and open-sourced **Squeez-2B**, a compact model for pruning tool output in coding agents. Given a focused query and one raw tool observation, it returns the smallest verbatim evidence block that the agent should inspect next. On our held-out benchmark it reaches **0.862 recall at 91.5% compression**, outperforming a zero-shot **Qwen 3.5 35B A3B** baseline by **11.3 recall points** while operating at essentially the same compression level.
+We trained and open-sourced **Squeez-2B**, a compact model for pruning tool output in coding agents. Given a focused query and one raw tool observation, it returns the smallest verbatim evidence block that the agent should inspect next. On our held-out benchmark it reaches **0.86 recall at 92% compression**, outperforming a zero-shot **Qwen 3.5 35B A3B** baseline by **11 recall points** at essentially the same compression level. The model, dataset, and code are released on [Hugging Face](https://huggingface.co/KRLabsOrg/squeez-2b), [the dataset hub](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench), and [GitHub](https://github.com/KRLabsOrg/squeez).
 
-The release consists of three parts:
-
-- Model: [KRLabsOrg/squeez-2b](https://huggingface.co/KRLabsOrg/squeez-2b)
-- Dataset: [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench)
-- Code and CLI: [github.com/KRLabsOrg/squeez](https://github.com/KRLabsOrg/squeez)
-
-This post describes the problem, explains how we built a benchmark for it, and shows that dedicated supervision works substantially better than larger zero-shot models or simple retrieval heuristics.
+This post explains the problem, describes how we built the benchmark, and shows why narrow supervision works better here than larger zero-shot models or simple retrieval heuristics.
 
 ## The Problem
 
@@ -99,13 +93,15 @@ The overall pipeline is shown below:
   <img src="./assets/squeez_overview.svg" alt="Squeez pipeline: from raw tool output through span annotation to generative model" width="920">
 </p>
 
-The benchmark is built from two sources. The first is [SWE-bench](https://openreview.net/forum?id=VTF8yNQM66), which provides real GitHub issue-resolution tasks over real repositories. We do not use SWE-bench as another patch-generation benchmark. Instead, we use it as a source of realistic repository snapshots, issue contexts, and raw tool observations. Starting from cloned SWE-bench repositories, we collected or reused **10,713** raw tool observations, including file reads, grep hits, Git history, shell output, test results, Python exceptions, and package-manager traces.
+The benchmark is built from two sources. The first is [SWE-bench](https://openreview.net/forum?id=VTF8yNQM66), which provides real GitHub issue-resolution tasks over real repositories. We clone repository snapshots and execute 14 tool types against them — file reads, grep, Git log and blame, test runners, linters, type checkers, package installation, curl, and others — collecting **10,713** raw observations that reflect the kind of output a coding agent encounters during issue resolution.
 
-The second source is synthetic multi-ecosystem tool output. Its role is to broaden coverage where SWE-bench is thin, especially outside the Python-heavy distribution of repository-level issue fixing. Starting from **2,039** raw synthetic observations, we add examples from TypeScript, Go, Rust, Java, Docker, Terraform, and Kubernetes workflows, and we also construct explicit negatives where the correct pruning decision is to return nothing.
+The second source is synthetic multi-ecosystem tool output, which extends coverage beyond SWE-bench's Python-heavy distribution. We use `openai/gpt-oss-120b` to generate **2,039** realistic tool outputs for representative tasks in TypeScript, Go, Rust, Java, Docker, Terraform, and Kubernetes workflows. We also construct explicit negatives by pairing mismatched queries and tool outputs, where the correct pruning decision is to return nothing.
 
-Each released example is built with a two-stage teacher-labeling pipeline using `openai/gpt-oss-120b`. First, the teacher writes a focused extraction query for one observation. Second, it selects the smallest contiguous span or set of spans that answers that query. The teacher sees a numbered rendering of the output for stable span selection, but the released labels are always mapped back onto the original raw text. This is a deliberate design choice: the benchmark stores a pruning decision over the source observation, not a free-form textual explanation of it. Positive examples whose query cannot be supported by the observation are dropped rather than retained as accidental empty outputs. Explicit negatives are created separately in the synthetic portion.
+The executed SWE-derived subset covers 14 tool types; the full released benchmark reaches 27 tool families once the synthetic multi-ecosystem portion is added.
 
-The held-out set was manually curated. Starting from **729** candidate test examples, we removed **111** cases (15.2%) that were near-duplicates, trivial 1--2 line outputs, overly broad spans, or incorrect annotations. The final test set contains **618** manually reviewed examples.
+Each released positive example is labeled with the same two-stage teacher pipeline, again using `openai/gpt-oss-120b`, regardless of whether it comes from SWE-bench or from the synthetic portion. First, the teacher writes a focused extraction query for one observation. Second, it selects the smallest contiguous span, or small set of spans, that answers that query. The teacher sees a numbered rendering of the output for stable span selection, but the released labels are always mapped back onto the original raw text. Positive examples whose query cannot be supported by the observation are dropped rather than retained as accidental empty outputs. Explicit negatives are created separately in the synthetic portion, where the correct target is an empty extraction.
+
+The held-out set was also manually curated. Starting from **729** candidate test examples, we removed **111** cases (15.2%) that were near-duplicates, trivial 1--2 line outputs, overly broad spans, or incorrect annotations. The final test set contains **618** manually reviewed examples.
 
 The released benchmark contains **11,477** examples in total: **9,205** SWE-derived examples, **1,697** synthetic positives, and **575** synthetic negatives. SWE-derived examples are split by repository and synthetic examples by tool family.
 
@@ -129,13 +125,13 @@ The benchmark covers **27** tool types. The largest families are shown below.
 | `pip_install` | 441 | 438 | 79 |
 | `type_check` | 317 | 3418 | 39 |
 | `git_blame` | 291 | 4210 | 139 |
-| remaining tools | 3873 | 688 | 47 |
+| remaining tools | 2873 | 688 | 47 |
 
-The distribution is intentionally heterogeneous. `python` and `test_output` rows are short; `read_file`, `type_check`, and `git_blame` can be extremely long. This variation is one reason simple truncation and lexical retrieval perform poorly: the useful evidence does not follow one structural pattern, and it may occur at the beginning, middle, or end of the observation.
+The distribution is intentionally heterogeneous. `python` and `test_output` rows are short; `read_file`, `type_check`, and `git_blame` can be extremely long. This matters because the useful evidence does not follow one structural pattern, and it may occur at the beginning, middle, or end of the observation. That is also why simple truncation and lexical retrieval remain weak baselines here.
 
 ## Training a Small Model for a Narrow Task
 
-We chose **Qwen 3.5 2B** as the base model ([Qwen3 Technical Report](https://arxiv.org/abs/2505.09388)). The choice was deliberate. The goal here is not to maximize zero-shot reasoning with the largest possible decoder. It is to learn a narrow supervised extraction policy that can run cheaply inside an agent loop. A dense 2B model is large enough to benefit from supervision, but still small enough to be practical for local serving and repeated tool use.
+We chose **Qwen 3.5 2B** as the base model ([Qwen3.5 blog post](https://qwen.ai/blog?id=qwen3.5)). The goal here is not to maximize zero-shot reasoning with the largest possible decoder. It is to learn a narrow supervised extraction policy that can run cheaply inside an agent loop. A 2B model is large enough to benefit from supervision, but still small enough to be practical for local serving and repeated tool use.
 
 We fine-tuned the model with **LoRA** ([Hu et al., 2022](https://openreview.net/forum?id=nZeVKeeFYf9); [Dettmers et al., 2023](https://proceedings.neurips.cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html)) using the **Unsloth** stack. The model receives a focused extraction query and the raw tool observation, and is trained to emit the extracted evidence wrapped in `<relevant_lines>` tags. In other words, the supervision target is not a classification label and not a summary. It is the exact evidence block the model should keep.
 
@@ -147,32 +143,73 @@ We compare Squeez-2B against three zero-shot generative baselines and four heuri
 
 | Model | Recall | F1 | Compression |
 |---|---:|---:|---:|
-| **Squeez-2B** | **0.8624** | **0.8035** | 0.9150 |
-| Qwen 3.5 35B A3B | 0.7498 | 0.7254 | 0.9177 |
-| Kimi K2 | 0.5286 | 0.6827 | 0.9425 |
-| Qwen 3.5 2B (base) | 0.5299 | 0.5482 | 0.8197 |
-| BM25 (10%) | 0.2172 | 0.2314 | 0.9036 |
-| First-N (10%) | 0.1445 | 0.1570 | 0.9055 |
-| Random (10%) | 0.1009 | 0.1966 | 0.9067 |
-| Last-N (10%) | 0.0503 | 0.1393 | 0.9130 |
+| **Squeez-2B** | **0.86** | **0.80** | 0.92 |
+| Qwen 3.5 35B A3B | 0.75 | 0.73 | 0.92 |
+| Kimi K2 | 0.53 | 0.68 | 0.94 |
+| Qwen 3.5 2B (base) | 0.53 | 0.55 | 0.82 |
+| BM25 (10%) | 0.22 | 0.23 | 0.90 |
+| First-N (10%) | 0.14 | 0.16 | 0.91 |
+| Random (10%) | 0.10 | 0.20 | 0.91 |
+| Last-N (10%) | 0.05 | 0.14 | 0.91 |
 
-Three results matter most. First, **task-specific training matters**: a fine-tuned 2B model outperforms the 18x larger Qwen 3.5 35B A3B by **11.3 recall points** at almost the same compression level. Second, **heuristics are not sufficient**: BM25 reaches only **0.22 recall**, because lexical overlap is a poor proxy for relevance in stack traces, logs, and mixed-format observations. Third, **aggressive compression alone is not enough**: Kimi K2 removes the largest fraction of tokens, but pays for that compression with a large recall drop.
+Three results matter most. First, **task-specific training matters**: a fine-tuned 2B model outperforms the 18x larger Qwen 3.5 35B A3B by **11 recall points** at almost the same compression level. Second, **heuristics are not sufficient**: BM25 reaches only **0.22 recall**, because lexical overlap is a poor proxy for relevance in stack traces, logs, and mixed-format observations. Third, **aggressive compression alone is not enough**: Kimi K2 removes the largest fraction of tokens, but pays for that compression with a large recall drop.
 
 The recall-compression trade-off is shown below. Squeez-2B occupies the upper-left region: high recall with strong compression.
 
 <p align="center">
   <img src="./assets/squeez_results_chart.svg" alt="Recall vs compression across all models" width="920">
 </p>
 
-The aggregate numbers are only part of the story. Qualitatively, the model appears to learn tool-specific pruning regularities. In `grep` and `git_log`, it tends to return the single relevant hit rather than a broader lexical neighborhood. In `test_output`, `build_output`, and package-manager logs, it keeps the failure block and drops surrounding boilerplate. In `read_file`, it often retains the smallest contiguous code block that answers the query instead of an entire surrounding function or class.
+The aggregate numbers are only part of the story. Below are four qualitative patterns from the held-out test set.
+
+**Precise selection in structured output.** In `grep` and `git_log`, the fine-tuned model learns to return the single relevant hit. Here is a 21-line `git_log` where the task is to find the commit that changed the dimension order of `xr.polyval` output:
+
+```
+fc282d59 re-add timedelta support for polyval (#6599)
+cad4474a Fix polyval overloads (#6593)
+6fbeb131 polyval: Use Horner's algorithm + support chunked inputs (#6548)  ← gold
+07de257c Simplify transpose in xr.dot (#5849)
+... 17 more lines ...
+```
+
+| Model | Prediction | Correct? |
+|---|---|---|
+| **Squeez-2B** | `6fbeb131 polyval: Use Horner's algorithm...` | Yes |
+| Qwen 3.5 35B A3B | `07de257c Simplify transpose in xr.dot` | No (wrong commit) |
+| Qwen 3.5 2B (base) | 3 polyval commits (over-selects) | Partial |
+
+Squeez picks the exact commit. Qwen 35B picks a plausible but wrong commit about transpose — right neighborhood, wrong entry.
+
+**Failure-block extraction in logs.** This 176-line service log contains **two** separate TLS handshake failures at different timestamps. The query asks for the health-check failure:
+
+```
+... 40 lines of startup logs ...
+10:00:00.240 [ERROR] TLS handshake failed: certificate verify failed     ← gold
+10:00:00.241 [ERROR] node-fetch: request to .../status failed            ← gold
+10:00:00.260 [WARN]  Health check #1 failed (TLS error)                  ← gold
+... 80 lines of normal operation ...
+10:00:21.165 [ERROR] TLS handshake failed: certificate verify failed     ← wrong block
+10:00:21.166 [ERROR] node-fetch: request to .../pay failed               ← wrong block
+... 50 more lines ...
+```
+
+| Model | Prediction | Correct? |
+|---|---|---|
+| **Squeez-2B** | Health-check TLS block (10:00:00) | Yes |
+| Qwen 3.5 35B A3B | Payment TLS block (10:00:21) | No (wrong timestamp) |
+| Kimi K2 | Health-check TLS block (10:00:00) | Partial (3 of 5 lines) |
+
+Qwen 35B selects a semantically similar but wrong block from a later request. This "right pattern, wrong instance" failure is common among zero-shot models on repetitive log output.
+
+**Correct empty predictions.** On negative examples where the tool output does not contain the requested evidence, Squeez correctly returns nothing. In a 316-line `docker_logs` output, the query asks about a numpy version conflict between torch and tensorflow — but no such conflict exists. Squeez returns empty output; Qwen 35B generates "No relevant lines found..." (not verbatim tool output); the 2B base returns unrelated database errors. On the 59 negative examples in the test set, Squeez-2B correctly returns empty 80% of the time. Kimi K2 matches this (81%), likely because its aggressive compression tends toward empty output. Qwen 35B returns empty only 7% of the time.
 
-The following `kubectl` example illustrates the intended use case. The full observation contains 250 lines of pod description; the relevant evidence is a two-line block reporting `OOMKilled` and the exit code.
+**The kubectl example** illustrates the intended use case at a glance. The full observation contains 250 lines of pod description; the relevant evidence is a two-line block reporting `OOMKilled` and the exit code.
 
 <p align="center">
   <img src="./assets/squeez_qualitative_example.svg" alt="kubectl example: 2 relevant lines from 250" width="920">
 </p>
 
-The strongest remaining failures are usually semantically adjacent but incorrect selections: choosing the wrong file from an `ls` listing, or returning a related commit that touches the same module without directly answering the query.
+**Remaining errors.** The strongest failures of Squeez-2B are semantically adjacent but incorrect selections. In a build log containing both a Dockerfile syntax error and a Python `SyntaxError`, Squeez correctly finds the Dockerfile error but also includes the nearby Python error. Qwen 35B picks *only* the Python error and misses the Dockerfile error entirely. This pattern — correct evidence plus some extra noise — accounts for most of the gap between Squeez's 0.86 recall and its 0.80 precision.
 
 ## Using Squeez
 
@@ -203,16 +240,15 @@ Examples:
 
 The same pattern works with Codex and other agent setups that accept system-level instructions or shell wrappers.
 
-## Takeaway
+## Closing Remarks
 
-One recurring bottleneck in coding agents is deciding what to keep from a single tool observation. Our results suggest this is learnable, practically useful, and not handled well by simple heuristics or larger zero-shot models alone. Squeez is our attempt at a focused solution: a narrow model for a narrow problem.
+One recurring bottleneck in coding agents is deciding what to keep from a single tool observation. Our results suggest that this bottleneck is both measurable and learnable: mixed-format tool output is not handled well by simple heuristics or larger zero-shot models alone, but it responds well to narrow supervision. That is the main claim behind Squeez. It is a small model for a small problem, but the problem turns out to matter.
 
 ## Resources
 
-- **Model:** [KRLabsOrg/squeez-2b](https://huggingface.co/KRLabsOrg/squeez-2b)
-- **Dataset:** [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench)
-- **Code:** [github.com/KRLabsOrg/squeez](https://github.com/KRLabsOrg/squeez)
-- **Paper:** [arXiv (coming soon)]()
+- **Model:** [KRLabsOrg/squeez-2b](https://huggingface.co/KRLabsOrg/squeez-2b) (Apache 2.0)
+- **Dataset:** [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) (Apache 2.0)
+- **Code & CLI:** [github.com/KRLabsOrg/squeez](https://github.com/KRLabsOrg/squeez) (Apache 2.0)
 
 ## References
 
@@ -221,9 +257,10 @@ One recurring bottleneck in coding agents is deciding what to keep from a single
 - Jiang, H., et al. (2023). *LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models*. [EMNLP](https://aclanthology.org/2023.emnlp-main.825/)
 - Jiang, H., et al. (2024). *LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression*. [ACL](https://aclanthology.org/2024.acl-long.91/)
 - Hwang, T., et al. (2025). *EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation*. [Findings of ACL](https://aclanthology.org/2025.findings-acl.253/)
-- Chirkova, N., et al. (2025). *Provence: Efficient and Robust Context Pruning for Retrieval-Augmented Generation*. [arXiv](https://arxiv.org/abs/2501.16214)
+- Chirkova, N., et al. (2025). *Provence: Efficient and Robust Context Pruning for Retrieval-Augmented Generation*. [ICLR](https://arxiv.org/abs/2501.16214)
 - Zilliz. (2025). *Semantic Highlight Bilingual v1*. [Model card](https://huggingface.co/zilliz/semantic-highlight-bilingual-v1)
 - Kerboua, I., et al. (2025). *FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents*. [arXiv](https://arxiv.org/abs/2510.03204)
 - Wang, Y., et al. (2026). *SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents*. [arXiv](https://arxiv.org/abs/2601.16746)
+- Kovacs, A., Schmitt, P., Recski, G. (2025). *KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering*. [BioNLP Workshop](https://aclanthology.org/2025.bionlp-share.8/)
 - Jimenez, C. E., et al. (2024). *SWE-bench: Can Language Models Resolve Real-World GitHub Issues?* [ICLR](https://openreview.net/forum?id=VTF8yNQM66)
-- Yang, A., et al. (2025). *Qwen3 Technical Report*. [arXiv](https://arxiv.org/abs/2505.09388)
+- Qwen Team. (2026). *Qwen3.5: Towards Native Multimodal Agents*. [Blog post](https://qwen.ai/blog?id=qwen3.5)