Skip to content

Commit 5bf0b86

Browse files
committed
Better readme
1 parent 5c95128 commit 5bf0b86

1 file changed

Lines changed: 11 additions & 11 deletions

File tree

README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,10 @@
1212

1313
- Tool output pruner for LLM coding agents
1414
- Pipe any tool output (pytest, grep, git log, npm build, kubectl, ...) through squeez with a task description, get back only the relevant lines
15-
- Fine-tuned Qwen 3.5 2B, 0.79 F1, ~91% compression
15+
- Fine-tuned Qwen 3.5 2B, 0.80 F1, 92% compression
1616
- CLI pipe, Python library, or vLLM server
1717

18-
Existing context pruning tools ([SWE-Pruner](https://github.com/Ayanami1314/swe-pruner), [Zilliz Semantic Highlight](https://huggingface.co/zilliz/semantic-highlight-bilingual-v1), [Provence](https://arxiv.org/abs/2501.16214)) are built for source code or document paragraphs. They don't handle the mixed, unstructured format of tool output (stack traces interleaved with passing tests, grep matches with context lines, build logs with timestamps). Squeez is trained specifically on 14 types of tool output from real SWE-bench workflows.
18+
Existing context pruning tools ([SWE-Pruner](https://github.com/Ayanami1314/swe-pruner), [Zilliz Semantic Highlight](https://huggingface.co/zilliz/semantic-highlight-bilingual-v1), [Provence](https://arxiv.org/abs/2501.16214)) are built for source code or document paragraphs. They don't handle the mixed, unstructured format of tool output (stack traces interleaved with passing tests, grep matches with context lines, build logs with timestamps). Squeez is trained on 27 types of tool output from real SWE-bench workflows and synthetic multi-ecosystem observations.
1919

2020
```bash
2121
pip install squeez
@@ -124,18 +124,18 @@ $ kubectl describe pod api-server-7d4b | squeez "why is the pod failing"
124124

125125
## Results
126126

127-
Evaluated on 617 held-out test samples from SWE-bench, across 14 tool types:
127+
Evaluated on 618 manually curated held-out examples spanning 27 tool types:
128128

129129
| Model | Precision | Recall | F1 | Compression |
130130
|-------|-----------|--------|------|-------------|
131-
| **Squeez-2B** | **0.8043** | **0.8624** | **0.7895** | 0.9150 |
132-
| Qwen 3.5 35B A3B (zero-shot) | 0.7402 | 0.7498 | 0.7000 | 0.9177 |
133-
| Kimi K2 (zero-shot) | 0.6128 | 0.5286 | 0.5344 | 0.9425 |
134-
| Qwen 3.5 2B (untrained) | 0.4154 | 0.5299 | 0.4075 | 0.8197 |
135-
| BM25 (10%) | 0.1277 | 0.2172 | 0.1314 | 0.9036 |
136-
| Random (10%) | 0.0738 | 0.1009 | 0.0697 | 0.9067 |
131+
| **Squeez-2B** | **0.80** | **0.86** | **0.80** | 0.92 |
132+
| Qwen 3.5 35B A3B (zero-shot) | 0.74 | 0.75 | 0.73 | 0.92 |
133+
| Kimi K2 (zero-shot) | 0.61 | 0.53 | 0.68 | 0.94 |
134+
| Qwen 3.5 2B (untrained) | 0.42 | 0.53 | 0.55 | 0.82 |
135+
| BM25 (10%) | 0.13 | 0.22 | 0.23 | 0.90 |
136+
| Random (10%) | 0.07 | 0.10 | 0.20 | 0.91 |
137137

138-
Squeez-2B (2B params) outperforms a 35B MoE model at zero-shot and is 6x better than BM25 on Span F1.
138+
Squeez-2B (2B params) outperforms the 18x larger Qwen 3.5 35B A3B by 11 recall points at the same compression level.
139139

140140
## Quick start
141141

@@ -305,7 +305,7 @@ squeez eval --extractor-model output/squeez_qwen --eval-file data/test.jsonl
305305

306306
Training data: [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench)
307307

308-
Built from SWE-bench repositories. Each sample has:
308+
Built from SWE-bench repositories and synthetic multi-ecosystem tool outputs. Each sample has:
309309
- `query`: a focused extraction request or agent subgoal
310310
- `tool_output`: raw tool output as seen by the agent
311311
- `gold_spans`: contiguous spans over the raw output

0 commit comments

Comments
 (0)