You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+46-8Lines changed: 46 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,12 @@ Squeeze verbose LLM agent tool output down to only the relevant lines.
15
15
16
16
LLM coding agents waste **80-95% of context tokens** on irrelevant tool output. When an agent reads a 500-line file to find one function, or runs `git log` to find a specific commit, most of the output is noise.
17
17
18
-
Squeez trains a small (2-3B) generative model to identify and extract only the lines that matter for the task at hand — compressing tool output by ~86% on average.
18
+
Squeez trains small models to identify and extract only the lines that matter for the task at hand — compressing tool output by ~86% on average.
The model returns JSON: `{"relevant_lines": ["line1", "line2", ...]}` and the `extract()` method joins them into filtered text.
192
+
Both model types use the same `extract()` API. The generative model returns JSON (`{"relevant_lines": [...]}`), the encoder classifies each line directly. Both return filtered text.
179
193
180
194
### Configuration
181
195
182
196
Backend is resolved in order: CLI args > env vars > config file (`squeez.yaml` or `configs/default.yaml`).
183
197
184
198
```yaml
185
199
# squeez.yaml
186
-
backend: "transformers"#optional preference
200
+
backend: null#auto-detect from model; or "transformers", "vllm", "encoder"
This pulls the [SWE-bench tool output dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) (7,148 train + 436 eval samples) from HuggingFace.
237
251
238
-
### 2. Train with LoRA
252
+
### 2a. Train generative model (Qwen + LoRA)
239
253
240
254
```bash
241
255
squeez train \
242
256
--train-file data/train.jsonl \
243
-
--eval-file data/eval.jsonl
257
+
--eval-file data/dev.jsonl
244
258
```
245
259
246
260
Default: Qwen 3.5 2B with LoRA (r=16, alpha=32). See `configs/default.yaml` for all hyperparameters.
247
261
262
+
### 2b. Train encoder model (mmBERT)
263
+
264
+
```bash
265
+
# Prepare encoder-format data from the ChatML training data
266
+
python scripts/prepare_encoder_data.py
267
+
268
+
# Train the encoder
269
+
python -m squeez.encoder.train \
270
+
--train-file data/encoder_train.jsonl \
271
+
--eval-file data/encoder_dev.jsonl \
272
+
--base-model jhu-clsp/mmBERT-base \
273
+
--output-dir output/squeez_encoder
274
+
```
275
+
276
+
The encoder is a 307M parameter mmBERT with a token classification head. It classifies each line as relevant/irrelevant and uses sliding windows to handle outputs longer than the 8K context.
277
+
248
278
### 3. Evaluate
249
279
250
280
```bash
281
+
# Generative model
251
282
squeez eval \
252
283
--extractor-model output/squeez_qwen \
253
-
--eval-file data/eval.jsonl
284
+
--eval-file data/test.jsonl
285
+
286
+
# Encoder model
287
+
python -m squeez.encoder.evaluate \
288
+
--model-path output/squeez_encoder \
289
+
--eval-file data/encoder_test.jsonl
254
290
```
255
291
292
+
Both produce the same metrics format (span F1, ROUGE-L, compression ratio) for direct comparison.
293
+
256
294
## Dataset
257
295
258
296
Training data: [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench)
-**Generative** (default): The model returns a JSON object `{"relevant_lines": [...]}` and `extract()` joins them into text.
53
+
-**Encoder**: A token classifier labels each line as relevant/irrelevant. Uses sliding windows for outputs longer than the context window.
52
54
53
-
The `extract()` method parses this JSON and joins the lines into filtered text.
55
+
The backend is auto-detected from the model's `config.json`. You can also set it explicitly via `SQUEEZ_BACKEND=encoder` or `backend: "encoder"` in config.
0 commit comments