Skip to content

Commit 14cd05e

Browse files
committed
Added encoder
1 parent 91f6e05 commit 14cd05e

20 files changed

Lines changed: 1526 additions & 47 deletions

README.md

Lines changed: 46 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,12 @@ Squeeze verbose LLM agent tool output down to only the relevant lines.
1515

1616
LLM coding agents waste **80-95% of context tokens** on irrelevant tool output. When an agent reads a 500-line file to find one function, or runs `git log` to find a specific commit, most of the output is noise.
1717

18-
Squeez trains a small (2-3B) generative model to identify and extract only the lines that matter for the task at hand — compressing tool output by ~86% on average.
18+
Squeez trains small models to identify and extract only the lines that matter for the task at hand — compressing tool output by ~86% on average.
19+
20+
Two approaches are available:
21+
22+
- **Generative** (Qwen 3.5 2B + LoRA) — high-quality extraction via JSON generation
23+
- **Encoder** (mmBERT 307M) — fast line-level binary classification, sliding window over long outputs
1924

2025
## Example
2126

@@ -133,12 +138,18 @@ $ git log --oneline -25 | squeez "find the commit that changed the authenticatio
133138
pip install squeez
134139
```
135140

136-
For local model training, use the pinned stack in [requirements-train.txt](/Users/adamkovacs/projects/squeez/requirements-train.txt):
141+
For generative model training (Qwen + LoRA):
137142

138143
```bash
139144
pip install -r requirements-train.txt
140145
```
141146

147+
For encoder model training (mmBERT):
148+
149+
```bash
150+
pip install -r requirements-encoder.txt
151+
```
152+
142153
## Quick Start
143154

144155
### CLI
@@ -162,9 +173,12 @@ from squeez.inference.extractor import ToolOutputExtractor
162173
# Load model from config/env
163174
extractor = ToolOutputExtractor()
164175

165-
# Or load model locally
176+
# Or load a generative model locally
166177
extractor = ToolOutputExtractor(model_path="./output/squeez_qwen")
167178

179+
# Or load an encoder model (auto-detected from config.json)
180+
extractor = ToolOutputExtractor(model_path="./output/squeez_encoder")
181+
168182
# Or connect to a server explicitly
169183
extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1", model_name="squeez")
170184

@@ -175,15 +189,15 @@ filtered = extractor.extract(
175189
print(filtered) # Only the relevant lines
176190
```
177191

178-
The model returns JSON: `{"relevant_lines": ["line1", "line2", ...]}` and the `extract()` method joins them into filtered text.
192+
Both model types use the same `extract()` API. The generative model returns JSON (`{"relevant_lines": [...]}`), the encoder classifies each line directly. Both return filtered text.
179193

180194
### Configuration
181195

182196
Backend is resolved in order: CLI args > env vars > config file (`squeez.yaml` or `configs/default.yaml`).
183197

184198
```yaml
185199
# squeez.yaml
186-
backend: "transformers" # optional preference
200+
backend: null # auto-detect from model; or "transformers", "vllm", "encoder"
187201
local_model_path: "./output/squeez_qwen"
188202
# server_url: "https://api.groq.com/openai/v1"
189203
# server_model: "squeez"
@@ -235,24 +249,48 @@ python scripts/download_data.py
235249

236250
This pulls the [SWE-bench tool output dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) (7,148 train + 436 eval samples) from HuggingFace.
237251

238-
### 2. Train with LoRA
252+
### 2a. Train generative model (Qwen + LoRA)
239253

240254
```bash
241255
squeez train \
242256
--train-file data/train.jsonl \
243-
--eval-file data/eval.jsonl
257+
--eval-file data/dev.jsonl
244258
```
245259

246260
Default: Qwen 3.5 2B with LoRA (r=16, alpha=32). See `configs/default.yaml` for all hyperparameters.
247261

262+
### 2b. Train encoder model (mmBERT)
263+
264+
```bash
265+
# Prepare encoder-format data from the ChatML training data
266+
python scripts/prepare_encoder_data.py
267+
268+
# Train the encoder
269+
python -m squeez.encoder.train \
270+
--train-file data/encoder_train.jsonl \
271+
--eval-file data/encoder_dev.jsonl \
272+
--base-model jhu-clsp/mmBERT-base \
273+
--output-dir output/squeez_encoder
274+
```
275+
276+
The encoder is a 307M parameter mmBERT with a token classification head. It classifies each line as relevant/irrelevant and uses sliding windows to handle outputs longer than the 8K context.
277+
248278
### 3. Evaluate
249279

250280
```bash
281+
# Generative model
251282
squeez eval \
252283
--extractor-model output/squeez_qwen \
253-
--eval-file data/eval.jsonl
284+
--eval-file data/test.jsonl
285+
286+
# Encoder model
287+
python -m squeez.encoder.evaluate \
288+
--model-path output/squeez_encoder \
289+
--eval-file data/encoder_test.jsonl
254290
```
255291

292+
Both produce the same metrics format (span F1, ROUGE-L, compression ratio) for direct comparison.
293+
256294
## Dataset
257295

258296
Training data: [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench)

configs/default.yaml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Inference
2-
backend: "transformers" # optional preference: "transformers" or "vllm"
2+
backend: null # auto-detect from model; set to "transformers", "vllm", or "encoder"
33
local_model_path: "./output/squeez_qwen"
44
server_url: null # OpenAI-compatible API, e.g. "http://localhost:8000/v1"
55
server_model: null # optional remote model id when using a server
@@ -23,6 +23,15 @@ lora_r: 16
2323
lora_alpha: 32
2424
lora_dropout: 0
2525

26+
# Encoder training hyperparameters
27+
encoder_base_model: "jhu-clsp/mmBERT-base"
28+
encoder_max_length: 8192
29+
encoder_batch_size: 16
30+
encoder_learning_rate: 2.0e-5
31+
encoder_num_epochs: 5
32+
encoder_warmup_ratio: 0.1
33+
encoder_output_dir: "./output/squeez_encoder"
34+
2635
# Data generation
2736
distillation_model: "gpt-5.4"
2837
max_tool_output_lines: 500

docs/api/encoder.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Encoder
2+
3+
The encoder line classifier for fast, discriminative tool output extraction.
4+
5+
## Model
6+
7+
::: squeez.encoder.model.SqueezEncoderConfig
8+
9+
::: squeez.encoder.model.SqueezEncoderForLineClassification
10+
11+
## Dataset
12+
13+
::: squeez.encoder.dataset.LineClassificationDataset
14+
15+
## Evaluation
16+
17+
::: squeez.encoder.evaluate.evaluate_encoder

docs/getting-started/configuration.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Squeez resolves backend configuration in order: **CLI args > env vars > config f
77
Create a `squeez.yaml` in your project root, or use `configs/default.yaml`:
88

99
```yaml
10-
backend: "transformers" # optional preference
10+
backend: null # auto-detect from model; or "transformers", "vllm", "encoder"
1111
local_model_path: "./output/squeez_qwen"
1212

1313
# Or remote API backend
@@ -29,7 +29,7 @@ Config files are searched in order:
2929
| `SQUEEZ_SERVER_URL` | OpenAI-compatible API URL |
3030
| `SQUEEZ_SERVER_MODEL` | Remote model ID on that server |
3131
| `SQUEEZ_API_KEY` | API key (also checks `OPENAI_API_KEY`) |
32-
| `SQUEEZ_BACKEND` | Optional backend preference: `transformers` or `vllm` |
32+
| `SQUEEZ_BACKEND` | Optional backend preference: `transformers`, `vllm`, or `encoder` |
3333

3434
```bash
3535
export SQUEEZ_LOCAL_MODEL=./output/squeez_qwen

docs/getting-started/installation.md

Lines changed: 21 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,27 @@ pip install -e ".[dev]"
2222

2323
This adds `pytest` and `ruff` for testing and linting.
2424

25+
## Training dependencies
26+
27+
For generative model training (Qwen + LoRA):
28+
29+
```bash
30+
pip install -r requirements-train.txt
31+
```
32+
33+
For encoder model training (mmBERT):
34+
35+
```bash
36+
pip install -r requirements-encoder.txt
37+
```
38+
2539
## Dependencies
2640

27-
Squeez requires Python 3.10+ and depends on:
41+
Squeez requires Python 3.10+. Base install only needs `openai` and `pyyaml`.
42+
43+
Optional dependency groups:
2844

29-
- `torch` — model inference and training
30-
- `transformers` — model loading and tokenization
31-
- `peft` — LoRA adapters
32-
- `datasets` — HuggingFace dataset loading
33-
- `openai` — vLLM/API backend
34-
- `pyyaml` — config file parsing
45+
- `pip install squeez[local]``torch`, `transformers`, `peft` for local inference
46+
- `pip install squeez[encoder]``torch`, `transformers`, `datasets` for encoder training
47+
- `pip install squeez[train]` — adds `trl`, `unsloth` for generative training
48+
- `pip install squeez[dev]` — adds `pytest`, `ruff`

docs/getting-started/quickstart.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,12 @@ from squeez.inference.extractor import ToolOutputExtractor
2929
# Load backend from config or env
3030
extractor = ToolOutputExtractor()
3131

32-
# Or load model locally
32+
# Or load a generative model locally
3333
extractor = ToolOutputExtractor(model_path="./output/squeez_qwen")
3434

35+
# Or load an encoder model (auto-detected from config.json)
36+
extractor = ToolOutputExtractor(model_path="./output/squeez_encoder")
37+
3538
# Or connect to a server explicitly
3639
extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1", model_name="squeez")
3740

@@ -44,10 +47,9 @@ print(filtered) # Only the relevant lines
4447

4548
## How it works
4649

47-
The model receives the task description and raw tool output, then returns a JSON object:
50+
Squeez supports two model types behind the same `extract()` API:
4851

49-
```json
50-
{"relevant_lines": ["class CsrfViewMiddleware(MiddlewareMixin):", " def _check_referer(self, request):", ...]}
51-
```
52+
- **Generative** (default): The model returns a JSON object `{"relevant_lines": [...]}` and `extract()` joins them into text.
53+
- **Encoder**: A token classifier labels each line as relevant/irrelevant. Uses sliding windows for outputs longer than the context window.
5254

53-
The `extract()` method parses this JSON and joins the lines into filtered text.
55+
The backend is auto-detected from the model's `config.json`. You can also set it explicitly via `SQUEEZ_BACKEND=encoder` or `backend: "encoder"` in config.

0 commit comments

Comments
 (0)