Better docs, encoding

adaamko · adaamko · commit 29e595e6bde3 · 2026-03-15T09:55:50.000+01:00
diff --git a/README.md b/README.md
@@ -17,10 +17,11 @@ LLM coding agents waste **80-95% of context tokens** on irrelevant tool output.
 
 Squeez trains small models to identify and extract only the lines that matter for the task at hand — compressing tool output by ~86% on average.
 
-Two approaches are available:
+Three approaches are available:
 
 - **Generative** (Qwen 3.5 2B + LoRA) — high-quality extraction via XML-wrapped verbatim output
-- **Encoder** (mmBERT 307M) — fast line-level binary classification, sliding window over long outputs
+- **Pooled encoder** (ModernBERT / ettin) — single-pass encoder with line-level mean-pool classification, works with any HuggingFace encoder
+- **Token encoder** (mmBERT) — per-token binary classification with sliding window
 
 ## Example
 
@@ -144,7 +145,7 @@ For generative model training (Qwen + LoRA):
 pip install -r requirements-train.txt
 ```
 
-For encoder model training (mmBERT):
+For encoder model training:
 
 ```bash
 pip install -r requirements-encoder.txt
@@ -176,8 +177,8 @@ extractor = ToolOutputExtractor()
 # Or load a generative model locally
 extractor = ToolOutputExtractor(model_path="./output/squeez_qwen")
 
-# Or load an encoder model (auto-detected from config.json)
-extractor = ToolOutputExtractor(model_path="./output/squeez_encoder")
+# Or load an encoder model (pooled or token, auto-detected from config.json)
+extractor = ToolOutputExtractor(model_path="./output/squeez_pooled")
 
 # Or connect to a server explicitly
 extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1", model_name="squeez")
@@ -259,37 +260,75 @@ squeez train \
 
 Default: Qwen 3.5 2B with LoRA (r=16, alpha=32). See `configs/default.yaml` for all hyperparameters.
 
-### 2b. Train encoder model (mmBERT)
+### 2b. Train pooled encoder (recommended)
 
 ```bash
-# Prepare encoder-format data from the downloaded splits
-python scripts/prepare_encoder_data.py --data-dir data
+python -m squeez.encoder.train \
+    --classifier-type pooled \
+    --train-file data/encoder_train.jsonl \
+    --eval-file data/encoder_dev.jsonl \
+    --base-model answerdotai/ModernBERT-base \
+    --output-dir output/squeez_pooled \
+    --batch-size 96 \
+    --gradient-accumulation-steps 2 \
+    --max-length 4096 \
+    --learning-rate 2e-5 \
+    --num-epochs 4
+```
+
+The pooled encoder runs a single forward pass over the full input, mean-pools hidden states per line, and classifies each line as relevant/irrelevant. Works with any HuggingFace encoder model (ModernBERT, ettin, DeBERTa, etc.) and uses sliding windows for outputs longer than `--max-length`.
+
+After training, the model can be loaded standalone without squeez installed:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+model = AutoModel.from_pretrained("output/squeez_pooled", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("output/squeez_pooled")
 
-# Train the encoder
+result = model.process(
+    task="Find the traceback that shows the import error",
+    tool_output=open("output.log").read(),
+    tokenizer=tokenizer,
+)
+print(result["highlighted_lines"])
+```
+
+### 2c. Train token encoder (alternative)
+
+```bash
 python -m squeez.encoder.train \
+    --classifier-type token \
     --train-file data/encoder_train.jsonl \
     --eval-file data/encoder_dev.jsonl \
-    --base-model jhu-clsp/mmBERT-base \
+    --base-model answerdotai/ModernBERT-base \
     --output-dir output/squeez_encoder
 ```
 
-The encoder is a 307M parameter mmBERT with a token classification head. It classifies each line as relevant/irrelevant and uses sliding windows to handle outputs longer than the 8K context.
-
 ### 3. Evaluate
 
 ```bash
-# Generative model
+# Generative model (local)
 squeez eval \
     --extractor-model output/squeez_qwen \
-    --eval-file data/test.jsonl
+    --eval-file data/test.jsonl \
+    --max-new-tokens 4096
+
+# Generative model (remote vLLM server)
+squeez eval \
+    --server-url http://localhost:8000/v1 \
+    --eval-file data/test.jsonl \
+    --max-new-tokens 4096 \
+    --request-concurrency 8
 
-# Encoder model
+# Encoder model (pooled or token, auto-detected)
 python -m squeez.encoder.evaluate \
-    --model-path output/squeez_encoder \
-    --eval-file data/encoder_test.jsonl
+    --model-path output/squeez_pooled \
+    --eval-file data/encoder_test.jsonl \
+    --examples-output eval_examples_pooled.json
 ```
 
-Both produce the same metrics format (strict and fuzzy line overlap, ROUGE-L, compression ratio) for direct comparison.
+All produce the same metrics format (strict and fuzzy line overlap, ROUGE-L, compression ratio) for direct comparison.
 
 ## Dataset
 
diff --git a/TRAINING.md b/TRAINING.md
@@ -0,0 +1,168 @@
+# Training & Evaluation Commands
+
+## 1. Download data
+
+```bash
+python scripts/download_data.py
+```
+
+Downloads from [HuggingFace](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) to `data/`:
+- `train.jsonl`, `dev.jsonl`, `test.jsonl` (generative format)
+- `encoder_train.jsonl`, `encoder_dev.jsonl`, `encoder_test.jsonl` (encoder format)
+- `canonical_train.jsonl`, `canonical_dev.jsonl`, `canonical_test.jsonl` (span-based ground truth)
+
+## 2. Train
+
+### Pooled encoder (recommended)
+
+Single-pass encoder + line-level mean-pool classifier. Works with any HuggingFace encoder.
+
+```bash
+# ModernBERT-base on A100 (~75 min)
+python -m squeez.encoder.train \
+    --classifier-type pooled \
+    --train-file data/encoder_train.jsonl \
+    --eval-file data/encoder_dev.jsonl \
+    --base-model answerdotai/ModernBERT-base \
+    --output-dir output/squeez_pooled \
+    --batch-size 96 \
+    --gradient-accumulation-steps 2 \
+    --max-length 4096 \
+    --learning-rate 2e-5 \
+    --num-epochs 4
+
+# ModernBERT-large (higher capacity, slower)
+python -m squeez.encoder.train \
+    --classifier-type pooled \
+    --train-file data/encoder_train.jsonl \
+    --eval-file data/encoder_dev.jsonl \
+    --base-model answerdotai/ModernBERT-large \
+    --output-dir output/squeez_pooled_large \
+    --batch-size 24 \
+    --gradient-accumulation-steps 4 \
+    --max-length 4096 \
+    --learning-rate 2e-5 \
+    --num-epochs 4
+
+# Other encoder models work too
+# --base-model jhu-clsp/ettin-encoder-32m
+# --base-model microsoft/deberta-v3-large
+# --base-model BAAI/bge-large-en-v1.5
+```
+
+### Token encoder
+
+Per-token binary classification (alternative approach).
+
+```bash
+python -m squeez.encoder.train \
+    --classifier-type token \
+    --train-file data/encoder_train.jsonl \
+    --eval-file data/encoder_dev.jsonl \
+    --base-model answerdotai/ModernBERT-base \
+    --output-dir output/squeez_encoder \
+    --batch-size 2 \
+    --max-length 8192
+```
+
+### Generative model (Qwen + LoRA)
+
+```bash
+squeez train \
+    --train-file data/train.jsonl \
+    --eval-file data/dev.jsonl \
+    --output-dir output/squeez_qwen
+```
+
+To merge LoRA weights and serve:
+
+```bash
+# Merge
+python scripts/merge_lora.py \
+    --checkpoint output/squeez_qwen/checkpoint-500 \
+    --output output/squeez_qwen_merged
+
+# Serve with vLLM
+vllm serve output/squeez_qwen_merged \
+    --max-model-len 32768 \
+    --trust-remote-code
+```
+
+## 3. Evaluate
+
+### Encoder (pooled or token, auto-detected)
+
+```bash
+python -m squeez.encoder.evaluate \
+    --model-path output/squeez_pooled \
+    --eval-file data/encoder_test.jsonl \
+    --examples-output eval_examples_pooled.json
+```
+
+Optional flags:
+- `--threshold 0.5` — relevance probability cutoff (default 0.5)
+- `--max-samples 100` — evaluate on a subset
+
+### Generative (local model)
+
+```bash
+squeez eval \
+    --extractor-model output/squeez_qwen_merged \
+    --eval-file data/test.jsonl \
+    --max-new-tokens 4096 \
+    --examples-output eval_examples.json
+```
+
+### Generative (remote vLLM server)
+
+```bash
+squeez eval \
+    --server-url http://localhost:8000/v1 \
+    --eval-file data/test.jsonl \
+    --max-new-tokens 4096 \
+    --request-concurrency 8 \
+    --examples-output eval_examples.json
+```
+
+## 4. Standalone inference (no squeez install)
+
+After training the pooled encoder, the output directory contains `modeling_squeez_pooled.py` so `AutoModel` works directly:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+model = AutoModel.from_pretrained("output/squeez_pooled", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("output/squeez_pooled")
+
+result = model.process(
+    task="Find the traceback that shows the import error",
+    tool_output=open("output.log").read(),
+    tokenizer=tokenizer,
+    threshold=0.5,
+    return_line_probabilities=True,
+)
+print(result["highlighted_lines"])
+print(result["highlighted_indices"])
+```
+
+## 5. Upload to HuggingFace
+
+### Dataset
+
+```bash
+python scripts/upload_to_hf.py --data-dir data/v3
+```
+
+### Model
+
+Push the trained model directory (includes `modeling_squeez_pooled.py` for standalone loading):
+
+```python
+from huggingface_hub import HfApi
+api = HfApi()
+api.upload_folder(
+    folder_path="output/squeez_pooled",
+    repo_id="KRLabsOrg/squeez-pooled-modernbert",
+    repo_type="model",
+)
+```
diff --git a/squeez/encoder/modeling_squeez_pooled.py b/squeez/encoder/modeling_squeez_pooled.py
@@ -232,10 +232,7 @@ def _pool_lines(
         if max_lines == 0:
             max_lines = 1
 
-        flat_idx = (
-            torch.arange(batch_size, device=device).unsqueeze(1) * max_lines
-            + segment_ids
-        )
+        flat_idx = torch.arange(batch_size, device=device).unsqueeze(1) * max_lines + segment_ids
         flat_idx = flat_idx * valid_token.long()
 
         pooled_flat = torch.zeros(batch_size * max_lines, hidden, device=device)
diff --git a/squeez/encoder/sentence.py b/squeez/encoder/sentence.py
@@ -216,8 +216,7 @@ def _pool_lines(
         # Use scatter_add to sum hidden states per (batch, segment)
         # Flatten to [batch * max_lines] buckets
         flat_idx = (
-            torch.arange(batch_size, device=device).unsqueeze(1) * max_lines
-            + segment_ids
+            torch.arange(batch_size, device=device).unsqueeze(1) * max_lines + segment_ids
         )  # [batch, seq_len]
 
         # Zero out invalid positions
@@ -559,32 +558,24 @@ def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
 
 def collate_pooled_lines(batch: list[dict]) -> dict[str, torch.Tensor]:
     """Custom collator: pad input_ids and line_labels separately."""
+    batch_size = len(batch)
     max_seq_len = max(b["input_ids"].shape[0] for b in batch)
     max_lines = max(b["line_labels"].shape[0] for b in batch)
 
-    input_ids = []
-    attention_mask = []
-    line_labels = []
+    # Pre-allocate padded tensors
+    input_ids = torch.zeros(batch_size, max_seq_len, dtype=torch.long)
+    attention_mask = torch.zeros(batch_size, max_seq_len, dtype=torch.long)
+    line_labels = torch.full((batch_size, max_lines), -100, dtype=torch.long)
 
-    for b in batch:
+    for i, b in enumerate(batch):
         seq_len = b["input_ids"].shape[0]
         n_lines = b["line_labels"].shape[0]
-
-        # Pad sequences
-        pad_len = max_seq_len - seq_len
-        input_ids.append(torch.cat([b["input_ids"], torch.zeros(pad_len, dtype=torch.long)]))
-        attention_mask.append(
-            torch.cat([b["attention_mask"], torch.zeros(pad_len, dtype=torch.long)])
-        )
-
-        # Pad line labels with -100
-        label_pad = max_lines - n_lines
-        line_labels.append(
-            torch.cat([b["line_labels"], torch.full((label_pad,), -100, dtype=torch.long)])
-        )
+        input_ids[i, :seq_len] = b["input_ids"]
+        attention_mask[i, :seq_len] = b["attention_mask"]
+        line_labels[i, :n_lines] = b["line_labels"]
 
     return {
-        "input_ids": torch.stack(input_ids),
-        "attention_mask": torch.stack(attention_mask),
-        "line_labels": torch.stack(line_labels),
+        "input_ids": input_ids,
+        "attention_mask": attention_mask,
+        "line_labels": line_labels,
     }
diff --git a/squeez/encoder/train.py b/squeez/encoder/train.py