|
| 1 | +# Training & Evaluation Commands |
| 2 | + |
| 3 | +## 1. Download data |
| 4 | + |
| 5 | +```bash |
| 6 | +python scripts/download_data.py |
| 7 | +``` |
| 8 | + |
| 9 | +Downloads from [HuggingFace](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) to `data/`: |
| 10 | +- `train.jsonl`, `dev.jsonl`, `test.jsonl` (generative format) |
| 11 | +- `encoder_train.jsonl`, `encoder_dev.jsonl`, `encoder_test.jsonl` (encoder format) |
| 12 | +- `canonical_train.jsonl`, `canonical_dev.jsonl`, `canonical_test.jsonl` (span-based ground truth) |
| 13 | + |
| 14 | +## 2. Train |
| 15 | + |
| 16 | +### Pooled encoder (recommended) |
| 17 | + |
| 18 | +Single-pass encoder + line-level mean-pool classifier. Works with any HuggingFace encoder. |
| 19 | + |
| 20 | +```bash |
| 21 | +# ModernBERT-base on A100 (~75 min) |
| 22 | +python -m squeez.encoder.train \ |
| 23 | + --classifier-type pooled \ |
| 24 | + --train-file data/encoder_train.jsonl \ |
| 25 | + --eval-file data/encoder_dev.jsonl \ |
| 26 | + --base-model answerdotai/ModernBERT-base \ |
| 27 | + --output-dir output/squeez_pooled \ |
| 28 | + --batch-size 96 \ |
| 29 | + --gradient-accumulation-steps 2 \ |
| 30 | + --max-length 4096 \ |
| 31 | + --learning-rate 2e-5 \ |
| 32 | + --num-epochs 4 |
| 33 | + |
| 34 | +# ModernBERT-large (higher capacity, slower) |
| 35 | +python -m squeez.encoder.train \ |
| 36 | + --classifier-type pooled \ |
| 37 | + --train-file data/encoder_train.jsonl \ |
| 38 | + --eval-file data/encoder_dev.jsonl \ |
| 39 | + --base-model answerdotai/ModernBERT-large \ |
| 40 | + --output-dir output/squeez_pooled_large \ |
| 41 | + --batch-size 24 \ |
| 42 | + --gradient-accumulation-steps 4 \ |
| 43 | + --max-length 4096 \ |
| 44 | + --learning-rate 2e-5 \ |
| 45 | + --num-epochs 4 |
| 46 | + |
| 47 | +# Other encoder models work too |
| 48 | +# --base-model jhu-clsp/ettin-encoder-32m |
| 49 | +# --base-model microsoft/deberta-v3-large |
| 50 | +# --base-model BAAI/bge-large-en-v1.5 |
| 51 | +``` |
| 52 | + |
| 53 | +### Token encoder |
| 54 | + |
| 55 | +Per-token binary classification (alternative approach). |
| 56 | + |
| 57 | +```bash |
| 58 | +python -m squeez.encoder.train \ |
| 59 | + --classifier-type token \ |
| 60 | + --train-file data/encoder_train.jsonl \ |
| 61 | + --eval-file data/encoder_dev.jsonl \ |
| 62 | + --base-model answerdotai/ModernBERT-base \ |
| 63 | + --output-dir output/squeez_encoder \ |
| 64 | + --batch-size 2 \ |
| 65 | + --max-length 8192 |
| 66 | +``` |
| 67 | + |
| 68 | +### Generative model (Qwen + LoRA) |
| 69 | + |
| 70 | +```bash |
| 71 | +squeez train \ |
| 72 | + --train-file data/train.jsonl \ |
| 73 | + --eval-file data/dev.jsonl \ |
| 74 | + --output-dir output/squeez_qwen |
| 75 | +``` |
| 76 | + |
| 77 | +To merge LoRA weights and serve: |
| 78 | + |
| 79 | +```bash |
| 80 | +# Merge |
| 81 | +python scripts/merge_lora.py \ |
| 82 | + --checkpoint output/squeez_qwen/checkpoint-500 \ |
| 83 | + --output output/squeez_qwen_merged |
| 84 | + |
| 85 | +# Serve with vLLM |
| 86 | +vllm serve output/squeez_qwen_merged \ |
| 87 | + --max-model-len 32768 \ |
| 88 | + --trust-remote-code |
| 89 | +``` |
| 90 | + |
| 91 | +## 3. Evaluate |
| 92 | + |
| 93 | +### Encoder (pooled or token, auto-detected) |
| 94 | + |
| 95 | +```bash |
| 96 | +python -m squeez.encoder.evaluate \ |
| 97 | + --model-path output/squeez_pooled \ |
| 98 | + --eval-file data/encoder_test.jsonl \ |
| 99 | + --examples-output eval_examples_pooled.json |
| 100 | +``` |
| 101 | + |
| 102 | +Optional flags: |
| 103 | +- `--threshold 0.5` — relevance probability cutoff (default 0.5) |
| 104 | +- `--max-samples 100` — evaluate on a subset |
| 105 | + |
| 106 | +### Generative (local model) |
| 107 | + |
| 108 | +```bash |
| 109 | +squeez eval \ |
| 110 | + --extractor-model output/squeez_qwen_merged \ |
| 111 | + --eval-file data/test.jsonl \ |
| 112 | + --max-new-tokens 4096 \ |
| 113 | + --examples-output eval_examples.json |
| 114 | +``` |
| 115 | + |
| 116 | +### Generative (remote vLLM server) |
| 117 | + |
| 118 | +```bash |
| 119 | +squeez eval \ |
| 120 | + --server-url http://localhost:8000/v1 \ |
| 121 | + --eval-file data/test.jsonl \ |
| 122 | + --max-new-tokens 4096 \ |
| 123 | + --request-concurrency 8 \ |
| 124 | + --examples-output eval_examples.json |
| 125 | +``` |
| 126 | + |
| 127 | +## 4. Standalone inference (no squeez install) |
| 128 | + |
| 129 | +After training the pooled encoder, the output directory contains `modeling_squeez_pooled.py` so `AutoModel` works directly: |
| 130 | + |
| 131 | +```python |
| 132 | +from transformers import AutoModel, AutoTokenizer |
| 133 | + |
| 134 | +model = AutoModel.from_pretrained("output/squeez_pooled", trust_remote_code=True) |
| 135 | +tokenizer = AutoTokenizer.from_pretrained("output/squeez_pooled") |
| 136 | + |
| 137 | +result = model.process( |
| 138 | + task="Find the traceback that shows the import error", |
| 139 | + tool_output=open("output.log").read(), |
| 140 | + tokenizer=tokenizer, |
| 141 | + threshold=0.5, |
| 142 | + return_line_probabilities=True, |
| 143 | +) |
| 144 | +print(result["highlighted_lines"]) |
| 145 | +print(result["highlighted_indices"]) |
| 146 | +``` |
| 147 | + |
| 148 | +## 5. Upload to HuggingFace |
| 149 | + |
| 150 | +### Dataset |
| 151 | + |
| 152 | +```bash |
| 153 | +python scripts/upload_to_hf.py --data-dir data/v3 |
| 154 | +``` |
| 155 | + |
| 156 | +### Model |
| 157 | + |
| 158 | +Push the trained model directory (includes `modeling_squeez_pooled.py` for standalone loading): |
| 159 | + |
| 160 | +```python |
| 161 | +from huggingface_hub import HfApi |
| 162 | +api = HfApi() |
| 163 | +api.upload_folder( |
| 164 | + folder_path="output/squeez_pooled", |
| 165 | + repo_id="KRLabsOrg/squeez-pooled-modernbert", |
| 166 | + repo_type="model", |
| 167 | +) |
| 168 | +``` |
0 commit comments