Skip to content

Commit 9cfbbb1

Browse files
committed
Some polish
1 parent 6d5e8e5 commit 9cfbbb1

9 files changed

Lines changed: 529 additions & 18 deletions

File tree

.github/workflows/python-tests.yml

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
name: Python Tests
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
branches: [main]
8+
9+
jobs:
10+
lint:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v4
14+
15+
- name: Set up Python
16+
uses: actions/setup-python@v5
17+
with:
18+
python-version: "3.12"
19+
20+
- name: Install ruff
21+
run: pip install ruff
22+
23+
- name: Lint
24+
run: |
25+
ruff format --check --diff squeez/ tests/
26+
ruff check squeez/ tests/
27+
28+
test:
29+
runs-on: ubuntu-latest
30+
steps:
31+
- uses: actions/checkout@v4
32+
33+
- name: Set up Python
34+
uses: actions/setup-python@v5
35+
with:
36+
python-version: "3.12"
37+
cache: 'pip'
38+
39+
- name: Install dependencies
40+
run: |
41+
python -m pip install --upgrade pip
42+
pip install -e ".[dev]"
43+
44+
- name: Test with pytest
45+
run: |
46+
pytest tests/ -v

CHANGELOG.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Changelog
2+
3+
All notable changes to Squeez are documented here.
4+
5+
## [0.1.0] - 2026-03-07
6+
7+
### Added
8+
- Initial release
9+
- CLI tool: `cat output.txt | squeez "task description"`
10+
- Python API: `ToolOutputExtractor` with vLLM and transformers backends
11+
- Config file support (`squeez.yaml`, env vars, CLI args)
12+
- LoRA fine-tuning pipeline for Qwen 3.5 2B
13+
- SFT dataset with proper label masking
14+
- Evaluation metrics: line-level F1, ROUGE-L, compression ratio
15+
- Full data generation pipeline from SWE-bench
16+
- Dataset download script for HuggingFace

CONTRIBUTING.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Contributing to Squeez
2+
3+
Thanks for your interest in contributing! This guide will get you up and running.
4+
5+
## Setup
6+
7+
```bash
8+
# Clone the repo
9+
git clone https://github.com/KRLabsOrg/squeez.git
10+
cd squeez
11+
12+
# Create a virtual environment (Python 3.10+)
13+
python -m venv .venv
14+
source .venv/bin/activate
15+
16+
# Install in development mode
17+
pip install -e ".[dev]"
18+
```
19+
20+
## Code style
21+
22+
We use [Ruff](https://docs.astral.sh/ruff/) for both linting and formatting, with a line length of 100.
23+
24+
```bash
25+
# Check formatting
26+
ruff format --check squeez/ tests/
27+
28+
# Auto-format
29+
ruff format squeez/ tests/
30+
31+
# Lint
32+
ruff check squeez/ tests/
33+
34+
# Lint with auto-fix
35+
ruff check --fix squeez/ tests/
36+
```
37+
38+
Key conventions:
39+
- Use modern type hints (`list[str]`, `dict[str, Any]`, `str | None`)
40+
- Add docstrings to public classes and methods
41+
- Use `logging` instead of `print()` for runtime messages
42+
- Use `pathlib.Path` instead of `os.path`
43+
44+
## Running tests
45+
46+
```bash
47+
pytest tests/ -v
48+
```
49+
50+
## Project structure
51+
52+
```
53+
squeez/
54+
inference/ # Runtime extractor (CLI + Python API)
55+
extractor.py # ToolOutputExtractor with vLLM/transformers backends
56+
training/ # Model training pipeline
57+
train.py # LoRA fine-tuning script
58+
dataset.py # SFT dataset with label masking
59+
evaluate.py # Evaluation metrics
60+
data/ # Data generation pipeline
61+
pipeline.py # Main pipeline orchestrator
62+
config.py # Configuration and system prompt
63+
swebench_loader.py
64+
source_fetcher.py
65+
tool_call_generator.py
66+
tool_call_executor.py
67+
auto_labeler.py
68+
llm_distiller.py
69+
sample_assembler.py
70+
validator.py
71+
configs/ # YAML configuration files
72+
scripts/ # Utility scripts
73+
tests/ # Pytest test suite
74+
```
75+
76+
## Making changes
77+
78+
1. **Create a branch** from `main`:
79+
```bash
80+
git checkout -b my-feature
81+
```
82+
83+
2. **Make your changes.** Keep PRs focused — one feature or fix per PR.
84+
85+
3. **Run lint and tests** before committing:
86+
```bash
87+
ruff format squeez/ tests/
88+
ruff check squeez/ tests/
89+
pytest tests/ -v
90+
```
91+
92+
4. **Open a pull request** against `main`. CI will run lint and tests automatically.
93+
94+
## What to work on
95+
96+
Good areas for contribution:
97+
- **New tool types** — add support for more tool output formats in the data generation pipeline
98+
- **Model backends** — add new inference backends (e.g. GGUF, TensorRT)
99+
- **Evaluation** — improve metrics or add new evaluation methods
100+
- **Tests** — increase coverage
101+
- **Bug fixes** — check [open issues](https://github.com/KRLabsOrg/squeez/issues)
102+
103+
## License
104+
105+
By contributing, you agree that your contributions will be licensed under the Apache 2.0 License.

README.md

Lines changed: 146 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
# Squeez
22

3+
<p align="center">
4+
<img src="https://github.com/KRLabsOrg/squeez/blob/main/assets/squeez_mascot.png?raw=true" alt="Squeez Logo" width="300"/>
5+
<br><em>Squeeze out the juice, leave the pulp behind.</em>
6+
</p>
7+
38
Squeeze verbose LLM agent tool output down to only the relevant lines.
49

510
[![PyPI](https://img.shields.io/pypi/v/squeez)](https://pypi.org/project/squeez/)
@@ -12,6 +17,116 @@ LLM coding agents waste **80-95% of context tokens** on irrelevant tool output.
1217

1318
Squeez trains a small (2-3B) generative model to identify and extract only the lines that matter for the task at hand — compressing tool output by ~86% on average.
1419

20+
## Example
21+
22+
Task: *"Fix the CSRF validation bug in the referer check"*
23+
24+
<table>
25+
<tr>
26+
<th>Before — 42 lines, ~1,200 tokens</th>
27+
<th>After — 8 lines, ~150 tokens</th>
28+
</tr>
29+
<tr>
30+
<td>
31+
32+
```python
33+
class CsrfViewMiddleware(MiddlewareMixin):
34+
def _check_referer(self, request):
35+
referer = request.META.get('HTTP_REFERER')
36+
if referer is None:
37+
raise RejectRequest('No referer')
38+
good_referer = request.get_host()
39+
if not same_origin(referer, good_referer):
40+
raise RejectRequest('Bad referer')
41+
42+
def process_view(self, request, callback, ...):
43+
if getattr(request, 'csrf_processing_done', False):
44+
return None
45+
csrf_token = request.META.get('CSRF_COOKIE')
46+
if csrf_token is None:
47+
return self._reject(request, 'No CSRF cookie')
48+
return self._accept(request)
49+
50+
class SessionMiddleware(MiddlewareMixin):
51+
def process_request(self, request):
52+
session_key = request.COOKIES.get(...)
53+
request.session = self.SessionStore(session_key)
54+
55+
def process_response(self, request, response):
56+
if request.session.modified:
57+
request.session.save()
58+
return response
59+
60+
class CommonMiddleware(MiddlewareMixin):
61+
def process_request(self, request):
62+
host = request.get_host()
63+
if settings.PREPEND_WWW and ...:
64+
return redirect(...)
65+
66+
def process_response(self, request, response):
67+
if settings.USE_ETAGS:
68+
response['ETag'] = hashlib.md5(...)
69+
return response
70+
71+
class SecurityMiddleware(MiddlewareMixin):
72+
def process_request(self, request):
73+
if settings.SECURE_SSL_REDIRECT and ...:
74+
return redirect(...)
75+
```
76+
77+
</td>
78+
<td>
79+
80+
```python
81+
class CsrfViewMiddleware(MiddlewareMixin):
82+
def _check_referer(self, request):
83+
referer = request.META.get('HTTP_REFERER')
84+
if referer is None:
85+
raise RejectRequest('No referer')
86+
good_referer = request.get_host()
87+
if not same_origin(referer, good_referer):
88+
raise RejectRequest('Bad referer')
89+
```
90+
91+
**87% compression** — only the CSRF referer logic survives. Session, Common, and Security middleware are irrelevant to the task and get dropped.
92+
93+
</td>
94+
</tr>
95+
</table>
96+
97+
```bash
98+
$ cat django/middleware.py | squeez "Fix the CSRF validation bug in the referer check"
99+
```
100+
101+
<details>
102+
<summary><b>Another example — filtering git log</b></summary>
103+
104+
Task: *"Find the commit that changed the authentication timeout"*
105+
106+
**Before** — 25 commits of noise:
107+
```
108+
a1b2c3d Fix typo in README
109+
e4f5g6h Update CI pipeline
110+
i7j8k9l Bump version to 2.3.1
111+
m0n1o2p Add docker-compose.yml
112+
q3r4s5t Refactor database migrations
113+
u6v7w8x Change auth timeout from 30m to 1h
114+
y9z0a1b Fix linting warnings
115+
c2d3e4f Update dependencies
116+
...
117+
```
118+
119+
**After** — the one commit that matters:
120+
```
121+
u6v7w8x Change auth timeout from 30m to 1h
122+
```
123+
124+
```bash
125+
$ git log --oneline -25 | squeez "find the commit that changed the authentication timeout"
126+
```
127+
128+
</details>
129+
15130
## Installation
16131

17132
```bash
@@ -52,15 +167,41 @@ The model returns JSON: `{"relevant_lines": ["line1", "line2", ...]}` and the `e
52167

53168
### Configuration
54169

55-
By default, squeez connects to a vLLM server. Configure with:
170+
Backend is resolved in order: CLI args > env vars > config file (`squeez.yaml` or `configs/default.yaml`).
171+
172+
```yaml
173+
# squeez.yaml
174+
model_path: "./output/squeez_qwen" # local transformers
175+
# base_url: "https://api.groq.com/openai/v1" # or remote API
176+
```
56177

57178
```bash
58-
# Environment variable
59-
export TOE_BASE_URL=http://localhost:8000/v1
179+
# Or via environment variables
180+
export SQUEEZ_MODEL_PATH=./output/squeez_qwen
181+
export SQUEEZ_BASE_URL=https://api.groq.com/openai/v1
182+
export SQUEEZ_API_KEY=gsk_...
183+
```
184+
185+
### Use with Claude Code
186+
187+
Add this to your project's `CLAUDE.md` (or `~/.claude/CLAUDE.md` for global):
60188

61-
# Or pass directly
62-
extractor = ToolOutputExtractor(base_url="http://my-server:8000/v1")
63189
```
190+
Always when you invoke a shell command, pipe it through `squeez` and tell exactly what you want to know.
191+
192+
Examples:
193+
- `bun test 2>&1 | squeez "did the tests pass?"`
194+
- `git log --oneline -50 | squeez "find the commit that broke CSRF"`
195+
- `cat src/auth/middleware.py | squeez "find the referer validation logic"`
196+
197+
Do NOT use squeez when:
198+
- You need exact, uncompressed output (e.g. writing a patch)
199+
- The command is interactive
200+
```
201+
202+
This saves context tokens by replacing verbose tool output with only the relevant lines.
203+
204+
Also works with other coding agents (Codex CLI, OpenCode, etc.) via their equivalent instruction files.
64205

65206
## Training
66207

assets/squeez_mascot.png

581 KB
Loading

configs/default.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
# Inference
2+
backend: "transformers" # "transformers" or "vllm"
3+
model_path: "./output/squeez_qwen"
4+
base_url: null # e.g. "http://localhost:8000/v1"
5+
16
# Training hyperparameters
27
model: "Qwen/Qwen3.5-2B"
38
max_length: 32768

0 commit comments

Comments
 (0)