KRLabsOrg
diff --git a/‎.github/workflows/python-tests.yml‎
Lines changed: 46 additions & 0 deletions b/‎.github/workflows/python-tests.yml‎
Lines changed: 46 additions & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 16 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 105 additions & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 105 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 146 additions & 5 deletions b/‎README.md‎
Lines changed: 146 additions & 5 deletions
diff --git a/‎assets/squeez_mascot.png‎
581 KB b/‎assets/squeez_mascot.png‎
581 KB
diff --git a/‎configs/default.yaml‎
Lines changed: 5 additions & 0 deletions b/‎configs/default.yaml‎
Lines changed: 5 additions & 0 deletions
@@ -0,0 +1,46 @@
+name: Python Tests
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Set up Python
+      uses: actions/setup-python@v5
+      with:
+        python-version: "3.12"
+
+    - name: Install ruff
+      run: pip install ruff
+
+    - name: Lint
+      run: |
+        ruff format --check --diff squeez/ tests/
+        ruff check squeez/ tests/
+
+  test:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Set up Python
+      uses: actions/setup-python@v5
+      with:
+        python-version: "3.12"
+        cache: 'pip'
+
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install -e ".[dev]"
+
+    - name: Test with pytest
+      run: |
+        pytest tests/ -v
@@ -0,0 +1,16 @@
+# Changelog
+
+All notable changes to Squeez are documented here.
+
+## [0.1.0] - 2026-03-07
+
+### Added
+- Initial release
+- CLI tool: `cat output.txt | squeez "task description"`
+- Python API: `ToolOutputExtractor` with vLLM and transformers backends
+- Config file support (`squeez.yaml`, env vars, CLI args)
+- LoRA fine-tuning pipeline for Qwen 3.5 2B
+- SFT dataset with proper label masking
+- Evaluation metrics: line-level F1, ROUGE-L, compression ratio
+- Full data generation pipeline from SWE-bench
+- Dataset download script for HuggingFace
@@ -0,0 +1,105 @@
+# Contributing to Squeez
+
+Thanks for your interest in contributing! This guide will get you up and running.
+
+## Setup
+
+```bash
+# Clone the repo
+git clone https://github.com/KRLabsOrg/squeez.git
+cd squeez
+
+# Create a virtual environment (Python 3.10+)
+python -m venv .venv
+source .venv/bin/activate
+
+# Install in development mode
+pip install -e ".[dev]"
+```
+
+## Code style
+
+We use [Ruff](https://docs.astral.sh/ruff/) for both linting and formatting, with a line length of 100.
+
+```bash
+# Check formatting
+ruff format --check squeez/ tests/
+
+# Auto-format
+ruff format squeez/ tests/
+
+# Lint
+ruff check squeez/ tests/
+
+# Lint with auto-fix
+ruff check --fix squeez/ tests/
+```
+
+Key conventions:
+- Use modern type hints (`list[str]`, `dict[str, Any]`, `str | None`)
+- Add docstrings to public classes and methods
+- Use `logging` instead of `print()` for runtime messages
+- Use `pathlib.Path` instead of `os.path`
+
+## Running tests
+
+```bash
+pytest tests/ -v
+```
+
+## Project structure
+
+```
+squeez/
+  inference/         # Runtime extractor (CLI + Python API)
+    extractor.py      # ToolOutputExtractor with vLLM/transformers backends
+  training/          # Model training pipeline
+    train.py          # LoRA fine-tuning script
+    dataset.py        # SFT dataset with label masking
+    evaluate.py       # Evaluation metrics
+  data/              # Data generation pipeline
+    pipeline.py       # Main pipeline orchestrator
+    config.py         # Configuration and system prompt
+    swebench_loader.py
+    source_fetcher.py
+    tool_call_generator.py
+    tool_call_executor.py
+    auto_labeler.py
+    llm_distiller.py
+    sample_assembler.py
+    validator.py
+configs/             # YAML configuration files
+scripts/             # Utility scripts
+tests/               # Pytest test suite
+```
+
+## Making changes
+
+1. **Create a branch** from `main`:
+   ```bash
+   git checkout -b my-feature
+   ```
+
+2. **Make your changes.** Keep PRs focused — one feature or fix per PR.
+
+3. **Run lint and tests** before committing:
+   ```bash
+   ruff format squeez/ tests/
+   ruff check squeez/ tests/
+   pytest tests/ -v
+   ```
+
+4. **Open a pull request** against `main`. CI will run lint and tests automatically.
+
+## What to work on
+
+Good areas for contribution:
+- **New tool types** — add support for more tool output formats in the data generation pipeline
+- **Model backends** — add new inference backends (e.g. GGUF, TensorRT)
+- **Evaluation** — improve metrics or add new evaluation methods
+- **Tests** — increase coverage
+- **Bug fixes** — check [open issues](https://github.com/KRLabsOrg/squeez/issues)
+
+## License
+
+By contributing, you agree that your contributions will be licensed under the Apache 2.0 License.
@@ -1,5 +1,10 @@
 # Squeez
 
+<p align="center">
+  <img src="https://github.com/KRLabsOrg/squeez/blob/main/assets/squeez_mascot.png?raw=true" alt="Squeez Logo" width="300"/>
+  <br><em>Squeeze out the juice, leave the pulp behind.</em>
+</p>
+
 Squeeze verbose LLM agent tool output down to only the relevant lines.
 
 [![PyPI](https://img.shields.io/pypi/v/squeez)](https://pypi.org/project/squeez/)
@@ -12,6 +17,116 @@ LLM coding agents waste **80-95% of context tokens** on irrelevant tool output.
 
 Squeez trains a small (2-3B) generative model to identify and extract only the lines that matter for the task at hand — compressing tool output by ~86% on average.
 
+## Example
+
+Task: *"Fix the CSRF validation bug in the referer check"*
+
+<table>
+<tr>
+<th>Before — 42 lines, ~1,200 tokens</th>
+<th>After — 8 lines, ~150 tokens</th>
+</tr>
+<tr>
+<td>
+
+```python
+class CsrfViewMiddleware(MiddlewareMixin):
+    def _check_referer(self, request):
+        referer = request.META.get('HTTP_REFERER')
+        if referer is None:
+            raise RejectRequest('No referer')
+        good_referer = request.get_host()
+        if not same_origin(referer, good_referer):
+            raise RejectRequest('Bad referer')
+
+    def process_view(self, request, callback, ...):
+        if getattr(request, 'csrf_processing_done', False):
+            return None
+        csrf_token = request.META.get('CSRF_COOKIE')
+        if csrf_token is None:
+            return self._reject(request, 'No CSRF cookie')
+        return self._accept(request)
+
+class SessionMiddleware(MiddlewareMixin):
+    def process_request(self, request):
+        session_key = request.COOKIES.get(...)
+        request.session = self.SessionStore(session_key)
+
+    def process_response(self, request, response):
+        if request.session.modified:
+            request.session.save()
+        return response
+
+class CommonMiddleware(MiddlewareMixin):
+    def process_request(self, request):
+        host = request.get_host()
+        if settings.PREPEND_WWW and ...:
+            return redirect(...)
+
+    def process_response(self, request, response):
+        if settings.USE_ETAGS:
+            response['ETag'] = hashlib.md5(...)
+        return response
+
+class SecurityMiddleware(MiddlewareMixin):
+    def process_request(self, request):
+        if settings.SECURE_SSL_REDIRECT and ...:
+            return redirect(...)
+```
+
+</td>
+<td>
+
+```python
+class CsrfViewMiddleware(MiddlewareMixin):
+    def _check_referer(self, request):
+        referer = request.META.get('HTTP_REFERER')
+        if referer is None:
+            raise RejectRequest('No referer')
+        good_referer = request.get_host()
+        if not same_origin(referer, good_referer):
+            raise RejectRequest('Bad referer')
+```
+
+**87% compression** — only the CSRF referer logic survives. Session, Common, and Security middleware are irrelevant to the task and get dropped.
+
+</td>
+</tr>
+</table>
+
+```bash
+$ cat django/middleware.py | squeez "Fix the CSRF validation bug in the referer check"
+```
+
+<details>
+<summary><b>Another example — filtering git log</b></summary>
+
+Task: *"Find the commit that changed the authentication timeout"*
+
+**Before** — 25 commits of noise:
+```
+a1b2c3d Fix typo in README
+e4f5g6h Update CI pipeline
+i7j8k9l Bump version to 2.3.1
+m0n1o2p Add docker-compose.yml
+q3r4s5t Refactor database migrations
+u6v7w8x Change auth timeout from 30m to 1h
+y9z0a1b Fix linting warnings
+c2d3e4f Update dependencies
+...
+```
+
+**After** — the one commit that matters:
+```
+u6v7w8x Change auth timeout from 30m to 1h
+```
+
+```bash
+$ git log --oneline -25 | squeez "find the commit that changed the authentication timeout"
+```
+
+</details>
+
 ## Installation
 
 ```bash
@@ -52,15 +167,41 @@ The model returns JSON: `{"relevant_lines": ["line1", "line2", ...]}` and the `e
 
 ### Configuration
 
-By default, squeez connects to a vLLM server. Configure with:
+Backend is resolved in order: CLI args > env vars > config file (`squeez.yaml` or `configs/default.yaml`).
+
+```yaml
+# squeez.yaml
+model_path: "./output/squeez_qwen"     # local transformers
+# base_url: "https://api.groq.com/openai/v1"  # or remote API
+```
 
 ```bash
-# Environment variable
-export TOE_BASE_URL=http://localhost:8000/v1
+# Or via environment variables
+export SQUEEZ_MODEL_PATH=./output/squeez_qwen
+export SQUEEZ_BASE_URL=https://api.groq.com/openai/v1
+export SQUEEZ_API_KEY=gsk_...
+```
+
+### Use with Claude Code
+
+Add this to your project's `CLAUDE.md` (or `~/.claude/CLAUDE.md` for global):
 
-# Or pass directly
-extractor = ToolOutputExtractor(base_url="http://my-server:8000/v1")
 ```
+Always when you invoke a shell command, pipe it through `squeez` and tell exactly what you want to know.
+
+Examples:
+- `bun test 2>&1 | squeez "did the tests pass?"`
+- `git log --oneline -50 | squeez "find the commit that broke CSRF"`
+- `cat src/auth/middleware.py | squeez "find the referer validation logic"`
+
+Do NOT use squeez when:
+- You need exact, uncompressed output (e.g. writing a patch)
+- The command is interactive
+```
+
+This saves context tokens by replacing verbose tool output with only the relevant lines.
+
+Also works with other coding agents (Codex CLI, OpenCode, etc.) via their equivalent instruction files.
 
 ## Training
 
 
@@ -1,3 +1,8 @@
+# Inference
+backend: "transformers"  # "transformers" or "vllm"
+model_path: "./output/squeez_qwen"
+base_url: null  # e.g. "http://localhost:8000/v1"
+
 # Training hyperparameters
 model: "Qwen/Qwen3.5-2B"
 max_length: 32768