Add README with decisions

abrookins · abrookins · commit 353d0314cfa6 · 2025-12-23T15:35:57.000-08:00
diff --git a/README.md b/README.md
@@ -0,0 +1,191 @@
+# sql-redis
+
+A proof-of-concept SQL-to-Redis translator that converts SQL SELECT statements into Redis `FT.SEARCH` and `FT.AGGREGATE` commands.
+
+## Status
+
+This is an **early POC** demonstrating feasibility, not a production-ready library. The goal is to explore design decisions and validate the approach before committing to a full implementation.
+
+## Quick Example
+
+```python
+from redis import Redis
+from sql_redis import Translator
+from sql_redis.schema import SchemaRegistry
+from sql_redis.executor import Executor
+
+client = Redis()
+registry = SchemaRegistry(client)
+registry.load_all()  # Loads index schemas from Redis
+
+executor = Executor(client, registry)
+
+# Simple query
+result = executor.execute("""
+    SELECT title, price 
+    FROM products 
+    WHERE category = 'electronics' AND price < 500
+    ORDER BY price ASC
+    LIMIT 10
+""")
+
+for row in result.rows:
+    print(row["title"], row["price"])
+
+# Vector search with params
+result = executor.execute("""
+    SELECT title, vector_distance(embedding, :vec) AS score
+    FROM products
+    LIMIT 5
+""", params={"vec": vector_bytes})
+```
+
+## Design Decisions
+
+### Why SQL instead of a pandas-like Python DSL?
+
+We considered several interface options:
+
+| Approach | Example | Trade-offs |
+|----------|---------|------------|
+| **SQL** | `SELECT * FROM products WHERE price > 100` | Universal, well-understood, tooling exists |
+| **Pandas-like** | `df[df.price > 100]` | Pythonic but limited to Python, no standard |
+| **Builder pattern** | `query.select("*").where(price__gt=100)` | Type-safe but verbose, learning curve |
+
+**We chose SQL because:**
+
+1. **Universality** — SQL is the lingua franca of data. Developers, analysts, and tools all speak it.
+2. **No new DSL to learn** — Users already know SQL. A pandas-like API requires learning our specific dialect.
+3. **Tooling compatibility** — SQL strings can be generated by ORMs, query builders, or AI assistants.
+4. **Clear mapping** — SQL semantics map reasonably well to RediSearch operations (SELECT→LOAD, WHERE→filter, GROUP BY→GROUPBY).
+
+The downside is losing Python's type checking and IDE support, but for a query interface, the universality trade-off is worth it.
+
+### Why sqlglot instead of writing a custom parser?
+
+**Options considered:**
+- **Custom parser** (regex, hand-rolled recursive descent)
+- **PLY/Lark** (parser generators)
+- **sqlglot** (production SQL parser)
+- **sqlparse** (tokenizer, not a full parser)
+
+**We chose sqlglot because:**
+
+1. **Battle-tested** — Used in production by companies like Tobiko (SQLMesh). Handles edge cases we'd miss.
+2. **Full AST** — Provides a complete abstract syntax tree, not just tokens. We can traverse and analyze queries properly.
+3. **Dialect support** — Handles SQL variations. Users can write MySQL-style or PostgreSQL-style queries.
+4. **Active maintenance** — Regular releases, responsive maintainers, good documentation.
+
+The alternative was writing a custom parser, which would be error-prone and time-consuming for a POC. sqlglot lets us focus on the translation logic rather than parsing edge cases.
+
+### Why schema-aware translation?
+
+Redis field types determine query syntax:
+
+| Field Type | Redis Syntax | Example |
+|------------|--------------|---------|
+| TEXT | `@field:term` | `@title:laptop` |
+| NUMERIC | `@field:[min max]` | `@price:[100 500]` |
+| TAG | `@field:{value}` | `@category:{books}` |
+
+**Without schema knowledge**, we can't translate `category = 'books'` correctly — it could be `@category:books` (TEXT search) or `@category:{books}` (TAG exact match).
+
+**Our approach:** The `SchemaRegistry` fetches index schemas via `FT.INFO` at startup. The translator uses this to generate correct syntax per field type.
+
+This adds a Redis round-trip at initialization but ensures correct query generation.
+
+### Architecture: Why this layered design?
+
+```
+SQL String
+    ↓
+┌─────────────────┐
+│   SQLParser     │  Parse SQL → ParsedQuery dataclass
+└────────┬────────┘
+         ↓
+┌─────────────────┐
+│ SchemaRegistry  │  Load field types from Redis
+└────────┬────────┘
+         ↓
+┌─────────────────┐
+│    Analyzer     │  Classify conditions by field type
+└────────┬────────┘
+         ↓
+┌─────────────────┐
+│  QueryBuilder   │  Generate RediSearch syntax per type
+└────────┬────────┘
+         ↓
+┌─────────────────┐
+│   Translator    │  Orchestrate pipeline, build command
+└────────┬────────┘
+         ↓
+┌─────────────────┐
+│    Executor     │  Execute command, parse results
+└────────┬────────┘
+         ↓
+QueryResult(rows, count)
+```
+
+**Why separate components?**
+
+1. **Testability** — Each layer has focused unit tests. 100% coverage is achievable because responsibilities are clear.
+2. **Single responsibility** — Parser doesn't know about Redis. QueryBuilder doesn't know about SQL. Changes are localized.
+3. **Extensibility** — Adding a new field type (e.g., GEO) means updating Analyzer and QueryBuilder, not rewriting everything.
+
+**Why not a single monolithic translator?**
+
+Early prototypes combined parsing and translation. This led to:
+- Tests that required Redis connections for simple SQL parsing tests
+- Difficulty testing edge cases in isolation
+- Tangled code that was hard to modify
+
+The layered approach emerged from TDD — writing tests first revealed natural boundaries.
+
+## What's Implemented
+
+- [x] Basic SELECT with field selection
+- [x] WHERE with TEXT, NUMERIC, TAG field types
+- [x] Comparison operators: `=`, `!=`, `<`, `<=`, `>`, `>=`, `BETWEEN`, `IN`
+- [x] Boolean operators: `AND`, `OR`
+- [x] Aggregations: `COUNT`, `SUM`, `AVG`, `MIN`, `MAX`
+- [x] `GROUP BY` with multiple aggregations
+- [x] `ORDER BY` with ASC/DESC
+- [x] `LIMIT` and `OFFSET` pagination
+- [x] Computed fields: `price * 0.9 AS discounted`
+- [x] Vector KNN search: `vector_distance(field, :param)`
+- [x] Hybrid search (filters + vector)
+- [x] Full-text search: `LIKE 'prefix%'` (prefix), `fulltext(field, 'terms')` function
+
+## What's Not Implemented (Yet...)
+
+- [ ] JOINs (Redis doesn't support cross-index joins)
+- [ ] Subqueries
+- [ ] HAVING clause
+- [ ] DISTINCT
+- [ ] GEO field queries
+- [ ] Index creation from SQL (CREATE INDEX)
+
+## Development
+
+```bash
+# Install dependencies
+uv sync --all-extras
+
+# Run tests (requires Docker for testcontainers)
+uv run pytest
+
+# Run with coverage
+uv run pytest --cov=sql_redis --cov-report=html
+```
+
+## Testing Philosophy
+
+This project uses strict TDD with 100% test coverage as a hard requirement. The approach:
+
+1. **Write failing tests first** — Define expected behavior before implementation
+2. **One test at a time** — Implement just enough to pass each test
+3. **No untestable code** — If we can't test it, we don't write it
+4. **Integration tests mirror raw Redis** — `test_sql_queries.py` verifies SQL produces same results as equivalent `FT.AGGREGATE` commands in `test_redis_queries.py`
+
+Coverage is enforced in CI. Pragmas (`# pragma: no cover`) are forbidden — if code can't be tested, it shouldn't exist.
+
diff --git a/pyproject.toml b/pyproject.toml
@@ -3,13 +3,16 @@ name = "sql-redis"
 version = "0.1.0"
 description = "SQL to Redis command translation utility"
 requires-python = ">=3.11"
-dependencies = []
+dependencies = [
+    "redis>=5.0.0",
+    "sqlglot>=26.0.0",
+]
 
 [project.optional-dependencies]
 dev = [
     "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
     "testcontainers[redis]>=4.0.0",
-    "redis>=5.0.0",
 ]
 
 [build-system]
diff --git a/uv.lock b/uv.lock