|
| 1 | +# ChromaDB Code Search UI |
| 2 | + |
| 3 | +A Flask web application for searching, browsing, and visualizing Python code using semantic embeddings and ChromaDB. Built as a companion app for the **Context Engineering with Chroma** course. |
| 4 | + |
| 5 | +## Purpose |
| 6 | + |
| 7 | +This app serves two roles in the course: |
| 8 | + |
| 9 | +1. **Teaching material** — Students ingest this codebase into ChromaDB using AST-based chunking pipelines they build in the labs. The well-structured Python code (models, services, routes, utils) makes it an ideal target for practicing chunking strategies. |
| 10 | +2. **Interactive tool** — Once ingested, students launch this app to explore their collections, run searches, and see how their chunking and metadata decisions affect retrieval quality. |
| 11 | + |
| 12 | +## Features |
| 13 | + |
| 14 | +- **Semantic search** — Natural language queries over code using OpenAI embeddings (`text-embedding-3-small`) |
| 15 | +- **Regex search** — Structural pattern matching across the codebase with analysis and explanation |
| 16 | +- **Collection explorer** — Paginated chunk browser with filters by file path, chunk type, and symbol name |
| 17 | +- **Code statistics** — Construct detection, size distributions, and symbol rankings |
| 18 | +- **Embedding visualizer** — 2D PCA projections of chunk embeddings to explore clustering |
| 19 | +- **Smart suggestions** — Context-aware query suggestions based on collection metadata |
| 20 | +- **Query history and bookmarks** — Persistent search history with color-coded bookmarks |
| 21 | +- **Interactive tutorials** — Guided tours with spotlight overlays for onboarding |
| 22 | + |
| 23 | +## Project Structure |
| 24 | + |
| 25 | +``` |
| 26 | +app/ |
| 27 | +├── app.py # Flask application factory and entry point |
| 28 | +├── config.py # Dataclass-based configuration (env vars, defaults) |
| 29 | +├── requirements.txt # Python dependencies |
| 30 | +├── .env.example # Environment variable template |
| 31 | +│ |
| 32 | +├── models/ # Data models |
| 33 | +│ ├── chunk.py # Chunk, ChunkMetadata, ChunkType |
| 34 | +│ ├── search_result.py # SearchResult, SearchResultSet, ResultFormatter |
| 35 | +│ └── query_history.py # QueryRecord, Bookmark, HistoryManager |
| 36 | +│ |
| 37 | +├── routes/ # Flask blueprints (one per feature) |
| 38 | +│ ├── search.py # Semantic and regex search endpoints |
| 39 | +│ ├── collections.py # Collection CRUD and ingestion triggers |
| 40 | +│ ├── explorer.py # Paginated chunk browsing with filters |
| 41 | +│ ├── similarity.py # Pairwise similarity matrix computation |
| 42 | +│ ├── history.py # Query history and bookmarks API |
| 43 | +│ ├── regex_tester.py # Regex testing and analysis |
| 44 | +│ ├── suggestions.py # Smart query suggestions |
| 45 | +│ ├── statistics.py # Code metrics and analytics |
| 46 | +│ ├── visualizer.py # 2D embedding visualization |
| 47 | +│ └── tutorial.py # Interactive guided tours |
| 48 | +│ |
| 49 | +├── services/ # Business logic layer |
| 50 | +│ ├── chroma_client.py # ChromaDB connection manager (singleton) |
| 51 | +│ ├── search_service.py # Search strategies (semantic + regex) |
| 52 | +│ ├── collection_service.py # Collection management and stats |
| 53 | +│ ├── ingestion_service.py # AST parsing and code chunking pipeline |
| 54 | +│ ├── similarity_service.py # Vector similarity computations |
| 55 | +│ ├── statistics_service.py # Code metrics and analysis |
| 56 | +│ ├── visualization_service.py # PCA and random projection reducers |
| 57 | +│ ├── suggestion_service.py # Multi-strategy suggestion generator |
| 58 | +│ └── tutorial_service.py # Tutorial builder and manager |
| 59 | +│ |
| 60 | +├── utils/ # Utilities and helpers |
| 61 | +│ ├── validators.py # Input validation (queries, paths, regex) |
| 62 | +│ ├── regex_engine.py # Regex analysis and human-readable explanation |
| 63 | +│ ├── code_parser.py # Lightweight regex-based Python parser |
| 64 | +│ ├── text_splitter.py # Token-based text splitting |
| 65 | +│ └── formatters.py # Display formatting (scores, code, paths) |
| 66 | +│ |
| 67 | +├── templates/ # Jinja2 HTML templates |
| 68 | +│ ├── base.html # Base layout with navbar and tutorial engine |
| 69 | +│ ├── index.html # Dashboard (collection cards) |
| 70 | +│ ├── search.html # Search interface |
| 71 | +│ ├── explorer.html # Chunk browser |
| 72 | +│ └── collection.html # Collection detail page |
| 73 | +│ |
| 74 | +└── static/ |
| 75 | + └── css/style.css # Custom styles |
| 76 | +``` |
| 77 | + |
| 78 | +## Design Patterns |
| 79 | + |
| 80 | +The codebase intentionally demonstrates several software design patterns, making it a richer target for code search exercises: |
| 81 | + |
| 82 | +- **Strategy** — `SearchStrategy`, `SimilarityComputer`, `DimensionReducer`, `SuggestionStrategy` |
| 83 | +- **Singleton** — `ChromaClientManager` for a single DB connection |
| 84 | +- **Factory** — `get_reducer()`, `get_similarity_computer()`, `get_tutorial_builder()` |
| 85 | +- **Builder** — Tutorial builders (`DashboardTutorialBuilder`, `CollectionTutorialBuilder`) |
| 86 | +- **Facade** — `SearchService`, `SuggestionService`, `StatisticsService` wrapping multiple strategies |
| 87 | + |
| 88 | +## Setup |
| 89 | + |
| 90 | +1. Install dependencies: |
| 91 | + ```bash |
| 92 | + pip install -r requirements.txt |
| 93 | + ``` |
| 94 | + |
| 95 | +2. Configure environment variables (copy `.env.example` to `.env`): |
| 96 | + ``` |
| 97 | + OPENAI_API_KEY=sk-your-key-here |
| 98 | + CHROMA_PERSIST_DIR=./chroma_data |
| 99 | + ``` |
| 100 | + |
| 101 | +3. Run the app: |
| 102 | + ```bash |
| 103 | + python app.py |
| 104 | + ``` |
| 105 | + |
| 106 | +## Dependencies |
| 107 | + |
| 108 | +| Package | Purpose | |
| 109 | +|---------|---------| |
| 110 | +| flask | Web framework | |
| 111 | +| chromadb | Vector database | |
| 112 | +| openai | Embedding API | |
| 113 | +| tiktoken | Token counting | |
| 114 | +| tree-sitter | AST parsing | |
| 115 | +| tree-sitter-python | Python grammar for tree-sitter | |
| 116 | +| python-dotenv | Environment variable management | |
| 117 | +| pathspec | `.gitignore` pattern matching | |
0 commit comments