AI-powered repository reverse engineering platform. Paste a GitHub URL and get structure, dependencies, code quality analysis, and generated documentation — all from static analysis, with an optional AI layer for natural-language Q&A.
Built as a static-analysis platform with an optional AI interface: everything through Phase 6 below works with zero LLM API keys.
| Phase | What it does | Requires LLM? |
|---|---|---|
| 1. Repository Ingestion | Validate URL, clone repo, extract metadata | No |
| 2. Static Parsing | Tree-sitter extraction of files/classes/functions/imports (Python, JS, TS) | No |
| 3. Dependency Graph | File-level import resolution, cycle detection | No |
| 4. Code Quality Analysis | Complexity, large functions, deep nesting, god classes, duplicates, unused imports, circular deps | No |
| 5. Structural Documentation | Auto-generated README/Architecture/FolderStructure/CodeQuality docs from stored data | No |
| 6. Frontend / Dashboard | Next.js UI — submit a repo, watch live analysis, browse results | No |
| 7. AI Layer | Embeddings + vector search + chat assistant | Yes (optional — mock LLM works without one) |
Install these before doing anything else:
| Tool | Version | Check with |
|---|---|---|
| Python | 3.11+ | python --version |
| Node.js | 20+ | node --version |
| Docker Desktop | any recent | docker --version |
| Git | any recent | git --version |
Docker Desktop must be running before you start Postgres/Redis.
git clone <your-repo-url> codeatlas
cd codeatlasWindows (PowerShell):
python -m venv venv
.\venv\Scripts\Activate.ps1If PowerShell blocks the activation script, run this once first:
Set-ExecutionPolicy -Scope CurrentUser RemoteSignedmacOS / Linux:
python3 -m venv venv
source venv/bin/activatepip install -r backend/requirements.txtThis installs FastAPI, SQLAlchemy, Tree-sitter (Python/JS/TS grammars), GitPython, NetworkX, sentence-transformers (for embeddings), and pgvector.
Heads up: sentence-transformers pulls in torch, which is a large download (several hundred MB) — this step can take a few minutes on first install. Let it finish; it isn't stuck.
cp backend/.env.example backend/.envThe defaults in .env.example already match the Docker Compose setup below, so no editing is required for local development.
docker compose up -d
docker compose ps # confirm both containers show "running"This uses the pgvector/pgvector:pg16 image (Postgres 16 with the pgvector extension pre-installed) and redis:7-alpine.
python scripts/create_tables.pyThis drops and recreates all tables — safe for local development, but destructive. It also enables the vector Postgres extension automatically.
uvicorn backend.main:app --port 8000Visit http://localhost:8000/health — you should see "database_connected": true.
Open a second terminal (leave the backend running in the first).
cd frontend
npm installcp .env.local.example .env.local # if present, otherwise create it manuallyfrontend/.env.local should contain:
NEXT_PUBLIC_API_BASE_URL=http://localhost:8000/api/v1
npm auditShould report 0 vulnerabilities. The overrides field in package.json pins postcss to a patched version regardless of what Next.js bundles internally — if npm audit ever shows something new, check it before ignoring (this project has already been through one CVSS 10.0 Next.js RCE advisory; don't assume npm audit fix --force is safe without reading what it actually changes first).
npm run devVisit http://localhost:3000.
- Open
http://localhost:3000 - Paste a GitHub repository URL (e.g.
https://github.com/octocat/Hello-World) - Click Analyze
- Watch it move through parsing → dependency graph → quality analysis → documentation generation, live
- Browse the Structure/Dependency/Quality stat cards and the four generated documents (README, Architecture, Folder Structure, Code Quality Report)
Re-submitting the same URL is safe (idempotent) — it re-runs the full analysis pipeline against the existing repository record rather than cloning a duplicate.
codeatlas/
├── backend/
│ ├── main.py # FastAPI entrypoint
│ ├── core/ # Settings, security
│ ├── database/ # SQLAlchemy engine/session/base
│ ├── models/ # ORM models (one per table)
│ ├── schemas/ # Pydantic request/response models
│ ├── api/v1/ # FastAPI routers
│ ├── services/ # Business logic (parsing, dependency graph, quality, docs, chat)
│ ├── parsers/ # Tree-sitter based language parsers
│ ├── analyzers/ # Quality analyzers (complexity, dead code, duplicates, cycles)
│ ├── ai/ # Embeddings, vector store, retriever, LLM client, prompts
│ └── utils/ # Git operations
├── frontend/
│ ├── app/ # Next.js App Router pages
│ ├── components/ # React components
│ ├── lib/ # API client
│ └── types/ # Shared TypeScript types
├── docker-compose.yml # Postgres (pgvector) + Redis
├── scripts/ # One-off/manual test scripts
└── storage/repos/ # Cloned repositories (gitignored)
All scripts live in scripts/ and are PowerShell (.ps1) — Windows was the primary dev environment for this project. macOS/Linux users can adapt them to bash easily; the underlying Python commands are the same either way.
| Script | What it tests |
|---|---|
test_backend_skeleton.ps1 |
Config, DB session, FastAPI app boot |
test_models.ps1 |
ORM model definitions (SQLite smoke test, no Postgres needed) |
test_git_utils.ps1 |
URL validation, cloning, Windows-safe cleanup |
test_repository_endpoint.ps1 |
Full POST/GET repository API against real Postgres |
test_parsers.ps1 |
Tree-sitter parser correctness (no DB needed) |
test_parsing_pipeline.ps1 |
Full parse pipeline against fixture files |
test_dependency_graph.ps1 |
Import resolution and cycle detection |
test_quality_analysis.ps1 |
All 8 quality analyzers against a fixture designed to trigger each one |
test_documentation.ps1 |
Document generation and persistence |
test_ai_pipeline.ps1 |
Real embeddings + retrieval + mock LLM |
test_frontend.ps1 |
Frontend dev server smoke test |
ModuleNotFoundError for any package — you likely edited requirements.txt without reinstalling. Run pip install -r backend/requirements.txt again.
ImportError: cannot import name 'runtime_version' from 'google.protobuf' — caused by a stale/conflicting TensorFlow install on your system interfering with sentence-transformers. This project only uses the PyTorch backend; backend/ai/embeddings.py sets USE_TF=0 before importing to avoid this entirely. If you still see it, confirm that environment variable is set at the very top of that file, before any other imports.
ForeignKeyViolationError when re-parsing a repository — fixed as of this version; _clear_previous_analysis in parsing_service.py now clears Relationship and CodeEmbedding rows (which reference File.id) before deleting files. If you see this again, check that function includes both deletes.
PowerShell: Start-Process -FilePath "npm" fails with "not a valid Win32 application" — npm on Windows is a .cmd wrapper, not a real executable. Scripts in this repo invoke it via cmd.exe /c npm ... instead of calling npm directly.
npm audit reports a postcss vulnerability nested inside next — do not run npm audit fix --force (it will suggest downgrading next to a version from 2020). Use the overrides field in package.json instead, which forces every copy of the dependency in the tree to the patched version.
Server "crashes" with no visible error when started via a background PowerShell process — Start-Process -WindowStyle Hidden swallows stderr by default. Always redirect it: -RedirectStandardError "err.log", and read that file if something fails silently.
First embedding/chat request hangs for several minutes — expected on first run only. sentence-transformers downloads the ~1.3GB BGE model from Hugging Face the first time embed_documents/embed_query is called. Subsequent calls are fast (model is cached).
- Phase 7 (AI Layer): real embeddings and retrieval work today, but the LLM client (
backend/ai/llm_client.py) only has aMockLLMClient— it proves the RAG pipeline is wired correctly without needing an API key, but doesn't generate real natural-language answers. Swapping in a real provider means writing one new class implementingBaseLLMClient.generate(). - Alembic migrations: schema changes currently require dropping and recreating all tables (
scripts/create_tables.py), which is fine for local development but not how a production deployment would handle schema evolution. - Route/endpoint extraction, DB model extraction: the PRD's
Authentication.md/Database.md/API endpoint docs need framework-specific route detection (e.g. Flask/Express route decorators) that hasn't been built — only the four documents backed by data we actually extract (README, Architecture, FolderStructure, CodeQuality) are generated. - Celery workers: analysis currently runs synchronously per API call rather than as background jobs, so very large repositories will block the request until complete.