codeatlas

AI-powered repository reverse engineering platform. Paste a GitHub URL and get structure, dependencies, code quality analysis, and generated documentation — all from static analysis, with an optional AI layer for natural-language Q&A.

Built as a static-analysis platform with an optional AI interface: everything through Phase 6 below works with zero LLM API keys.

Project status

Phase	What it does	Requires LLM?
1. Repository Ingestion	Validate URL, clone repo, extract metadata	No
2. Static Parsing	Tree-sitter extraction of files/classes/functions/imports (Python, JS, TS)	No
3. Dependency Graph	File-level import resolution, cycle detection	No
4. Code Quality Analysis	Complexity, large functions, deep nesting, god classes, duplicates, unused imports, circular deps	No
5. Structural Documentation	Auto-generated README/Architecture/FolderStructure/CodeQuality docs from stored data	No
6. Frontend / Dashboard	Next.js UI — submit a repo, watch live analysis, browse results	No
7. AI Layer	Embeddings + vector search + chat assistant	Yes (optional — mock LLM works without one)

Prerequisites

Install these before doing anything else:

Tool	Version	Check with
Python	3.11+	`python --version`
Node.js	20+	`node --version`
Docker Desktop	any recent	`docker --version`
Git	any recent	`git --version`

Docker Desktop must be running before you start Postgres/Redis.

1. Clone and enter the project

git clone <your-repo-url> codeatlas
cd codeatlas

2. Backend setup

2.1 Create and activate a virtual environment

Windows (PowerShell):

python -m venv venv
.\venv\Scripts\Activate.ps1

If PowerShell blocks the activation script, run this once first:

Set-ExecutionPolicy -Scope CurrentUser RemoteSigned

macOS / Linux:

python3 -m venv venv
source venv/bin/activate

2.2 Install Python dependencies

pip install -r backend/requirements.txt

This installs FastAPI, SQLAlchemy, Tree-sitter (Python/JS/TS grammars), GitPython, NetworkX, sentence-transformers (for embeddings), and pgvector.

Heads up: sentence-transformers pulls in torch, which is a large download (several hundred MB) — this step can take a few minutes on first install. Let it finish; it isn't stuck.

2.3 Configure environment variables

cp backend/.env.example backend/.env

The defaults in .env.example already match the Docker Compose setup below, so no editing is required for local development.

2.4 Start Postgres and Redis

docker compose up -d
docker compose ps    # confirm both containers show "running"

This uses the pgvector/pgvector:pg16 image (Postgres 16 with the pgvector extension pre-installed) and redis:7-alpine.

2.5 Create database tables

python scripts/create_tables.py

This drops and recreates all tables — safe for local development, but destructive. It also enables the vector Postgres extension automatically.

2.6 Start the backend

uvicorn backend.main:app --port 8000

Visit http://localhost:8000/health — you should see "database_connected": true.

3. Frontend setup

Open a second terminal (leave the backend running in the first).

3.1 Install dependencies

cd frontend
npm install

3.2 Configure environment variables

cp .env.local.example .env.local   # if present, otherwise create it manually

frontend/.env.local should contain:

NEXT_PUBLIC_API_BASE_URL=http://localhost:8000/api/v1

3.3 Check for known vulnerabilities

npm audit

Should report 0 vulnerabilities. The overrides field in package.json pins postcss to a patched version regardless of what Next.js bundles internally — if npm audit ever shows something new, check it before ignoring (this project has already been through one CVSS 10.0 Next.js RCE advisory; don't assume npm audit fix --force is safe without reading what it actually changes first).

3.4 Start the frontend

npm run dev

Visit http://localhost:3000.

4. Using the app

Open http://localhost:3000
Paste a GitHub repository URL (e.g. https://github.com/octocat/Hello-World)
Click Analyze
Watch it move through parsing → dependency graph → quality analysis → documentation generation, live
Browse the Structure/Dependency/Quality stat cards and the four generated documents (README, Architecture, Folder Structure, Code Quality Report)

Re-submitting the same URL is safe (idempotent) — it re-runs the full analysis pipeline against the existing repository record rather than cloning a duplicate.

5. Project structure

codeatlas/
├── backend/
│   ├── main.py                # FastAPI entrypoint
│   ├── core/                  # Settings, security
│   ├── database/               # SQLAlchemy engine/session/base
│   ├── models/                 # ORM models (one per table)
│   ├── schemas/                 # Pydantic request/response models
│   ├── api/v1/                  # FastAPI routers
│   ├── services/                # Business logic (parsing, dependency graph, quality, docs, chat)
│   ├── parsers/                 # Tree-sitter based language parsers
│   ├── analyzers/                # Quality analyzers (complexity, dead code, duplicates, cycles)
│   ├── ai/                       # Embeddings, vector store, retriever, LLM client, prompts
│   └── utils/                     # Git operations
├── frontend/
│   ├── app/                       # Next.js App Router pages
│   ├── components/                 # React components
│   ├── lib/                         # API client
│   └── types/                        # Shared TypeScript types
├── docker-compose.yml               # Postgres (pgvector) + Redis
├── scripts/                          # One-off/manual test scripts
└── storage/repos/                     # Cloned repositories (gitignored)

6. Useful scripts

All scripts live in scripts/ and are PowerShell (.ps1) — Windows was the primary dev environment for this project. macOS/Linux users can adapt them to bash easily; the underlying Python commands are the same either way.

Script	What it tests
`test_backend_skeleton.ps1`	Config, DB session, FastAPI app boot
`test_models.ps1`	ORM model definitions (SQLite smoke test, no Postgres needed)
`test_git_utils.ps1`	URL validation, cloning, Windows-safe cleanup
`test_repository_endpoint.ps1`	Full POST/GET repository API against real Postgres
`test_parsers.ps1`	Tree-sitter parser correctness (no DB needed)
`test_parsing_pipeline.ps1`	Full parse pipeline against fixture files
`test_dependency_graph.ps1`	Import resolution and cycle detection
`test_quality_analysis.ps1`	All 8 quality analyzers against a fixture designed to trigger each one
`test_documentation.ps1`	Document generation and persistence
`test_ai_pipeline.ps1`	Real embeddings + retrieval + mock LLM
`test_frontend.ps1`	Frontend dev server smoke test

7. Troubleshooting

ModuleNotFoundError for any package — you likely edited requirements.txt without reinstalling. Run pip install -r backend/requirements.txt again.

ImportError: cannot import name 'runtime_version' from 'google.protobuf' — caused by a stale/conflicting TensorFlow install on your system interfering with sentence-transformers. This project only uses the PyTorch backend; backend/ai/embeddings.py sets USE_TF=0 before importing to avoid this entirely. If you still see it, confirm that environment variable is set at the very top of that file, before any other imports.

ForeignKeyViolationError when re-parsing a repository — fixed as of this version; _clear_previous_analysis in parsing_service.py now clears Relationship and CodeEmbedding rows (which reference File.id) before deleting files. If you see this again, check that function includes both deletes.

PowerShell: Start-Process -FilePath "npm" fails with "not a valid Win32 application" — npm on Windows is a .cmd wrapper, not a real executable. Scripts in this repo invoke it via cmd.exe /c npm ... instead of calling npm directly.

npm audit reports a postcss vulnerability nested inside next — do not run npm audit fix --force (it will suggest downgrading next to a version from 2020). Use the overrides field in package.json instead, which forces every copy of the dependency in the tree to the patched version.

Server "crashes" with no visible error when started via a background PowerShell process — Start-Process -WindowStyle Hidden swallows stderr by default. Always redirect it: -RedirectStandardError "err.log", and read that file if something fails silently.

First embedding/chat request hangs for several minutes — expected on first run only. sentence-transformers downloads the ~1.3GB BGE model from Hugging Face the first time embed_documents/embed_query is called. Subsequent calls are fast (model is cached).

8. What's not built yet

Phase 7 (AI Layer): real embeddings and retrieval work today, but the LLM client (backend/ai/llm_client.py) only has a MockLLMClient — it proves the RAG pipeline is wired correctly without needing an API key, but doesn't generate real natural-language answers. Swapping in a real provider means writing one new class implementing BaseLLMClient.generate().
Alembic migrations: schema changes currently require dropping and recreating all tables (scripts/create_tables.py), which is fine for local development but not how a production deployment would handle schema evolution.
Route/endpoint extraction, DB model extraction: the PRD's Authentication.md/Database.md/API endpoint docs need framework-specific route detection (e.g. Flask/Express route decorators) that hasn't been built — only the four documents backed by data we actually extract (README, Architecture, FolderStructure, CodeQuality) are generated.
Celery workers: analysis currently runs synchronously per API call rather than as background jobs, so very large repositories will block the request until complete.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

codeatlas

Project status

Prerequisites

1. Clone and enter the project

2. Backend setup

2.1 Create and activate a virtual environment

2.2 Install Python dependencies

2.3 Configure environment variables

2.4 Start Postgres and Redis

2.5 Create database tables

2.6 Start the backend

3. Frontend setup

3.1 Install dependencies

3.2 Configure environment variables

3.3 Check for known vulnerabilities

3.4 Start the frontend

4. Using the app

5. Project structure

6. Useful scripts

7. Troubleshooting

8. What's not built yet

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
backend		backend
docker		docker
frontend		frontend
scripts		scripts
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

codeatlas

Project status

Prerequisites

1. Clone and enter the project

2. Backend setup

2.1 Create and activate a virtual environment

2.2 Install Python dependencies

2.3 Configure environment variables

2.4 Start Postgres and Redis

2.5 Create database tables

2.6 Start the backend

3. Frontend setup

3.1 Install dependencies

3.2 Configure environment variables

3.3 Check for known vulnerabilities

3.4 Start the frontend

4. Using the app

5. Project structure

6. Useful scripts

7. Troubleshooting

8. What's not built yet

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages