Skip to content

dvvrtmshra/codeatlas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

codeatlas

AI-powered repository reverse engineering platform. Paste a GitHub URL and get structure, dependencies, code quality analysis, and generated documentation — all from static analysis, with an optional AI layer for natural-language Q&A.

Built as a static-analysis platform with an optional AI interface: everything through Phase 6 below works with zero LLM API keys.


Project status

Phase What it does Requires LLM?
1. Repository Ingestion Validate URL, clone repo, extract metadata No
2. Static Parsing Tree-sitter extraction of files/classes/functions/imports (Python, JS, TS) No
3. Dependency Graph File-level import resolution, cycle detection No
4. Code Quality Analysis Complexity, large functions, deep nesting, god classes, duplicates, unused imports, circular deps No
5. Structural Documentation Auto-generated README/Architecture/FolderStructure/CodeQuality docs from stored data No
6. Frontend / Dashboard Next.js UI — submit a repo, watch live analysis, browse results No
7. AI Layer Embeddings + vector search + chat assistant Yes (optional — mock LLM works without one)

Prerequisites

Install these before doing anything else:

Tool Version Check with
Python 3.11+ python --version
Node.js 20+ node --version
Docker Desktop any recent docker --version
Git any recent git --version

Docker Desktop must be running before you start Postgres/Redis.


1. Clone and enter the project

git clone <your-repo-url> codeatlas
cd codeatlas

2. Backend setup

2.1 Create and activate a virtual environment

Windows (PowerShell):

python -m venv venv
.\venv\Scripts\Activate.ps1

If PowerShell blocks the activation script, run this once first:

Set-ExecutionPolicy -Scope CurrentUser RemoteSigned

macOS / Linux:

python3 -m venv venv
source venv/bin/activate

2.2 Install Python dependencies

pip install -r backend/requirements.txt

This installs FastAPI, SQLAlchemy, Tree-sitter (Python/JS/TS grammars), GitPython, NetworkX, sentence-transformers (for embeddings), and pgvector.

Heads up: sentence-transformers pulls in torch, which is a large download (several hundred MB) — this step can take a few minutes on first install. Let it finish; it isn't stuck.

2.3 Configure environment variables

cp backend/.env.example backend/.env

The defaults in .env.example already match the Docker Compose setup below, so no editing is required for local development.

2.4 Start Postgres and Redis

docker compose up -d
docker compose ps    # confirm both containers show "running"

This uses the pgvector/pgvector:pg16 image (Postgres 16 with the pgvector extension pre-installed) and redis:7-alpine.

2.5 Create database tables

python scripts/create_tables.py

This drops and recreates all tables — safe for local development, but destructive. It also enables the vector Postgres extension automatically.

2.6 Start the backend

uvicorn backend.main:app --port 8000

Visit http://localhost:8000/health — you should see "database_connected": true.


3. Frontend setup

Open a second terminal (leave the backend running in the first).

3.1 Install dependencies

cd frontend
npm install

3.2 Configure environment variables

cp .env.local.example .env.local   # if present, otherwise create it manually

frontend/.env.local should contain:

NEXT_PUBLIC_API_BASE_URL=http://localhost:8000/api/v1

3.3 Check for known vulnerabilities

npm audit

Should report 0 vulnerabilities. The overrides field in package.json pins postcss to a patched version regardless of what Next.js bundles internally — if npm audit ever shows something new, check it before ignoring (this project has already been through one CVSS 10.0 Next.js RCE advisory; don't assume npm audit fix --force is safe without reading what it actually changes first).

3.4 Start the frontend

npm run dev

Visit http://localhost:3000.


4. Using the app

  1. Open http://localhost:3000
  2. Paste a GitHub repository URL (e.g. https://github.com/octocat/Hello-World)
  3. Click Analyze
  4. Watch it move through parsing → dependency graph → quality analysis → documentation generation, live
  5. Browse the Structure/Dependency/Quality stat cards and the four generated documents (README, Architecture, Folder Structure, Code Quality Report)

Re-submitting the same URL is safe (idempotent) — it re-runs the full analysis pipeline against the existing repository record rather than cloning a duplicate.


5. Project structure

codeatlas/
├── backend/
│   ├── main.py                # FastAPI entrypoint
│   ├── core/                  # Settings, security
│   ├── database/               # SQLAlchemy engine/session/base
│   ├── models/                 # ORM models (one per table)
│   ├── schemas/                 # Pydantic request/response models
│   ├── api/v1/                  # FastAPI routers
│   ├── services/                # Business logic (parsing, dependency graph, quality, docs, chat)
│   ├── parsers/                 # Tree-sitter based language parsers
│   ├── analyzers/                # Quality analyzers (complexity, dead code, duplicates, cycles)
│   ├── ai/                       # Embeddings, vector store, retriever, LLM client, prompts
│   └── utils/                     # Git operations
├── frontend/
│   ├── app/                       # Next.js App Router pages
│   ├── components/                 # React components
│   ├── lib/                         # API client
│   └── types/                        # Shared TypeScript types
├── docker-compose.yml               # Postgres (pgvector) + Redis
├── scripts/                          # One-off/manual test scripts
└── storage/repos/                     # Cloned repositories (gitignored)

6. Useful scripts

All scripts live in scripts/ and are PowerShell (.ps1) — Windows was the primary dev environment for this project. macOS/Linux users can adapt them to bash easily; the underlying Python commands are the same either way.

Script What it tests
test_backend_skeleton.ps1 Config, DB session, FastAPI app boot
test_models.ps1 ORM model definitions (SQLite smoke test, no Postgres needed)
test_git_utils.ps1 URL validation, cloning, Windows-safe cleanup
test_repository_endpoint.ps1 Full POST/GET repository API against real Postgres
test_parsers.ps1 Tree-sitter parser correctness (no DB needed)
test_parsing_pipeline.ps1 Full parse pipeline against fixture files
test_dependency_graph.ps1 Import resolution and cycle detection
test_quality_analysis.ps1 All 8 quality analyzers against a fixture designed to trigger each one
test_documentation.ps1 Document generation and persistence
test_ai_pipeline.ps1 Real embeddings + retrieval + mock LLM
test_frontend.ps1 Frontend dev server smoke test

7. Troubleshooting

ModuleNotFoundError for any package — you likely edited requirements.txt without reinstalling. Run pip install -r backend/requirements.txt again.

ImportError: cannot import name 'runtime_version' from 'google.protobuf' — caused by a stale/conflicting TensorFlow install on your system interfering with sentence-transformers. This project only uses the PyTorch backend; backend/ai/embeddings.py sets USE_TF=0 before importing to avoid this entirely. If you still see it, confirm that environment variable is set at the very top of that file, before any other imports.

ForeignKeyViolationError when re-parsing a repository — fixed as of this version; _clear_previous_analysis in parsing_service.py now clears Relationship and CodeEmbedding rows (which reference File.id) before deleting files. If you see this again, check that function includes both deletes.

PowerShell: Start-Process -FilePath "npm" fails with "not a valid Win32 application"npm on Windows is a .cmd wrapper, not a real executable. Scripts in this repo invoke it via cmd.exe /c npm ... instead of calling npm directly.

npm audit reports a postcss vulnerability nested inside next — do not run npm audit fix --force (it will suggest downgrading next to a version from 2020). Use the overrides field in package.json instead, which forces every copy of the dependency in the tree to the patched version.

Server "crashes" with no visible error when started via a background PowerShell processStart-Process -WindowStyle Hidden swallows stderr by default. Always redirect it: -RedirectStandardError "err.log", and read that file if something fails silently.

First embedding/chat request hangs for several minutes — expected on first run only. sentence-transformers downloads the ~1.3GB BGE model from Hugging Face the first time embed_documents/embed_query is called. Subsequent calls are fast (model is cached).


8. What's not built yet

  • Phase 7 (AI Layer): real embeddings and retrieval work today, but the LLM client (backend/ai/llm_client.py) only has a MockLLMClient — it proves the RAG pipeline is wired correctly without needing an API key, but doesn't generate real natural-language answers. Swapping in a real provider means writing one new class implementing BaseLLMClient.generate().
  • Alembic migrations: schema changes currently require dropping and recreating all tables (scripts/create_tables.py), which is fine for local development but not how a production deployment would handle schema evolution.
  • Route/endpoint extraction, DB model extraction: the PRD's Authentication.md/Database.md/API endpoint docs need framework-specific route detection (e.g. Flask/Express route decorators) that hasn't been built — only the four documents backed by data we actually extract (README, Architecture, FolderStructure, CodeQuality) are generated.
  • Celery workers: analysis currently runs synchronously per API call rather than as background jobs, so very large repositories will block the request until complete.

About

Analyze any repo in detail.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors