Internship deliverable for JetBrains: build an AI-powered IntelliJ plugin. CodeAtlas lets a developer ask questions like "where is authentication implemented?" or "how does the payment flow work?" and get a ranked list of classes/methods/top-level functions in the open project, each one-click navigable via the standard IntelliJ navigation APIs.
Problem it addresses. In unfamiliar codebases, "find the right symbol" is a discovery problem, not a keyword problem. The IDE's existing search (Ctrl+Shift+F, Navigate by Class/File) only matches what you already know to type. Semantic retrieval closes that gap by matching intent to implementation even when identifier names don't overlap with the question.
Phase 1 outcome (this plan). A working IntelliJ IDEA plugin that indexes Kotlin/Java sources of the open project, produces vector embeddings via a locally-bundled ONNX model, and exposes a tool window where natural-language queries return a ranked list of navigable code symbols. Works fully offline — no API key, no network.
Phase 2 (deferred, not built here). Retrieval-augmented generation: feed top-K chunks to an LLM and render a natural-language answer with citations. Architecture in Phase 1 leaves a clean seam for this.
| Decision | Choice |
|---|---|
| Scope | Semantic search + navigation (Phase 1); RAG chat seam for Phase 2 |
| Languages | Kotlin + Java concrete; pluggable LanguageAdapter interface for future Python/TS/Go |
| Embeddings | Local ONNX model bundled in plugin resources |
| UI | Dedicated tool window (right side); no Search Everywhere, no right-click actions in Phase 1 |
| Koog | Not used in Phase 1 (no LLM calls to orchestrate). Re-evaluate for Phase 2. |
| Timeline | 3–4 weeks focused work |
Five layers, each a single-purpose unit communicating through narrow interfaces.
┌─────────────────────────────────────────────┐
│ UI (Tool Window) │
│ • Search field + debounced input │
│ • Ranked result cards + progress indicator │
└───────────────────┬─────────────────────────┘
│ calls
┌───────────────────▼─────────────────────────┐
│ Retriever │
│ embed(query) → top-K vectors → rerank │
└───────────┬───────────────┬─────────────────┘
│ query │ chunk vectors
┌───────────▼───────┐ ┌───▼─────────────────┐
│ EmbeddingProvider │ │ VectorStore + Cache │
│ OnnxEmbedding… │ │ in-mem + disk │
└───────────▲───────┘ └───▲─────────────────┘
│ │ written by
┌───────────┴────────────────┴────────────────┐
│ IndexService │
│ • full index on startup │
│ • incremental re-index on PSI change │
│ • uses LanguageAdapter per file │
└───────────────────┬─────────────────────────┘
│ extracts via
┌───────────────────▼─────────────────────────┐
│ LanguageAdapter (Kotlin, Java) │
│ PsiFile → List<CodeChunk> │
└─────────────────────────────────────────────┘
Source root: src/main/kotlin/com/bugdigger/codeatlas/
CodeChunk.kt— immutable data class:(id, qualifiedName, kind, signature, docComment, language, virtualFileUrl, startOffset, endOffset, containerFqn, contentHash).kindis an enum overCLASS | INTERFACE | OBJECT | METHOD | FUNCTION | CONSTRUCTOR | DOC.CodeAtlasIndexService.kt—@Service(Service.Level.PROJECT); orchestrates indexing and exposessuspend fun search(query: String, limit: Int): List<RankedResult>. Owns theVectorStoreandPersistentCacheinstances.IndexBuilder.kt— walksProjectFileIndex.iterateContentin aTask.Backgroundable, routes files to the matchingLanguageAdapter, batches chunks through theEmbeddingProvider(batch size 16), writes into the in-memory store and cache.PersistentCache.kt— single-file binary cache atPathManager.getSystemPath()/CodeAtlas/<projectHash>/index.bin. Header:magic(4) | schemaVersion(2) | embeddingModelId(string) | dim(4). Record:contentHash(32B) + vector(dim*4B) + chunk metadata (JSON line). On version mismatch → rebuild.PsiChangeListener.kt— implementsPsiTreeChangeListener; debounces with a 1s coalescing window; pushes affectedPsiFiles to aCoroutineScope-backed channel that theIndexBuilderdrains in incremental mode.
LanguageAdapter.kt— interface:interface LanguageAdapter { fun supports(file: PsiFile): Boolean fun extract(file: PsiFile): List<CodeChunk> }
KotlinLanguageAdapter.kt— uses Kotlin PSI (KtFile,KtClass,KtNamedFunction,KtDeclaration). Extracts classes, objects, top-level functions, member functions, KDoc blocks.JavaLanguageAdapter.kt— usesPsiJavaFile,PsiClass,PsiMethod, JavaDoc. Same chunk shape.LanguageAdapters.kt— registry/lookup; iterated in order untilsupports(file)returns true.
Why this split: Phase-2 readiness without cost — adding Python later is a single-file addition.
EmbeddingProvider.kt— interface:suspend fun embed(texts: List<String>): List<FloatArray>; propertyval dim: Int; propertyval modelId: String(for cache invalidation).OnnxEmbeddingProvider.kt— wrapsai.onnxruntime.OrtSession. Loads model from plugin resources at first use. Tokenizes viaBertTokenizer, runs inference, mean-pools the last hidden state, L2-normalizes. Single-threaded on a dedicated dispatcher (Dispatchers.Default.limitedParallelism(1)) to avoid ONNX contention.tokenizer/BertTokenizer.kt— WordPiece tokenizer. Either ship vocab.txt and hand-roll the ~80-line WordPiece loop, OR depend onai.djl.huggingface:tokenizers. Preference: the DJL library if bundle size allows; else hand-roll.
Model choice: BAAI/bge-small-en-v1.5 (384-dim, ~130MB fp32, ~33MB int8 quantized). Int8 quantized is preferred — quality loss is <2% on MTEB, and the plugin JAR stays lean.
VectorStore.kt— in-memoryFloatArrayof shape(N, dim)flattened, plus a parallelList<CodeChunk>.topK(queryVec, k)runs a SIMD-friendly dot-product scan. Linear scan is fine up to ~10k chunks (<20ms). HNSW is a later upgrade if profiling shows it's needed.Retriever.kt— pipeline:- embed query once
VectorStore.topK(queryVec, 50)→ candidate setReranker.rerank(candidates, query)→ final top 20
ScoringSignals.kt— pure-function signal fusion. Final score =w_vec * cosineSim + w_name * identifierSubstringMatch + w_kind * kindFitPrior + w_doc * hasDocBoost. Weights start at(0.7, 0.15, 0.05, 0.1)and are plain constants (no learning). Tunable via a dev-mode setting.
CodeAtlasToolWindowFactory.kt— replaces the templateMyToolWindowFactory. Registered inplugin.xml.CodeAtlasToolWindow.kt— root panel (BorderLayout). North:SearchBar. Center: aJBSplitter(vertical, proportion=1.0 in Phase 1) with the top component =ResultListPanelinside aJBScrollPaneand the bottom component = an empty placeholder reserved for the Phase 2 answer panel. South: thin indexing-status strip.SearchBar.kt—JBTextFieldwith a 300ms debounce (javax.swing.Timer). OnENTERor debounce tick, callsIndexService.searchoff-EDT viaAppExecutorUtil/coroutines and posts results back to EDT.ResultListPanel.kt— virtualizedJBListwith a customListCellRendererthat produces aResultCardper row.ResultCard.kt— renders: icon (fromIconManagervia chunkkind), bold qualified name, muted signature line, small file-path:line suffix, two-line snippet from the PSI range. Enter / double-click / button click →OpenFileDescriptor(project, virtualFile, startOffset).navigate(true)thenFileEditorManager.getInstance(project).selectedTextEditorcaret set.- Indexing status: bound to
MessageBusConnectiontopicCodeAtlasIndexTopic; shows "Indexing 412/1834" during full indexing, collapses to a small dot when idle.
CodeAtlasSettings.kt—@State(name = "CodeAtlasSettings", storages = [@Storage("codeAtlas.xml")]). Phase 1 fields:includeTestSources: Boolean = false,cacheDirOverride: String? = null.SettingsConfigurable.kt— registeredcom.intellij.applicationConfigurable(or project-level) with a single-form panel.
CodeAtlasStartupActivity.kt— implementsProjectActivity(coroutine-based startup, 2023.1+ API). Triggers initial index build; registersPsiChangeListener.
Rename existing MyMessageBundle.kt; keep all user-facing strings here for future i18n.
Startup, first time opening a project
CodeAtlasStartupActivity.execute(project)→IndexService.ensureIndexed().PersistentCache.load(projectHash)→ miss.IndexBuilder.fullIndex()launchesTask.Backgroundable:ProjectFileIndex.iterateContentyields source files.- Each file → matching
LanguageAdapter.extract→ list ofCodeChunk. - Chunks batched (16) →
EmbeddingProvider.embed(chunkText)wherechunkText = signature + "\n" + docComment + "\n" + containerFqn. - Vectors + chunks written to
VectorStore(in-memory) andPersistentCache(streamed to disk).
PsiChangeListenerregistered.
Startup, subsequent opens
- Cache hit →
VectorStore.loadFromCache(cache). No embedding computation. PsiChangeListenerregistered. Any files modified while the IDE was closed are caught by PSI's own stamp comparison + the index builder walks modified files once on startup.
User query
- User types in
SearchBar→ debounce 300ms. - Dispatch to background:
IndexService.search(query, 20). EmbeddingProvider.embed([query])→FloatArray(dim).VectorStore.topK(q, 50)→ 50 candidates.Reranker.rerank(…)→ top 20.- Back on EDT:
ResultListPanelre-renders.
Navigation
- Enter/double-click on a
ResultCard→OpenFileDescriptor(project, vfile, chunk.startOffset).navigate(true).
Incremental re-index
- User edits a
.ktfile →PsiChangeListener.childrenChanged. - Debounce coalesces bursts into 1s windows.
- For each changed file: re-run
LanguageAdapter.extract→ diff new chunks vs cached chunks bycontentHash→ embed only net-new or changed → replace inVectorStore→ flush delta toPersistentCache.
AnswerGenerator.ktinterface in a newrag/package, with onlyNoopAnswerGeneratorwired in.Retrieveralready returnsRankedResult(chunk, score, snippet, lineRange)— precisely the shape a RAG prompt needs.CodeAtlasToolWindowalready uses aJBSplitterwhose bottom component is empty in Phase 1; Phase 2 fills that bottom component with a streaming answer panel and animates the divider down.- When Phase 2 starts, the decision is: single generative call (direct Anthropic/OpenAI SDK, ~50 LOC) versus agentic tool-calling (Koog). If the UX stays as "answer once with citations", the direct SDK is simpler and Koog is unnecessary. Koog becomes warranted only if the LLM should autonomously invoke IDE tools in sequence (search → read file → find usages → answer).
Modify
build.gradle.kts— addbundledPlugin("com.intellij.java")andbundledPlugin("org.jetbrains.kotlin"); addimplementation("com.microsoft.onnxruntime:onnxruntime:<pin>"); addimplementation("ai.djl.huggingface:tokenizers:<pin>")(or skip if hand-rolled); addtestImplementationfor JUnit 5 & IntelliJ test framework extras as needed.src/main/resources/META-INF/plugin.xml— add<depends>com.intellij.modules.java</depends>and<depends>org.jetbrains.kotlin</depends>; register new tool window factory + startup activity + settings configurable + project-level service; update<description>and<vendor>.src/main/kotlin/com/bugdigger/codeatlas/MyToolWindow.kt— delete (replaced byCodeAtlasToolWindow+CodeAtlasToolWindowFactory).src/main/kotlin/com/bugdigger/codeatlas/MyMessageBundle.kt→ renameCodeAtlasBundle.kt.src/main/resources/messages/MyMessageBundle.properties→ renameCodeAtlasBundle.properties; seed with tool window title, search placeholder, indexing states.
Create (source)
- All files listed under "Component responsibilities & file layout" above.
Create (resources)
src/main/resources/model/bge-small-en-v1.5-int8.onnx(or equivalent small ONNX file; document the source URL and license in a siblingMODEL_CARD.md)src/main/resources/model/vocab.txt(WordPiece vocab)
Create (tests) — under src/test/kotlin/com/bugdigger/codeatlas/
search/ScoringSignalsTest.kt— pure-function tests of the signal fusion math.search/VectorStoreTest.kt— cosine-sim correctness, top-K ordering, cache round-trip.language/KotlinLanguageAdapterTest.kt—LightJavaCodeInsightFixtureTestCasesubclass; fixture Kotlin file → assert extracted chunk shape.language/JavaLanguageAdapterTest.kt— same pattern for Java fixtures.index/IndexBuilderIntegrationTest.kt— end-to-end on a fixture project: index → run 5 known queries → assert expected chunks appear in top-3.embedding/OnnxEmbeddingProviderTest.kt— sanity test: output dim matches, identical input yields identical vector, vector is unit-normalized.
| Need | API |
|---|---|
| Walk project sources, respect excludes | ProjectFileIndex.iterateContent |
| Background task with progress | Task.Backgroundable, ProgressIndicator |
| Navigate to symbol | OpenFileDescriptor.navigate, FileEditorManager |
| Incremental PSI change events | PsiManager.getInstance().addPsiTreeChangeListener |
| Project-level service | @Service(Service.Level.PROJECT) |
| Startup hook | ProjectActivity (registered as com.intellij.postStartupActivity) |
| Persistent plugin data dir | PathManager.getSystemPath() |
| Settings UI | com.intellij.applicationConfigurable + @State/@Storage |
| Tool window | com.intellij.toolWindow extension + ToolWindowFactory |
| Event bus for index status | MessageBus + topic constant |
| Off-EDT dispatch | AppExecutorUtil or Kotlin coroutines on Dispatchers.Default |
Week 1 — Skeleton + extraction
- Copy this plan to
devdocs/plan.md(the user's requested location). - Rename
MyMessageBundle→CodeAtlasBundle; remove template tool window. - Wire
build.gradle.kts: addcom.intellij.java+ Kotlin bundled plugin dependencies; add ONNX + tokenizer dependencies. - Build
CodeChunk+LanguageAdapterinterface +KotlinLanguageAdapter+JavaLanguageAdapterwith unit tests. - Build
CodeAtlasIndexServicestub +IndexBuilder.fullIndex()scaffold that extracts chunks but does not yet embed. Smoke test: count chunks on a fixture project.
Week 2 — Embeddings + retrieval
6. Add ONNX model file + tokenizer. Implement OnnxEmbeddingProvider + tests (dim, determinism, unit norm).
7. Implement VectorStore + PersistentCache + tests (round-trip, version invalidation).
8. Implement Retriever + ScoringSignals + Reranker + tests (pure-function signal math).
9. End-to-end wire-up: IndexBuilder.fullIndex() now actually embeds and populates the store.
Week 3 — UI + lifecycle
10. Build CodeAtlasToolWindowFactory, CodeAtlasToolWindow, SearchBar, ResultListPanel, ResultCard. Debounce, async search, EDT-safe rendering.
11. Implement navigation on result click (OpenFileDescriptor).
12. CodeAtlasStartupActivity triggers IndexService.ensureIndexed() on project open with Task.Backgroundable progress UI.
13. PsiChangeListener + incremental re-index with debounced coalescing.
14. Indexing status bus + UI indicator.
Week 4 — Polish + verification
15. Settings panel (includeTestSources).
16. Integration test: full index + 5 fixed queries against a fixture project.
17. Run against a real small Kotlin codebase (the plugin itself + one public repo) — tune scoring weights.
18. Run verifyPlugin Gradle task; fix any reported issues.
19. Write README.md section with screenshots and usage, plus devdocs/plan.md → move to project root if appropriate.
- Build + run:
./gradlew runIde(via the existingRun IDE with Pluginrun configuration) opens a sandbox IDE with the plugin loaded. - Manual smoke on a real project:
- Open a small Kotlin/Java project (recommend the plugin's own repo or a simple Spring Boot sample).
- Wait for "Indexing… N/M" status to finish.
- Search: "where is authentication done" / "entry point" / "tool window" / "how is indexing triggered" → expect relevant symbols in top-5.
- Click a result → IDE jumps to the exact PSI range.
- Modify a file (rename a class) → within ~2s, the new name appears in search; the old one does not.
- Close and reopen the project → no re-index runs (cache hit).
- Automated:
./gradlew testruns unit +BasePlatformTestCaseintegration tests../gradlew verifyPluginmust pass. - JetBrains MCP verification (during development): use
mcp__jetbrains__build_projectafter each change;mcp__jetbrains__execute_run_configurationonRun IDE with Plugin;mcp__jetbrains__get_file_problemsto catch inspection issues before build.
- Answering questions with an LLM (Phase 2).
- Languages other than Kotlin/Java (interface exists; implementations later).
- Search Everywhere integration, right-click "ask about this" actions.
- Remote embedding providers (interface exists; implementations later).
- HNSW / ANN vector index (linear scan is sufficient for target scale).
- Multi-project/workspace-wide search.
- Commit-message / VCS history signal in ranking.
- Marketplace publishing.
| Risk | Mitigation |
|---|---|
| ONNX + tokenizer bundle is too large (>50MB plugin JAR) | Use int8-quantized bge-small (~33MB). If still too large, download-on-first-run to PathManager.getSystemPath() with a progress UI. |
| Embedding latency slows typing | Embed on a single dedicated dispatcher; debounce UI; cap query length. |
| Kotlin PSI API differences across IDE versions | Pin to intellijIdea("2025.2.4") for Phase 1; verify with verifyPlugin before any version bumps. |
| Initial index on a huge project blocks | Task.Backgroundable + progress + user can close the IDE; on reopen we resume from partial cache. |
| Ranking quality is weak on vague queries | Signal fusion weights are plain constants, easy to tune; keep a small hand-labeled eval set of 20 queries → expected top-3 for regression-testing tuning changes. |