Skip to content

feat: AST-aware chunking for Java (.java)#676

Open
Ciel2142 wants to merge 4 commits into
tobi:mainfrom
Ciel2142:feat/java-ast-chunking
Open

feat: AST-aware chunking for Java (.java)#676
Ciel2142 wants to merge 4 commits into
tobi:mainfrom
Ciel2142:feat/java-ast-chunking

Conversation

@Ciel2142
Copy link
Copy Markdown

@Ciel2142 Ciel2142 commented May 23, 2026

Summary

Adds Java (.java) to QMD's AST-aware chunking. With --chunk-strategy auto,
Java sources are split at class / interface / record / enum / method / field /
import boundaries via tree-sitter instead of arbitrary regex cuts, improving
chunk coherence and retrieval quality for Java code. Implements the Java
portion of #565.

No behavior change unless --chunk-strategy auto is requested — existing
languages and the default regex strategy are untouched, and src/store.ts
needs no change.

How it works

src/ast.ts is data-driven: a language is registered purely through lookup
tables, no new code paths.

  • SupportedLanguage += "java"; EXTENSION_MAP += .java
  • GRAMMAR_MAP += tree-sitter-java — official package, ships a prebuilt
    ~414 KB wasm, ABI-compatible with this repo's web-tree-sitter@0.26.8
  • LANGUAGE_QUERIES.java — tree-sitter query mapping Java nodes onto the
    existing cross-language capture vocabulary
  • SCORE_MAP += one new key, field: 60

getASTStatus() / qmd status pick up Java automatically.

Query & scoring

Capture Java nodes Score
@class class, record 100
@iface interface 100
@enum enum 80
@type annotation type 80
@method method, constructor, compact constructor 90
@field field, interface constant, enum constant 60 (new)
@import package, import 60

Reuses the established score vocabulary so Java stays consistent with the other
six languages; field is the only new key.

Annotations

tree-sitter-java includes a declaration's leading annotations in its
startIndex (the modifiers node is a child), so an @Service-annotated class
or @Override method breaks at the annotation — a chunk boundary never
splits an annotation from the declaration it decorates. Pinned by a test.

Scope — Kotlin

Java only, to keep this PR focused. Kotlin (.kt / .kts) AST chunking is
coming in a separate follow-up PR.

Testing

  • test/ast.test.ts: detection (incl. uppercase extension + qmd:// paths),
    break points across package/import/class/interface/enum/record/method/field/
    annotation-type, score hierarchy, position ordering, annotation-boundary
    positioning, and grammar availability via getASTStatus().
  • test/ast-chunking.test.ts: integration check that the new field score
    flows through the chunking pipeline.
  • Green on both runners (vitest and bun test) and tsc.
  • Also exercised end-to-end on a real ~360-file Spring Boot codebase: indexed +
    embedded, Java chunks land on method/class/Javadoc boundaries and are
    retrievable via keyword and semantic search.

Packaging

The Java wasm (~414 KB) is comparable to the already-shipped TypeScript grammar
— no meaningful change to install size.

Commits

  • build: add tree-sitter-java grammar dependency
  • feat: register Java in the AST layer
  • docs: CLAUDE.md / README.md / CHANGELOG.md
  • test: integration coverage for the Java field score

Part of #565 (Java; Kotlin tracked separately — not auto-closing).

🤖 Generated with Claude Code

AT_VALukin and others added 4 commits May 23, 2026 23:10
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@socket-security
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant