feat: AST-aware chunking for Java (.java)#676
Open
Ciel2142 wants to merge 4 commits into
Open
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Java (
.java) to QMD's AST-aware chunking. With--chunk-strategy auto,Java sources are split at class / interface / record / enum / method / field /
import boundaries via tree-sitter instead of arbitrary regex cuts, improving
chunk coherence and retrieval quality for Java code. Implements the Java
portion of #565.
No behavior change unless
--chunk-strategy autois requested — existinglanguages and the default
regexstrategy are untouched, andsrc/store.tsneeds no change.
How it works
src/ast.tsis data-driven: a language is registered purely through lookuptables, no new code paths.
SupportedLanguage+="java";EXTENSION_MAP+=.javaGRAMMAR_MAP+=tree-sitter-java— official package, ships a prebuilt~414 KB wasm, ABI-compatible with this repo's
web-tree-sitter@0.26.8LANGUAGE_QUERIES.java— tree-sitter query mapping Java nodes onto theexisting cross-language capture vocabulary
SCORE_MAP+= one new key,field: 60getASTStatus()/qmd statuspick up Java automatically.Query & scoring
@class@iface@enum@type@method@field@importReuses the established score vocabulary so Java stays consistent with the other
six languages;
fieldis the only new key.Annotations
tree-sitter-java includes a declaration's leading annotations in its
startIndex(themodifiersnode is a child), so an@Service-annotated classor
@Overridemethod breaks at the annotation — a chunk boundary neversplits an annotation from the declaration it decorates. Pinned by a test.
Scope — Kotlin
Java only, to keep this PR focused. Kotlin (
.kt/.kts) AST chunking iscoming in a separate follow-up PR.
Testing
test/ast.test.ts: detection (incl. uppercase extension +qmd://paths),break points across package/import/class/interface/enum/record/method/field/
annotation-type, score hierarchy, position ordering, annotation-boundary
positioning, and grammar availability via
getASTStatus().test/ast-chunking.test.ts: integration check that the newfieldscoreflows through the chunking pipeline.
vitestandbun test) andtsc.embedded, Java chunks land on method/class/Javadoc boundaries and are
retrievable via keyword and semantic search.
Packaging
The Java wasm (~414 KB) is comparable to the already-shipped TypeScript grammar
— no meaningful change to install size.
Commits
build:addtree-sitter-javagrammar dependencyfeat:register Java in the AST layerdocs:CLAUDE.md / README.md / CHANGELOG.mdtest:integration coverage for the Java field scorePart of #565 (Java; Kotlin tracked separately — not auto-closing).
🤖 Generated with Claude Code