Skip to content

Avoid double-counting Automaton in CompiledAutomaton.ramBytesUsed#16046

Merged
javanna merged 6 commits into
apache:mainfrom
reugn:fix-compiled-automaton-rambytes
May 28, 2026
Merged

Avoid double-counting Automaton in CompiledAutomaton.ramBytesUsed#16046
javanna merged 6 commits into
apache:mainfrom
reugn:fix-compiled-automaton-rambytes

Conversation

@reugn

@reugn reugn commented May 10, 2026

Copy link
Copy Markdown
Contributor

Description

CompiledAutomaton.ramBytesUsed() counts the underlying Automaton twice on the DFA path, over-reporting retained heap by 18–35% on non-trivial wildcard and regexp queries.

The automaton field is aliased to runAutomaton.automaton — a single Automaton instance referenced from two places:

// CompiledAutomaton.java:261-263
runAutomaton = new ByteRunAutomaton(binary, true);
this.automaton = runAutomaton.automaton;   // same reference

ramBytesUsed() accounts for it twice: once directly via sizeOfObject(automaton), and again through sizeOfObject(runAutomaton) which delegates to RunAutomaton.ramBytesUsed() and adds sizeOfObject(automaton) itself.

The fix is to drop the redundant sizeOfObject(automaton) from CompiledAutomaton.ramBytesUsed(). In the NFA branch both fields are null, so this is a no-op there.

import org.apache.lucene.index.Term;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.util.automaton.*;

Automaton dfa = Operations.determinize(
    WildcardQuery.toAutomaton(new Term("f", "*" + "x".repeat(3000) + "*")),
    Integer.MAX_VALUE);
CompiledAutomaton ca = new CompiledAutomaton(dfa, false, true, false);

System.out.println(ca.ramBytesUsed());
// Before: 8_361_506   (over-reports by 2.15 MB, ratio ~1.35)
// After:  6_214_953   (matches retained heap within 159 bytes)

Cross-checked against org.openjdk.jol.info.GraphLayout.parseInstance(ca).totalSize() — reported below is CompiledAutomaton.ramBytesUsed(), retained is JOL's GraphLayout.totalSize().

Pattern reported (before) reported (after) retained
*foo* 13,634 10,593 10,752
*foo*bar*baz* 53,586 42,921 43,080
*a*b*c*d*e*f*g* 79,282 61,577 61,736
* + x×1000 + * 2,789,994 2,073,945 2,074,104
* + x×3000 + * 8,361,506 6,214,953 6,215,112
*a*b*…*j* (depth 10) 154,522 121,457 121,616
*a*b*…*t* (depth 20) 677,562 548,497 548,656
.*foo.*bar.* 34,258 27,161 27,320

After the fix, ramBytesUsed() lands within a fixed +159-byte offset of actual retained heap across all cases.

@github-actions

Copy link
Copy Markdown
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions Bot added the Stale label May 25, 2026
Comment thread lucene/CHANGES.txt Outdated
@github-actions github-actions Bot modified the milestones: 11.0.0, 10.5.0 May 25, 2026
@github-actions github-actions Bot removed the Stale label May 26, 2026

@javanna javanna left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@javanna javanna merged commit ef20aa5 into apache:main May 28, 2026
13 checks passed
@javanna

javanna commented May 28, 2026

Copy link
Copy Markdown
Contributor

Thanks @reugn !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants