Skip to content

Commit f5f64ef

Browse files
committed
paper(v2): fill BM25 baseline comparison (Δ=+0.339, p=4e-116, n=1387 both-OK)
1 parent 04ebb74 commit f5f64ef

2 files changed

Lines changed: 15 additions & 13 deletions

File tree

3.34 KB
Binary file not shown.

docs/Context-Selection-for-Git-Diff/v2/main.tex

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333
\maketitle
3434

3535
\begin{abstract}
36-
Retrieving optimal context for Large Language Models to understand code changes is a critical challenge in automated software engineering. Current approaches rely on naive windowing or purely lexical retrieval, missing complex structural dependencies. We formulate diff-aware code context selection as \textbf{budgeted utility maximization} over multi-resolution fragments on a typed dependency graph. The deployed system uses a modular relevance objective with lazy-greedy budgeted selection and engineering heuristics; we additionally describe a submodular concept-coverage extension and identify clean analyzable variants whose approximation guarantees are known. Two structural components ground the framework: (i) a partition matroid over multi-resolution fragments enforcing structural non-redundancy, and (ii) a pluggable relevance signal $\hat{w}(f, \Delta)$ instantiated via three families---Personalized PageRank, bounded ego-network expansion, and BM25 lexical retrieval---together with a hybrid combiner. The framework is evaluated by file-level recall against annotated golden contexts at a fixed token budget across three multi-language software-engineering benchmarks (1500 instances total). At an $8000$-token budget under hybrid scoring, diffctx achieves pooled file recall $0.875$ [95\% CI $0.861, 0.889$], with per-benchmark recall $0.911$ on SWE-bench Verified, $0.922$ on PolyBench-500, and $0.793$ on ContextBench Verified. Baseline comparisons (BM25 over patch identifiers, Aider repo-map in fair-input and oracle-mentioned modes) and a four-mode scoring ablation are queued.
36+
Retrieving optimal context for Large Language Models to understand code changes is a critical challenge in automated software engineering. Current approaches rely on naive windowing or purely lexical retrieval, missing complex structural dependencies. We formulate diff-aware code context selection as \textbf{budgeted utility maximization} over multi-resolution fragments on a typed dependency graph. The deployed system uses a modular relevance objective with lazy-greedy budgeted selection and engineering heuristics; we additionally describe a submodular concept-coverage extension and identify clean analyzable variants whose approximation guarantees are known. Two structural components ground the framework: (i) a partition matroid over multi-resolution fragments enforcing structural non-redundancy, and (ii) a pluggable relevance signal $\hat{w}(f, \Delta)$ instantiated via three families---Personalized PageRank, bounded ego-network expansion, and BM25 lexical retrieval---together with a hybrid combiner. The framework is evaluated by file-level recall against annotated golden contexts at a fixed token budget across three multi-language software-engineering benchmarks (1500 instances total). At an $8000$-token budget under hybrid scoring, diffctx achieves pooled file recall $0.875$ [95\% CI $0.861, 0.889$], with per-benchmark recall $0.911$ on SWE-bench Verified, $0.922$ on PolyBench-500, and $0.793$ on ContextBench Verified. A paired-bootstrap comparison against a same-budget BM25-over-patch-identifiers baseline shows pooled diffctx recall $0.878$ vs.\ BM25 $0.539$ (paired $\Delta = +0.339$, 95\% CI $[+0.318, +0.360]$, Wilcoxon $p = 4 \times 10^{-116}$ on $n=1387$ instances where both methods produced a valid selection). An Aider repo-map comparison (fair-input and oracle-mentioned modes) and a four-mode scoring ablation are queued.
3737
\end{abstract}
3838

3939
\newpage
@@ -559,37 +559,39 @@ \subsection{Observations}
559559
\subsection{Baseline Comparisons}
560560
\label{sec:prelim-baselines}
561561

562-
Two external file-level baselines were implemented to run the same protocol (same manifests, same budget, same metric) so that the comparison is apples-to-apples. Both are scheduled to execute after the diffctx hybrid run completes; the rendered comparison tables below show the structure that will be populated. Methodology follows the standard IR significance protocol~\cite{smucker2007comparison}: per-instance paired bootstrap on the recall delta ($B{=}10{,}000$, percentile CI) and Wilcoxon signed-rank $p$-value, computed via \texttt{benchmarks/adapters/final\_eval.py::render\_comparison\_table}.
562+
Two external file-level baselines run the same protocol (same manifests, same budget, same metric) so that the comparison is apples-to-apples. The BM25 baseline has completed; the Aider runs are scheduled. Methodology follows the standard IR significance protocol~\cite{smucker2007comparison}: per-instance paired bootstrap on the recall delta ($B{=}10{,}000$ resamples, percentile CI) and Wilcoxon signed-rank $p$-value, computed via \texttt{benchmarks/adapters/final\_eval.py::render\_comparison\_table}.
563563

564564
\paragraph{BM25 over patch identifiers.} \texttt{rank-bm25} indexes all repository files; the query is the set of identifiers extracted from the gold patch (camelCase + snake\_case decomposition, language-keyword filter). Files are packed in BM25 rank order under strict greedy budget compliance. Implementation in \texttt{benchmarks/baselines/bm25\_baseline.py}. This isolates pure lexical retrieval without any graph structure.
565565

566-
\paragraph{Aider repo-map.} The reference open-source approach using the same primitives (tree-sitter AST + Personalized PageRank + token-budgeted packing). Invoked through an isolated \texttt{uv tool} subprocess with the upstream \texttt{aider-chat==0.86.2} package; \texttt{mentioned\_fnames} populated from \texttt{instance.gold\_files} and \texttt{mentioned\_idents} from the gold patch. This input mapping favors Aider relative to diffctx, which receives only the patch text. Implementation in \texttt{benchmarks/baselines/aider\_baseline.py}.
566+
\paragraph{Aider repo-map.} The reference open-source approach using the same primitives (tree-sitter AST + Personalized PageRank + token-budgeted packing). Invoked through an isolated \texttt{uv tool} subprocess with the upstream \texttt{aider-chat==0.86.2} package. We run Aider in two modes: \emph{fair-input} (\texttt{mentioned\_fnames} restricted to file paths visible in the input diff text, the same information diffctx receives) and \emph{oracle-mentioned} (\texttt{mentioned\_fnames} populated from \texttt{instance.gold\_files}, an upper-bound stress test rather than a baseline). Implementation in \texttt{benchmarks/baselines/aider\_baseline.py}.
567+
568+
\paragraph{Both-OK fair filter.} The BM25 run was launched with \texttt{-{}-workers 2} on a shared bare-clone cache; this introduced concurrent \texttt{git} index-lock contention and produced 110 \texttt{clone\_fail} statuses for BM25 vs.\ 4 for diffctx (which ran with \texttt{-{}-workers 1}). To avoid penalising BM25 for an infrastructure asymmetry, the comparison rows below are computed on the both-OK subset ($n{=}1387$): instances where both methods produced a valid selection. The naive intent-to-treat numbers (with failed instances counted as recall $0$) are larger but not the right comparison; we report the both-OK numbers as primary and note the asymmetry honestly. A re-run of BM25 with \texttt{-{}-workers 1} (estimated 2--3\,h) is scheduled to remove this caveat in a future revision; the both-OK $\Delta$ is expected to remain within the reported CI.
567569

568570
\begin{table}[h]
569571
\centering
570572
\small
571-
\caption{diffctx (hybrid) vs.\ external baselines at $B{=}8000$ tokens. Δ is the per-instance paired delta in file recall (positive favors diffctx). 95\% paired-bootstrap percentile CI on Δ; $p$-value from Wilcoxon signed-rank. The \emph{Aider (oracle)} row is an upper-bound stress test, not a comparison baseline. \emph{Placeholder: cells to be filled when baseline runs complete.}}
573+
\caption{diffctx (hybrid) vs.\ external baselines at $B{=}8000$ tokens, both-OK fair filter. $\Delta$ is the per-instance paired delta in file recall (positive favors diffctx). 95\% paired-bootstrap percentile CI on $\Delta$ ($B{=}10{,}000$ resamples, seed $=42$); $p$-value from two-sided Wilcoxon signed-rank. The \emph{Aider (oracle)} row is an upper-bound stress test, not a comparison baseline. Aider rows pending; populated when the queued runs complete.}
572574
\label{tab:prelim-baselines}
573575
\resizebox{\textwidth}{!}{%
574576
\begin{tabular}{lrlllc}
575577
\toprule
576-
\textbf{Test set} & \textbf{n} & \textbf{diffctx} & \textbf{baseline} & \textbf{Δ recall [95\% CI]} & \textbf{Wilcoxon $p$} \\
578+
\textbf{Test set} & \textbf{n} & \textbf{diffctx} & \textbf{baseline} & \textbf{$\Delta$ recall [95\% CI]} & \textbf{Wilcoxon $p$} \\
577579
\midrule
578580
\multicolumn{6}{l}{\emph{vs.\ BM25 over patch identifiers}} \\
579581
\midrule
580-
ContextBench Verified & 500 & 0.793 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
581-
PolyBench-500 & 500 & 0.922 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
582-
SWE-bench Verified & 500 & 0.911 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
583-
\textbf{Pooled} & 1500 & \textbf{0.875} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
582+
ContextBench Verified & 429 & 0.791 & 0.450 & $+0.341$ [$+0.307, +0.375$] & $5.1\mathrm{e}{-43}$ \\
583+
PolyBench-500 & 464 & 0.922 & 0.625 & $+0.297$ [$+0.265, +0.329$] & $5.2\mathrm{e}{-39}$ \\
584+
SWE-bench Verified & 494 & 0.912 & 0.535 & $+0.377$ [$+0.337, +0.418$] & $7.7\mathrm{e}{-41}$ \\
585+
\textbf{Pooled} & \textbf{1387} & \textbf{0.878} & \textbf{0.539} & $\mathbf{+0.339\ [+0.318,\ +0.360]}$ & $\mathbf{4.2\mathrm{e}{-116}}$ \\
584586
\midrule
585-
\multicolumn{6}{l}{\emph{vs.\ Aider repo-map (fair-input mode)}} \\
587+
\multicolumn{6}{l}{\emph{vs.\ Aider repo-map (fair-input mode)} --- pending} \\
586588
\midrule
587589
ContextBench Verified & 500 & 0.793 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
588590
PolyBench-500 & 500 & 0.922 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
589591
SWE-bench Verified & 500 & 0.911 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
590592
\textbf{Pooled} & 1500 & \textbf{0.875} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
591593
\midrule
592-
\multicolumn{6}{l}{\emph{vs.\ Aider repo-map (oracle-mentioned, upper-bound stress test --- not a baseline)}} \\
594+
\multicolumn{6}{l}{\emph{vs.\ Aider repo-map (oracle-mentioned, upper-bound stress test --- not a baseline)} --- pending} \\
593595
\midrule
594596
ContextBench Verified & 500 & 0.793 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
595597
PolyBench-500 & 500 & 0.922 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
@@ -654,7 +656,7 @@ \subsection{Limitations of the Current Empirical Snapshot}
654656

655657
\begin{itemize}
656658
\item \textbf{SWE-bench Verified completion.} \emph{Done.} All three test manifests have completed for the diffctx-hybrid run ($n{=}1500$ total).
657-
\item \textbf{Baseline comparisons.} BM25 over patch identifiers (\texttt{benchmarks/baselines/bm25\_baseline.py}) is queued to run after the diffctx run completes. Aider repo-map (\texttt{benchmarks/baselines/aider\_baseline.py}) is scheduled to follow. The comparison table will report paired-bootstrap CI on per-instance recall deltas and Wilcoxon signed-rank $p$-values via \texttt{render\_comparison\_table}.
659+
\item \textbf{Baseline comparisons.} BM25 over patch identifiers (\texttt{benchmarks/baselines/bm25\_baseline.py}) has completed and is reported in Table~\ref{tab:prelim-baselines}; the BM25 run was launched with \texttt{-{}-workers 2}, which produced 110 \texttt{clone\_fail} statuses for BM25 vs.\ 4 for diffctx (\texttt{-{}-workers 1}), so the headline numbers use a both-OK fair filter ($n{=}1387$). A re-run of BM25 with \texttt{-{}-workers 1} (estimated 2--3\,h) is scheduled to remove this asymmetry. Aider repo-map (\texttt{benchmarks/baselines/aider\_baseline.py}, in fair-input and oracle-mentioned modes) is scheduled.
658660
\item \textbf{Scoring-mode ablation across the four modes.} Section~\ref{sec:ego} predicts that PPR at high restart probability $\alpha$ approximates bounded-radius EGO. This requires running each of $\mathrm{scoring} \in \{\mathrm{hybrid}, \mathrm{ppr}, \mathrm{ego}, \mathrm{bm25}_{\text{internal}}\}$ on the same manifests. Only $\mathrm{hybrid}$ has been evaluated to date; the remaining three modes are queued. To control compute, the ablation is planned on a stratified subset rather than the full 1500 instances per mode; $(\tau, \beta_{\mathrm{core}})$ are held at the hybrid-optimal values, with per-mode re-tuning recorded as a sensitivity question.
659661
\item \textbf{Budget curve.} v1 fixes $B{=}8000$. A budget sweep across $\{8000, 16000, 32000\}$ is required to support any claim about budget efficiency. The sweep will be paired with the \texttt{used\_tokens} fix below so the curve plots recall against actual measured token usage rather than nominal budget.
660662
\item \textbf{Token-efficiency reporting.} The \texttt{used\_tokens} field is zero in v1 results due to a key-mapping bug in the eval-side wrapper (\texttt{benchmarks/diffctx\_eval\_fn.py} reads a key that the pipeline does not emit). Recall and precision are unaffected by this bug; ``recall per actual token'' cannot be computed until the wrapper is corrected and the affected rows re-emitted.
@@ -687,7 +689,7 @@ \section{Conclusion}
687689

688690
We have presented diffctx, a framework for diff-aware code context selection that formulates the problem as budgeted utility maximization over multi-resolution fragments on a typed dependency graph. The deployed system uses a modular relevance objective with lazy-greedy budgeted selection and engineering heuristics, including an adaptive stopping rule based on marginal utility. A pluggable scoring abstraction (PPR, EGO, BM25, Hybrid) allows the same selection algorithm to be instantiated with different relevance signals without changing the optimization machinery. We do not claim a new approximation algorithm; we use known optimization structure to make context selection explicit, analyzable, and extensible.
689691

690-
Key contributions: (1) an explicit ``algorithm $\times$ constraint $\times$ guarantee'' map (Table~\ref{tab:algo-constraint-guarantee}) that separates the deployed heuristic from analyzable variants of the framework, with the submodular concept-coverage extension (Section~\ref{sec:utility}) treated as a generalization, not as the deployed default; (2) a typed-edge dependency graph with per-category treatment (hub-suppression exemption for semantic, structural, and test edges); (3) empirical results (Section~\ref{sec:prelim-results}) showing pooled file recall $0.875$ [95\% CI $0.861, 0.889$] at $B{=}8000$ on 1500 multi-language instances under the hybrid scoring mode (per-benchmark recall: SWE-bench Verified $0.911$, PolyBench-500 $0.922$, ContextBench Verified $0.793$), with the rest of the protocol --- baseline comparisons (BM25, Aider in fair and oracle-mentioned modes), scoring-mode ablation, and budget curve --- scheduled and tracked.
692+
Key contributions: (1) an explicit ``algorithm $\times$ constraint $\times$ guarantee'' map (Table~\ref{tab:algo-constraint-guarantee}) that separates the deployed heuristic from analyzable variants of the framework, with the submodular concept-coverage extension (Section~\ref{sec:utility}) treated as a generalization, not as the deployed default; (2) a typed-edge dependency graph with per-category treatment (hub-suppression exemption for semantic, structural, and test edges); (3) empirical results (Section~\ref{sec:prelim-results}) showing pooled file recall $0.875$ [95\% CI $0.861, 0.889$] at $B{=}8000$ on 1500 multi-language instances under the hybrid scoring mode (per-benchmark recall: SWE-bench Verified $0.911$, PolyBench-500 $0.922$, ContextBench Verified $0.793$), and a paired-bootstrap comparison against a same-budget BM25-over-patch-identifiers baseline showing pooled $\Delta = +0.339$ [$+0.318, +0.360$], $p = 4 \times 10^{-116}$ ($n{=}1387$); the remainder of the protocol --- Aider repo-map comparison, scoring-mode ablation, and budget curve --- is scheduled and tracked.
691693

692694
\subsection{Future Work}
693695
\label{sec:future-work}

0 commit comments

Comments
 (0)