|
33 | 33 | \maketitle |
34 | 34 |
|
35 | 35 | \begin{abstract} |
36 | | -Retrieving optimal context for Large Language Models to understand code changes is a critical challenge in automated software engineering. Current approaches rely on naive windowing or purely lexical retrieval, missing complex structural dependencies. We formulate diff-aware code context selection as \textbf{budgeted utility maximization} over multi-resolution fragments on a typed dependency graph. The deployed system uses a modular relevance objective with lazy-greedy budgeted selection and engineering heuristics; we additionally describe a submodular concept-coverage extension and identify clean analyzable variants whose approximation guarantees are known. Two structural components ground the framework: (i) a partition matroid over multi-resolution fragments enforcing structural non-redundancy, and (ii) a pluggable relevance signal $\hat{w}(f, \Delta)$ instantiated via three families---Personalized PageRank, bounded ego-network expansion, and BM25 lexical retrieval---together with a hybrid combiner. The framework is evaluated by file-level recall against annotated golden contexts at a fixed token budget across three multi-language software-engineering benchmarks (1500 instances total). At an $8000$-token budget under hybrid scoring, diffctx achieves pooled file recall $0.875$ [95\% CI $0.861, 0.889$], with per-benchmark recall $0.911$ on SWE-bench Verified, $0.922$ on PolyBench-500, and $0.793$ on ContextBench Verified. Baseline comparisons (BM25 over patch identifiers, Aider repo-map in fair-input and oracle-mentioned modes) and a four-mode scoring ablation are queued. |
| 36 | +Retrieving optimal context for Large Language Models to understand code changes is a critical challenge in automated software engineering. Current approaches rely on naive windowing or purely lexical retrieval, missing complex structural dependencies. We formulate diff-aware code context selection as \textbf{budgeted utility maximization} over multi-resolution fragments on a typed dependency graph. The deployed system uses a modular relevance objective with lazy-greedy budgeted selection and engineering heuristics; we additionally describe a submodular concept-coverage extension and identify clean analyzable variants whose approximation guarantees are known. Two structural components ground the framework: (i) a partition matroid over multi-resolution fragments enforcing structural non-redundancy, and (ii) a pluggable relevance signal $\hat{w}(f, \Delta)$ instantiated via three families---Personalized PageRank, bounded ego-network expansion, and BM25 lexical retrieval---together with a hybrid combiner. The framework is evaluated by file-level recall against annotated golden contexts at a fixed token budget across three multi-language software-engineering benchmarks (1500 instances total). At an $8000$-token budget under hybrid scoring, diffctx achieves pooled file recall $0.875$ [95\% CI $0.861, 0.889$], with per-benchmark recall $0.911$ on SWE-bench Verified, $0.922$ on PolyBench-500, and $0.793$ on ContextBench Verified. A paired-bootstrap comparison against a same-budget BM25-over-patch-identifiers baseline shows pooled diffctx recall $0.878$ vs.\ BM25 $0.539$ (paired $\Delta = +0.339$, 95\% CI $[+0.318, +0.360]$, Wilcoxon $p = 4 \times 10^{-116}$ on $n=1387$ instances where both methods produced a valid selection). An Aider repo-map comparison (fair-input and oracle-mentioned modes) and a four-mode scoring ablation are queued. |
37 | 37 | \end{abstract} |
38 | 38 |
|
39 | 39 | \newpage |
@@ -559,37 +559,39 @@ \subsection{Observations} |
559 | 559 | \subsection{Baseline Comparisons} |
560 | 560 | \label{sec:prelim-baselines} |
561 | 561 |
|
562 | | -Two external file-level baselines were implemented to run the same protocol (same manifests, same budget, same metric) so that the comparison is apples-to-apples. Both are scheduled to execute after the diffctx hybrid run completes; the rendered comparison tables below show the structure that will be populated. Methodology follows the standard IR significance protocol~\cite{smucker2007comparison}: per-instance paired bootstrap on the recall delta ($B{=}10{,}000$, percentile CI) and Wilcoxon signed-rank $p$-value, computed via \texttt{benchmarks/adapters/final\_eval.py::render\_comparison\_table}. |
| 562 | +Two external file-level baselines run the same protocol (same manifests, same budget, same metric) so that the comparison is apples-to-apples. The BM25 baseline has completed; the Aider runs are scheduled. Methodology follows the standard IR significance protocol~\cite{smucker2007comparison}: per-instance paired bootstrap on the recall delta ($B{=}10{,}000$ resamples, percentile CI) and Wilcoxon signed-rank $p$-value, computed via \texttt{benchmarks/adapters/final\_eval.py::render\_comparison\_table}. |
563 | 563 |
|
564 | 564 | \paragraph{BM25 over patch identifiers.} \texttt{rank-bm25} indexes all repository files; the query is the set of identifiers extracted from the gold patch (camelCase + snake\_case decomposition, language-keyword filter). Files are packed in BM25 rank order under strict greedy budget compliance. Implementation in \texttt{benchmarks/baselines/bm25\_baseline.py}. This isolates pure lexical retrieval without any graph structure. |
565 | 565 |
|
566 | | -\paragraph{Aider repo-map.} The reference open-source approach using the same primitives (tree-sitter AST + Personalized PageRank + token-budgeted packing). Invoked through an isolated \texttt{uv tool} subprocess with the upstream \texttt{aider-chat==0.86.2} package; \texttt{mentioned\_fnames} populated from \texttt{instance.gold\_files} and \texttt{mentioned\_idents} from the gold patch. This input mapping favors Aider relative to diffctx, which receives only the patch text. Implementation in \texttt{benchmarks/baselines/aider\_baseline.py}. |
| 566 | +\paragraph{Aider repo-map.} The reference open-source approach using the same primitives (tree-sitter AST + Personalized PageRank + token-budgeted packing). Invoked through an isolated \texttt{uv tool} subprocess with the upstream \texttt{aider-chat==0.86.2} package. We run Aider in two modes: \emph{fair-input} (\texttt{mentioned\_fnames} restricted to file paths visible in the input diff text, the same information diffctx receives) and \emph{oracle-mentioned} (\texttt{mentioned\_fnames} populated from \texttt{instance.gold\_files}, an upper-bound stress test rather than a baseline). Implementation in \texttt{benchmarks/baselines/aider\_baseline.py}. |
| 567 | + |
| 568 | +\paragraph{Both-OK fair filter.} The BM25 run was launched with \texttt{-{}-workers 2} on a shared bare-clone cache; this introduced concurrent \texttt{git} index-lock contention and produced 110 \texttt{clone\_fail} statuses for BM25 vs.\ 4 for diffctx (which ran with \texttt{-{}-workers 1}). To avoid penalising BM25 for an infrastructure asymmetry, the comparison rows below are computed on the both-OK subset ($n{=}1387$): instances where both methods produced a valid selection. The naive intent-to-treat numbers (with failed instances counted as recall $0$) are larger but not the right comparison; we report the both-OK numbers as primary and note the asymmetry honestly. A re-run of BM25 with \texttt{-{}-workers 1} (estimated 2--3\,h) is scheduled to remove this caveat in a future revision; the both-OK $\Delta$ is expected to remain within the reported CI. |
567 | 569 |
|
568 | 570 | \begin{table}[h] |
569 | 571 | \centering |
570 | 572 | \small |
571 | | -\caption{diffctx (hybrid) vs.\ external baselines at $B{=}8000$ tokens. Δ is the per-instance paired delta in file recall (positive favors diffctx). 95\% paired-bootstrap percentile CI on Δ; $p$-value from Wilcoxon signed-rank. The \emph{Aider (oracle)} row is an upper-bound stress test, not a comparison baseline. \emph{Placeholder: cells to be filled when baseline runs complete.}} |
| 573 | +\caption{diffctx (hybrid) vs.\ external baselines at $B{=}8000$ tokens, both-OK fair filter. $\Delta$ is the per-instance paired delta in file recall (positive favors diffctx). 95\% paired-bootstrap percentile CI on $\Delta$ ($B{=}10{,}000$ resamples, seed $=42$); $p$-value from two-sided Wilcoxon signed-rank. The \emph{Aider (oracle)} row is an upper-bound stress test, not a comparison baseline. Aider rows pending; populated when the queued runs complete.} |
572 | 574 | \label{tab:prelim-baselines} |
573 | 575 | \resizebox{\textwidth}{!}{% |
574 | 576 | \begin{tabular}{lrlllc} |
575 | 577 | \toprule |
576 | | -\textbf{Test set} & \textbf{n} & \textbf{diffctx} & \textbf{baseline} & \textbf{Δ recall [95\% CI]} & \textbf{Wilcoxon $p$} \\ |
| 578 | +\textbf{Test set} & \textbf{n} & \textbf{diffctx} & \textbf{baseline} & \textbf{$\Delta$ recall [95\% CI]} & \textbf{Wilcoxon $p$} \\ |
577 | 579 | \midrule |
578 | 580 | \multicolumn{6}{l}{\emph{vs.\ BM25 over patch identifiers}} \\ |
579 | 581 | \midrule |
580 | | -ContextBench Verified & 500 & 0.793 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\ |
581 | | -PolyBench-500 & 500 & 0.922 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\ |
582 | | -SWE-bench Verified & 500 & 0.911 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\ |
583 | | -\textbf{Pooled} & 1500 & \textbf{0.875} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\ |
| 582 | +ContextBench Verified & 429 & 0.791 & 0.450 & $+0.341$ [$+0.307, +0.375$] & $5.1\mathrm{e}{-43}$ \\ |
| 583 | +PolyBench-500 & 464 & 0.922 & 0.625 & $+0.297$ [$+0.265, +0.329$] & $5.2\mathrm{e}{-39}$ \\ |
| 584 | +SWE-bench Verified & 494 & 0.912 & 0.535 & $+0.377$ [$+0.337, +0.418$] & $7.7\mathrm{e}{-41}$ \\ |
| 585 | +\textbf{Pooled} & \textbf{1387} & \textbf{0.878} & \textbf{0.539} & $\mathbf{+0.339\ [+0.318,\ +0.360]}$ & $\mathbf{4.2\mathrm{e}{-116}}$ \\ |
584 | 586 | \midrule |
585 | | -\multicolumn{6}{l}{\emph{vs.\ Aider repo-map (fair-input mode)}} \\ |
| 587 | +\multicolumn{6}{l}{\emph{vs.\ Aider repo-map (fair-input mode)} --- pending} \\ |
586 | 588 | \midrule |
587 | 589 | ContextBench Verified & 500 & 0.793 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\ |
588 | 590 | PolyBench-500 & 500 & 0.922 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\ |
589 | 591 | SWE-bench Verified & 500 & 0.911 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\ |
590 | 592 | \textbf{Pooled} & 1500 & \textbf{0.875} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\ |
591 | 593 | \midrule |
592 | | -\multicolumn{6}{l}{\emph{vs.\ Aider repo-map (oracle-mentioned, upper-bound stress test --- not a baseline)}} \\ |
| 594 | +\multicolumn{6}{l}{\emph{vs.\ Aider repo-map (oracle-mentioned, upper-bound stress test --- not a baseline)} --- pending} \\ |
593 | 595 | \midrule |
594 | 596 | ContextBench Verified & 500 & 0.793 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\ |
595 | 597 | PolyBench-500 & 500 & 0.922 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\ |
@@ -654,7 +656,7 @@ \subsection{Limitations of the Current Empirical Snapshot} |
654 | 656 |
|
655 | 657 | \begin{itemize} |
656 | 658 | \item \textbf{SWE-bench Verified completion.} \emph{Done.} All three test manifests have completed for the diffctx-hybrid run ($n{=}1500$ total). |
657 | | - \item \textbf{Baseline comparisons.} BM25 over patch identifiers (\texttt{benchmarks/baselines/bm25\_baseline.py}) is queued to run after the diffctx run completes. Aider repo-map (\texttt{benchmarks/baselines/aider\_baseline.py}) is scheduled to follow. The comparison table will report paired-bootstrap CI on per-instance recall deltas and Wilcoxon signed-rank $p$-values via \texttt{render\_comparison\_table}. |
| 659 | + \item \textbf{Baseline comparisons.} BM25 over patch identifiers (\texttt{benchmarks/baselines/bm25\_baseline.py}) has completed and is reported in Table~\ref{tab:prelim-baselines}; the BM25 run was launched with \texttt{-{}-workers 2}, which produced 110 \texttt{clone\_fail} statuses for BM25 vs.\ 4 for diffctx (\texttt{-{}-workers 1}), so the headline numbers use a both-OK fair filter ($n{=}1387$). A re-run of BM25 with \texttt{-{}-workers 1} (estimated 2--3\,h) is scheduled to remove this asymmetry. Aider repo-map (\texttt{benchmarks/baselines/aider\_baseline.py}, in fair-input and oracle-mentioned modes) is scheduled. |
658 | 660 | \item \textbf{Scoring-mode ablation across the four modes.} Section~\ref{sec:ego} predicts that PPR at high restart probability $\alpha$ approximates bounded-radius EGO. This requires running each of $\mathrm{scoring} \in \{\mathrm{hybrid}, \mathrm{ppr}, \mathrm{ego}, \mathrm{bm25}_{\text{internal}}\}$ on the same manifests. Only $\mathrm{hybrid}$ has been evaluated to date; the remaining three modes are queued. To control compute, the ablation is planned on a stratified subset rather than the full 1500 instances per mode; $(\tau, \beta_{\mathrm{core}})$ are held at the hybrid-optimal values, with per-mode re-tuning recorded as a sensitivity question. |
659 | 661 | \item \textbf{Budget curve.} v1 fixes $B{=}8000$. A budget sweep across $\{8000, 16000, 32000\}$ is required to support any claim about budget efficiency. The sweep will be paired with the \texttt{used\_tokens} fix below so the curve plots recall against actual measured token usage rather than nominal budget. |
660 | 662 | \item \textbf{Token-efficiency reporting.} The \texttt{used\_tokens} field is zero in v1 results due to a key-mapping bug in the eval-side wrapper (\texttt{benchmarks/diffctx\_eval\_fn.py} reads a key that the pipeline does not emit). Recall and precision are unaffected by this bug; ``recall per actual token'' cannot be computed until the wrapper is corrected and the affected rows re-emitted. |
@@ -687,7 +689,7 @@ \section{Conclusion} |
687 | 689 |
|
688 | 690 | We have presented diffctx, a framework for diff-aware code context selection that formulates the problem as budgeted utility maximization over multi-resolution fragments on a typed dependency graph. The deployed system uses a modular relevance objective with lazy-greedy budgeted selection and engineering heuristics, including an adaptive stopping rule based on marginal utility. A pluggable scoring abstraction (PPR, EGO, BM25, Hybrid) allows the same selection algorithm to be instantiated with different relevance signals without changing the optimization machinery. We do not claim a new approximation algorithm; we use known optimization structure to make context selection explicit, analyzable, and extensible. |
689 | 691 |
|
690 | | -Key contributions: (1) an explicit ``algorithm $\times$ constraint $\times$ guarantee'' map (Table~\ref{tab:algo-constraint-guarantee}) that separates the deployed heuristic from analyzable variants of the framework, with the submodular concept-coverage extension (Section~\ref{sec:utility}) treated as a generalization, not as the deployed default; (2) a typed-edge dependency graph with per-category treatment (hub-suppression exemption for semantic, structural, and test edges); (3) empirical results (Section~\ref{sec:prelim-results}) showing pooled file recall $0.875$ [95\% CI $0.861, 0.889$] at $B{=}8000$ on 1500 multi-language instances under the hybrid scoring mode (per-benchmark recall: SWE-bench Verified $0.911$, PolyBench-500 $0.922$, ContextBench Verified $0.793$), with the rest of the protocol --- baseline comparisons (BM25, Aider in fair and oracle-mentioned modes), scoring-mode ablation, and budget curve --- scheduled and tracked. |
| 692 | +Key contributions: (1) an explicit ``algorithm $\times$ constraint $\times$ guarantee'' map (Table~\ref{tab:algo-constraint-guarantee}) that separates the deployed heuristic from analyzable variants of the framework, with the submodular concept-coverage extension (Section~\ref{sec:utility}) treated as a generalization, not as the deployed default; (2) a typed-edge dependency graph with per-category treatment (hub-suppression exemption for semantic, structural, and test edges); (3) empirical results (Section~\ref{sec:prelim-results}) showing pooled file recall $0.875$ [95\% CI $0.861, 0.889$] at $B{=}8000$ on 1500 multi-language instances under the hybrid scoring mode (per-benchmark recall: SWE-bench Verified $0.911$, PolyBench-500 $0.922$, ContextBench Verified $0.793$), and a paired-bootstrap comparison against a same-budget BM25-over-patch-identifiers baseline showing pooled $\Delta = +0.339$ [$+0.318, +0.360$], $p = 4 \times 10^{-116}$ ($n{=}1387$); the remainder of the protocol --- Aider repo-map comparison, scoring-mode ablation, and budget curve --- is scheduled and tracked. |
691 | 693 |
|
692 | 694 | \subsection{Future Work} |
693 | 695 | \label{sec:future-work} |
|
0 commit comments