You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/Context-Selection-for-Git-Diff/v2/main.tex
+27-27Lines changed: 27 additions & 27 deletions
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,7 @@
33
33
\maketitle
34
34
35
35
\begin{abstract}
36
-
Retrieving optimal context for Large Language Models to understand code changes is a critical challenge in automated software engineering. Current approaches rely on naive windowing or purely lexical retrieval, missing complex structural dependencies. We formulate diff-aware code context selection as \textbf{budgeted utility maximization} over multi-resolution fragments on a typed dependency graph. The deployed system uses a modular relevance objective with lazy-greedy budgeted selection and engineering heuristics; we additionally describe a submodular concept-coverage extension and identify clean analyzable variants whose approximation guarantees are known. Two structural components ground the framework: (i) a partition matroid over multi-resolution fragments enforcing structural non-redundancy, and (ii) a pluggable relevance signal $\hat{w}(f, \Delta)$ instantiated via three families---Personalized PageRank, bounded ego-network expansion, and BM25 lexical retrieval---together with a hybrid combiner. The framework is evaluated by file-level recall against annotated golden contexts at a fixed token budget across multi-language software-engineering benchmarks; interim results on 845 of $\sim$1500 instances yield pooled file recall $0.855$ at an $8000$-token budget, with baseline comparisons (BM25, Aider repo-map) and a four-mode scoring ablation queued.
36
+
Retrieving optimal context for Large Language Models to understand code changes is a critical challenge in automated software engineering. Current approaches rely on naive windowing or purely lexical retrieval, missing complex structural dependencies. We formulate diff-aware code context selection as \textbf{budgeted utility maximization} over multi-resolution fragments on a typed dependency graph. The deployed system uses a modular relevance objective with lazy-greedy budgeted selection and engineering heuristics; we additionally describe a submodular concept-coverage extension and identify clean analyzable variants whose approximation guarantees are known. Two structural components ground the framework: (i) a partition matroid over multi-resolution fragments enforcing structural non-redundancy, and (ii) a pluggable relevance signal $\hat{w}(f, \Delta)$ instantiated via three families---Personalized PageRank, bounded ego-network expansion, and BM25 lexical retrieval---together with a hybrid combiner. The framework is evaluated by file-level recall against annotated golden contexts at a fixed token budget across three multi-language software-engineering benchmarks (1500 instances total). At an $8000$-token budget under hybrid scoring, diffctx achieves pooled file recall $0.875$ [95\% CI $0.861, 0.889$], with per-benchmark recall $0.911$ on SWE-bench Verified, $0.922$ on PolyBench-500, and $0.793$ on ContextBench Verified. Baseline comparisons (BM25 over patch identifiers, Aider repo-map in fair-input and oracle-mentioned modes) and a four-mode scoring ablation are queued.
37
37
\end{abstract}
38
38
39
39
\newpage
@@ -472,7 +472,7 @@ \subsection{Baselines}
472
472
\section{Empirical Evaluation}
473
473
\label{sec:prelim-results}
474
474
475
-
This section instantiates the protocol of Section~\ref{sec:eval} and reports the v1 evaluation run. At the time of writing the run is partially complete; numbers are reported over the completed portion and the in-flight items are listed at the end of the section. The setup, calibration, and reproducibility statements (frozen manifests, pinned dataset revisions, fixed hardware) are unconditional on completion.
475
+
This section instantiates the protocol of Section~\ref{sec:eval} and reports the v1 evaluation run. The diffctx-hybrid run on the full 1500-instance test set has completed; baseline runs (BM25, Aider) and the scoring-mode ablation are scheduled and tracked at the end of the section.
\caption{Per-benchmark file-level metrics, scoring=hybrid, $B{=}8000$ tokens, with 95\% percentile bootstrap CIs ($B{=}10{,}000$ resamples, seed=42). Status \texttt{ok} excludes \texttt{clone\_fail} (4 Java instances on ContextBench Verified) and pending instances. SWE-bench Verified row is a placeholder pending completion of the in-flight run; the full table will be re-emitted when $n=1500$.}
513
+
\caption{Per-benchmark file-level metrics, scoring=hybrid, $B{=}8000$ tokens, with 95\% percentile bootstrap CIs ($B{=}10{,}000$ resamples, seed=42). Status \texttt{ok} excludes \texttt{clone\_fail} (4 Java instances on ContextBench Verified). Macro and micro averages coincide because each test set has $n{=}500$.}
\paragraph{Recall is high at a tight 8k-token budget.} Pooled mean file recall is $0.855$across 845 completed instances at $B{=}8000$. Direct head-to-head against published baselines is deferred to the completed run; a BM25 baseline running the same protocol at the same budget on the same manifests is implemented (\texttt{benchmarks/baselines/bm25\_baseline.py}) and queued, and an Aider repo-map baseline (\texttt{benchmarks/baselines/aider\_baseline.py}) is scheduled to follow.
553
+
\paragraph{Recall is high at a tight 8k-token budget.} Pooled mean file recall is $0.875$ [95\% CI $0.861, 0.889$] across all 1500 instances at $B{=}8000$. Per-benchmark recall is highest on PolyBench-500 ($0.922$) and SWE-bench Verified ($0.911$), and lowest on ContextBench Verified ($0.793$); the per-set spread (about 13 percentage points) is much larger than each set's CI half-width and therefore is benchmark-driven, not noise. Direct head-to-head against published baselines is forthcoming: a BM25 baseline running the same protocol at the same budget on the same manifests is implemented (\texttt{benchmarks/baselines/bm25\_baseline.py}) and currently running, and an Aider repo-map baseline (\texttt{benchmarks/baselines/aider\_baseline.py}, in fair-input and oracle-mentioned modes) is scheduled to follow.
554
554
555
-
\paragraph{Per-language pattern.} TypeScript ($0.944$) and Go ($0.931$) lead; Python ($0.780$), C ($0.703$), and C++ ($0.702$) trail. The C/C++ samples are too small ($n{=}23$and $n{=}10$) to draw firm conclusions; the Python gap is consistent with the hypothesis that dynamic dispatch and metaprogramming weaken the symbol-reference heuristic (limitations discussed in Section~\ref{sec:scoring}). The TypeScript advantage tracks the language's static type information enriching the dependency graph.
555
+
\paragraph{Per-language pattern.} TypeScript ($n{=}191$, recall $0.950$) and Go ($n{=}40$, $0.931$) lead; Python ($n{=}891$, $0.862$), Java ($n{=}136$, $0.887$), and JavaScript ($n{=}185$, $0.886$) cluster in the upper-eighties; C ($n{=}23$, $0.703$) and C++ ($n{=}10$, $0.702$) trail with sample sizes too small to draw firm conclusions. The TypeScript advantage tracks the language's static type information enriching the dependency graph; the relatively narrow gap between Python and the upper-eighties cluster (roughly 9 percentage points despite Python carrying $\sim$60\% of the pooled instances and being the language most affected by dynamic dispatch and metaprogramming) is the more interesting signal --- the symbol-reference heuristic, while imperfect for dynamic dispatch (Section~\ref{sec:scoring}), nonetheless covers most cases in modern Python codebases.
556
556
557
557
\paragraph{Precision is uniformly low ($\approx0.12$).} Mean file precision is approximately $|G|/k$ where $|G|$ is the gold-file count per instance ($\sim1$--$3$) and $k$ is the number of files that fit the budget. Low file precision indicates that the selector retrieves many non-gold files under the fixed budget; downstream LLM context quality depends on both recall and noise, so file recall is a necessary but insufficient proxy. ContextBench Verified provides block-level (line-range) annotations beyond the file level; we plan to report block-level recall on that subset to characterize how concentrated the selected context is around the gold regions, and to bound the true noise floor implied by the file-precision number.
@@ -653,7 +653,7 @@ \subsection{Limitations of the Current Empirical Snapshot}
653
653
The numbers in this section reflect a partial v1 run. The following items are scheduled work, not deferred future work, and will be incorporated into the next version of this section before submission:
654
654
655
655
\begin{itemize}
656
-
\item\textbf{SWE-bench Verified completion.} The third test manifest is in queue; final pooled $n$ will be $\sim1500$.
656
+
\item\textbf{SWE-bench Verified completion.} \emph{Done.} All three test manifests have completed for the diffctx-hybrid run ($n{=}1500$ total).
657
657
\item\textbf{Baseline comparisons.} BM25 over patch identifiers (\texttt{benchmarks/baselines/bm25\_baseline.py}) is queued to run after the diffctx run completes. Aider repo-map (\texttt{benchmarks/baselines/aider\_baseline.py}) is scheduled to follow. The comparison table will report paired-bootstrap CI on per-instance recall deltas and Wilcoxon signed-rank $p$-values via \texttt{render\_comparison\_table}.
658
658
\item\textbf{Scoring-mode ablation across the four modes.} Section~\ref{sec:ego} predicts that PPR at high restart probability $\alpha$ approximates bounded-radius EGO. This requires running each of $\mathrm{scoring} \in\{\mathrm{hybrid}, \mathrm{ppr}, \mathrm{ego}, \mathrm{bm25}_{\text{internal}}\}$ on the same manifests. Only $\mathrm{hybrid}$ has been evaluated to date; the remaining three modes are queued. To control compute, the ablation is planned on a stratified subset rather than the full 1500 instances per mode; $(\tau, \beta_{\mathrm{core}})$ are held at the hybrid-optimal values, with per-mode re-tuning recorded as a sensitivity question.
659
659
\item\textbf{Budget curve.} v1 fixes $B{=}8000$. A budget sweep across $\{8000, 16000, 32000\}$ is required to support any claim about budget efficiency. The sweep will be paired with the \texttt{used\_tokens} fix below so the curve plots recall against actual measured token usage rather than nominal budget.
@@ -687,7 +687,7 @@ \section{Conclusion}
687
687
688
688
We have presented diffctx, a framework for diff-aware code context selection that formulates the problem as budgeted utility maximization over multi-resolution fragments on a typed dependency graph. The deployed system uses a modular relevance objective with lazy-greedy budgeted selection and engineering heuristics, including an adaptive stopping rule based on marginal utility. A pluggable scoring abstraction (PPR, EGO, BM25, Hybrid) allows the same selection algorithm to be instantiated with different relevance signals without changing the optimization machinery. We do not claim a new approximation algorithm; we use known optimization structure to make context selection explicit, analyzable, and extensible.
689
689
690
-
Key contributions: (1) an explicit ``algorithm $\times$ constraint $\times$ guarantee'' map (Table~\ref{tab:algo-constraint-guarantee}) that separates the deployed heuristic from analyzable variants of the framework, with the submodular concept-coverage extension (Section~\ref{sec:utility}) treated as a generalization, not as the deployed default; (2) a typed-edge dependency graph with per-category treatment (hub-suppression exemption for semantic, structural, and test edges); (3) interim empirical results (Section~\ref{sec:prelim-results}) showing pooled file recall $0.855$at $B{=}8000$ on 845 multi-language instances under the hybrid scoring mode, with the rest of the protocol --- baseline comparisons (BM25, Aider in fair and oracle-mentioned modes), scoring-mode ablation, and budget curve --- scheduled and tracked.
690
+
Key contributions: (1) an explicit ``algorithm $\times$ constraint $\times$ guarantee'' map (Table~\ref{tab:algo-constraint-guarantee}) that separates the deployed heuristic from analyzable variants of the framework, with the submodular concept-coverage extension (Section~\ref{sec:utility}) treated as a generalization, not as the deployed default; (2) a typed-edge dependency graph with per-category treatment (hub-suppression exemption for semantic, structural, and test edges); (3) empirical results (Section~\ref{sec:prelim-results}) showing pooled file recall $0.875$ [95\% CI $0.861, 0.889$] at $B{=}8000$ on 1500 multi-language instances under the hybrid scoring mode (per-benchmark recall: SWE-bench Verified $0.911$, PolyBench-500 $0.922$, ContextBench Verified $0.793$), with the rest of the protocol --- baseline comparisons (BM25, Aider in fair and oracle-mentioned modes), scoring-mode ablation, and budget curve --- scheduled and tracked.
0 commit comments