You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/Context-Selection-for-Git-Diff/v2/main.tex
+35-16Lines changed: 35 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -11,6 +11,13 @@
11
11
\usepackage{algorithm}
12
12
\usepackage{algpseudocode}
13
13
\usepackage[margin=1in]{geometry}
14
+
% Allow TeX to stretch interword space in emergencies to avoid overfull hboxes
15
+
% caused by long unbreakable \texttt{} or compound identifiers.
16
+
\setlength{\emergencystretch}{5em}
17
+
% Permit \texttt{} and \url to break at common code separators (./_\-).
18
+
\hyphenpenalty=200
19
+
\exhyphenpenalty=200
20
+
\sloppy
14
21
15
22
\title{diffctx: Budgeted Typed-Graph Retrieval for Diff-Aware Code Context Selection}
16
23
@@ -121,17 +128,18 @@ \subsection{Inputs and Definitions}
121
128
\small
122
129
\caption{Algorithm $\times$ constraint $\times$ guarantee map. The deployed default is a heuristic; analyzable variants of the framework admit the listed guarantees on their respective surrogate problems, and a submodular concept-coverage extension is described in Section~\ref{sec:utility} but is not the deployed default.}
\paragraph{Objective.} The objective takes the same algebraic form in both modes but with different sources for the per-fragment weight $w(f, \Delta) \geq0$:
\caption{Per-benchmark file-level metrics, scoring=hybrid, $B{=}8000$ tokens, with 95\% percentile bootstrap CIs ($B{=}10{,}000$ resamples, seed=42). Status \texttt{ok} excludes \texttt{clone\_fail} (4 Java instances on ContextBench Verified) and pending instances. SWE-bench Verified row is a placeholder pending completion of the in-flight run; the full table will be re-emitted when $n=1500$.}
\caption{diffctx (hybrid) vs.\ external baselines at $B{=}8000$ tokens. Δ is the per-instance paired delta in file recall (positive favors diffctx). 95\% paired-bootstrap percentile CI on Δ; $p$-value from Wilcoxon signed-rank. The \emph{Aider (oracle)} row is an upper-bound stress test, not a comparison baseline. \emph{Placeholder: cells to be filled when baseline runs complete.}}
\caption{Scoring-mode ablation at $B{=}8000$, hybrid-optimal operational hyperparameters. File recall with 95\% percentile bootstrap CIs. \emph{Placeholder: cells to be filled once each non-hybrid mode completes on the ablation subset.}}
\caption{Budget curve: pooled file recall under hybrid mode at three budgets, on the full 1500-instance test set. \emph{Placeholder: $B{=}16{,}000$ and $B{=}32{,}000$ runs queued.}}
619
636
\label{tab:prelim-budget}
637
+
\resizebox{\textwidth}{!}{%
620
638
\begin{tabular}{lrlll}
621
639
\toprule
622
640
\textbf{Budget $B$} & \textbf{n} & \textbf{Mean recall [95\% CI]} & \textbf{Mean used tokens} & \textbf{Recall / used token (k)} \\
$^{\dagger}$ Mean used tokens is presently zero in v1 result rows due to a key-mapping bug in \texttt{benchmarks/diffctx\_eval\_fn.py} that reads a key not emitted by the pipeline. Recall and precision are unaffected. The bug is documented in our project tracker and fixed runs are scheduled before the budget curve is finalized; once \texttt{used\_tokens} reflects the actual encoder count, recall-per-used-token (rather than recall-per-nominal-budget) becomes the reportable efficiency metric.
\caption{Symbol-to-code map. Paper symbols on the left correspond to the named code identifiers on the right, located in the listed source file under \texttt{diffctx/src/}. Implementation-only parameters are documented inline in Appendix~A and are not duplicated here.}
0 commit comments