Skip to content

Commit 68070a7

Browse files
committed
paper(v2): finalize empirical section + placeholders for pending data
1 parent 72aebeb commit 68070a7

2 files changed

Lines changed: 94 additions & 6 deletions

File tree

docs/Context-Selection-for-Git-Diff/v2/main.tex

Lines changed: 85 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -474,17 +474,17 @@ \subsection{Interim Results: Hybrid Mode}
474474

475475
\begin{table}[h]
476476
\centering
477-
\caption{Per-benchmark interim file-level metrics, scoring=hybrid, $B{=}8000$ tokens. Recall and precision are means over completed instances. Status \texttt{ok} excludes \texttt{clone\_fail} (4 Java instances on ContextBench Verified) and pending instances.}
477+
\caption{Per-benchmark file-level metrics, scoring=hybrid, $B{=}8000$ tokens, with 95\% percentile bootstrap CIs ($B{=}10{,}000$ resamples, seed=42). Status \texttt{ok} excludes \texttt{clone\_fail} (4 Java instances on ContextBench Verified) and pending instances. SWE-bench Verified row is a placeholder pending completion of the in-flight run; the full table will be re-emitted when $n=1500$.}
478478
\label{tab:prelim-bench}
479-
\begin{tabular}{lrrrr}
479+
\begin{tabular}{lrll r}
480480
\toprule
481481
\textbf{Test set} & \textbf{n} & \textbf{File recall} & \textbf{File precision} & \textbf{ok\%} \\
482482
\midrule
483-
ContextBench Verified & 500 & 0.793 & 0.121 & 99.2\% \\
484-
PolyBench-500 (in progress) & 345 & 0.946 & 0.124 & 100.0\% \\
485-
SWE-bench Verified & --- & --- & --- & not started \\
483+
ContextBench Verified & 500 & 0.793 [0.767, 0.818] & 0.121 [0.110, 0.133] & 99.2\% \\
484+
PolyBench-500 (in progress) & 345 & 0.946 [0.926, 0.964] & 0.124 [0.113, 0.137] & 100.0\% \\
485+
SWE-bench Verified \emph{(TBD)} & \emph{500} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
486486
\midrule
487-
Pooled (interim) & 845 & 0.855 & 0.122 & 99.5\% \\
487+
Pooled (interim) & 845 & 0.855 [0.837, 0.873] & 0.122 [0.114, 0.131] & 99.5\% \\
488488
\bottomrule
489489
\end{tabular}
490490
\end{table}
@@ -517,6 +517,85 @@ \subsection{Observations}
517517

518518
\paragraph{Precision is uniformly low ($\approx 0.12$).} The current selector packs files until the budget is exhausted; on small gold sets ($\sim 1$--$3$ files per instance) this produces many true-negative selections that depress precision. This is a methodological artifact, not a flaw: at fixed budget, recall is the load-bearing metric and precision is constrained by $|G|/k$. We report precision for completeness but do not optimize against it.
519519

520+
\subsection{Baseline Comparisons}
521+
\label{sec:prelim-baselines}
522+
523+
Two external file-level baselines were implemented to run the same protocol (same manifests, same budget, same metric) so that the comparison is apples-to-apples. Both are scheduled to execute after the diffctx hybrid run completes; the rendered comparison tables below show the structure that will be populated. Methodology follows the standard IR significance protocol~\cite{smucker2007comparison}: per-instance paired bootstrap on the recall delta ($B{=}10{,}000$, percentile CI) and Wilcoxon signed-rank $p$-value, computed via \texttt{benchmarks/adapters/final\_eval.py::render\_comparison\_table}.
524+
525+
\paragraph{BM25 over patch identifiers.} \texttt{rank-bm25} indexes all repository files; the query is the set of identifiers extracted from the gold patch (camelCase + snake\_case decomposition, language-keyword filter). Files are packed in BM25 rank order under strict greedy budget compliance. Implementation in \texttt{benchmarks/baselines/bm25\_baseline.py}. This isolates pure lexical retrieval without any graph structure.
526+
527+
\paragraph{Aider repo-map.} The reference open-source approach using the same primitives (tree-sitter AST + Personalized PageRank + token-budgeted packing). Invoked through an isolated \texttt{uv tool} subprocess with the upstream \texttt{aider-chat==0.86.2} package; \texttt{mentioned\_fnames} populated from \texttt{instance.gold\_files} and \texttt{mentioned\_idents} from the gold patch. This input mapping favors Aider relative to diffctx, which receives only the patch text. Implementation in \texttt{benchmarks/baselines/aider\_baseline.py}.
528+
529+
\begin{table}[h]
530+
\centering
531+
\caption{diffctx (hybrid) vs.\ external baselines at $B{=}8000$ tokens. Δ is the per-instance paired delta in file recall (positive favors diffctx). 95\% paired-bootstrap percentile CI on Δ; $p$-value from Wilcoxon signed-rank. \emph{Placeholder: cells to be filled when baseline runs complete.}}
532+
\label{tab:prelim-baselines}
533+
\begin{tabular}{lrlllc}
534+
\toprule
535+
\textbf{Test set} & \textbf{n} & \textbf{diffctx} & \textbf{baseline} & \textbf{Δ recall [95\% CI]} & \textbf{Wilcoxon $p$} \\
536+
\midrule
537+
\multicolumn{6}{l}{\emph{vs.\ BM25 over patch identifiers}} \\
538+
\midrule
539+
ContextBench Verified & \emph{500} & 0.793 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
540+
PolyBench-500 & \emph{500} & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
541+
SWE-bench Verified & \emph{500} & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
542+
\textbf{Pooled} & \emph{1500} & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
543+
\midrule
544+
\multicolumn{6}{l}{\emph{vs.\ Aider repo-map}} \\
545+
\midrule
546+
ContextBench Verified & \emph{500} & 0.793 & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
547+
PolyBench-500 & \emph{500} & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
548+
SWE-bench Verified & \emph{500} & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
549+
\textbf{Pooled} & \emph{1500} & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
550+
\bottomrule
551+
\end{tabular}
552+
\end{table}
553+
554+
\subsection{Scoring-Mode Ablation}
555+
\label{sec:prelim-ablation}
556+
557+
Section~\ref{sec:scoring} introduces four pluggable scoring modes (Hybrid, PPR, EGO, internal BM25); the final hybrid combiner adapts among them at query time. To attribute the source of recall, we run each of the four modes in isolation on a stratified subset of the test manifests with operational hyperparameters held at their hybrid-optimal values ($\tau{=}0.12$, $\beta_{\mathrm{core}}{=}0.5$, $B{=}8000$). Per-mode re-tuning is recorded as a sensitivity question --- if the hybrid hyperparameters disadvantage a single-mode run, that disadvantage is itself an argument against the single mode in deployment.
558+
559+
Section~\ref{sec:ego} predicts that PPR with high restart probability $\alpha$ approximates bounded-radius EGO. The ablation tests this empirically: a monotone ordering BM25 $\prec$ EGO $\prec$ PPR would validate the structural-relevance hypothesis with maximum strength; an EGO $\approx$ PPR result would inform conclusions about the radius of relevance for diff-aware context selection.
560+
561+
\begin{table}[h]
562+
\centering
563+
\caption{Scoring-mode ablation at $B{=}8000$, hybrid-optimal operational hyperparameters. File recall with 95\% percentile bootstrap CIs. \emph{Placeholder: cells to be filled once each non-hybrid mode completes on the ablation subset.}}
564+
\label{tab:prelim-ablation}
565+
\begin{tabular}{lrllll}
566+
\toprule
567+
\textbf{Test set} & \textbf{n} & \textbf{Hybrid} & \textbf{PPR} & \textbf{EGO} & \textbf{BM25 (internal)} \\
568+
\midrule
569+
ContextBench Verified & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
570+
PolyBench-500 & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
571+
SWE-bench Verified & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
572+
\textbf{Pooled} & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
573+
\bottomrule
574+
\end{tabular}
575+
\end{table}
576+
577+
\subsection{Budget Curve}
578+
\label{sec:prelim-budget}
579+
580+
To support claims about budget efficiency, the same protocol is repeated at $B \in \{8000, 16000, 32000\}$ on the test manifests. A flat-or-sublinear recall curve would indicate diminishing returns from a larger budget --- a load-bearing claim for the ``smaller is better'' framing of the comp-per-token framework. The curve is reported in Table~\ref{tab:prelim-budget}; the $B{=}8000$ column is identical to the pooled row of Table~\ref{tab:prelim-bench}.
581+
582+
\begin{table}[h]
583+
\centering
584+
\caption{Budget curve: pooled file recall under hybrid mode at three budgets, on the full 1500-instance test set. \emph{Placeholder: $B{=}16{,}000$ and $B{=}32{,}000$ runs queued.}}
585+
\label{tab:prelim-budget}
586+
\begin{tabular}{lrlll}
587+
\toprule
588+
\textbf{Budget $B$} & \textbf{n} & \textbf{Mean recall [95\% CI]} & \textbf{Mean used tokens} & \textbf{Recall / used token (k)} \\
589+
\midrule
590+
$8{,}000$ & 845 (interim) & 0.855 [0.837, 0.873] & \emph{TBD$^{\dagger}$} & \emph{TBD$^{\dagger}$} \\
591+
$16{,}000$ & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
592+
$32{,}000$ & \emph{TBD} & \emph{TBD} & \emph{TBD} & \emph{TBD} \\
593+
\bottomrule
594+
\end{tabular}
595+
\end{table}
596+
597+
$^{\dagger}$ Mean used tokens is presently zero in v1 result rows due to a key-mapping bug in \texttt{benchmarks/diffctx\_eval\_fn.py} that reads a key not emitted by the pipeline. Recall and precision are unaffected. The bug is documented in our project tracker and fixed runs are scheduled before the budget curve is finalized; once \texttt{used\_tokens} reflects the actual encoder count, recall-per-used-token (rather than recall-per-nominal-budget) becomes the reportable efficiency metric.
598+
520599
\subsection{Limitations of the Current Empirical Snapshot}
521600

522601
The numbers in this section reflect a partial v1 run. The following items are scheduled work, not deferred future work, and will be incorporated into the next version of this section before submission:

docs/Context-Selection-for-Git-Diff/v2/references.bib

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -253,3 +253,12 @@ @inproceedings{jimenez2024swebench
253253
booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
254254
year={2024}
255255
}
256+
257+
@inproceedings{smucker2007comparison,
258+
title={A comparison of statistical significance tests for information retrieval evaluation},
259+
author={Smucker, Mark D. and Allan, James and Carterette, Ben},
260+
booktitle={Proceedings of the ACM Conference on Information and Knowledge Management (CIKM)},
261+
pages={623--632},
262+
year={2007},
263+
doi={10.1145/1321440.1321528}
264+
}

0 commit comments

Comments
 (0)