You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
\caption{Per-benchmark interim file-level metrics, scoring=hybrid, $B{=}8000$ tokens. Recall and precision are means over completed instances. Status \texttt{ok} excludes \texttt{clone\_fail} (4 Java instances on ContextBench Verified) and pending instances.}
477
+
\caption{Per-benchmark file-level metrics, scoring=hybrid, $B{=}8000$ tokens, with 95\% percentile bootstrap CIs ($B{=}10{,}000$ resamples, seed=42). Status \texttt{ok} excludes \texttt{clone\_fail} (4 Java instances on ContextBench Verified) and pending instances. SWE-bench Verified row is a placeholder pending completion of the in-flight run; the full table will be re-emitted when $n=1500$.}
\paragraph{Precision is uniformly low ($\approx0.12$).} The current selector packs files until the budget is exhausted; on small gold sets ($\sim1$--$3$ files per instance) this produces many true-negative selections that depress precision. This is a methodological artifact, not a flaw: at fixed budget, recall is the load-bearing metric and precision is constrained by $|G|/k$. We report precision for completeness but do not optimize against it.
519
519
520
+
\subsection{Baseline Comparisons}
521
+
\label{sec:prelim-baselines}
522
+
523
+
Two external file-level baselines were implemented to run the same protocol (same manifests, same budget, same metric) so that the comparison is apples-to-apples. Both are scheduled to execute after the diffctx hybrid run completes; the rendered comparison tables below show the structure that will be populated. Methodology follows the standard IR significance protocol~\cite{smucker2007comparison}: per-instance paired bootstrap on the recall delta ($B{=}10{,}000$, percentile CI) and Wilcoxon signed-rank $p$-value, computed via \texttt{benchmarks/adapters/final\_eval.py::render\_comparison\_table}.
524
+
525
+
\paragraph{BM25 over patch identifiers.} \texttt{rank-bm25} indexes all repository files; the query is the set of identifiers extracted from the gold patch (camelCase + snake\_case decomposition, language-keyword filter). Files are packed in BM25 rank order under strict greedy budget compliance. Implementation in \texttt{benchmarks/baselines/bm25\_baseline.py}. This isolates pure lexical retrieval without any graph structure.
526
+
527
+
\paragraph{Aider repo-map.} The reference open-source approach using the same primitives (tree-sitter AST + Personalized PageRank + token-budgeted packing). Invoked through an isolated \texttt{uv tool} subprocess with the upstream \texttt{aider-chat==0.86.2} package; \texttt{mentioned\_fnames} populated from \texttt{instance.gold\_files} and \texttt{mentioned\_idents} from the gold patch. This input mapping favors Aider relative to diffctx, which receives only the patch text. Implementation in \texttt{benchmarks/baselines/aider\_baseline.py}.
528
+
529
+
\begin{table}[h]
530
+
\centering
531
+
\caption{diffctx (hybrid) vs.\ external baselines at $B{=}8000$ tokens. Δ is the per-instance paired delta in file recall (positive favors diffctx). 95\% paired-bootstrap percentile CI on Δ; $p$-value from Wilcoxon signed-rank. \emph{Placeholder: cells to be filled when baseline runs complete.}}
Section~\ref{sec:scoring} introduces four pluggable scoring modes (Hybrid, PPR, EGO, internal BM25); the final hybrid combiner adapts among them at query time. To attribute the source of recall, we run each of the four modes in isolation on a stratified subset of the test manifests with operational hyperparameters held at their hybrid-optimal values ($\tau{=}0.12$, $\beta_{\mathrm{core}}{=}0.5$, $B{=}8000$). Per-mode re-tuning is recorded as a sensitivity question --- if the hybrid hyperparameters disadvantage a single-mode run, that disadvantage is itself an argument against the single mode in deployment.
558
+
559
+
Section~\ref{sec:ego} predicts that PPR with high restart probability $\alpha$ approximates bounded-radius EGO. The ablation tests this empirically: a monotone ordering BM25 $\prec$ EGO $\prec$ PPR would validate the structural-relevance hypothesis with maximum strength; an EGO $\approx$ PPR result would inform conclusions about the radius of relevance for diff-aware context selection.
560
+
561
+
\begin{table}[h]
562
+
\centering
563
+
\caption{Scoring-mode ablation at $B{=}8000$, hybrid-optimal operational hyperparameters. File recall with 95\% percentile bootstrap CIs. \emph{Placeholder: cells to be filled once each non-hybrid mode completes on the ablation subset.}}
To support claims about budget efficiency, the same protocol is repeated at $B \in\{8000, 16000, 32000\}$ on the test manifests. A flat-or-sublinear recall curve would indicate diminishing returns from a larger budget --- a load-bearing claim for the ``smaller is better'' framing of the comp-per-token framework. The curve is reported in Table~\ref{tab:prelim-budget}; the $B{=}8000$ column is identical to the pooled row of Table~\ref{tab:prelim-bench}.
581
+
582
+
\begin{table}[h]
583
+
\centering
584
+
\caption{Budget curve: pooled file recall under hybrid mode at three budgets, on the full 1500-instance test set. \emph{Placeholder: $B{=}16{,}000$ and $B{=}32{,}000$ runs queued.}}
585
+
\label{tab:prelim-budget}
586
+
\begin{tabular}{lrlll}
587
+
\toprule
588
+
\textbf{Budget $B$} & \textbf{n} & \textbf{Mean recall [95\% CI]} & \textbf{Mean used tokens} & \textbf{Recall / used token (k)} \\
$^{\dagger}$ Mean used tokens is presently zero in v1 result rows due to a key-mapping bug in \texttt{benchmarks/diffctx\_eval\_fn.py} that reads a key not emitted by the pipeline. Recall and precision are unaffected. The bug is documented in our project tracker and fixed runs are scheduled before the budget curve is finalized; once \texttt{used\_tokens} reflects the actual encoder count, recall-per-used-token (rather than recall-per-nominal-budget) becomes the reportable efficiency metric.
598
+
520
599
\subsection{Limitations of the Current Empirical Snapshot}
521
600
522
601
The numbers in this section reflect a partial v1 run. The following items are scheduled work, not deferred future work, and will be incorporated into the next version of this section before submission:
0 commit comments