Stabilize profit controller depth policy

Anbeeld · Anbeeld · commit 20e1ff940e3d · 2026-05-13T18:32:30.000+02:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,8 +3,9 @@
 ## v0.1.2
 
 - Fixed the adaptive `profit` controller's no-spec baseline path. Profit mode now seeds baseline samples before positive-depth warmup, can shut DFlash fully off when the measured baseline wins, and no longer makes speculative decisions from draft-only telemetry.
-- Added periodic profit-controller baseline reprobes with `--spec-dm-profit-baseline-interval` / `LLAMA_ARG_SPEC_DM_PROFIT_BASELINE_INTERVAL` so long-context runs can refresh target-only timing as context grows. Off-state probes now restart with the configured probe depth instead of jumping straight to full draft depth.
-- Made profit depth selection less coarse by scoring every integer draft depth up to the supported telemetry window, preserving the previous active depth across baseline reprobes, and avoiding off-probe counter starvation from repeated baseline cycles.
+- Fixed a profit-controller bucket-transition deadlock where telemetry reset could clear the no-spec baseline while preserving a positive active draft depth, causing all later cycles to run as unrecorded single-token baseline and leaving DFlash permanently disabled.
+- Added periodic profit-controller baseline reprobes with `--spec-dm-profit-baseline-interval` / `LLAMA_ARG_SPEC_DM_PROFIT_BASELINE_INTERVAL` so long-context runs can refresh target-only timing as context grows. The default interval is 512 active speculative cycles, and periodic reprobes start only in longer context buckets; bucket transitions still seed a fresh baseline. Off-state probes now restart with the configured probe depth instead of jumping straight to full draft depth.
+- Stabilized profit depth selection on the production ladder (`0`-`8`, `10`, `12`, `14`, `16`, and the configured max) while preserving the previous active depth across baseline reprobes and avoiding off-probe counter starvation from repeated baseline cycles.
 - Hardened active-reasoning EOS handling. When an end-of-generation token appears while reasoning output is still active, the sampler now forces the reasoning-end sequence through the normal full-logits path; reduced DFlash verification rejects that case instead of accepting an unsafe reduced candidate set.
 - Hardened DFlash on split CUDA / multi-GPU placement. GPU cross-ring setup, hidden capture, CUDA graph capture, K/V projection cache updates, recurrent replay, conv replay, and async tensor get/set paths now check buffer/backend ownership and fall back to safer CPU or owning-buffer paths instead of reading or writing recurrent state through the wrong CUDA backend.
 - Added clearer diagnostics and regression coverage for multi-GPU DFlash fallback decisions, CUDA graph buffer visibility, wrong-device async tensor access, active-reasoning reduced-sampling rejection, adaptive DM defaults, and profit-controller baseline behavior.
diff --git a/common/arg.cpp b/common/arg.cpp
@@ -3843,7 +3843,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
     ).set_spec().set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_SPEC_DM_PROFIT_WARMUP"));
     add_opt(common_arg(
         {"--spec-dm-profit-baseline-interval"}, "N",
-        string_format("active profit-controller cycles between no-spec baseline reprobes (default: %d, 0 = disabled)", params.speculative.dm_profit_baseline_interval),
+        string_format("active profit-controller cycles between long-context no-spec baseline reprobes (default: %d, 0 = disabled)", params.speculative.dm_profit_baseline_interval),
         [](common_params & params, int value) {
             if (value < 0 || value > 4096) {
                 throw std::invalid_argument("spec-dm-profit-baseline-interval must be in [0, 4096]");
diff --git a/common/common.h b/common/common.h
@@ -402,7 +402,7 @@ struct common_params_speculative {
     float   dm_profit_ewma_alpha   = 0.15f;
     int32_t dm_profit_min_samples  = 3;
     int32_t dm_profit_warmup       = 0;     // positive-depth warmup cycles after baseline seeding (0 = auto from min_samples)
-    int32_t dm_profit_baseline_interval = 128; // active spec cycles between no-spec baseline reprobes (0 = disabled)
+    int32_t dm_profit_baseline_interval = 512; // active spec cycles between long-context no-spec baseline reprobes (0 = disabled)
 
     // DFlash draft model (separate from upstream's draft.model)
     struct common_params_model mparams_dft;
diff --git a/docs/beellama-args.md b/docs/beellama-args.md
@@ -310,7 +310,7 @@ Adaptive Draft-Max is enabled by default for DFlash. It can reduce the active dr
 | `--spec-dm-profit-ewma-alpha F` | `0.15` | Smoothing factor for acceptance and timing running averages. |
 | `--spec-dm-profit-min-samples N` | `3` | Minimum observations per position/depth before scoring that depth as ready. |
 | `--spec-dm-profit-warmup N` | `0` | Positive-depth warmup cycles after the no-spec baseline is seeded (0 = use --spec-dm-profit-min-samples). |
-| `--spec-dm-profit-baseline-interval N` | `128` | Active speculative cycles between no-spec baseline reprobes (0 = disabled). |
+| `--spec-dm-profit-baseline-interval N` | `512` | Active speculative cycles between long-context no-spec baseline reprobes (0 = disabled). |
 
 Use `profit` for normal serving. Use `fringe` when you want behavior tied more directly to observed draft acceptance near the active tail. Use `--no-spec-dm-adaptive` only when comparing fixed `--spec-draft-n-max` values or reproducing a narrow benchmark.
 
diff --git a/docs/beellama-features.md b/docs/beellama-features.md
@@ -128,7 +128,7 @@ This is not the same as public buun's checked DFlash adaptive tracking. Bee adds
 --spec-dm-profit-ewma-alpha 0.15
 --spec-dm-profit-min-samples 3
 --spec-dm-profit-warmup 0
---spec-dm-profit-baseline-interval 128
+--spec-dm-profit-baseline-interval 512
 ```
 
 Use `--no-spec-dm-adaptive` when you need a fixed-depth benchmark. Otherwise, adaptive mode is the safer default for live serving because it can back away from weak drafts without changing the process command line.
diff --git a/tests/test-adaptive-dm.cpp b/tests/test-adaptive-dm.cpp
@@ -42,7 +42,7 @@ int main() {
     assert_candidates(1,  {0, 1});
     assert_candidates(2,  {0, 1, 2});
     assert_candidates(8,  {0, 1, 2, 3, 4, 5, 6, 7, 8});
-    assert_candidates(16, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16});
+    assert_candidates(16, {0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16});
 
     assert(server_adaptive_dm_next_explore_depth(0, 8, 0.25f) == 2);
     assert(server_adaptive_dm_next_explore_depth(2, 8, 0.25f) == 3);
@@ -99,6 +99,10 @@ int main() {
         assert(server_adaptive_dm_apply_profit_hysteresis(2, 4, 30.0f, 30.9f, 0.06f, 0.02f, cand, nc8) == 2);
         assert(server_adaptive_dm_apply_profit_hysteresis(2, 8, 30.0f, 33.0f, 0.06f, 0.02f, cand, nc8) == 4);
         assert(server_adaptive_dm_apply_profit_hysteresis(6, 2, 30.0f, 31.5f, 0.06f, 0.02f, cand, nc8) == 5);
+
+        int cand16[160];
+        const int nc16 = server_adaptive_dm_build_candidates(16, cand16, 160);
+        assert(server_adaptive_dm_apply_profit_hysteresis(16, 8, 30.0f, 36.0f, 0.06f, 0.02f, cand16, nc16) == 14);
     }
 
     assert(server_adaptive_dm_should_preserve_for_continuation(0.995f, 1.000f));
@@ -271,12 +275,40 @@ int main() {
     state.reset_profit_if_config_changed(spec, 8, 0);
     assert(state.profit_has_key);
     assert(state.profit_depth[2].samples == 0);
+    assert(state.adaptive_n_max == -1);
     state.observe_profit_timing(2, 10.0f, 20.0f, 5.0f, 35.0f);
     state.reset_profit_if_config_changed(spec, 8, 0);
     assert(state.profit_depth[2].samples == 1);
+    state.adaptive_n_max = 8;
     spec.dflash_cross_ctx = 2048;
     state.reset_profit_if_config_changed(spec, 8, 0);
     assert(state.profit_depth[2].samples == 0);
+    assert(state.adaptive_n_max == -1);
+
+    // test a bucket transition cannot leave baseline collection invisible
+    {
+        server_adaptive_dm_state bucket;
+        common_params_speculative bucket_spec;
+        bucket_spec.n_max = 16;
+        bucket_spec.branch_budget = 0;
+        bucket_spec.draft_topk = 1;
+        bucket_spec.dflash_cross_ctx = 1024;
+        bucket_spec.sample_temp = 0.0f;
+        bucket_spec.p_min = 0.0f;
+
+        bucket.reset_profit_if_config_changed(bucket_spec, 16, 1000);
+        bucket.observe_profit_timing(0, 0.0f, 30.0f, 0.0f, 30.0f);
+        bucket.observe_profit_timing(0, 0.0f, 31.0f, 0.0f, 31.0f);
+        bucket.observe_profit_timing(0, 0.0f, 32.0f, 0.0f, 32.0f);
+        bucket.apply_profit_recommendation(14);
+        assert(bucket.profit_baseline_ready());
+        assert(!bucket.profit_expects_baseline_sample());
+
+        bucket.reset_profit_if_config_changed(bucket_spec, 16, 9000);
+        assert(bucket.adaptive_n_max == -1);
+        assert(!bucket.profit_baseline_ready());
+        assert(bucket.profit_expects_baseline_sample());
+    }
 
     // test cross-depth estimation
     {
@@ -319,6 +351,14 @@ int main() {
         server_adaptive_dm_state reprobe;
         reprobe.dm_profit_min_samples = 1;
         reprobe.dm_profit_baseline_interval = 3;
+        common_params_speculative reprobe_spec;
+        reprobe_spec.n_max = 8;
+        reprobe_spec.branch_budget = 0;
+        reprobe_spec.draft_topk = 1;
+        reprobe_spec.dflash_cross_ctx = 1024;
+        reprobe_spec.sample_temp = 0.0f;
+        reprobe_spec.p_min = 0.0f;
+        reprobe.reset_profit_if_config_changed(reprobe_spec, 8, 40000);
         reprobe.adaptive_n_max = 8;
         reprobe.observe_profit_timing(0, 0.0f, 40.0f, 0.0f, 40.0f);
         reprobe.observe_profit_acceptance(8, 7);
@@ -335,5 +375,25 @@ int main() {
         assert(reprobe.decide_profit_n_max(8) == 8);
     }
 
+    // test periodic baseline reprobes wait until longer context buckets
+    {
+        server_adaptive_dm_state early;
+        early.dm_profit_min_samples = 1;
+        early.dm_profit_baseline_interval = 1;
+        common_params_speculative early_spec;
+        early_spec.n_max = 8;
+        early_spec.branch_budget = 0;
+        early_spec.draft_topk = 1;
+        early_spec.dflash_cross_ctx = 1024;
+        early_spec.sample_temp = 0.0f;
+        early_spec.p_min = 0.0f;
+        early.reset_profit_if_config_changed(early_spec, 8, 4096);
+        early.adaptive_n_max = 8;
+        early.observe_profit_timing(0, 0.0f, 40.0f, 0.0f, 40.0f);
+        early.observe_profit_acceptance(8, 7);
+        early.observe_profit_timing(8, 8.0f, 30.0f, 2.0f, 40.0f);
+        assert(!early.profit_should_probe_baseline());
+    }
+
     return 0;
 }
diff --git a/tests/test-arg-parser.cpp b/tests/test-arg-parser.cpp
@@ -145,7 +145,7 @@ int main(void) {
     assert(params.speculative.dm_controller == COMMON_SPECULATIVE_DM_CONTROLLER_PROFIT);
     assert(params.speculative.dm_profit_min_samples == 3);
     assert(params.speculative.dm_profit_warmup == 0);
-    assert(params.speculative.dm_profit_baseline_interval == 128);
+    assert(params.speculative.dm_profit_baseline_interval == 512);
 
     argv = {"binary_name", "--spec-draft-p-min", "0"};
     assert(true == common_params_parse(argv.size(), list_str_to_char(argv).data(), params, LLAMA_EXAMPLE_SERVER));
diff --git a/tools/server/server-adaptive-dm.h b/tools/server/server-adaptive-dm.h
@@ -11,6 +11,7 @@
 static constexpr int SERVER_ADAPTIVE_DM_PROFIT_POSITIONS  = 128;
 static constexpr int SERVER_ADAPTIVE_DM_PROFIT_DEPTHS     = SERVER_ADAPTIVE_DM_PROFIT_POSITIONS + 1;
 static constexpr int SERVER_ADAPTIVE_DM_PROFIT_CANDIDATES = SERVER_ADAPTIVE_DM_PROFIT_DEPTHS + 1;
+static constexpr int SERVER_ADAPTIVE_DM_PROFIT_BASELINE_REPROBE_MIN_BUCKET = 3;
 
 static inline int server_adaptive_dm_probe_n_max(int base_n_max, float probe_fraction) {
     if (base_n_max <= 0) {
@@ -75,24 +76,26 @@ static inline int server_adaptive_dm_build_candidates(int base_n_max, int * out,
     }
 
     int n = 0;
-
-    const int max_depth = std::clamp(base_n_max, 0, SERVER_ADAPTIVE_DM_PROFIT_DEPTHS - 1);
-    for (int candidate = 0; candidate <= max_depth && n < out_cap; ++candidate) {
-        out[n++] = candidate;
-    }
-
-    if (base_n_max > max_depth && n < out_cap) {
+    const int ladder[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, base_n_max};
+    for (const int candidate : ladder) {
+        if (candidate < 0 || candidate > base_n_max) {
+            continue;
+        }
         bool exists = false;
         for (int i = 0; i < n; ++i) {
-            if (out[i] == base_n_max) {
+            if (out[i] == candidate) {
                 exists = true;
                 break;
             }
         }
         if (!exists) {
-            out[n++] = base_n_max;
+            out[n++] = candidate;
+            if (n >= out_cap) {
+                break;
+            }
         }
     }
+    std::sort(out, out + n);
     return n;
 }
 
@@ -275,7 +278,7 @@ struct server_adaptive_dm_state {
     float   dm_profit_ewma_alpha   = 0.15f;
     int32_t dm_profit_min_samples  = 3;
     int32_t dm_profit_warmup       = 0;
-    int32_t dm_profit_baseline_interval = 128;
+    int32_t dm_profit_baseline_interval = 512;
 
     struct profit_depth_stats {
         int32_t samples = 0;
@@ -371,6 +374,10 @@ struct server_adaptive_dm_state {
         profit_has_key = false;
         profit_epoch++;
         reset_request_profit_state();
+        adaptive_n_max = -1;
+        adaptive_probe_counter = 0;
+        explore_counter = 0;
+        off_dwell = 0;
     }
 
     void reset_profit_if_config_changed(const common_params_speculative & spec, int base_n_max, int32_t n_past) {
@@ -420,9 +427,16 @@ struct server_adaptive_dm_state {
             profit_baseline_ready() &&
             !profit_baseline_probe_pending &&
             adaptive_n_max > 0 &&
+            profit_key.context_bucket >= SERVER_ADAPTIVE_DM_PROFIT_BASELINE_REPROBE_MIN_BUCKET &&
             profit_cycles_since_baseline >= dm_profit_baseline_interval;
     }
 
+    bool profit_expects_baseline_sample() const {
+        return !profit_baseline_ready() ||
+            profit_baseline_probe_pending ||
+            adaptive_n_max <= 0;
+    }
+
     void profit_mark_baseline_probe() {
         profit_baseline_probe_pending = true;
         profit_baseline_probe_resume_n = adaptive_n_max;
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
@@ -3541,7 +3541,7 @@ struct server_context_impl {
             } else {
                 if (slot.can_speculate() && slot.dm_adaptive &&
                         server_adaptive_dm_uses_profit_controller(slot.dm_controller) &&
-                        slot.adaptive_n_max == 0) {
+                        slot.profit_expects_baseline_sample()) {
                     profit_baseline_slot = &slot;
                     n_profit_baseline_slots++;
                 }

Original file line number	Diff line number	Diff line change
`@@ -3541,7 +3541,7 @@ struct server_context_impl {`
`3541`	`3541`	`} else {`
`3542`	`3542`	`if (slot.can_speculate() && slot.dm_adaptive &&`
`3543`	`3543`	`server_adaptive_dm_uses_profit_controller(slot.dm_controller) &&`
`3544`		`- slot.adaptive_n_max == 0) {`
	`3544`	`+ slot.profit_expects_baseline_sample()) {`
`3545`	`3545`	`profit_baseline_slot = &slot;`
`3546`	`3546`	`n_profit_baseline_slots++;`
`3547`	`3547`	`}`