Skip to content

Commit 20e1ff9

Browse files
committed
Stabilize profit controller depth policy
1 parent 10acb48 commit 20e1ff9

9 files changed

Lines changed: 94 additions & 19 deletions

File tree

CHANGELOG.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,9 @@
33
## v0.1.2
44

55
- Fixed the adaptive `profit` controller's no-spec baseline path. Profit mode now seeds baseline samples before positive-depth warmup, can shut DFlash fully off when the measured baseline wins, and no longer makes speculative decisions from draft-only telemetry.
6-
- Added periodic profit-controller baseline reprobes with `--spec-dm-profit-baseline-interval` / `LLAMA_ARG_SPEC_DM_PROFIT_BASELINE_INTERVAL` so long-context runs can refresh target-only timing as context grows. Off-state probes now restart with the configured probe depth instead of jumping straight to full draft depth.
7-
- Made profit depth selection less coarse by scoring every integer draft depth up to the supported telemetry window, preserving the previous active depth across baseline reprobes, and avoiding off-probe counter starvation from repeated baseline cycles.
6+
- Fixed a profit-controller bucket-transition deadlock where telemetry reset could clear the no-spec baseline while preserving a positive active draft depth, causing all later cycles to run as unrecorded single-token baseline and leaving DFlash permanently disabled.
7+
- Added periodic profit-controller baseline reprobes with `--spec-dm-profit-baseline-interval` / `LLAMA_ARG_SPEC_DM_PROFIT_BASELINE_INTERVAL` so long-context runs can refresh target-only timing as context grows. The default interval is 512 active speculative cycles, and periodic reprobes start only in longer context buckets; bucket transitions still seed a fresh baseline. Off-state probes now restart with the configured probe depth instead of jumping straight to full draft depth.
8+
- Stabilized profit depth selection on the production ladder (`0`-`8`, `10`, `12`, `14`, `16`, and the configured max) while preserving the previous active depth across baseline reprobes and avoiding off-probe counter starvation from repeated baseline cycles.
89
- Hardened active-reasoning EOS handling. When an end-of-generation token appears while reasoning output is still active, the sampler now forces the reasoning-end sequence through the normal full-logits path; reduced DFlash verification rejects that case instead of accepting an unsafe reduced candidate set.
910
- Hardened DFlash on split CUDA / multi-GPU placement. GPU cross-ring setup, hidden capture, CUDA graph capture, K/V projection cache updates, recurrent replay, conv replay, and async tensor get/set paths now check buffer/backend ownership and fall back to safer CPU or owning-buffer paths instead of reading or writing recurrent state through the wrong CUDA backend.
1011
- Added clearer diagnostics and regression coverage for multi-GPU DFlash fallback decisions, CUDA graph buffer visibility, wrong-device async tensor access, active-reasoning reduced-sampling rejection, adaptive DM defaults, and profit-controller baseline behavior.

common/arg.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3843,7 +3843,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
38433843
).set_spec().set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_SPEC_DM_PROFIT_WARMUP"));
38443844
add_opt(common_arg(
38453845
{"--spec-dm-profit-baseline-interval"}, "N",
3846-
string_format("active profit-controller cycles between no-spec baseline reprobes (default: %d, 0 = disabled)", params.speculative.dm_profit_baseline_interval),
3846+
string_format("active profit-controller cycles between long-context no-spec baseline reprobes (default: %d, 0 = disabled)", params.speculative.dm_profit_baseline_interval),
38473847
[](common_params & params, int value) {
38483848
if (value < 0 || value > 4096) {
38493849
throw std::invalid_argument("spec-dm-profit-baseline-interval must be in [0, 4096]");

common/common.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -402,7 +402,7 @@ struct common_params_speculative {
402402
float dm_profit_ewma_alpha = 0.15f;
403403
int32_t dm_profit_min_samples = 3;
404404
int32_t dm_profit_warmup = 0; // positive-depth warmup cycles after baseline seeding (0 = auto from min_samples)
405-
int32_t dm_profit_baseline_interval = 128; // active spec cycles between no-spec baseline reprobes (0 = disabled)
405+
int32_t dm_profit_baseline_interval = 512; // active spec cycles between long-context no-spec baseline reprobes (0 = disabled)
406406

407407
// DFlash draft model (separate from upstream's draft.model)
408408
struct common_params_model mparams_dft;

docs/beellama-args.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -310,7 +310,7 @@ Adaptive Draft-Max is enabled by default for DFlash. It can reduce the active dr
310310
| `--spec-dm-profit-ewma-alpha F` | `0.15` | Smoothing factor for acceptance and timing running averages. |
311311
| `--spec-dm-profit-min-samples N` | `3` | Minimum observations per position/depth before scoring that depth as ready. |
312312
| `--spec-dm-profit-warmup N` | `0` | Positive-depth warmup cycles after the no-spec baseline is seeded (0 = use --spec-dm-profit-min-samples). |
313-
| `--spec-dm-profit-baseline-interval N` | `128` | Active speculative cycles between no-spec baseline reprobes (0 = disabled). |
313+
| `--spec-dm-profit-baseline-interval N` | `512` | Active speculative cycles between long-context no-spec baseline reprobes (0 = disabled). |
314314

315315
Use `profit` for normal serving. Use `fringe` when you want behavior tied more directly to observed draft acceptance near the active tail. Use `--no-spec-dm-adaptive` only when comparing fixed `--spec-draft-n-max` values or reproducing a narrow benchmark.
316316

docs/beellama-features.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ This is not the same as public buun's checked DFlash adaptive tracking. Bee adds
128128
--spec-dm-profit-ewma-alpha 0.15
129129
--spec-dm-profit-min-samples 3
130130
--spec-dm-profit-warmup 0
131-
--spec-dm-profit-baseline-interval 128
131+
--spec-dm-profit-baseline-interval 512
132132
```
133133

134134
Use `--no-spec-dm-adaptive` when you need a fixed-depth benchmark. Otherwise, adaptive mode is the safer default for live serving because it can back away from weak drafts without changing the process command line.

tests/test-adaptive-dm.cpp

Lines changed: 61 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ int main() {
4242
assert_candidates(1, {0, 1});
4343
assert_candidates(2, {0, 1, 2});
4444
assert_candidates(8, {0, 1, 2, 3, 4, 5, 6, 7, 8});
45-
assert_candidates(16, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16});
45+
assert_candidates(16, {0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16});
4646

4747
assert(server_adaptive_dm_next_explore_depth(0, 8, 0.25f) == 2);
4848
assert(server_adaptive_dm_next_explore_depth(2, 8, 0.25f) == 3);
@@ -99,6 +99,10 @@ int main() {
9999
assert(server_adaptive_dm_apply_profit_hysteresis(2, 4, 30.0f, 30.9f, 0.06f, 0.02f, cand, nc8) == 2);
100100
assert(server_adaptive_dm_apply_profit_hysteresis(2, 8, 30.0f, 33.0f, 0.06f, 0.02f, cand, nc8) == 4);
101101
assert(server_adaptive_dm_apply_profit_hysteresis(6, 2, 30.0f, 31.5f, 0.06f, 0.02f, cand, nc8) == 5);
102+
103+
int cand16[160];
104+
const int nc16 = server_adaptive_dm_build_candidates(16, cand16, 160);
105+
assert(server_adaptive_dm_apply_profit_hysteresis(16, 8, 30.0f, 36.0f, 0.06f, 0.02f, cand16, nc16) == 14);
102106
}
103107

104108
assert(server_adaptive_dm_should_preserve_for_continuation(0.995f, 1.000f));
@@ -271,12 +275,40 @@ int main() {
271275
state.reset_profit_if_config_changed(spec, 8, 0);
272276
assert(state.profit_has_key);
273277
assert(state.profit_depth[2].samples == 0);
278+
assert(state.adaptive_n_max == -1);
274279
state.observe_profit_timing(2, 10.0f, 20.0f, 5.0f, 35.0f);
275280
state.reset_profit_if_config_changed(spec, 8, 0);
276281
assert(state.profit_depth[2].samples == 1);
282+
state.adaptive_n_max = 8;
277283
spec.dflash_cross_ctx = 2048;
278284
state.reset_profit_if_config_changed(spec, 8, 0);
279285
assert(state.profit_depth[2].samples == 0);
286+
assert(state.adaptive_n_max == -1);
287+
288+
// test a bucket transition cannot leave baseline collection invisible
289+
{
290+
server_adaptive_dm_state bucket;
291+
common_params_speculative bucket_spec;
292+
bucket_spec.n_max = 16;
293+
bucket_spec.branch_budget = 0;
294+
bucket_spec.draft_topk = 1;
295+
bucket_spec.dflash_cross_ctx = 1024;
296+
bucket_spec.sample_temp = 0.0f;
297+
bucket_spec.p_min = 0.0f;
298+
299+
bucket.reset_profit_if_config_changed(bucket_spec, 16, 1000);
300+
bucket.observe_profit_timing(0, 0.0f, 30.0f, 0.0f, 30.0f);
301+
bucket.observe_profit_timing(0, 0.0f, 31.0f, 0.0f, 31.0f);
302+
bucket.observe_profit_timing(0, 0.0f, 32.0f, 0.0f, 32.0f);
303+
bucket.apply_profit_recommendation(14);
304+
assert(bucket.profit_baseline_ready());
305+
assert(!bucket.profit_expects_baseline_sample());
306+
307+
bucket.reset_profit_if_config_changed(bucket_spec, 16, 9000);
308+
assert(bucket.adaptive_n_max == -1);
309+
assert(!bucket.profit_baseline_ready());
310+
assert(bucket.profit_expects_baseline_sample());
311+
}
280312

281313
// test cross-depth estimation
282314
{
@@ -319,6 +351,14 @@ int main() {
319351
server_adaptive_dm_state reprobe;
320352
reprobe.dm_profit_min_samples = 1;
321353
reprobe.dm_profit_baseline_interval = 3;
354+
common_params_speculative reprobe_spec;
355+
reprobe_spec.n_max = 8;
356+
reprobe_spec.branch_budget = 0;
357+
reprobe_spec.draft_topk = 1;
358+
reprobe_spec.dflash_cross_ctx = 1024;
359+
reprobe_spec.sample_temp = 0.0f;
360+
reprobe_spec.p_min = 0.0f;
361+
reprobe.reset_profit_if_config_changed(reprobe_spec, 8, 40000);
322362
reprobe.adaptive_n_max = 8;
323363
reprobe.observe_profit_timing(0, 0.0f, 40.0f, 0.0f, 40.0f);
324364
reprobe.observe_profit_acceptance(8, 7);
@@ -335,5 +375,25 @@ int main() {
335375
assert(reprobe.decide_profit_n_max(8) == 8);
336376
}
337377

378+
// test periodic baseline reprobes wait until longer context buckets
379+
{
380+
server_adaptive_dm_state early;
381+
early.dm_profit_min_samples = 1;
382+
early.dm_profit_baseline_interval = 1;
383+
common_params_speculative early_spec;
384+
early_spec.n_max = 8;
385+
early_spec.branch_budget = 0;
386+
early_spec.draft_topk = 1;
387+
early_spec.dflash_cross_ctx = 1024;
388+
early_spec.sample_temp = 0.0f;
389+
early_spec.p_min = 0.0f;
390+
early.reset_profit_if_config_changed(early_spec, 8, 4096);
391+
early.adaptive_n_max = 8;
392+
early.observe_profit_timing(0, 0.0f, 40.0f, 0.0f, 40.0f);
393+
early.observe_profit_acceptance(8, 7);
394+
early.observe_profit_timing(8, 8.0f, 30.0f, 2.0f, 40.0f);
395+
assert(!early.profit_should_probe_baseline());
396+
}
397+
338398
return 0;
339399
}

tests/test-arg-parser.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,7 @@ int main(void) {
145145
assert(params.speculative.dm_controller == COMMON_SPECULATIVE_DM_CONTROLLER_PROFIT);
146146
assert(params.speculative.dm_profit_min_samples == 3);
147147
assert(params.speculative.dm_profit_warmup == 0);
148-
assert(params.speculative.dm_profit_baseline_interval == 128);
148+
assert(params.speculative.dm_profit_baseline_interval == 512);
149149

150150
argv = {"binary_name", "--spec-draft-p-min", "0"};
151151
assert(true == common_params_parse(argv.size(), list_str_to_char(argv).data(), params, LLAMA_EXAMPLE_SERVER));

tools/server/server-adaptive-dm.h

Lines changed: 24 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
static constexpr int SERVER_ADAPTIVE_DM_PROFIT_POSITIONS = 128;
1212
static constexpr int SERVER_ADAPTIVE_DM_PROFIT_DEPTHS = SERVER_ADAPTIVE_DM_PROFIT_POSITIONS + 1;
1313
static constexpr int SERVER_ADAPTIVE_DM_PROFIT_CANDIDATES = SERVER_ADAPTIVE_DM_PROFIT_DEPTHS + 1;
14+
static constexpr int SERVER_ADAPTIVE_DM_PROFIT_BASELINE_REPROBE_MIN_BUCKET = 3;
1415

1516
static inline int server_adaptive_dm_probe_n_max(int base_n_max, float probe_fraction) {
1617
if (base_n_max <= 0) {
@@ -75,24 +76,26 @@ static inline int server_adaptive_dm_build_candidates(int base_n_max, int * out,
7576
}
7677

7778
int n = 0;
78-
79-
const int max_depth = std::clamp(base_n_max, 0, SERVER_ADAPTIVE_DM_PROFIT_DEPTHS - 1);
80-
for (int candidate = 0; candidate <= max_depth && n < out_cap; ++candidate) {
81-
out[n++] = candidate;
82-
}
83-
84-
if (base_n_max > max_depth && n < out_cap) {
79+
const int ladder[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, base_n_max};
80+
for (const int candidate : ladder) {
81+
if (candidate < 0 || candidate > base_n_max) {
82+
continue;
83+
}
8584
bool exists = false;
8685
for (int i = 0; i < n; ++i) {
87-
if (out[i] == base_n_max) {
86+
if (out[i] == candidate) {
8887
exists = true;
8988
break;
9089
}
9190
}
9291
if (!exists) {
93-
out[n++] = base_n_max;
92+
out[n++] = candidate;
93+
if (n >= out_cap) {
94+
break;
95+
}
9496
}
9597
}
98+
std::sort(out, out + n);
9699
return n;
97100
}
98101

@@ -275,7 +278,7 @@ struct server_adaptive_dm_state {
275278
float dm_profit_ewma_alpha = 0.15f;
276279
int32_t dm_profit_min_samples = 3;
277280
int32_t dm_profit_warmup = 0;
278-
int32_t dm_profit_baseline_interval = 128;
281+
int32_t dm_profit_baseline_interval = 512;
279282

280283
struct profit_depth_stats {
281284
int32_t samples = 0;
@@ -371,6 +374,10 @@ struct server_adaptive_dm_state {
371374
profit_has_key = false;
372375
profit_epoch++;
373376
reset_request_profit_state();
377+
adaptive_n_max = -1;
378+
adaptive_probe_counter = 0;
379+
explore_counter = 0;
380+
off_dwell = 0;
374381
}
375382

376383
void reset_profit_if_config_changed(const common_params_speculative & spec, int base_n_max, int32_t n_past) {
@@ -420,9 +427,16 @@ struct server_adaptive_dm_state {
420427
profit_baseline_ready() &&
421428
!profit_baseline_probe_pending &&
422429
adaptive_n_max > 0 &&
430+
profit_key.context_bucket >= SERVER_ADAPTIVE_DM_PROFIT_BASELINE_REPROBE_MIN_BUCKET &&
423431
profit_cycles_since_baseline >= dm_profit_baseline_interval;
424432
}
425433

434+
bool profit_expects_baseline_sample() const {
435+
return !profit_baseline_ready() ||
436+
profit_baseline_probe_pending ||
437+
adaptive_n_max <= 0;
438+
}
439+
426440
void profit_mark_baseline_probe() {
427441
profit_baseline_probe_pending = true;
428442
profit_baseline_probe_resume_n = adaptive_n_max;

tools/server/server-context.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3541,7 +3541,7 @@ struct server_context_impl {
35413541
} else {
35423542
if (slot.can_speculate() && slot.dm_adaptive &&
35433543
server_adaptive_dm_uses_profit_controller(slot.dm_controller) &&
3544-
slot.adaptive_n_max == 0) {
3544+
slot.profit_expects_baseline_sample()) {
35453545
profit_baseline_slot = &slot;
35463546
n_profit_baseline_slots++;
35473547
}

0 commit comments

Comments
 (0)