Skip to content

Commit 3109a0b

Browse files
committed
Refine profit baseline reprobes
1 parent 20e1ff9 commit 3109a0b

8 files changed

Lines changed: 9 additions & 11 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
- Fixed the adaptive `profit` controller's no-spec baseline path. Profit mode now seeds baseline samples before positive-depth warmup, can shut DFlash fully off when the measured baseline wins, and no longer makes speculative decisions from draft-only telemetry.
66
- Fixed a profit-controller bucket-transition deadlock where telemetry reset could clear the no-spec baseline while preserving a positive active draft depth, causing all later cycles to run as unrecorded single-token baseline and leaving DFlash permanently disabled.
7-
- Added periodic profit-controller baseline reprobes with `--spec-dm-profit-baseline-interval` / `LLAMA_ARG_SPEC_DM_PROFIT_BASELINE_INTERVAL` so long-context runs can refresh target-only timing as context grows. The default interval is 512 active speculative cycles, and periodic reprobes start only in longer context buckets; bucket transitions still seed a fresh baseline. Off-state probes now restart with the configured probe depth instead of jumping straight to full draft depth.
7+
- Added low-frequency profit-controller baseline reprobes with `--spec-dm-profit-baseline-interval` / `LLAMA_ARG_SPEC_DM_PROFIT_BASELINE_INTERVAL` so runs can refresh target-only timing as context grows. The default interval is 1024 active speculative cycles to keep probe overhead minimal; bucket transitions still seed a fresh baseline. Off-state probes now restart with the configured probe depth instead of jumping straight to full draft depth.
88
- Stabilized profit depth selection on the production ladder (`0`-`8`, `10`, `12`, `14`, `16`, and the configured max) while preserving the previous active depth across baseline reprobes and avoiding off-probe counter starvation from repeated baseline cycles.
99
- Hardened active-reasoning EOS handling. When an end-of-generation token appears while reasoning output is still active, the sampler now forces the reasoning-end sequence through the normal full-logits path; reduced DFlash verification rejects that case instead of accepting an unsafe reduced candidate set.
1010
- Hardened DFlash on split CUDA / multi-GPU placement. GPU cross-ring setup, hidden capture, CUDA graph capture, K/V projection cache updates, recurrent replay, conv replay, and async tensor get/set paths now check buffer/backend ownership and fall back to safer CPU or owning-buffer paths instead of reading or writing recurrent state through the wrong CUDA backend.

common/arg.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3843,7 +3843,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
38433843
).set_spec().set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_SPEC_DM_PROFIT_WARMUP"));
38443844
add_opt(common_arg(
38453845
{"--spec-dm-profit-baseline-interval"}, "N",
3846-
string_format("active profit-controller cycles between long-context no-spec baseline reprobes (default: %d, 0 = disabled)", params.speculative.dm_profit_baseline_interval),
3846+
string_format("active profit-controller cycles between no-spec baseline reprobes (default: %d, 0 = disabled)", params.speculative.dm_profit_baseline_interval),
38473847
[](common_params & params, int value) {
38483848
if (value < 0 || value > 4096) {
38493849
throw std::invalid_argument("spec-dm-profit-baseline-interval must be in [0, 4096]");

common/common.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -402,7 +402,7 @@ struct common_params_speculative {
402402
float dm_profit_ewma_alpha = 0.15f;
403403
int32_t dm_profit_min_samples = 3;
404404
int32_t dm_profit_warmup = 0; // positive-depth warmup cycles after baseline seeding (0 = auto from min_samples)
405-
int32_t dm_profit_baseline_interval = 512; // active spec cycles between long-context no-spec baseline reprobes (0 = disabled)
405+
int32_t dm_profit_baseline_interval = 1024; // active spec cycles between no-spec baseline reprobes (0 = disabled)
406406

407407
// DFlash draft model (separate from upstream's draft.model)
408408
struct common_params_model mparams_dft;

docs/beellama-args.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -310,7 +310,7 @@ Adaptive Draft-Max is enabled by default for DFlash. It can reduce the active dr
310310
| `--spec-dm-profit-ewma-alpha F` | `0.15` | Smoothing factor for acceptance and timing running averages. |
311311
| `--spec-dm-profit-min-samples N` | `3` | Minimum observations per position/depth before scoring that depth as ready. |
312312
| `--spec-dm-profit-warmup N` | `0` | Positive-depth warmup cycles after the no-spec baseline is seeded (0 = use --spec-dm-profit-min-samples). |
313-
| `--spec-dm-profit-baseline-interval N` | `512` | Active speculative cycles between long-context no-spec baseline reprobes (0 = disabled). |
313+
| `--spec-dm-profit-baseline-interval N` | `1024` | Active speculative cycles between no-spec baseline reprobes (0 = disabled). |
314314

315315
Use `profit` for normal serving. Use `fringe` when you want behavior tied more directly to observed draft acceptance near the active tail. Use `--no-spec-dm-adaptive` only when comparing fixed `--spec-draft-n-max` values or reproducing a narrow benchmark.
316316

docs/beellama-features.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ This is not the same as public buun's checked DFlash adaptive tracking. Bee adds
128128
--spec-dm-profit-ewma-alpha 0.15
129129
--spec-dm-profit-min-samples 3
130130
--spec-dm-profit-warmup 0
131-
--spec-dm-profit-baseline-interval 512
131+
--spec-dm-profit-baseline-interval 1024
132132
```
133133

134134
Use `--no-spec-dm-adaptive` when you need a fixed-depth benchmark. Otherwise, adaptive mode is the safer default for live serving because it can back away from weak drafts without changing the process command line.

tests/test-adaptive-dm.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -375,7 +375,7 @@ int main() {
375375
assert(reprobe.decide_profit_n_max(8) == 8);
376376
}
377377

378-
// test periodic baseline reprobes wait until longer context buckets
378+
// test periodic baseline reprobes are interval-based, not context-bucket gated
379379
{
380380
server_adaptive_dm_state early;
381381
early.dm_profit_min_samples = 1;
@@ -392,7 +392,7 @@ int main() {
392392
early.observe_profit_timing(0, 0.0f, 40.0f, 0.0f, 40.0f);
393393
early.observe_profit_acceptance(8, 7);
394394
early.observe_profit_timing(8, 8.0f, 30.0f, 2.0f, 40.0f);
395-
assert(!early.profit_should_probe_baseline());
395+
assert(early.profit_should_probe_baseline());
396396
}
397397

398398
return 0;

tests/test-arg-parser.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,7 @@ int main(void) {
145145
assert(params.speculative.dm_controller == COMMON_SPECULATIVE_DM_CONTROLLER_PROFIT);
146146
assert(params.speculative.dm_profit_min_samples == 3);
147147
assert(params.speculative.dm_profit_warmup == 0);
148-
assert(params.speculative.dm_profit_baseline_interval == 512);
148+
assert(params.speculative.dm_profit_baseline_interval == 1024);
149149

150150
argv = {"binary_name", "--spec-draft-p-min", "0"};
151151
assert(true == common_params_parse(argv.size(), list_str_to_char(argv).data(), params, LLAMA_EXAMPLE_SERVER));

tools/server/server-adaptive-dm.h

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@
1111
static constexpr int SERVER_ADAPTIVE_DM_PROFIT_POSITIONS = 128;
1212
static constexpr int SERVER_ADAPTIVE_DM_PROFIT_DEPTHS = SERVER_ADAPTIVE_DM_PROFIT_POSITIONS + 1;
1313
static constexpr int SERVER_ADAPTIVE_DM_PROFIT_CANDIDATES = SERVER_ADAPTIVE_DM_PROFIT_DEPTHS + 1;
14-
static constexpr int SERVER_ADAPTIVE_DM_PROFIT_BASELINE_REPROBE_MIN_BUCKET = 3;
1514

1615
static inline int server_adaptive_dm_probe_n_max(int base_n_max, float probe_fraction) {
1716
if (base_n_max <= 0) {
@@ -278,7 +277,7 @@ struct server_adaptive_dm_state {
278277
float dm_profit_ewma_alpha = 0.15f;
279278
int32_t dm_profit_min_samples = 3;
280279
int32_t dm_profit_warmup = 0;
281-
int32_t dm_profit_baseline_interval = 512;
280+
int32_t dm_profit_baseline_interval = 1024;
282281

283282
struct profit_depth_stats {
284283
int32_t samples = 0;
@@ -427,7 +426,6 @@ struct server_adaptive_dm_state {
427426
profit_baseline_ready() &&
428427
!profit_baseline_probe_pending &&
429428
adaptive_n_max > 0 &&
430-
profit_key.context_bucket >= SERVER_ADAPTIVE_DM_PROFIT_BASELINE_REPROBE_MIN_BUCKET &&
431429
profit_cycles_since_baseline >= dm_profit_baseline_interval;
432430
}
433431

0 commit comments

Comments
 (0)