Skip to content

Commit 30bb24e

Browse files
gHashTagona-agent
andcommitted
feat(serve): implement continuous batching scheduler (OPT-B01)
- Add ContinuousBatchingScheduler with iteration-level scheduling - Add Request with priority and status tracking - Implement dynamic batch formation (add/remove sequences) - Add SchedulerStats for throughput monitoring - Expected improvement: 2-3x throughput under high load Co-authored-by: Ona <no-reply@ona.com>
1 parent 49123d7 commit 30bb24e

3 files changed

Lines changed: 552 additions & 0 deletions

File tree

docs/DISCOVERIES.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ Where:
8282
| OPT-M01 | Memory-Mapped Loading | N/A | 30x load | ✅ Implemented |
8383
| OPT-C01 | KV Cache Compression | 5-16x | 1x | ✅ Implemented |
8484
| OPT-S01 | Speculative Decoding | N/A | 2-3x gen | ✅ Implemented |
85+
| OPT-B01 | Continuous Batching | N/A | 2-3x thru | ✅ Implemented |
8586

8687
### Business Value
8788

@@ -569,6 +570,74 @@ std.debug.print("Generated {d} tokens, acceptance rate: {d:.1}%\n",
569570
.{result.tokens.len, result.acceptance_rate * 100});
570571
```
571572

573+
### Continuous Batching (OPT-B01)
574+
575+
**Status**: ✅ Implemented
576+
577+
| Component | File | Description |
578+
|-----------|------|-------------|
579+
| Request | `tri_inference.zig` | Inference request with priority |
580+
| ContinuousBatchingScheduler | `tri_inference.zig` | Main scheduler |
581+
| SchedulerConfig | `tri_inference.zig` | Configuration |
582+
| SchedulerStats | `tri_inference.zig` | Statistics |
583+
584+
**Architecture:**
585+
```
586+
┌─────────────────────────────────────────────────────────────┐
587+
│ CONTINUOUS BATCHING SCHEDULER │
588+
├─────────────────────────────────────────────────────────────┤
589+
│ │
590+
│ REQUEST QUEUE (Priority Sorted) │
591+
│ ┌─────┬─────┬─────┬─────┐ │
592+
│ │ R5 │ R3 │ R7 │ R1 │ → sorted by priority │
593+
│ └──┬──┴──┬──┴─────┴─────┘ │
594+
│ │ │ │
595+
│ ▼ ▼ │
596+
│ RUNNING BATCH (dynamic slots) │
597+
│ ┌─────┬─────┬─────┬─────┐ │
598+
│ │ S0 │ S1 │ --- │ --- │ → fill as slots free up │
599+
│ └─────┴─────┴─────┴─────┘ │
600+
│ │
601+
│ ITERATION LOOP: │
602+
│ 1. Check completions → free slots │
603+
│ 2. Fill empty slots from queue │
604+
│ 3. Process all active sequences │
605+
│ 4. Repeat │
606+
│ │
607+
└─────────────────────────────────────────────────────────────┘
608+
```
609+
610+
**Key Features:**
611+
- **Iteration-level scheduling**: New requests added immediately
612+
- **Priority queue**: Higher priority requests scheduled first
613+
- **Dynamic batch**: Slots freed as sequences complete
614+
- **Statistics tracking**: Tokens/iteration, throughput metrics
615+
616+
**Expected Throughput Improvement:**
617+
- Static batching: Wait for slowest sequence
618+
- Continuous batching: Fill slots immediately
619+
- **Improvement: 2-3x under high load**
620+
621+
**Usage:**
622+
```zig
623+
const config = SchedulerConfig.default();
624+
var scheduler = try ContinuousBatchingScheduler.init(
625+
allocator, model, batch_model, config
626+
);
627+
defer scheduler.deinit();
628+
629+
// Submit requests
630+
const id1 = try scheduler.submitRequest(&prompt1, 100, 1.0, 0);
631+
const id2 = try scheduler.submitRequest(&prompt2, 50, 1.0, 1); // higher priority
632+
633+
// Run until complete
634+
try scheduler.runUntilComplete();
635+
636+
// Get results
637+
const stats = scheduler.getStats();
638+
std.debug.print("Avg tokens/iter: {d:.1}\n", .{stats.avg_tokens_per_iter});
639+
```
640+
572641
### Batch Processing (INF-004)
573642

574643
**Status**: ✅ Implemented
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# continuous_batching.vibee
2+
# Continuous Batching for high-throughput LLM serving
3+
# Orca/vLLM style iteration-level scheduling
4+
5+
name: continuous_batching
6+
version: "1.0.0"
7+
language: zig
8+
module: continuous_batching
9+
10+
types:
11+
Request:
12+
description: "Inference request from client"
13+
fields:
14+
id: Int # Unique request ID
15+
prompt_tokens: List<Int> # Input token IDs
16+
max_tokens: Int # Maximum tokens to generate
17+
temperature: Float # Sampling temperature
18+
priority: Int # Request priority (higher = more urgent)
19+
created_at: Timestamp # Request creation time
20+
status: RequestStatus # Current status
21+
22+
RequestStatus:
23+
description: "Status of a request"
24+
values:
25+
- QUEUED # Waiting in queue
26+
- PREFILL # Processing prompt
27+
- GENERATING # Generating tokens
28+
- COMPLETED # Finished generation
29+
- CANCELLED # Cancelled by client
30+
31+
SchedulerConfig:
32+
description: "Configuration for continuous batching scheduler"
33+
fields:
34+
max_batch_size: Int # Maximum sequences in batch
35+
max_tokens_per_iter: Int # Token budget per iteration
36+
preemption_enabled: Bool # Allow preemption
37+
priority_decay: Float # Priority decay for waiting requests
38+
39+
BatchSlot:
40+
description: "Slot in the running batch"
41+
fields:
42+
request_id: Int # Associated request
43+
seq_idx: Int # Sequence index in batch
44+
tokens_generated: Int # Tokens generated so far
45+
is_prefill: Bool # In prefill phase
46+
47+
behaviors:
48+
- name: submit_request
49+
given: request queue, new request
50+
when: client submits inference request
51+
then: adds request to queue with priority
52+
53+
- name: schedule_iteration
54+
given: running batch, request queue, token budget
55+
when: starting new iteration
56+
then: returns batch configuration for this iteration
57+
58+
- name: process_iteration
59+
given: model, batch configuration
60+
when: running one iteration
61+
then: processes all sequences, returns generated tokens
62+
63+
- name: handle_completion
64+
given: completed sequence, request queue
65+
when: sequence finishes generation
66+
then: removes from batch, adds new request if available
67+
68+
- name: preempt_sequence
69+
given: running sequence, higher priority request
70+
when: preemption needed
71+
then: pauses sequence, saves state, schedules new request
72+
73+
# Architecture:
74+
#
75+
# ┌─────────────────────────────────────────────────────────────┐
76+
# │ CONTINUOUS BATCHING SCHEDULER │
77+
# ├─────────────────────────────────────────────────────────────┤
78+
# │ │
79+
# │ REQUEST QUEUE (Priority Heap) │
80+
# │ ┌─────┬─────┬─────┬─────┬─────┐ │
81+
# │ │ R5 │ R3 │ R7 │ R1 │ R9 │ (sorted by priority) │
82+
# │ └──┬──┴──┬──┴──┬──┴─────┴─────┘ │
83+
# │ │ │ │ │
84+
# │ ▼ ▼ ▼ │
85+
# │ RUNNING BATCH (max_batch_size slots) │
86+
# │ ┌─────┬─────┬─────┬─────┐ │
87+
# │ │ S0 │ S1 │ S2 │ S3 │ (active sequences) │
88+
# │ │ R5 │ R3 │ R7 │ --- │ (--- = empty slot) │
89+
# │ └──┬──┴──┬──┴──┬──┴─────┘ │
90+
# │ │ │ │ │
91+
# │ ▼ ▼ ▼ │
92+
# │ ┌─────────────────────────────────────────┐ │
93+
# │ │ MODEL FORWARD PASS │ │
94+
# │ │ (process all active sequences) │ │
95+
# │ └─────────────────────────────────────────┘ │
96+
# │ │
97+
# │ ITERATION LOOP: │
98+
# │ 1. Check for completed sequences → free slots │
99+
# │ 2. Fill empty slots from queue │
100+
# │ 3. Run forward pass for all active sequences │
101+
# │ 4. Sample next tokens │
102+
# │ 5. Check stopping conditions │
103+
# │ 6. Repeat │
104+
# │ │
105+
# └─────────────────────────────────────────────────────────────┘
106+
#
107+
# Throughput Improvement:
108+
# Static batching: Wait for slowest sequence
109+
# Continuous batching: Fill slots immediately
110+
#
111+
# Example (batch_size=4, requests with varying lengths):
112+
# Static: [====][====][==][======] → 6 iterations wasted
113+
# Continuous: [====][====][==][======]
114+
# [ ][ ][++][ ] → new requests fill gaps
115+
#
116+
# Throughput gain: 30-50% typical, up to 3x under high load

0 commit comments

Comments
 (0)