Skip to content

Commit 26daafe

Browse files
viraatcnvzhihanjCopilot
authored
feat: add concurrency based scheduler (#17)
* rename: loadge.py -> http_sample_issuer.py * add concurrency scheduler * add poisson gen test * fix * use scipy * cleanup * address comments * update LOCAL_TESTING.md * cleanup comments * fix CI * address comments * address comments * address comments * update tests * address comments, update concurrency test * address comments, update docs * Update src/inference_endpoint/endpoint_client/http_sample_issuer.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent b004cc5 commit 26daafe

19 files changed

Lines changed: 413 additions & 64 deletions

File tree

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,11 +46,12 @@ inference-endpoint benchmark offline \
4646
--model Qwen/Qwen3-8B \
4747
--dataset tests/datasets/dummy_1k.pkl
4848

49-
# Run online benchmark (sustained QPS - requires --target-qps)
49+
# Run online benchmark (sustained QPS - requires --target-qps, --load-pattern)
5050
inference-endpoint benchmark online \
5151
--endpoint http://your-endpoint:8000 \
5252
--model Qwen/Qwen3-8B \
5353
--dataset tests/datasets/dummy_1k.pkl \
54+
--load-pattern poisson \
5455
--target-qps 100
5556

5657
# With explicit sample count

docs/CLI_QUICK_REFERENCE.md

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,12 @@ inference-endpoint benchmark offline \
1111
--model Qwen/Qwen3-8B \
1212
--dataset tests/datasets/dummy_1k.pkl
1313

14-
# Online (sustained QPS - CLI mode - requires --target-qps)
14+
# Online (sustained QPS - CLI mode - requires --target-qps, --load-pattern)
1515
inference-endpoint benchmark online \
1616
--endpoint URL \
1717
--model Qwen/Qwen3-8B \
1818
--dataset tests/datasets/dummy_1k.pkl \
19+
--load-pattern poisson \
1920
--target-qps 100
2021

2122
# With detailed report generation
@@ -75,7 +76,7 @@ inference-endpoint info
7576
## Benchmark Options (CLI Mode Only)
7677

7778
- `--api-key KEY` - API authentication
78-
- `--target-qps N` - Target queries per second (required for online mode with poisson pattern)
79+
- `--target-qps N` - Target queries per second (required when --load-pattern=poisson)
7980
- `--duration SEC` - Test duration in seconds (default: 0 - run until dataset exhausted)
8081
- `--num-samples N` - Number of samples to issue (overrides dataset size and duration calculation)
8182
- `--streaming MODE` - Streaming control: `auto` (default), `on`, or `off`. Streaming will enable token streaming in response.
@@ -86,8 +87,8 @@ inference-endpoint info
8687

8788
## Online-Specific Options
8889

89-
- `--load-pattern TYPE` - Load pattern: `poisson` (default), `concurrency`
90-
- `--concurrency N` - Max concurrent requests (required when using concurrency load pattern, default: -1 unlimited for other patterns)
90+
- `--load-pattern TYPE` - Load pattern (required): `poisson`, `concurrency`
91+
- `--concurrency N` - Max concurrent requests (required when --load-pattern=concurrency)
9192

9293
## Dataset Formats
9394

@@ -134,11 +135,11 @@ inference-endpoint info
134135
- Sustains target QPS
135136
- Use with `benchmark online --target-qps N`
136137

137-
**concurrency** - Online mode (fixed concurrency) - NOT YET IMPLEMENTED
138+
**concurrency** - Online mode (fixed concurrency)
138139

139140
- Maintains N concurrent requests
140141
- QPS emerges from concurrency/latency
141-
- Will be available in future release
142+
- Use with `benchmark online --load-pattern concurrency --concurrency N`
142143

143144
## Examples
144145

@@ -159,6 +160,7 @@ inference-endpoint benchmark online \
159160
--endpoint https://api.production.com \
160161
--model Qwen/Qwen3-8B \
161162
--dataset prod_queries.pkl \
163+
--load-pattern poisson \
162164
--target-qps 100 \
163165
--num-samples 10000 \
164166
--workers 16 \
@@ -171,6 +173,7 @@ inference-endpoint benchmark online \
171173
--endpoint https://api.production.com \
172174
--model Qwen/Qwen3-8B \
173175
--dataset prod_queries.pkl \
176+
--load-pattern poisson \
174177
--target-qps 100 \
175178
--duration 300 \
176179
--workers 16 \
@@ -274,7 +277,9 @@ endpoint_config:
274277

275278
**Mode Requirements:**
276279

277-
- Online mode requires `--target-qps` (poisson) or `--concurrency` (concurrency pattern)
280+
- Online mode requires `--load-pattern` (poisson or concurrency)
281+
- `--load-pattern poisson` requires `--target-qps`
282+
- `--load-pattern concurrency` requires `--concurrency`
278283
- Use `--mode both` for combined perf + accuracy runs
279284
- Streaming: auto (default) enables streaming responses for online, disables for offline
280285

docs/LOCAL_TESTING.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,7 @@ inference-endpoint -v benchmark online \
113113
--endpoint http://localhost:8765 \
114114
--model Qwen/Qwen3-8B \
115115
--dataset tests/datasets/dummy_1k.pkl \
116+
--load-pattern poisson \
116117
--target-qps 100 \
117118
--report-path online_benchmark_report
118119
```
@@ -286,6 +287,7 @@ inference-endpoint benchmark online \
286287
--endpoint http://localhost:8765 \
287288
--model Qwen/Qwen3-8B \
288289
--dataset tests/datasets/dummy_1k.pkl \
290+
--load-pattern poisson \
289291
--target-qps 500 \
290292
--report-path online_report
291293

@@ -302,14 +304,24 @@ inference-endpoint benchmark offline \
302304
--model Qwen/Qwen3-8B \
303305
--dataset tests/datasets/dummy_1k.pkl \
304306
--streaming on
307+
308+
# Concurrency mode (fixed concurrent requests)
309+
inference-endpoint benchmark online \
310+
--endpoint http://localhost:8765 \
311+
--model Qwen/Qwen3-8B \
312+
--dataset tests/datasets/dummy_1k.pkl \
313+
--load-pattern concurrency \
314+
--concurrency 32
305315
```
306316

307317
## Tips
308318

309319
**Key Requirements:**
310320

311321
- Model name is **required** for all benchmark and probe commands
312-
- Online mode requires `--target-qps` (poisson) or `--concurrency` (concurrency pattern)
322+
- Online mode requires `--load-pattern` to specify the scheduler type (poisson or concurrency)
323+
- `--load-pattern poisson` requires `--target-qps`
324+
- `--load-pattern concurrency` requires `--concurrency`
313325
- Set `HF_TOKEN` environment variable for non-public models (public models like Qwen/Qwen3-8B don't need it)
314326

315327
**Sample Count Control:**

requirements/test.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ pytest-benchmark>=4.0.0
1414
coverage>=7.0.0
1515
line-profiler==5.0.0
1616
Pympler==1.1
17+
scipy==1.16.3
1718

1819
# HTTP server and client for mock server fixture
1920
aiohttp>=3.8.0

src/inference_endpoint/cli.py

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ def create_parser() -> argparse.ArgumentParser:
9797
online_parser = benchmark_subparsers.add_parser(
9898
"online",
9999
help="Online benchmark (sustained QPS)",
100-
description="Online mode: Issues queries at target QPS using Poisson distribution.",
100+
description="Online mode: Issues queries using specified load pattern (--load-pattern required).",
101101
)
102102
_add_shared_benchmark_args(online_parser)
103103
_add_online_specific_args(online_parser)
@@ -189,17 +189,7 @@ def _add_shared_benchmark_args(parser):
189189
"--dataset", "-d", type=Path, required=True, help="Dataset file"
190190
)
191191
parser.add_argument("--api-key", type=str, help="API key")
192-
parser.add_argument(
193-
"--target-qps",
194-
type=float,
195-
help="Target queries per second (required for online mode with poisson pattern)",
196-
)
197192
parser.add_argument("--workers", type=int, help="HTTP workers (default: 4)")
198-
parser.add_argument(
199-
"--concurrency",
200-
type=int,
201-
help="Max concurrent requests (required when using concurrency load pattern, default: -1 unlimited for other patterns)",
202-
)
203193
parser.add_argument(
204194
"--duration",
205195
type=int,
@@ -230,8 +220,13 @@ def _add_shared_benchmark_args(parser):
230220
def _add_online_specific_args(parser):
231221
"""Add online-specific arguments.
232222
223+
These arguments are only available for online mode and will be rejected
224+
by argparse if used with offline mode.
225+
233226
Currently adds:
234-
- load-pattern: Scheduler type (poisson, etc.)
227+
- load-pattern: Scheduler type (poisson, concurrency, etc.) - REQUIRED
228+
- target-qps: Target QPS for poisson pattern
229+
- concurrency: Max concurrent requests for concurrency pattern
235230
236231
Load pattern choices are dynamically derived from registered Scheduler
237232
implementations to maintain a single source of truth.
@@ -244,7 +239,18 @@ def _add_online_specific_args(parser):
244239
parser.add_argument(
245240
"--load-pattern",
246241
choices=available_patterns,
247-
help=f"Load pattern (default: poisson, available: {', '.join(available_patterns)})",
242+
required=True,
243+
help=f"Load pattern (required, available: {', '.join(available_patterns)})",
244+
)
245+
parser.add_argument(
246+
"--target-qps",
247+
type=float,
248+
help="Target queries per second (required when --load-pattern=poisson)",
249+
)
250+
parser.add_argument(
251+
"--concurrency",
252+
type=int,
253+
help="Max concurrent requests (required when --load-pattern=concurrency)",
248254
)
249255

250256

src/inference_endpoint/commands/benchmark.py

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@
5858
ZMQConfig,
5959
)
6060
from inference_endpoint.endpoint_client.http_client import HTTPEndpointClient
61-
from inference_endpoint.endpoint_client.loadgen import HttpClientSampleIssuer
61+
from inference_endpoint.endpoint_client.http_sample_issuer import HttpClientSampleIssuer
6262
from inference_endpoint.exceptions import (
6363
ExecutionError,
6464
InputValidationError,
@@ -248,15 +248,16 @@ def _build_config_from_cli(
248248
InputValidationError: If required params missing
249249
"""
250250
# Determine load pattern (CLI override or mode default)
251-
load_pattern_arg = getattr(args, "load_pattern", None)
252-
if load_pattern_arg:
251+
if load_pattern_arg := getattr(args, "load_pattern", None):
253252
load_pattern_type = LoadPatternType(load_pattern_arg)
254253
else:
255-
load_pattern_type = (
256-
LoadPatternType.MAX_THROUGHPUT
257-
if benchmark_mode == "offline"
258-
else LoadPatternType.POISSON
259-
)
254+
match benchmark_mode:
255+
case "offline":
256+
load_pattern_type = LoadPatternType.MAX_THROUGHPUT
257+
case "online" if getattr(args, "concurrency", None):
258+
load_pattern_type = LoadPatternType.CONCURRENCY
259+
case "online":
260+
load_pattern_type = LoadPatternType.POISSON
260261

261262
# Build BenchmarkConfig from CLI params
262263
return BenchmarkConfig(
@@ -274,7 +275,8 @@ def _build_config_from_cli(
274275
settings=Settings(
275276
load_pattern=LoadPattern(
276277
type=load_pattern_type,
277-
target_qps=args.target_qps if args.target_qps else None,
278+
target_qps=getattr(args, "target_qps", None),
279+
target_concurrency=getattr(args, "concurrency", None),
278280
),
279281
runtime=RuntimeConfig(
280282
min_duration_ms=args.duration * 1000
@@ -289,7 +291,7 @@ def _build_config_from_cli(
289291
),
290292
client=ClientSettings(
291293
workers=args.workers if args.workers else 4,
292-
max_concurrency=args.concurrency if args.concurrency else -1,
294+
max_concurrency=-1, # client uses unlimited concurrency by default
293295
),
294296
),
295297
model_params=ModelParams(
@@ -540,6 +542,9 @@ def _run_benchmark(
540542
max_concurrency = config.settings.client.max_concurrency
541543

542544
logger.info(f"Connecting: {endpoint}")
545+
logger.info(
546+
f"Client config: workers={num_workers}, max_concurrency={max_concurrency if max_concurrency > 0 else 'unlimited'}"
547+
)
543548

544549
tmp_dir = tempfile.mkdtemp(prefix="inference_endpoint_")
545550

src/inference_endpoint/config/rulesets/mlcommons/rules.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -207,6 +207,7 @@ def apply_user_config(
207207
min_sample_count=min_sample_count,
208208
rng_sched=random.Random(self.scheduler_rng_seed),
209209
rng_sample_index=random.Random(self.sample_index_rng_seed),
210+
load_pattern=None, # not part user config
210211
optimization_priority=opt_prio,
211212
model=model,
212213
rules=ruleset,

src/inference_endpoint/config/runtime_settings.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737

3838
if TYPE_CHECKING:
3939
from .ruleset_base import BenchmarkSuiteRuleset
40-
from .schema import BenchmarkConfig
40+
from .schema import BenchmarkConfig, LoadPattern
4141

4242

4343
@dataclass(frozen=True, slots=True)
@@ -81,6 +81,9 @@ class RuntimeSettings:
8181
rng_sample_index: random.Random
8282
"""Random number generator for sample indexing"""
8383

84+
load_pattern: LoadPattern | None
85+
"""Load pattern configuration"""
86+
8487
@classmethod
8588
def from_config(
8689
cls,
@@ -155,6 +158,7 @@ def _from_config_default(
155158
"min_sample_count": 1,
156159
"rng_sched": random.Random(runtime_cfg.scheduler_random_seed),
157160
"rng_sample_index": random.Random(runtime_cfg.dataloader_random_seed),
161+
"load_pattern": load_pattern_cfg,
158162
}
159163

160164
# Apply overrides

src/inference_endpoint/config/schema.py

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ class LoadPatternType(str, Enum):
3535

3636
MAX_THROUGHPUT = "max_throughput" # Offline: all queries at t=0
3737
POISSON = "poisson" # Online: fixed QPS with Poisson distribution
38-
CONCURRENCY = "concurrency" # Online: fixed concurrent requests (TODO)
38+
CONCURRENCY = "concurrency" # Online: fixed concurrent requests
3939
BURST = "burst" # Burst pattern (TODO)
4040
STEP = "step" # Step pattern (TODO)
4141

@@ -217,14 +217,14 @@ class LoadPattern(BaseModel):
217217
Different patterns use target_qps differently:
218218
- max_throughput: target_qps used for calculating total queries (offline, optional with default)
219219
- poisson: target_qps sets scheduler rate (online, required - validated)
220-
- concurrency: target_qps not used, concurrency limit dominates (TODO)
220+
- concurrency: issue at fixed target_concurrency (online, required - validated)
221221
"""
222222

223223
type: LoadPatternType = LoadPatternType.MAX_THROUGHPUT
224224
target_qps: float | None = (
225225
None # Target QPS - required for poisson pattern, optional otherwise
226226
)
227-
target_concurrency: int | None = None # For concurrency mode (TODO)
227+
target_concurrency: int | None = None # For concurrency mode, ignored otherwise
228228

229229

230230
class ClientSettings(BaseModel):
@@ -314,7 +314,8 @@ class BenchmarkConfig(BaseModel):
314314
version: str = "1.0"
315315
type: TestType
316316
submission_ref: SubmissionReference | None = None # For SUBMISSION type configs
317-
benchmark_mode: TestType | None = None # For SUBMISSION: specify offline or online
317+
# For SUBMISSION: specify offline or online
318+
benchmark_mode: TestType | None = None
318319
model_params: ModelParams = Field(default_factory=ModelParams)
319320
datasets: list[Dataset]
320321
settings: Settings = Field(default_factory=Settings)
@@ -433,7 +434,7 @@ def validate_load_pattern(self, benchmark_mode: TestType) -> None:
433434
"""
434435
load_pattern_type = self.settings.load_pattern.type
435436
target_qps = self.settings.load_pattern.target_qps
436-
max_concurrency = self.settings.client.max_concurrency
437+
target_concurrency = self.settings.load_pattern.target_concurrency
437438

438439
if benchmark_mode == TestType.OFFLINE:
439440
if load_pattern_type != LoadPatternType.MAX_THROUGHPUT:
@@ -451,11 +452,11 @@ def validate_load_pattern(self, benchmark_mode: TestType) -> None:
451452
"Specify target queries per second (e.g., target_qps: 100 in YAML or --target-qps 100 in CLI)"
452453
)
453454
elif load_pattern_type == LoadPatternType.CONCURRENCY:
454-
# Concurrency pattern requires max_concurrency > 0
455-
if not max_concurrency or max_concurrency <= 0:
455+
# Concurrency pattern requires target_concurrency > 0
456+
if not target_concurrency or target_concurrency <= 0:
456457
raise ValueError(
457-
"Concurrency load pattern requires max_concurrency > 0. "
458-
"Specify number of concurrent requests (e.g., max_concurrency: 10 in YAML or --concurrency 10 in CLI)"
458+
"Concurrency load pattern requires target_concurrency > 0. "
459+
"Specify number of concurrent requests (e.g., target_concurrency: 10 under load_pattern in YAML or --concurrency 10 in CLI)"
459460
)
460461

461462
def validate_client_settings(self) -> None:

0 commit comments

Comments
 (0)