You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The goal of this proposal is to define a new standard CLI and config (previously known as scenario) format which solves many usability issues and allows us to implement new features which are prohibitively difficult to add to the existing design.
Usability Improvements
Here are a few issues that often crop up in the current design and how this proposal addresses them.
Better data configuration
One critical limitation of the existing (<=v0.6.0) design for the --data argument is it is really hard to determine what a user intended when something is entered incorrectly. Most users of GuideLLM are likely familiar with some variant of this error (truncated to avoid taking up an entire page):
This is because a data string like prompt_tokens=100,output_tokens=50 contains no information on what kind of dataset it is. GuideLLM will attempt to parse it as every kind of dataset and if they all fail, it will return the error of every dataset type. With this proposal we have eliminated this ambiguity by adding a type field to configure explicitly which kind of dataset is being requested. This also has the added benefit of making it possible to have multiple dataset formats which closely match. Such as mookcake dataset which would normally conflict with plain jsonl.
Built-in option documentation
The current GuideLLM CLI is very limited in what is documented, but internally every field has a description. Since the run CLI exposes users to the internal type lookup tables we can make a very intuitive help system for describing available options (note final format will differ):
$ guidellm describe backend openai_http
HTTP backend for OpenAI-compatible servers.
Supports OpenAI API, vLLM servers, and other compatible endpoints with
text/chat completions, streaming, authentication, and multimodal inputs.
Handles request formatting, response parsing, error handling, and token
usage tracking with flexible parameter customization.
Fields:
- target: str
Base URL of the OpenAI-compatible server
- model: str | None
Model identifier for generation requests
- request_format: Literal["/v1/completions"] (default "/v1/completion")
Request format for OpenAI-compatible server.
Config layering
While the GuideLLM CLI/Config format is very powerful, often there are times where a pair or multiple options should always be configured together. For instance, with the recent addition of Geospatial model support The existing CLI requires both --request-format /pooling and --data-column-mapper pooling_column_mapper. These multi-component workloads will most likely become more common with Mooncake and tool-calling additions in the future. Currently GuideLLM already has the concept of a builtin scenario to help address this problem, however only one scenario (builtin or custom) can be passed.
By allowing multiple configs and implementing rules for layering we can embed common use-cases as configs that can be layered into a benchmark. For example:
could enable the Geospatial model arguments, configure the profile for trace data, and run the users custom config.
New Features
Here are a few new features this redesign will enable.
Per-benchmark randomness
Randomness plays a few different roles in GuideLLM; likely the most important role is in generating synthetic data. In the current design of GuideLLM the randomness of synthetic data suffers due to a compromise made with real datasets. If given a series of rates (e.g. --rate 1,2,4,8) GuideLLM always starts each rate at the same location in the dataset. For pre-created datasets this means replaying the exact same requests in the exact same order for each rate. For synthetic data this means reinitializing the dataset using the same random seed which results in the same requests. This is fine for use-cases that don’t involve any sort of server-side caching and is necessary for use-cases where the goal is to evaluate response quality at each rate. However once caching is introduced this can cause issues if the previous rate’s requests are not evicted from the cache. To work around this the synthetic data generator inserts an index marker at the beginning of the prompt which matches the index of the current rate. This approach has at least 3 problems:
prefix_tokens are unaffected by this index and are thus shared across all rates
If --data-samples is set, the dataset is generated once before all benchmarks which results in the index marker being a static 1.
The order of rates affects the values in the dataset even when random is static
To fix these problems (and a few others) the new configuration is designed to give control over randomness at a per-benchmark level. For example:
---
global:
seed:
type: incrementstart: 42step: 2
will start the first benchmark with a random seed of 42 and increment the seed by 2 for each subsequent benchmark whereas:
---
global:
seed:
type: staticvalue: 42
will use a static seed for each benchmark. With static, the seed can be overwritten by each benchmark to allow more manual control.
Conditional constraint groups (Future work)
With the new design we can support more advanced combinations of constraints such as logical groups. For example
could be used to ensure a minimum number of requests are run and then stop either when a max is hit or when oversaturation occurs.
Plugins (Future work)
Since v0.4 GuideLLM has been designed as an extendable architecture. Many internal components are implemented as registries. For example the backend registry has the openai_http and python_vllm backends registered to it. However, the current CLI implementation limits the usefulness of adding to these registries externally. This is due some static type checking as well as the inability for external code to to extend the CLI with new options. With this new CLI design it will be easier to allow plugins to define their own options and the separation of config from functional class will allow argument validation to be handled by plugins.
---
global:
backend:
type: "openai_http"target: "http://localhost:8000"request_format: "/v1/chat/completions"model: "OpenAI/gpt-oss-120b"processor: "OpenAI/gpt-oss-120b"validate_backend: trueverify: falseprofile:
type: "concurrent"rampup: 3warmup: 10cooldown: 20constraints:
- type: "max_seconds"seconds: 50
- type: "over_saturation"data_loader:
type: "generative"sampler: null # Currently --data-samplersamples: 1000# Currently --data-samplesstart_index: 0# Index to start the dataset atnum_workers: 10# Currently --data-num-workersdata_column_mapper:
type: "generative_column_mapper"mappings:
text_column: "article"output_tokens_count_column: "output_tokens"data_preprocessors:
- type: "encode_media"
- type: "custom_pre"max_len: 256data_finalizer:
type: "generative"data:
- type: syntheticprompt_tokens: 50output_tokens: 100load_args: ... # Currently --data-argsseed:
type: increment # Auto increment based on benchmark indexstart: 56by: 1benchmarks:
- profile.streams: 50
- profile.streams: 100profile.rampup: 10profile.warmup: 0profile.cooldown: 0constraints[0].seconds: 100outputs:
- type: jsonexclude_requests: truepath: benchmarks.json
- type: csvpath: benchmarks.csv
- type: jsonl_requestssample_requests: 50path: benchmark.jsonl
Notes
Open Questions
1. How will well-lit paths / layering work?
Example: guidellm run --config builtin/geospatial --config custom.yaml should enable the required options for geospatial models and then layer custom configs on top.
One problem to solve here is how to handle list options (aka data). By default they should probably be fully overwritten but could (and should) we come up with a design that allows merging lists? The previous version of this proposal had a “merge_lists” key at the top of the config but that seems too coarse. Also how do we handle merging the merge option? Does the last one apply to all config layers? Does each “merge_list” config only merge lists with the one before it or the one after it?
Another problem is how to merge incompatible types. Aka if one config has type: "openai_http" and the next has type: "vllm_python" what happens to all of the configured options since they may not be valid for other types? What happens if a different type is layered in-between two compatible types? I think the solution here is to build a graph of every type while layering configs and then apply whichever one is seen last.
Implementation Details (TBD)
Currently YAML and CLI arguments feed into a Pydantic model called BenchmarkGenerativeTextArgs. This config model then is passed to the benchmark_generative_text function which spawns the required resources. In the new design BenchmarkGenerativeTextArgs will be split into multiple layers. For example:
Individual global args will be owned by the related component. For example The backend component will have a BackendArgs (this already exists) pydantic registry which is subclassed for each backend. The overlying BenchmarkGlobalArgs will implement helper validation and serialization methods that use the provided type field to create the appropriate subclass for each global arg. For example:
Note
This proposal is human written.
The goal of this proposal is to define a new standard CLI and config (previously known as scenario) format which solves many usability issues and allows us to implement new features which are prohibitively difficult to add to the existing design.
Usability Improvements
Here are a few issues that often crop up in the current design and how this proposal addresses them.
Better data configuration
One critical limitation of the existing (<=v0.6.0) design for the
--dataargument is it is really hard to determine what a user intended when something is entered incorrectly. Most users of GuideLLM are likely familiar with some variant of this error (truncated to avoid taking up an entire page):This is because a data string like
prompt_tokens=100,output_tokens=50contains no information on what kind of dataset it is. GuideLLM will attempt to parse it as every kind of dataset and if they all fail, it will return the error of every dataset type. With this proposal we have eliminated this ambiguity by adding atypefield to configure explicitly which kind of dataset is being requested. This also has the added benefit of making it possible to have multiple dataset formats which closely match. Such as mookcake dataset which would normally conflict with plain jsonl.Built-in option documentation
The current GuideLLM CLI is very limited in what is documented, but internally every field has a description. Since the run CLI exposes users to the internal type lookup tables we can make a very intuitive help system for describing available options (note final format will differ):
Config layering
While the GuideLLM CLI/Config format is very powerful, often there are times where a pair or multiple options should always be configured together. For instance, with the recent addition of Geospatial model support The existing CLI requires both
--request-format /poolingand--data-column-mapper pooling_column_mapper. These multi-component workloads will most likely become more common with Mooncake and tool-calling additions in the future. Currently GuideLLM already has the concept of a builtin scenario to help address this problem, however only one scenario (builtin or custom) can be passed.By allowing multiple configs and implementing rules for layering we can embed common use-cases as configs that can be layered into a benchmark. For example:
guidellm run \ --config well-lit/geospatial \ --config special/trace_data \ --config custom.yamlcould enable the Geospatial model arguments, configure the profile for trace data, and run the users custom config.
New Features
Here are a few new features this redesign will enable.
Per-benchmark randomness
Randomness plays a few different roles in GuideLLM; likely the most important role is in generating synthetic data. In the current design of GuideLLM the randomness of synthetic data suffers due to a compromise made with real datasets. If given a series of rates (e.g.
--rate 1,2,4,8) GuideLLM always starts each rate at the same location in the dataset. For pre-created datasets this means replaying the exact same requests in the exact same order for each rate. For synthetic data this means reinitializing the dataset using the same random seed which results in the same requests. This is fine for use-cases that don’t involve any sort of server-side caching and is necessary for use-cases where the goal is to evaluate response quality at each rate. However once caching is introduced this can cause issues if the previous rate’s requests are not evicted from the cache. To work around this the synthetic data generator inserts an index marker at the beginning of the prompt which matches the index of the current rate. This approach has at least 3 problems:prefix_tokensare unaffected by this index and are thus shared across all rates--data-samplesis set, the dataset is generated once before all benchmarks which results in the index marker being a static 1.To fix these problems (and a few others) the new configuration is designed to give control over randomness at a per-benchmark level. For example:
will start the first benchmark with a random seed of 42 and increment the seed by 2 for each subsequent benchmark whereas:
will use a static seed for each benchmark. With static, the seed can be overwritten by each benchmark to allow more manual control.
Conditional constraint groups (Future work)
With the new design we can support more advanced combinations of constraints such as logical groups. For example
could be used to ensure a minimum number of requests are run and then stop either when a max is hit or when oversaturation occurs.
Plugins (Future work)
Since v0.4 GuideLLM has been designed as an extendable architecture. Many internal components are implemented as registries. For example the backend registry has the
openai_httpandpython_vllmbackends registered to it. However, the current CLI implementation limits the usefulness of adding to these registries externally. This is due some static type checking as well as the inability for external code to to extend the CLI with new options. With this new CLI design it will be easier to allow plugins to define their own options and the separation of config from functional class will allow argument validation to be handled by plugins.Examples
RHAIIS Regression Workload Example
Common use-case from the PSAP RHAIIS sub-team.
YAML
CLI
guidellm run \ --output json path=./benchmarks.json --backend openai_http target=http://host:8000,request_format=/v1/completions \ --constraint max_seconds seconds=600 \ --profile concurrent \ --data synthetic prompt_tokens=1000,output_tokens=1000 \ --seed auto start=42 \ --override "profile.streams" 1,50,100,200,300,500,650 \ --override "constraint[0].seconds" 60,120,120,120,,,Exhaustive Example
Example with most of the options set.
Notes
Open Questions
1. How will well-lit paths / layering work?
Example:
guidellm run --config builtin/geospatial --config custom.yamlshould enable the required options for geospatial models and then layer custom configs on top.One problem to solve here is how to handle list options (aka data). By default they should probably be fully overwritten but could (and should) we come up with a design that allows merging lists? The previous version of this proposal had a “merge_lists” key at the top of the config but that seems too coarse. Also how do we handle merging the merge option? Does the last one apply to all config layers? Does each “merge_list” config only merge lists with the one before it or the one after it?
Another problem is how to merge incompatible
types. Aka if one config hastype: "openai_http"and the next hastype: "vllm_python"what happens to all of the configured options since they may not be valid for other types? What happens if a different type is layered in-between two compatible types? I think the solution here is to build a graph of every type while layering configs and then apply whichever one is seen last.Implementation Details (TBD)
Currently YAML and CLI arguments feed into a Pydantic model called
BenchmarkGenerativeTextArgs. This config model then is passed to thebenchmark_generative_textfunction which spawns the required resources. In the new designBenchmarkGenerativeTextArgswill be split into multiple layers. For example:Individual global args will be owned by the related component. For example The backend component will have a
BackendArgs(this already exists) pydantic registry which is subclassed for each backend. The overlyingBenchmarkGlobalArgswill implement helper validation and serialization methods that use the provided type field to create the appropriate subclass for each global arg. For example:will become
Note, The existing
BackendArgswill have to be modified to be a registry and contain a type field.