Major refactor to move multi-threaded load generator to async event loops#282
Major refactor to move multi-threaded load generator to async event loops#282nv-alicheng merged 4 commits intomainfrom
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request transitions the load generator to an asynchronous architecture and replaces the SQLite-based metrics pipeline with a high-performance, mmap-backed key-value store. It introduces a new BenchmarkSession for phase orchestration and optimized async strategies for Poisson, burst, and concurrency load patterns. The review feedback identifies a potential race condition caused by premature temporary directory deletion in the scoring artifact logic, suggests using generic placeholders for endpoint URLs to improve configuration portability, and recommends dynamically registering metric keys using enums to enhance code maintainability and prevent silent failures.
arekay-nv
left a comment
There was a problem hiding this comment.
Made a first pass. Will finish up tomorrow.
arekay-nv
left a comment
There was a problem hiding this comment.
Part 2 of reviews - next one should wrap it up
arekay-nv
left a comment
There was a problem hiding this comment.
Awesome work - thanks.
I will run some perf test offline, but looks good!
- docs(report_design): add "reports reproducible from event log" principle - refactor(metrics_table): rename subtract_field -> delta_start_fieldname - docs(metrics_table): reword ISL token_ids/text comments so SGLang/OpenAI are examples, not defining conditions - test(kv_store): pin empty SeriesStats min/max sentinels; add snapshot isolation regression test - test(aggregator): add explanatory messages to tracking-window asserts - test(report_builder): pin max/std_dev on empty compute_summary - test(sample_order): parametrize over [3, 100, 10_000] dataset sizes - test(zmq_pool_transport): collapse two pool transport classes into one parametrized over (num_workers, create_publisher) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses PR #282 thread T33: reframe the docstring so MetricsAggregator and Harmony read as examples rather than a closed list of consumers, and state explicitly that this function never loads or downloads the tokenizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: address PR review comments — fix tmpfs cleanup, use metric enums, fix templates - Fix premature tmpfs deletion in _write_scoring_artifacts (cleanup now owned solely by run_benchmark's finally block) - Replace hardcoded metric key strings in _setup_kv_reader with MetricCounterKey/MetricSeriesKey enum iteration - Fix config templates: replace placeholder URLs with http://localhost:8000, remove nonexistent record_worker_events field, fix YAML indentation - Replace internal hostname in gptoss_test.yaml with localhost Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Disable warmup by default until rules are defined * Remove stray config file * Add TPOT_NS to _STREAMING_ONLY metrics set Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * chore: address second round of PR #282 review comments - docs(report_design): add "reports reproducible from event log" principle - refactor(metrics_table): rename subtract_field -> delta_start_fieldname - docs(metrics_table): reword ISL token_ids/text comments so SGLang/OpenAI are examples, not defining conditions - test(kv_store): pin empty SeriesStats min/max sentinels; add snapshot isolation regression test - test(aggregator): add explanatory messages to tracking-window asserts - test(report_builder): pin max/std_dev on empty compute_summary - test(sample_order): parametrize over [3, 100, 10_000] dataset sizes - test(zmq_pool_transport): collapse two pool transport classes into one parametrized over (num_workers, create_publisher) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: clarify _check_tokenizer_exists is a probe, consumers are examples Addresses PR #282 thread T33: reframe the docstring so MetricsAggregator and Harmony read as examples rather than a closed list of consumers, and state explicitly that this function never loads or downloads the tokenizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Squashed combination of 8 commits developed on feat/alicheng-pubsub-integration: - Initial cleanup to remove all features that will be replaced and permanently removed (#214) - Add KVStore, ready-check mechanism, and ServiceLauncher (#215) - Add report building (#216) - Fix rebase errors - Make Loadgen Async (#255) - Do not use /dev/shm tmpfs for KVStore on ARM, fix rebase errors in e2e_oracle tests - Add batched publishing to pubsub (#281) - Disable warmup temporarily, bugfixes. (#288) Squashed locally prior to rebase onto origin/main to avoid a case-insensitive-filesystem conflict on docs/load_generator/DESIGN.md between intermediate commits. Full per-sub-PR history is preserved in the review threads on PR #282 and on the backup ref backup/pre-squash-pubsub-integration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4c785b2 to
aaf6f9b
Compare
Squashed combination of 8 commits developed on feat/alicheng-pubsub-integration: - Initial cleanup to remove all features that will be replaced and permanently removed (#214) - Add KVStore, ready-check mechanism, and ServiceLauncher (#215) - Add report building (#216) - Fix rebase errors - Make Loadgen Async (#255) - Do not use /dev/shm tmpfs for KVStore on ARM, fix rebase errors in e2e_oracle tests - Add batched publishing to pubsub (#281) - Disable warmup temporarily, bugfixes. (#288) Squashed locally prior to rebase onto origin/main to avoid a case-insensitive-filesystem conflict on docs/load_generator/DESIGN.md between intermediate commits. Full per-sub-PR history is preserved in the review threads on PR #282 and on the backup ref backup/pre-squash-pubsub-integration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aaf6f9b to
d44be10
Compare
What does this PR do?
Major refactor which moves the multi-threaded system to an event loop based one to reduced Python context switching for higher throughput and performance.
Type of change
Related issues
This merges from a side branch which consists of multiple PRs:
Testing
Checklist