[feat] Add a random data loader by arekay-nv · Pull Request #21 · mlcommons/endpoints

arekay-nv · 2025-11-10T20:33:54Z

Adds a random data loader that can generate arbitrary sequences of a given length. Useful for testing behavior over fixed sequence lengths.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

gemini-code-assist · 2025-11-10T20:34:08Z

Summary of Changes

Hello @arekay-nv, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a RandomDataLoader into the system, designed to generate arbitrary sequences of a specified length. This new component is crucial for robust testing, enabling developers to evaluate model behavior consistently across fixed or varied sequence lengths, thereby improving the reliability of inference endpoints. It provides a flexible way to create synthetic datasets for development and testing purposes.

Highlights

New RandomDataLoader: Introduced a new RandomDataLoader class for generating synthetic data sequences, useful for testing model behavior over fixed or varied sequence lengths.
Configurable Data Generation: The RandomDataLoader supports configurable parameters such as number of sequences, input sequence length, vocabulary size, a range ratio for varying sequence lengths, and utilizes a transformers tokenizer for tokenization.
Unit Tests Added: Added comprehensive unit tests for the RandomDataLoader to ensure its functionality and correctness across different sequence length ranges, including checks for tokenization consistency.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-11-10T20:34:09Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist · 2025-11-10T20:35:09Z

Summary of Changes

Hello @arekay-nv, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new RandomDataLoader feature designed to generate synthetic, tokenized data for testing purposes. This loader allows for the creation of arbitrary sequences with configurable lengths and quantities, which is particularly useful for evaluating model behavior across different input sequence sizes. The implementation includes integration with Hugging Face transformers tokenizers to produce realistic tokenized inputs, ensuring robust testing capabilities for sequence-based models.

Highlights

New RandomDataLoader: A new RandomDataLoader class has been introduced, inheriting from the base DataLoader, to generate synthetic data for testing purposes.
Configurable Sequence Generation: The RandomDataLoader can generate arbitrary sequences with configurable parameters such as the number of sequences, input sequence length, a range ratio for varying lengths, random seed, and vocabulary size.
Tokenizer Integration: The new data loader integrates with transformers.PreTrainedTokenizerBase to decode and encode the generated random sequences, producing tokenized data suitable for language model testing.
Unit Tests: Comprehensive unit tests have been added for the RandomDataLoader, including parameterized tests to verify its behavior across different range_ratio values and ensure correct sequence generation and length constraints.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist · 2025-11-10T20:35:26Z

Summary of Changes

Hello @arekay-nv, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new RandomDataLoader to the system, designed to create synthetic datasets with customizable sequence lengths and tokenization. This feature is crucial for robust testing of models, particularly those sensitive to input sequence dimensions, by providing a flexible way to generate diverse test cases without relying on external datasets.

Highlights

New Random Data Loader: Introduced a RandomDataLoader class capable of generating arbitrary sequences of specified lengths, useful for testing models with fixed sequence length requirements.
Tokenization Integration: The RandomDataLoader leverages the transformers library's PreTrainedTokenizerBase to generate and encode random sequences, allowing for realistic tokenized data.
Comprehensive Testing: Added unit tests for the RandomDataLoader using pytest.mark.parametrize to validate its behavior across different sequence length ranges.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a RandomDataLoader for generating arbitrary sequences, which is useful for testing. The implementation is a good start, but I've identified a few areas for improvement. My feedback includes fixing a potential TypeError from a float being passed to an integer-expecting function, resolving an inconsistency in vocabulary size usage, making the tokenizer a required argument, removing redundant code, and improving the new tests for clarity and robustness. I've also suggested adding docstrings to improve maintainability.

gemini-code-assist

Code Review

This pull request introduces a RandomDataLoader for generating test data. The implementation is a good start, but there are several areas for improvement regarding correctness, consistency, and code clarity. My review includes suggestions to handle vocabulary size consistently, fix a bug in random number generation, improve constructor logic, and align the implementation with the base class API contract. I've also provided feedback on the accompanying tests to make them more robust and less dependent on external services.

gemini-code-assist · 2025-11-10T20:37:02Z

+        assert index < self.num_samples(), "Index is out of range."
+        return self.data[index]


Using assert for index validation in a public method is not ideal as assertions can be disabled with the -O flag. To conform with the base class DataLoader's contract, which implies raising IndexError for out-of-bounds access, it's better to perform an explicit check and raise an IndexError.

Suggested change

assert index < self.num_samples(), "Index is out of range."

return self.data[index]

if not (0 <= index < self.num_samples()):

raise IndexError("Index is out of range.")

return self.data[index]

gemini-code-assist

Code Review

This pull request introduces a RandomDataLoader, which is a useful addition for testing purposes. The implementation is solid, but I've identified a few areas for improvement to enhance correctness and clarity. My main feedback revolves around an inconsistency in using vocab_size, which could lead to bugs. I suggest removing the vocab_size parameter and consistently using tokenizer.vocab_size. I've also pointed out some minor issues in the __init__ method and opportunities to improve variable naming for clarity. Finally, I've suggested updates to the tests to align with these changes and improve maintainability by removing magic numbers.

nvzhihanj · 2025-11-12T04:33:12Z

+            self.num_sequences,
+        )
+        # Generate the input starts randomly from the vocab size
+        input_starts = self.rng.integers(


I assume we would like reuse semi-analysis's random dataset as is? I would recommend properly quoting from (and add attribution) vLLM's benchmark_serving scripts (FYI this is from a fork): https://github.com/kimbochen/bench_serving/blob/499c0b171b499b02a1fd546fb2326d2175a5d66e/benchmark_serving.py#L366 so we have parity

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Add a random data loader

523bd2a

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv requested a review from a team as a code owner November 10, 2025 20:33

github-actions Bot requested a review from nvzhihanj November 10, 2025 20:34

gemini-code-assist Bot reviewed Nov 10, 2025

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataloader.py

Comment thread src/inference_endpoint/dataset_manager/dataloader.py Outdated

Comment thread tests/unit/dataset_manager/test_data_loader.py

Comment thread tests/unit/dataset_manager/test_data_loader.py

Merge branch 'main' into arekay/add_random_dataloader

726862a

nvzhihanj reviewed Nov 12, 2025

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataloader.py

nvzhihanj reviewed Nov 12, 2025

View reviewed changes

nvzhihanj approved these changes Nov 12, 2025

View reviewed changes

arekay-nv added 3 commits November 12, 2025 17:51

Merge branch 'main' into arekay/add_random_dataloader

bf25b51

Fix prompt sizes

d16d02e

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

mark test slow, add comments

3f7c57d

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

viraatc approved these changes Nov 14, 2025

View reviewed changes

arekay-nv merged commit b9fb4de into main Nov 19, 2025
4 checks passed

github-actions Bot locked and limited conversation to collaborators Nov 19, 2025

arekay-nv deleted the arekay/add_random_dataloader branch November 24, 2025 15:35

		assert index < self.num_samples(), "Index is out of range."
		return self.data[index]

Conversation

arekay-nv commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of change

Related issues

Testing

Checklist

Uh oh!

gemini-code-assist Bot commented Nov 10, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions Bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Nov 10, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot commented Nov 10, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvzhihanj Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

arekay-nv commented Nov 10, 2025 •

edited

Loading

github-actions Bot commented Nov 10, 2025 •

edited

Loading

nvzhihanj Nov 12, 2025 •

edited

Loading