Skip to content

[feat] Add a random data loader#21

Merged
arekay-nv merged 5 commits intomainfrom
arekay/add_random_dataloader
Nov 19, 2025
Merged

[feat] Add a random data loader#21
arekay-nv merged 5 commits intomainfrom
arekay/add_random_dataloader

Conversation

@arekay-nv
Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv commented Nov 10, 2025

Adds a random data loader that can generate arbitrary sequences of a given length. Useful for testing behavior over fixed sequence lengths.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@arekay-nv arekay-nv requested a review from a team as a code owner November 10, 2025 20:33
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @arekay-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a RandomDataLoader into the system, designed to generate arbitrary sequences of a specified length. This new component is crucial for robust testing, enabling developers to evaluate model behavior consistently across fixed or varied sequence lengths, thereby improving the reliability of inference endpoints. It provides a flexible way to create synthetic datasets for development and testing purposes.

Highlights

  • New RandomDataLoader: Introduced a new RandomDataLoader class for generating synthetic data sequences, useful for testing model behavior over fixed or varied sequence lengths.
  • Configurable Data Generation: The RandomDataLoader supports configurable parameters such as number of sequences, input sequence length, vocabulary size, a range ratio for varying sequence lengths, and utilizes a transformers tokenizer for tokenization.
  • Unit Tests Added: Added comprehensive unit tests for the RandomDataLoader to ensure its functionality and correctness across different sequence length ranges, including checks for tokenization consistency.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 10, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions Bot requested a review from nvzhihanj November 10, 2025 20:34
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @arekay-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new RandomDataLoader feature designed to generate synthetic, tokenized data for testing purposes. This loader allows for the creation of arbitrary sequences with configurable lengths and quantities, which is particularly useful for evaluating model behavior across different input sequence sizes. The implementation includes integration with Hugging Face transformers tokenizers to produce realistic tokenized inputs, ensuring robust testing capabilities for sequence-based models.

Highlights

  • New RandomDataLoader: A new RandomDataLoader class has been introduced, inheriting from the base DataLoader, to generate synthetic data for testing purposes.
  • Configurable Sequence Generation: The RandomDataLoader can generate arbitrary sequences with configurable parameters such as the number of sequences, input sequence length, a range ratio for varying lengths, random seed, and vocabulary size.
  • Tokenizer Integration: The new data loader integrates with transformers.PreTrainedTokenizerBase to decode and encode the generated random sequences, producing tokenized data suitable for language model testing.
  • Unit Tests: Comprehensive unit tests have been added for the RandomDataLoader, including parameterized tests to verify its behavior across different range_ratio values and ensure correct sequence generation and length constraints.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @arekay-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new RandomDataLoader to the system, designed to create synthetic datasets with customizable sequence lengths and tokenization. This feature is crucial for robust testing of models, particularly those sensitive to input sequence dimensions, by providing a flexible way to generate diverse test cases without relying on external datasets.

Highlights

  • New Random Data Loader: Introduced a RandomDataLoader class capable of generating arbitrary sequences of specified lengths, useful for testing models with fixed sequence length requirements.
  • Tokenization Integration: The RandomDataLoader leverages the transformers library's PreTrainedTokenizerBase to generate and encode random sequences, allowing for realistic tokenized data.
  • Comprehensive Testing: Added unit tests for the RandomDataLoader using pytest.mark.parametrize to validate its behavior across different sequence length ranges.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a RandomDataLoader for generating arbitrary sequences, which is useful for testing. The implementation is a good start, but I've identified a few areas for improvement. My feedback includes fixing a potential TypeError from a float being passed to an integer-expecting function, resolving an inconsistency in vocabulary size usage, making the tokenizer a required argument, removing redundant code, and improving the new tests for clarity and robustness. I've also suggested adding docstrings to improve maintainability.

Comment thread src/inference_endpoint/dataset_manager/dataloader.py
Comment thread src/inference_endpoint/dataset_manager/dataloader.py
Comment thread src/inference_endpoint/dataset_manager/dataloader.py
Comment thread src/inference_endpoint/dataset_manager/dataloader.py Outdated
Comment thread src/inference_endpoint/dataset_manager/dataloader.py Outdated
Comment thread tests/unit/dataset_manager/test_data_loader.py Outdated
Comment thread tests/unit/dataset_manager/test_data_loader.py
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a RandomDataLoader for generating test data. The implementation is a good start, but there are several areas for improvement regarding correctness, consistency, and code clarity. My review includes suggestions to handle vocabulary size consistently, fix a bug in random number generation, improve constructor logic, and align the implementation with the base class API contract. I've also provided feedback on the accompanying tests to make them more robust and less dependent on external services.

Comment thread src/inference_endpoint/dataset_manager/dataloader.py Outdated
Comment thread src/inference_endpoint/dataset_manager/dataloader.py Outdated
Comment thread src/inference_endpoint/dataset_manager/dataloader.py Outdated
Comment on lines +392 to +393
assert index < self.num_samples(), "Index is out of range."
return self.data[index]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using assert for index validation in a public method is not ideal as assertions can be disabled with the -O flag. To conform with the base class DataLoader's contract, which implies raising IndexError for out-of-bounds access, it's better to perform an explicit check and raise an IndexError.

Suggested change
assert index < self.num_samples(), "Index is out of range."
return self.data[index]
if not (0 <= index < self.num_samples()):
raise IndexError("Index is out of range.")
return self.data[index]

Comment thread tests/unit/dataset_manager/test_data_loader.py
Comment thread tests/unit/dataset_manager/test_data_loader.py
Comment thread tests/unit/dataset_manager/test_data_loader.py Outdated
Comment thread tests/unit/dataset_manager/test_data_loader.py
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a RandomDataLoader, which is a useful addition for testing purposes. The implementation is solid, but I've identified a few areas for improvement to enhance correctness and clarity. My main feedback revolves around an inconsistency in using vocab_size, which could lead to bugs. I suggest removing the vocab_size parameter and consistently using tokenizer.vocab_size. I've also pointed out some minor issues in the __init__ method and opportunities to improve variable naming for clarity. Finally, I've suggested updates to the tests to align with these changes and improve maintainability by removing magic numbers.

Comment thread src/inference_endpoint/dataset_manager/dataloader.py
Comment thread src/inference_endpoint/dataset_manager/dataloader.py Outdated
Comment thread tests/unit/dataset_manager/test_data_loader.py
Comment thread tests/unit/dataset_manager/test_data_loader.py
Comment thread src/inference_endpoint/dataset_manager/dataloader.py
self.num_sequences,
)
# Generate the input starts randomly from the vocab size
input_starts = self.rng.integers(
Copy link
Copy Markdown
Collaborator

@nvzhihanj nvzhihanj Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we would like reuse semi-analysis's random dataset as is? I would recommend properly quoting from (and add attribution) vLLM's benchmark_serving scripts (FYI this is from a fork): https://github.com/kimbochen/bench_serving/blob/499c0b171b499b02a1fd546fb2326d2175a5d66e/benchmark_serving.py#L366 so we have parity

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@arekay-nv arekay-nv merged commit b9fb4de into main Nov 19, 2025
4 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Nov 19, 2025
@arekay-nv arekay-nv deleted the arekay/add_random_dataloader branch November 24, 2025 15:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants