[feat] Add json parser and enable parser config by arekay-nv · Pull Request #19 · mlcommons/endpoints

arekay-nv · 2025-11-08T23:01:53Z

What does this PR do?

Adds a json parser to load json datasets.
Adds yaml config options to specify parser behavior by mapping fields.
Fixes hardcoded pickle file type.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-actions · 2025-11-08T23:02:02Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist · 2025-11-08T23:02:12Z

Summary of Changes

Hello @arekay-nv, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data loading and configuration flexibility for benchmarking. It introduces a new JSON Lines parser, allowing a wider variety of datasets to be used. Crucially, it enables users to define how dataset fields are mapped to benchmark parameters through YAML configurations, moving away from rigid, hardcoded parsing logic. This change also refines the benchmark configuration schema to better support model identification and updates default templates to reflect these new capabilities.

Highlights

JSON Parser: Added a new JSON Lines (JSONL) parser to support loading datasets in this format, expanding data ingestion capabilities.
Configurable Dataset Parsing: Introduced the ability to specify dataset parsing behavior via YAML configuration, allowing users to map fields (e.g., 'prompt', 'output') from the dataset to internal structures using key_maps.
Flexible Dataset Format Handling: Removed the hardcoded 'pkl' format assumption for datasets, enabling the DataLoaderFactory to infer or be explicitly told the dataset format, supporting both pickle and JSONL.
Model Name in Configuration: Added an optional name field to ModelParams in the configuration schema, allowing the model name to be specified and used consistently across benchmarks.
Template Updates: Updated the offline_template.yaml and online_template.yaml files to include the new model_params.name and demonstrate the new parser configuration for datasets, along with adjusted runtime durations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a JSON parser for datasets and makes the parsing behavior configurable through YAML, which is a great enhancement for flexibility. The changes also remove a hardcoded pickle file type, improving the tool's adaptability.

My review includes several points:

A critical issue in openai_adapter.py regarding the order of messages sent to the model, which will likely cause incorrect behavior.
Some high-severity issues in dataset_manager/factory.py related to incorrect type hints and potential TypeErrors that should be addressed for robustness.
A high-severity issue in dataset_manager/dataloader.py concerning memory usage in the new JsonlReader which could be problematic for large datasets.
A couple of medium-severity suggestions for code cleanup and to track a TODO for future improvements.

Overall, the changes are valuable, but the identified critical and high-severity issues should be fixed to ensure correctness and stability.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Add json parser and enable parser config

6e5bc26

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv requested a review from a team as a code owner November 8, 2025 23:01

gemini-code-assist Bot reviewed Nov 8, 2025

View reviewed changes

arekay-nv added 2 commits November 8, 2025 18:48

Fix test.

ebc4b28

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Fix reorder

1ecabbd

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

nvzhihanj reviewed Nov 9, 2025

View reviewed changes

Comment thread src/inference_endpoint/commands/benchmark.py Outdated

nvzhihanj reviewed Nov 9, 2025

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataloader.py

nvzhihanj approved these changes Nov 9, 2025

View reviewed changes

arekay-nv added 3 commits November 10, 2025 08:58

Merge branch 'main' into arekay/add_json_and_parser_config

68370df

Address comments

8e032c3

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Fix test.

b05b07a

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-code-quality Bot found potential problems Nov 10, 2025

View reviewed changes

Comment thread tests/unit/openai/test_openai_types.py Fixed

Potential fix for pull request finding 'Commented-out code'

731b647

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

arekay-nv merged commit b004cc5 into main Nov 10, 2025
4 checks passed

arekay-nv deleted the arekay/add_json_and_parser_config branch November 10, 2025 20:22

github-actions Bot locked and limited conversation to collaborators Nov 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add json parser and enable parser config#19

[feat] Add json parser and enable parser config#19
arekay-nv merged 7 commits intomainfrom
arekay/add_json_and_parser_config

arekay-nv commented Nov 8, 2025

Uh oh!

github-actions Bot commented Nov 8, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Nov 8, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arekay-nv commented Nov 8, 2025

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Nov 8, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Nov 8, 2025 •

edited

Loading