Skip to content

[feat] Add json parser and enable parser config#19

Merged
arekay-nv merged 7 commits intomainfrom
arekay/add_json_and_parser_config
Nov 10, 2025
Merged

[feat] Add json parser and enable parser config#19
arekay-nv merged 7 commits intomainfrom
arekay/add_json_and_parser_config

Conversation

@arekay-nv
Copy link
Copy Markdown
Collaborator

What does this PR do?

Adds a json parser to load json datasets.
Adds yaml config options to specify parser behavior by mapping fields.
Fixes hardcoded pickle file type.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@arekay-nv arekay-nv requested a review from a team as a code owner November 8, 2025 23:01
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 8, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @arekay-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data loading and configuration flexibility for benchmarking. It introduces a new JSON Lines parser, allowing a wider variety of datasets to be used. Crucially, it enables users to define how dataset fields are mapped to benchmark parameters through YAML configurations, moving away from rigid, hardcoded parsing logic. This change also refines the benchmark configuration schema to better support model identification and updates default templates to reflect these new capabilities.

Highlights

  • JSON Parser: Added a new JSON Lines (JSONL) parser to support loading datasets in this format, expanding data ingestion capabilities.
  • Configurable Dataset Parsing: Introduced the ability to specify dataset parsing behavior via YAML configuration, allowing users to map fields (e.g., 'prompt', 'output') from the dataset to internal structures using key_maps.
  • Flexible Dataset Format Handling: Removed the hardcoded 'pkl' format assumption for datasets, enabling the DataLoaderFactory to infer or be explicitly told the dataset format, supporting both pickle and JSONL.
  • Model Name in Configuration: Added an optional name field to ModelParams in the configuration schema, allowing the model name to be specified and used consistently across benchmarks.
  • Template Updates: Updated the offline_template.yaml and online_template.yaml files to include the new model_params.name and demonstrate the new parser configuration for datasets, along with adjusted runtime durations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a JSON parser for datasets and makes the parsing behavior configurable through YAML, which is a great enhancement for flexibility. The changes also remove a hardcoded pickle file type, improving the tool's adaptability.

My review includes several points:

  • A critical issue in openai_adapter.py regarding the order of messages sent to the model, which will likely cause incorrect behavior.
  • Some high-severity issues in dataset_manager/factory.py related to incorrect type hints and potential TypeErrors that should be addressed for robustness.
  • A high-severity issue in dataset_manager/dataloader.py concerning memory usage in the new JsonlReader which could be problematic for large datasets.
  • A couple of medium-severity suggestions for code cleanup and to track a TODO for future improvements.

Overall, the changes are valuable, but the identified critical and high-severity issues should be fixed to ensure correctness and stability.

Comment thread src/inference_endpoint/openai/openai_adapter.py Outdated
Comment thread src/inference_endpoint/commands/benchmark.py Outdated
Comment thread src/inference_endpoint/dataset_manager/dataloader.py
Comment thread src/inference_endpoint/dataset_manager/factory.py Outdated
Comment thread src/inference_endpoint/dataset_manager/factory.py Outdated
Comment thread src/inference_endpoint/dataset_manager/dataloader.py
Comment thread src/inference_endpoint/dataset_manager/factory.py
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Comment thread src/inference_endpoint/commands/benchmark.py Outdated
Comment thread src/inference_endpoint/dataset_manager/dataloader.py
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Comment thread tests/unit/openai/test_openai_types.py Fixed
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
@arekay-nv arekay-nv merged commit b004cc5 into main Nov 10, 2025
4 checks passed
@arekay-nv arekay-nv deleted the arekay/add_json_and_parser_config branch November 10, 2025 20:22
@github-actions github-actions Bot locked and limited conversation to collaborators Nov 10, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants