[feat] Add json parser and enable parser config#19
Conversation
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Summary of ChangesHello @arekay-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the data loading and configuration flexibility for benchmarking. It introduces a new JSON Lines parser, allowing a wider variety of datasets to be used. Crucially, it enables users to define how dataset fields are mapped to benchmark parameters through YAML configurations, moving away from rigid, hardcoded parsing logic. This change also refines the benchmark configuration schema to better support model identification and updates default templates to reflect these new capabilities. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a JSON parser for datasets and makes the parsing behavior configurable through YAML, which is a great enhancement for flexibility. The changes also remove a hardcoded pickle file type, improving the tool's adaptability.
My review includes several points:
- A critical issue in
openai_adapter.pyregarding the order of messages sent to the model, which will likely cause incorrect behavior. - Some high-severity issues in
dataset_manager/factory.pyrelated to incorrect type hints and potentialTypeErrors that should be addressed for robustness. - A high-severity issue in
dataset_manager/dataloader.pyconcerning memory usage in the newJsonlReaderwhich could be problematic for large datasets. - A couple of medium-severity suggestions for code cleanup and to track a
TODOfor future improvements.
Overall, the changes are valuable, but the identified critical and high-severity issues should be fixed to ensure correctness and stability.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
What does this PR do?
Adds a json parser to load json datasets.
Adds yaml config options to specify parser behavior by mapping fields.
Fixes hardcoded pickle file type.
Type of change
Related issues
Testing
Checklist