WIP - Accuracy Checkers based on proposal#36
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Summary of ChangesHello @nv-alicheng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request lays the groundwork for comprehensive accuracy checking within the inference endpoint. It introduces a modular system for defining and managing datasets and evaluators, enabling the integration of various benchmarks like GPQA. The changes also adapt the existing load generation and session management to support accuracy tests, ensuring that model outputs can be correctly linked back to their source data for evaluation and reporting. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the foundational components for accuracy checking, including base classes for datasets and evaluators, along with a concrete implementation for the GPQA dataset. The changes are a good step towards building out the accuracy evaluation framework.
My review includes a few key points:
- A critical bug fix in
session.pyfor iterating over accuracy test generators. - A high-severity bug fix in
gpqa.pyrelated to loading dataset variants from Hugging Face. - Suggestions to improve the design of dataset loading by making automatic generation more accessible through the base classes.
- Minor fixes for logging and a typo.
Overall, the structure is well-thought-out, and with these fixes, it will be a solid foundation.
| for _, generator in accuracy_test_generators: | ||
| for _ in generator: | ||
| pass |
There was a problem hiding this comment.
There is a bug in how you iterate over accuracy_test_generators. The expression for _, generator in accuracy_test_generators: attempts to iterate over the keys of the dictionary and unpack each key. This will likely fail or produce incorrect behavior. To iterate over the LoadGenerator instances, you should iterate over the dictionary's values.
| for _, generator in accuracy_test_generators: | |
| for _ in generator: | |
| pass | |
| for generator in accuracy_test_generators.values(): | |
| for _ in generator: | |
| pass |
| def generate(self, datasets_dir: Path): | ||
| # Load the variant from HuggingFace | ||
| try: | ||
| raw_ds = hf_datasets.load_dataset("Idavidrein/gpqa", f"gpqa_{self.variant}") |
There was a problem hiding this comment.
There's a bug when loading the dataset from Hugging Face. When max_samples is provided, self.variant is modified to include the sample count (e.g., "diamond_100"). This modified variant name is then used to load the dataset, which will likely fail as the Hugging Face dataset variant does not include the sample count. You should use self._variant_name, which stores the original variant name ("diamond", "extended", or "main").
| raw_ds = hf_datasets.load_dataset("Idavidrein/gpqa", f"gpqa_{self.variant}") | |
| raw_ds = hf_datasets.load_dataset("Idavidrein/gpqa", f"gpqa_{self._variant_name}") |
| variable. | ||
|
|
||
| If <variant> is not specified or not applicable, the default value is 'full'. The | ||
| variant should be specied as an instance variable: .variant. |
| raise NotImplementedError | ||
|
|
||
| @abstractmethod | ||
| def load(self, datasets_dir: Path) -> Any: |
There was a problem hiding this comment.
The signature of the abstract method load should include the create_if_not_exists parameter to match the implementation in subclasses like GPQA. This ensures consistency and allows callers using the AccuracyDataset interface to leverage this functionality.
| def load(self, datasets_dir: Path) -> Any: | |
| def load(self, datasets_dir: Path, create_if_not_exists: bool = False) -> Any: |
| print(f"Error loading dataset: {e}") | ||
| print("Note: This dataset may require HuggingFace authentication.") | ||
| print("Run: huggingface-cli login") |
There was a problem hiding this comment.
It's better to use the configured logger (logger.error or logger.warning) instead of print() for outputting error messages and instructions. This provides more consistent and controllable logging within the application.
| print(f"Error loading dataset: {e}") | |
| print("Note: This dataset may require HuggingFace authentication.") | |
| print("Run: huggingface-cli login") | |
| logger.error(f"Error loading dataset: {e}") | |
| logger.error("Note: This dataset may require HuggingFace authentication.") | |
| logger.error("Run: huggingface-cli login") |
| self.accuracy_dataset = accuracy_dataset | ||
| self.dataset_dir = dataset_dir | ||
|
|
||
| self.ds = self.accuracy_dataset.load(dataset_dir) |
There was a problem hiding this comment.
To improve usability, consider allowing the Evaluator to create datasets that don't exist. You can do this by adding a create_if_not_exists: bool = False parameter to Evaluator.__init__ and passing it to the load method here. This change assumes that the AccuracyDataset.load abstract method is also updated to accept this parameter.
| self.ds = self.accuracy_dataset.load(dataset_dir) | |
| self.ds = self.accuracy_dataset.load(dataset_dir, create_if_not_exists=create_if_not_exists) |
| from typing import Any, ClassVar | ||
|
|
||
|
|
||
| class DatasetFormat(Enum): |
There was a problem hiding this comment.
I would recommend that we consolidate the dataset format to some standard, and we provide conversion script to convert to the standard format (otherwise maintaining the behavior of so many formats is tough)
Pickle / parquet / jsonl seem like the most popolar ones.
| correct_answer = f"choice{correct_index + 1}" | ||
|
|
||
| # Create processed row | ||
| processed_row = { |
There was a problem hiding this comment.
Seems missing the common f-string for question formatting - plan to add it somewhere else?
aa7c46d to
067fddd
Compare
…ABCDExtractor for GPQA
…ile with support for reasoning sequences
d8352d1 to
189ebd3
Compare
| evaluator = Evaluator( | ||
| extractor=ABCDExtractor, | ||
| scorer=PassAt1Scorer(), | ||
| accuracy_dataset=mock_dataset, | ||
| dataset_dir=Path("/fake/path"), | ||
| ) |
| evaluator = Evaluator( | ||
| extractor=ABCDExtractor, | ||
| scorer=PassAt1Scorer(), | ||
| accuracy_dataset=mock_dataset, | ||
| dataset_dir=Path("/fake/path"), | ||
| ) |
| evaluator = Evaluator( | ||
| extractor=ABCDExtractor, | ||
| scorer=PassAt1Scorer(), | ||
| accuracy_dataset=mock_dataset, | ||
| dataset_dir=Path("/fake/path"), | ||
| ) |
| evaluator = Evaluator( | ||
| extractor=ABCDExtractor, | ||
| scorer=PassAt1Scorer(), | ||
| accuracy_dataset=mock_dataset, | ||
| dataset_dir=Path("/fake/path"), | ||
| ) |
| evaluator = Evaluator( | ||
| extractor=ABCDExtractor, | ||
| scorer=PassAt1Scorer(), | ||
| accuracy_dataset=mock_dataset, | ||
| dataset_dir=Path("/fake/path"), | ||
| ) |
| evaluator = Evaluator( | ||
| extractor=Mock(extract=lambda x: x), # No-op extractor | ||
| scorer=RougeScorer(use_stemmer=True), | ||
| accuracy_dataset=mock_dataset, | ||
| dataset_dir=Path("/fake/path"), | ||
| ) |
| evaluator = Evaluator( | ||
| extractor=Mock(extract=lambda x: x), | ||
| scorer=RougeScorer(use_stemmer=True), | ||
| accuracy_dataset=mock_dataset, | ||
| dataset_dir=Path("/fake/path"), | ||
| ) |
| evaluator = Evaluator( | ||
| extractor=ABCDExtractor, | ||
| scorer=PassAt1Scorer(), | ||
| accuracy_dataset=mock_dataset, | ||
| dataset_dir=Path("/fake/path"), | ||
| ) |
| evaluator = Evaluator( | ||
| extractor=ABCDExtractor, | ||
| scorer=PassAt1Scorer(), | ||
| accuracy_dataset=mock_dataset, | ||
| dataset_dir=Path("/fake/path"), | ||
| ) |
| evaluator = Evaluator( | ||
| extractor=ABCDExtractor, | ||
| scorer=PassAt1Scorer(), | ||
| accuracy_dataset=mock_dataset, | ||
| dataset_dir=Path("/fake/path"), | ||
| ) |
…cess function to generate v6.0 compatible dataset file
|
Closing because this was merged in a separate PR. |
What does this PR do?
Type of change
Related issues
Testing
Checklist