Skip to content

Standardize columns + Data refactor#1113

Merged
bkorycki merged 15 commits into
mainfrom
standardize-columns
Jul 3, 2025
Merged

Standardize columns + Data refactor#1113
bkorycki merged 15 commits into
mainfrom
standardize-columns

Conversation

@bkorycki
Copy link
Copy Markdown
Contributor

@bkorycki bkorycki commented Jul 2, 2025

The overall goal of this PR is to standardize data formatting so that the outputs of one stage can be used as the input of another. And to avoid head aches.
Two new objects are instrumental in accomplishing this:

  1. The Data Schema objects, which define column names in one place.
  2. Dataset objects which handle both reading from input and writing to output.
    • The different types of datasets will read CSV rows and output the relevant object for each row e.g. TestItem for prompts, SUTInteraction for prompts-responses, and AnnotatedSUTInteraction for annotations.
    • The Annotation dataset serializes/deserializes annotations as json-strings. It can serialize pydantic objects or dictionaries. Annotations will always be deserialized as dictionaries.

Also in service of this goal, every output file (aside from metadata.json) is a CSV file in which one row corresponds to one prompt and one sut response and/or one annotation. No more annotations.jsonl.

@bkorycki bkorycki requested a review from a team as a code owner July 2, 2025 21:21
@bkorycki bkorycki temporarily deployed to Scheduled Testing July 2, 2025 21:21 — with GitHub Actions Inactive
@bkorycki bkorycki temporarily deployed to Scheduled Testing July 2, 2025 21:21 — with GitHub Actions Inactive
@bkorycki bkorycki temporarily deployed to Scheduled Testing July 2, 2025 21:21 — with GitHub Actions Inactive
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jul 2, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@bkorycki bkorycki requested a review from rogthefrog July 2, 2025 21:24
Copy link
Copy Markdown
Contributor

@rogthefrog rogthefrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great and very clear.

Comment thread src/modelgauge/dataset.py

quoting = csv.QUOTE_ALL

def __init__(self, path: Union[str, Path], mode: str):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to have a BaseDatasetReader and BaseDatasetWriter instead of handling both use cases in one class?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm that's interesting. But I don't like the idea of having 6 dataset classes instead of just 3. Do you think it would be worth it to split it up?

Comment thread src/modelgauge/dataset.py Outdated
@bkorycki bkorycki temporarily deployed to Scheduled Testing July 3, 2025 17:56 — with GitHub Actions Inactive
@bkorycki bkorycki temporarily deployed to Scheduled Testing July 3, 2025 17:56 — with GitHub Actions Inactive
@bkorycki bkorycki temporarily deployed to Scheduled Testing July 3, 2025 17:56 — with GitHub Actions Inactive
@bkorycki bkorycki merged commit afe1cbf into main Jul 3, 2025
4 checks passed
@bkorycki bkorycki deleted the standardize-columns branch July 3, 2025 21:17
@github-actions github-actions Bot locked and limited conversation to collaborators Jul 3, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants