Skip to content

Dynamic SUTs refactor#1141

Merged
bkorycki merged 9 commits into
mainfrom
dynamic-refactory
Jul 23, 2025
Merged

Dynamic SUTs refactor#1141
bkorycki merged 9 commits into
mainfrom
dynamic-refactory

Conversation

@bkorycki
Copy link
Copy Markdown
Contributor

This is an attempt to refactor the dynamic SUTs to address the concerns mentioned in PR #1131 , summarized below:

  • Make dynamic SUT factory objects less anemic. Store them as instances.
  • De-couple the dynamic SUT mechanism from the registry. Don't pre-register dynamic SUTs at the start of execution.
  • A dynamic SUT factory should return a hydrated SUT from make_sut, instead of a registry-tuple. (Related to above).

This is what I tried to do:

  • Create a new object SUTFactory that is responsibly for making all SUTs. It encapsulates the dynamic SUT functionality.
  • My goal was to be able to call SUTFactory.make_instance() for dynamic suts and pre-registered SUTs. The caller doesn't have to know anything about what type of SUT it is. We can strip the registry functionality from the object once we get to that point.

Here's where I got stuck:

  • the validate_uid and check_secrets functions both work well with the registry system. To extend those pre-flight functions to work on dynamic SUTs, they would have to attempt to create the dynamic SUT. But if a SUT is successfully created, they might as well register that SUT. Otherwise it will just have to do the work again when make_instance is called.
  • So, I feel like I failed to find a way to de-couple the registry system from the dynamic SUTs.

(Less important) I think there is room for further clean up with how we map driver names to dynamic SUT factories. ie I think the driver name can be a part of the dynamic SUT factory classes, and we can collect those objects automatically instead of populating a dictionary. But I think this is work that can be done after sorting out this PR.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jul 21, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@wpietri
Copy link
Copy Markdown
Contributor

wpietri commented Jul 21, 2025

Maybe I'm missing something here, but given the dynamic nature of things, can't we just try making the SUT and then blow up if it doesn't work?

For example, if I remove the validate_uid and check_secrets calls, this runs fine:

modelgauge run-sut --sut openai/gpt-4o:openai --prompt "Why did the chicken cross the road?"

If I do it with an invalid model name, it blows up:

% modelgauge run-sut --sut openai/gpt-4p:openai --prompt "Why did the chicken cross the road?"
[stack trace removed]
modelgauge.dynamic_sut_factory.ModelNotSupportedError: Model gpt-4p not found or not available on openai.

From a user perspective, the stack trace is annoying, but since it's a nicely specific exception, I think we can just catch that at the top and print out a nice error message.

If we were doing more than one SUT per run, this would be more of a problem; we'd want the command to fail fast. But since we're getting rid of multi-SUT runs anyhow, this should either work fine or blow up quickly, shouldn't it?

If that's true, I think you can just be a bit more radical in your trimming, ending up with less complexity and a good user experience.

@bkorycki
Copy link
Copy Markdown
Contributor Author

@wpietri Thanks for your feedback. Do you propose I remove the validate_uid and check_secrets checks for pre-registered SUTs as well? Is that what you mean by more radical trimming?

@wpietri
Copy link
Copy Markdown
Contributor

wpietri commented Jul 21, 2025

@wpietri Thanks for your feedback. Do you propose I remove the validate_uid and check_secrets checks for pre-registered SUTs as well? Is that what you mean by more radical trimming?

If you can get as a good user experience without them, then yes. If not, then you can at least put those checks in the pre-registered SUTs path only.

Trying it out, the validate_uid check for non-dynamic SUTs does produce a list of other options, which is a nice feature given that the keys are things we made up and can't be found in other docs. So maybe I'd keep that in the non-dynamic path. It's a pretty big list these days, though, so I might take pity on the user and give just the closest matches.

That help-the-user-with-a-typo experience is also something we could probably include in the dynamic SUTs, come to think of it. Or at least we should give people a link whatever docs page has all the valid SUT ids for the dynamic provider invoked.

@bkorycki bkorycki temporarily deployed to Scheduled Testing July 21, 2025 23:55 — with GitHub Actions Inactive
@bkorycki bkorycki temporarily deployed to Scheduled Testing July 21, 2025 23:55 — with GitHub Actions Inactive
@bkorycki bkorycki temporarily deployed to Scheduled Testing July 21, 2025 23:55 — with GitHub Actions Inactive
@bkorycki
Copy link
Copy Markdown
Contributor Author

@wpietri Okay! I made it so that we do validation and secret checks for pre registered suts and not dynamic ones.

@bkorycki bkorycki marked this pull request as ready for review July 21, 2025 23:58
@bkorycki bkorycki requested a review from a team as a code owner July 21, 2025 23:58
@wpietri
Copy link
Copy Markdown
Contributor

wpietri commented Jul 22, 2025

@wpietri Okay! I made it so that we do validation and secret checks for pre registered suts and not dynamic ones.

I'm not following. I was expecting to see validate_uid and check_secrets get removed from places like main.py:run_sut and only get invoked if somebody didn't ask for a dynamic SUT.

@bkorycki
Copy link
Copy Markdown
Contributor Author

@wpietri You're right, the call to check_secrets in run_sut is redudant. I'll remove it, didn't realize it was there.

validate_uid is a callback for the click --sut option. I could have done a check insidevalidate_uid to see if the sut is dynamic before proceeding with validation. But I opted to keep that logic encapsulated in the SUT_FACTORY object. imho things that use suts shouldn't have to know whether suts are dynamic or not.
The same applies for check_secrets. The factory decides whether or not it should check the secrets for a sut. Nobody has to know about pre-registered/dynamic SUT classification besides the factory.

@wpietri
Copy link
Copy Markdown
Contributor

wpietri commented Jul 22, 2025

@bkorycki I'm saying that you could also drop the validate_uid check at the top level for SUTs. That work needs to get done later anyhow. Then that method can go back to being just validating static SUT uids and we can slowly eliminate it.

Unless maybe there's some advantage to doing the work twice?

@bkorycki
Copy link
Copy Markdown
Contributor Author

@wpietri What's wrong with validating UIDs at the start of execution? That's the only place we call that method, so I'm not sure what you mean by doing it twice.

@wpietri
Copy link
Copy Markdown
Contributor

wpietri commented Jul 22, 2025

@bkorycki: As I showed before, you can remove the validate_uid check and things work fine. Good uids work, bad uids blow up in a clear way. (That we could make more user-friendly with a try statement to translate the exception.) The method isn't getting called twice, but the same work is being done.

Pre-checking the SUT uids was mildly helpful before, because if somebody put in two bad --sut arguments we could explain that they were both wrong at once. But now I don't see a point to a separate validation step. We can just check the argument once as part of creating the SUT.

@bkorycki bkorycki temporarily deployed to Scheduled Testing July 22, 2025 18:12 — with GitHub Actions Inactive
@bkorycki bkorycki temporarily deployed to Scheduled Testing July 22, 2025 18:12 — with GitHub Actions Inactive
@bkorycki bkorycki temporarily deployed to Scheduled Testing July 22, 2025 18:12 — with GitHub Actions Inactive
@bkorycki
Copy link
Copy Markdown
Contributor Author

@wpietri Okay, I think I did what you want. I removed the validate_uid callback for --sut options and prettyfied the new error message that now gets evoked when a sut isnt found.

Copy link
Copy Markdown
Contributor

@rogthefrog rogthefrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement. I have a couple of questions/suggestions in the review, but they can be addressed later (if ever) and shouldn't block merging.

Comment thread src/modelgauge/command_line.py
UNKNOWN = "unknown"


# TODO: Auto-collect. Maybe make "driver" keys a constant in each factory module.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are, at least in some. E.g. huggingface_sut_factory line 19. That was the intent originally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it would be nice if it were fully automatic though:)

# Maps a string to the module and factory function in that module
# that can be used to create a dynamic sut
DYNAMIC_SUT_FACTORIES: dict = {
"proxied": {"hfrelay": HuggingFaceSUTFactory},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a vague memory of chatting about flattening this and ditching the proxied vs direct distinction. Did you decide to keep it here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided that there was already too much in this PR haha. I still think it would be valuable to do though

def _make_dynamic_sut(self, uid: str) -> SUT:
sut_metadata: DynamicSUTMetadata = DynamicSUTMetadata.parse_sut_uid(uid)

if sut_metadata.is_proxied():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be simpler if we removed the proxied vs direct distinction.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This whole chunk would be unnecessary.

class TogetherSUTFactory(DynamicSUTFactory):
def __init__(self, raw_secrets: RawSecrets):
super().__init__(raw_secrets)
self._client = None # Lazy load.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to add a type hint like for the OpenAI client above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I only added it for openai because mypy was complaining

benchmark_options.extend(["--prompt-set", "practice"])

with pytest.raises(SUTNotFoundException):
with pytest.raises(ValueError, match="No registration for bogus"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to test for a more generic exception? SUTNotFoundException and related exception classes are more expressive IMO.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is more expressive. I only changed it to accommodate for the fact that SUTs dont get validated with validate_uid anymore. I think the failure actually happens in when check_secrets tries to get the thing from the registry, and the registry raises that error. I unfortunately can't change it to raise a SUTNotFoundException because the code is in the InstanceFactory class, which handles other registry types as well e.g. tests, annotators. I agree this is less ideal.

Comment thread tests/modelgauge_tests/sut_tests/test_openai_sut_factory.py
with pytest.raises(SUTNotFoundException):
_ = run_cli("run-test", "--sut", "unknown-uid", "--test", "fake-test")
del TESTS._lookup["fake-test"]
with pytest.raises(ValueError, match="No registration for unknown-uid"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about the specific exception classes.

Comment thread tests/modelgauge_tests/test_sut_factory.py
Comment thread tests/modelgauge_tests/test_sut_factory.py
Copy link
Copy Markdown
Contributor

@wpietri wpietri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge progress. Thanks for doing this!

@bkorycki bkorycki merged commit 91443dc into main Jul 23, 2025
4 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jul 23, 2025
@wpietri wpietri deleted the dynamic-refactory branch January 16, 2026 14:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants