fix(weave): populate structured scorer feedback for runnable scores#6986
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
71d6e01 to
f919f8b
Compare
|
Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=a210a3262172406157419cc0a2b5535703b78fe1 |
f919f8b to
89517a8
Compare
jtschoonhoven
left a comment
There was a problem hiding this comment.
We should DRY this up with the code in this branch of agent_scoring_types.py
Apologies if you weren't aware of that branch, hate to duplicate work.
| def _derive_scorer_fields_from_payload( | ||
| feedback_req: tsi.FeedbackCreateReq, | ||
| processed_payload: dict[str, Any], | ||
| ) -> dict[str, Any]: |
There was a problem hiding this comment.
You should be able to replace most of this by reusing code from https://github.com/wandb/core/pull/44328/changes#diff-daee47bf3fb4333f5265212ed054429a7a15ad728e17225fa7a09993400d04f0
E.g. this is similar to ScorerLlmOutputGroup.from_agent_scorer_output()
It probably makes sense to pull some of those classes out of the agent scoring worker and into a shared module.
| request_scorer_fields = { | ||
| "scorer_tags": feedback_req.scorer_tags, | ||
| "scorer_tag_reasons": feedback_req.scorer_tag_reasons, | ||
| "scorer_tag_confidences": feedback_req.scorer_tag_confidences, | ||
| "scorer_ratings": feedback_req.scorer_ratings, | ||
| "scorer_rating_reasons": feedback_req.scorer_rating_reasons, | ||
| "scorer_rating_confidences": feedback_req.scorer_rating_confidences, | ||
| } |
There was a problem hiding this comment.
In https://github.com/wandb/core/pull/44328/changes#diff-daee47bf3fb4333f5265212ed054429a7a15ad728e17225fa7a09993400d04f0 there's a pydantic model ScorerColumns you could use to parse and validate these.
|
@jtschoonhoven i was not aware but sounds good to me, excited to simplify+reuse. I'll update this when that gets merged. |

Description
For older runnable call scorers using llm-as-judge, if the llm output matches what we expect for the typed outputs (scorer_* columns), populate them in the to-be-inserted feedback row.
This should allow the runnable scorers to start adopting the new typed columns while remaining backwards compatible for the ones that exist.
Testing
Unit tests