feat(seer): Add lightweight RCA clustering endpoint integration#112229
feat(seer): Add lightweight RCA clustering endpoint integration#112229
Conversation
Call Seer's new /v0/issues/supergroups/cluster-lightweight endpoint on new issue creation, gated per-org via sentry-options. This sends issue event data to Seer for lightweight root cause analysis and clustering into supergroups. Also renames the existing explorer-based lightweight RCA files to explorer_lightweight_rca to avoid confusion with the new direct endpoint-based clustering approach. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
a949cd4 to
1b7c082
Compare
Serialized event data from EventSerializer can contain non-string dict keys (integer keys in _meta.entries). Without this option orjson.dumps raises TypeError.
Backend Test FailuresFailures on
|
Without this registration the task won't be discovered by the taskworker in production.
Backend Test FailuresFailures on
|
src/sentry/options/defaults.py
Outdated
| ) | ||
|
|
||
| # Supergroups / Lightweight RCA | ||
| register( |
There was a problem hiding this comment.
does it make sense to duplicate the options between here and seer? i thought the original plan was to have seer do this check
There was a problem hiding this comment.
yea I was conflicted about it, I think that now I got options in Seer I can have protections there, but I also dont want to queue tasks for all issues for nothing, seems very wasteful... and now I can basically have both be controlled by the same repo so I think its ok to protect from both sides using same config
There was a problem hiding this comment.
we could just do the check only in sentry the? not sure it makes sense to duplicate the options, especially if the options have exactly the same name + purpose
This option is used on the Seer side, not in Sentry. Remove it until it's actually needed here.
Backend Test FailuresFailures on
|
The instrumented_task decorator requires `namespace` not `taskworker_namespace`, and doesn't accept `queue` or `max_retries`.
GroupEvent.group is typed as non-optional, so the None check is unreachable and mypy flags it.
…uster The org eligibility check is already done in the pipeline step before scheduling the task, so there's no need to check again in the function itself.
|
|
||
| class LightweightRCAClusterRequest(TypedDict): | ||
| group_id: int | ||
| issue: dict[str, Any] |
There was a problem hiding this comment.
| issue: dict[str, Any] | |
| group: dict[str, Any] |
There was a problem hiding this comment.
Seer has this thing where issue is the word used in APIs, I am mimicking IssueSummary endpoint here, and its in other places as well, the model used in code there is IssueDetails, and so theres this weird thing of like Issue is the model, group_id is the number it gets...
| namespace=issues_tasks, | ||
| ) | ||
| def trigger_lightweight_rca_cluster_task(group_id: int, **kwargs) -> None: | ||
| from sentry.seer.supergroups.lightweight_rca_cluster import trigger_lightweight_rca_cluster |
There was a problem hiding this comment.
does this import need to be in the func?
There was a problem hiding this comment.
kick_off_seer_automation does it as well, I think its to avoid circular definitions
There was a problem hiding this comment.
are you sure? kick_off_seer_automation is defined in post process and everything there has imports inside the functions, but all the other tasks in this folder don't follow that pattern
There was a problem hiding this comment.
oh yea youre right, I had in mind like its in a different place, removing it
| trigger_lightweight_rca_cluster(self.group) | ||
|
|
||
| @patch("sentry.seer.supergroups.lightweight_rca_cluster.make_lightweight_rca_cluster_request") | ||
| def test_passes_viewer_context(self, mock_request): |
There was a problem hiding this comment.
do we have this type of test on the other seer endpoints?
There was a problem hiding this comment.
hmm not really, guess it doesnt make sense, ill remove
| from sentry.testutils.cases import TestCase | ||
|
|
||
|
|
||
| class TriggerLightweightRCAClusterTest(TestCase): |
There was a problem hiding this comment.
shouldn't we have a test for the end to end actual flow? e.g. that triggering an event that creates a new issue will cause trigger_lightweight_rca_cluster to be sent?
There was a problem hiding this comment.
looked into it a bit, none of the other tests actually do this, as it forces you to sort of force the task to run synchronously and no tests do that right now, gonna follow the established pattern of asserting on the delay call being made, and letting the task tests verify that it does what is needs
| "sentry.workflow_engine.tasks.cleanup", | ||
| "sentry.tasks.seer.explorer_index", | ||
| "sentry.tasks.seer.context_engine_index", | ||
| "sentry.tasks.seer.lightweight_rca_cluster", |
There was a problem hiding this comment.
we should probably name this lightweight_rca_embedding or just lightweight rca?
There was a problem hiding this comment.
its like the command - to trigger clustering, because I basically treat the task as not a task to generate lgithweight-rca and a side effect of clustering, but instead of purposely triggering clustering, because before we didnt even save the lightweight-rca.
The endpoint I added is even called /cluster-lightweight, so thats like the command here, so I think the name fits. Does that make sense? I dont feel strongly about it though, just want it all to be coherent
There was a problem hiding this comment.
maybe it should be more explicit like just "trigger_supergroup_clustering_lightweight"
There was a problem hiding this comment.
i also don't feel too strongly, i think a slightly more consistent name would be 'lightweight rca embedding' / 'lightweight rca generation' but i think cluster also fits
There was a problem hiding this comment.
I prefer cluster to both of these I think, like embedding is kinda technical, its not what the caller really intends, and generation is sort of inaccurate because of the way we set it up where the point is to cluster by lightweightRCA - not to generate it, we didnt even save the generated RCA until we realized we need it for resummarization, so its like a side effect now.
Just so were all on the same page - I am just trying to be consistent with the way I phrased and treated it up until now, I even leaned in the direction of making it all about lightweight RCA generation and the clustering being a side effect, but went the other way around in the API and flow, so I think we should stay consistent.
| Sends issue event data to Seer, which generates a lightweight root cause analysis | ||
| and clusters the issue into supergroups based on embedding similarity. | ||
| """ | ||
| event = group.get_recommended_event_for_environments() |
There was a problem hiding this comment.
shouldn't we always only have one event for this group, since we run this when the issue is created? just confused on any situation where we'd have >1 event
There was a problem hiding this comment.
yea probably no need for this protection, will change to get_latest_event()
| from sentry.seer.supergroups.lightweight_rca_cluster import trigger_lightweight_rca_cluster | ||
|
|
||
| try: | ||
| group = Group.objects.get(id=group_id) |
There was a problem hiding this comment.
don't we have the group in post_process when we pass it here? do we need another fetch?
There was a problem hiding this comment.
its standard for these tasks, you cant pass a model into them, they gotta refetch
|
|
||
| @instrumented_task( | ||
| name="sentry.tasks.seer.lightweight_rca_cluster.trigger_lightweight_rca_cluster_task", | ||
| namespace=issues_tasks, |
There was a problem hiding this comment.
not necessarily blocking, is this the right namespace? i see that issues_tasks is technically labeled for 'issueplatform', while ingest_errors_postprocess_tasks exists and might fit slightly better
- Remove viewer_context test (not tested on other seer endpoints) - Switch to get_latest_event() since this runs on new issue creation - Change task namespace to ingest_errors_postprocess_tasks - Add post_process pipeline tests verifying task dispatch gating: dispatched when org enabled + new issue, skipped otherwise
No circular dependency exists, so the function-level import is unnecessary.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 87ecf74. Configure here.
|
|
||
| @patch("sentry.tasks.seer.lightweight_rca_cluster.trigger_lightweight_rca_cluster_task.delay") | ||
| def test_kick_off_lightweight_rca_cluster_skips_when_not_new(self, mock_task): | ||
| event = self.create_event( |
There was a problem hiding this comment.
i'm not sure, but isn't there a helper for creating an event and triggering the entire event lifecycle, including post process? basically, is there a way we remove the call_post_process_group while having the option enabled and have this test pass?
There was a problem hiding this comment.
if there is I (and claude) couldnt find it, theres this , which saves a new event and triggers some stuff related to the group save but not post processing.
From what I can tell call_post_process_group is the way its done around other tests
| assert len(body["issue"]["events"]) == 1 | ||
|
|
||
| @patch("sentry.seer.supergroups.lightweight_rca_cluster.make_lightweight_rca_cluster_request") | ||
| def test_raises_on_seer_error(self, mock_request): |
There was a problem hiding this comment.
similar to other test, do seer callers have this sort of test?
There was a problem hiding this comment.
hmm actually they dont, this was generated as a response to some check I made though its sort of a good verification that if the request fails we throw something, however it did make me notice that the task itself just does logger.exception and swallows the error, which I think I will change - I want the task to fail outright, I think we agreed on this already in some other place.
So I am adding a raise there, regarding this test I feel like its harmless, dont mind removing
| return make_signed_seer_api_request( | ||
| seer_autofix_default_connection_pool, | ||
| "/v0/issues/supergroups/cluster-lightweight", | ||
| body=orjson.dumps(body, option=orjson.OPT_NON_STR_KEYS), |
There was a problem hiding this comment.
curious, what is this orjson option?
There was a problem hiding this comment.
its to allow dict keys that are non string, in this case integers - from what I understand when we send event data it contains these kind of keys and its requried that we allow it, the issue summary endpoint does the same thing for the same reason I believe
Log the exception for Sentry visibility, then re-raise so the task is marked as failed in monitoring.
…ight-rca-cluster # Conflicts: # src/sentry/seer/signed_seer_api.py
This task makes an external API call with a 30s timeout, so it shouldn't run on the postprocess worker pool. Use ingest_errors_tasks to match generate_summary_and_run_automation which has the same dispatch pattern.
…ting (#112436) Add `rca_source` parameter to supergroup query APIs so Seer knows which embedding space (explorer vs lightweight) to query from. The source is determined by the `organizations:supergroups-lightweight-rca-clustering` feature flag. Also replaces the `supergroups.lightweight-enabled-orgs` sentry-option with this feature flag for both: - **Write path**: post_process task dispatch for lightweight RCA clustering - **Read path**: supergroup query endpoints (details + by-group) This is consistent with how all other supergroup features are gated (via feature flags, not options). Depends on #112229 (merged).
) Integrate Seer's new `/v0/issues/supergroups/cluster-lightweight` endpoint for lightweight root cause analysis and supergroup clustering. When a new error issue is created, if the org is in the `supergroups.lightweight-enabled-orgs` sentry-option, we send the issue's event data to Seer. Seer generates a lightweight RCA via a single LLM call and clusters the issue into supergroups based on embedding similarity. This is separate from the existing Explorer-based agentic RCA flow. **Changes:** - Register `supergroups.active-rca-source` and `supergroups.lightweight-enabled-orgs` sentry-options - Add `LightweightRCAClusterRequest` type and `make_lightweight_rca_cluster_request()` API function - Add `trigger_lightweight_rca_cluster()` core function and Celery task - Add `kick_off_lightweight_rca_cluster` pipeline step in post_process for new error issues - Rename existing `lightweight_rca.py` → `explorer_lightweight_rca.py` to clarify it's the Explorer-based flow --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…ting (#112436) Add `rca_source` parameter to supergroup query APIs so Seer knows which embedding space (explorer vs lightweight) to query from. The source is determined by the `organizations:supergroups-lightweight-rca-clustering` feature flag. Also replaces the `supergroups.lightweight-enabled-orgs` sentry-option with this feature flag for both: - **Write path**: post_process task dispatch for lightweight RCA clustering - **Read path**: supergroup query endpoints (details + by-group) This is consistent with how all other supergroup features are gated (via feature flags, not options). Depends on #112229 (merged).

Integrate Seer's new
/v0/issues/supergroups/cluster-lightweightendpoint for lightweight root cause analysis and supergroup clustering.When a new error issue is created, if the org is in the
supergroups.lightweight-enabled-orgssentry-option, we send the issue's event data to Seer. Seer generates a lightweight RCA via a single LLM call and clusters the issue into supergroups based on embedding similarity. This is separate from the existing Explorer-based agentic RCA flow.Changes:
supergroups.active-rca-sourceandsupergroups.lightweight-enabled-orgssentry-optionsLightweightRCAClusterRequesttype andmake_lightweight_rca_cluster_request()API functiontrigger_lightweight_rca_cluster()core function and Celery taskkick_off_lightweight_rca_clusterpipeline step in post_process for new error issueslightweight_rca.py→explorer_lightweight_rca.pyto clarify it's the Explorer-based flow