Skip to content

Commit 8316bcc

Browse files
committed
Merge branch 'master' into feat/unity-metrics
2 parents 23bab8e + e434a0f commit 8316bcc

9,883 files changed

Lines changed: 301018 additions & 156136 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/skills/design-system/SKILL.md

Lines changed: 499 additions & 0 deletions
Large diffs are not rendered by default.

.agents/skills/generate-frontend-forms/SKILL.md

Lines changed: 963 additions & 0 deletions
Large diffs are not rendered by default.
File renamed without changes.

.agents/skills/hybrid-cloud-outboxes/SKILL.md

Lines changed: 404 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
# Outbox Backfill Reference
2+
3+
## Overview
4+
5+
When a model is migrated to use outboxes (or its replication logic changes), existing rows need outboxes created retroactively. The backfill system handles this incrementally, processing rows in batches with cursor position tracked in Redis and version gating controlled by the sentry options system.
6+
7+
**Source file**: `src/sentry/hybridcloud/tasks/backfill_outboxes.py`
8+
9+
## `replication_version` Mechanism
10+
11+
Every `CellOutboxProducingModel` and `ControlOutboxProducingModel` has a class variable:
12+
13+
```python
14+
replication_version: int = 1 # Default
15+
```
16+
17+
Two systems work together to control backfills:
18+
19+
1. **Sentry options** — gate the effective replication version (controls _whether_ a backfill runs)
20+
2. **Redis cursor** — track backfill progress as `(lower_bound_id, current_version)` (controls _where_ a backfill resumes)
21+
22+
### Version Resolution via Options
23+
24+
`find_replication_version()` determines the effective target version:
25+
26+
```python
27+
def find_replication_version(model, force_synchronous=False) -> int:
28+
coded_version = model.replication_version
29+
if force_synchronous:
30+
return coded_version
31+
model_key = f"outbox_replication.{model._meta.db_table}.replication_version"
32+
return min(options.get(model_key), coded_version)
33+
```
34+
35+
The effective version is `min(option_value, coded_version)`. This means:
36+
37+
- If the option is **not set or set lower** than the code, the backfill won't advance to the new version
38+
- If the option is **set equal to or higher** than the code, the coded version is used
39+
- If `force_synchronous=True` (self-hosted), the option is bypassed entirely
40+
41+
### Cursor Tracking via Redis
42+
43+
Redis tracks `(lower_bound_id, current_version)` per model table:
44+
45+
```python
46+
# Key format:
47+
f"outbox_backfill.{model._meta.db_table}"
48+
49+
# Value: JSON-encoded tuple of (lower_bound_id, current_version)
50+
```
51+
52+
`_chunk_processing_batch()` compares the Redis cursor's `version` against the options-resolved `target_version`:
53+
54+
- If `version > target_version`: backfill already complete, skip
55+
- If `version < target_version`: new version detected, reset cursor to 0 and start fresh
56+
- If `version == target_version`: continue from where we left off
57+
58+
**To trigger a backfill**: Bump `replication_version` on the model class:
59+
60+
```python
61+
class MyModel(ReplicatedCellModel):
62+
replication_version = 2 # Was 1; bumping triggers backfill
63+
```
64+
65+
## SaaS vs Self-Hosted Rollout
66+
67+
### SaaS (Gradual Rollout via Options)
68+
69+
The option key format is:
70+
71+
```python
72+
f"outbox_replication.{model._meta.db_table}.replication_version"
73+
74+
# Example for OrganizationMember:
75+
"outbox_replication.sentry_organizationmember.replication_version"
76+
```
77+
78+
**Rollout procedure:**
79+
80+
1. Merge the code change with bumped `replication_version`
81+
2. At this point, `min(option_value, coded_version)` still returns the old version — no backfill runs yet
82+
3. Set the option to the new version value in the Sentry options system
83+
4. Now `min(option_value, coded_version)` returns the new version — backfill starts on the next `enqueue_outbox_jobs` cycle
84+
5. Monitor via Redis cursor state and task metrics
85+
86+
This two-step process allows deploying code first, then enabling the backfill separately — useful for coordinating with other changes or rolling back quickly by lowering the option.
87+
88+
### Self-Hosted (Synchronous)
89+
90+
On self-hosted instances, backfills run synchronously during `sentry upgrade` via the `run_outbox_replications_for_self_hosted` function (connected to the `post_upgrade` signal). This function:
91+
92+
1. Calls `backfill_outboxes_for(force_synchronous=True)` — bypasses options, uses `model.replication_version` directly
93+
2. Drains all pending outbox shards
94+
3. Ensures the instance is fully caught up after every upgrade
95+
96+
## Redis Cursor State Transitions
97+
98+
1. **Initial**: `(0, 1)` — no backfill has run (created on first `get_processing_state` call)
99+
2. **In progress**: `(last_processed_id + 1, target_version)` — backfill is processing rows
100+
3. **Complete**: `(0, replication_version + 1)` — all rows processed, version advanced past target
101+
4. **New version detected**: cursor resets to `(0, new_target_version)` and starts from the beginning
102+
103+
## Batch Processing
104+
105+
```python
106+
OUTBOX_BACKFILLS_PER_MINUTE = 10_000
107+
```
108+
109+
Each batch (via `process_outbox_backfill_batch`):
110+
111+
1. Calls `_chunk_processing_batch` to determine the ID range `(low, up)` for this batch
112+
2. For each instance in `model.objects.filter(id__gte=low, id__lte=up)`:
113+
- Region models: `inst.outbox_for_update().save()` inside `outbox_context(flush=False)`
114+
- Control models: saves all `inst.outboxes_for_update()` inside `outbox_context(flush=False)`
115+
3. If no more rows: sets cursor to `(0, replication_version + 1)` (marks complete)
116+
4. Otherwise: advances cursor to `(up + 1, version)`
117+
118+
Rate is limited by `OUTBOX_BACKFILLS_PER_MINUTE` adjusted by the count of already-scheduled outboxes. The `backfill_outboxes_for` function iterates all registered models and processes batches until the rate limit is reached.
119+
120+
## Monitoring a Backfill
121+
122+
### Check Redis Cursor State
123+
124+
```python
125+
from sentry.hybridcloud.tasks.backfill_outboxes import get_processing_state
126+
127+
lower_bound, version = get_processing_state("sentry_mymodel")
128+
# lower_bound > 0 means backfill is in progress
129+
# version == model.replication_version + 1 means backfill is complete
130+
```
131+
132+
### Check Option Value
133+
134+
```python
135+
from sentry import options
136+
137+
# See what version the option is gating to:
138+
options.get("outbox_replication.sentry_mymodel.replication_version")
139+
```
140+
141+
### Check Outbox Queue Depth
142+
143+
```sql
144+
-- Region outboxes for a specific category
145+
SELECT count(*) FROM sentry_regionoutbox
146+
WHERE category = <category_value>;
147+
148+
-- Top shards by depth
149+
SELECT shard_scope, shard_identifier, count(*) as depth
150+
FROM sentry_regionoutbox
151+
GROUP BY shard_scope, shard_identifier
152+
ORDER BY depth DESC
153+
LIMIT 10;
154+
```
155+
156+
### Metrics
157+
158+
- `backfill_outboxes.low_bound` — gauge of the current cursor position per table
159+
- `backfill_outboxes.backfilled` — counter of rows backfilled per cycle
160+
- `outbox.saved` — counter incremented each time an outbox is saved
161+
- `outbox.processed` — counter incremented each time a coalesced outbox is processed
162+
- `outbox.processing_lag` — histogram of time from outbox creation to processing
Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
# OutboxCategory and OutboxScope Reference
2+
3+
## Overview
4+
5+
Every outbox message has a **category** (what kind of change) and a **scope** (how it's sharded). Categories are members of the `OutboxCategory` IntEnum; scopes are members of `OutboxScope`. Each category must be registered to exactly one scope — an assertion at import time enforces this.
6+
7+
**Source file**: `src/sentry/hybridcloud/outbox/category.py`
8+
9+
## Scope-to-Category Mapping
10+
11+
Scope to category mappings can be found in src/sentry/hybridcloud/outbox/category.py
12+
13+
When selecting a scope to use, consider which other operations the target outbox depends on.
14+
15+
### Retired Categories and Scopes
16+
17+
Categories and scopes should never be deleted. If a category is to be retired, simply add an inline comment denoting it as no longer in use.
18+
19+
If a scope is to be retired, remove all categories from its nested definition, and denote that it's no longer in use with a comment above the list.
20+
21+
## Sharding Pitfalls
22+
23+
Understanding how shards interact with processing is critical to choosing the right scope. Getting it wrong causes subtle, hard-to-diagnose production issues.
24+
25+
### Head-of-Line Blocking
26+
27+
A shard is processed **sequentially** — every category sharing the same `(scope, shard_identifier)` sits in one queue. If a handler for one category fails, **all other categories in that shard enter backoff together**. The entire shard's `scheduled_for` is bumped, not just the failing message's.
28+
29+
**Example**: `ORGANIZATION_SCOPE` groups ~21 categories per org. If the `AUTH_PROVIDER_UPDATE` handler crashes for org 42, then `ORGANIZATION_MEMBER_UPDATE`, `PROJECT_UPDATE`, and all other org-42 categories are blocked until the backoff expires and the failing handler either succeeds or is fixed.
30+
31+
This is why high-volume or failure-prone operations sometimes get their own dedicated scope (e.g., `AUDIT_LOG_SCOPE` and `USER_IP_SCOPE` are separate from `ORGANIZATION_SCOPE` and `USER_SCOPE` respectively) — isolating them prevents their failures from blocking unrelated replication work.
32+
33+
### Harmful Coalescing
34+
35+
Outboxes with the same `(scope, shard_identifier, category, object_identifier)` are **coalesced**: only the row with the highest ID is processed, all others are deleted. This is correct for "latest state wins" replication (model sync) but destructive for event-style data where every occurrence matters.
36+
37+
**Bad**: Using a single category for audit log events with `object_identifier = org_id`. Multiple audit events for the same org would coalesce to just the latest one — losing audit history.
38+
39+
**Good**: `AUDIT_LOG_EVENT` uses its own scope and carries all data in the payload. Each event gets a unique `object_identifier` (or the coalescing is harmless because the payload is self-contained).
40+
41+
**Rule**: If every individual outbox message matters (not just the latest), either ensure `object_identifier` is unique per message, or use a payload-only pattern where coalescing the envelope is harmless because the signal receiver reads the payload, not the DB row.
42+
43+
### Hot Shards
44+
45+
A "hot shard" is a single `(scope, shard_identifier)` with a disproportionate number of pending outboxes. Since one shard is processed sequentially, a hot shard becomes a bottleneck.
46+
47+
**Causes**:
48+
49+
- A large org with frequent updates across many categories in `ORGANIZATION_SCOPE`
50+
- A backfill that generates thousands of outboxes for a single shard
51+
- A handler that's slow (network calls, large queries), causing the shard to grow faster than it drains
52+
53+
**Mitigation**: The system has `should_skip_shard()` kill switches for disabling specific org/user shards, and the `get_shard_depths_descending()` method helps identify hot shards. But the best fix is choosing a scope with the right granularity — see "When to Create a New Scope" below.
54+
55+
### Wrong Shard Key
56+
57+
If your model's natural grouping doesn't match the scope's shard key, you get either unnecessary contention or broken ordering guarantees.
58+
59+
**Example**: Putting an integration-scoped model under `ORGANIZATION_SCOPE` means all integration changes for an org share a shard with org member updates, project updates, etc. — contention with no benefit. Worse, if the model doesn't have an `organization_id` at all, `infer_identifiers()` will fail at runtime.
60+
61+
## When to Create a New Category
62+
63+
**Always create a new category** when:
64+
65+
- You have a new model inheriting from `ReplicatedCellModel` or `ReplicatedControlModel`
66+
- You have a new type of event/signal that needs outbox delivery
67+
- The handler logic is distinct from all existing categories
68+
69+
**Do not reuse** an existing category for a different model or operation. Categories map 1:1 to signal receivers — reusing means both models' changes trigger the same handler.
70+
71+
## When to Create a New Scope vs Reuse an Existing One
72+
73+
**Reuse an existing scope** when:
74+
75+
- Your model naturally keys on the same identifier (e.g., has `organization_id` → use `ORGANIZATION_SCOPE`)
76+
- Head-of-line blocking with the other categories in that scope is acceptable (i.e., your handler is reliable and fast)
77+
- Coalescing with the existing shard granularity makes sense for your data
78+
79+
**Create a new scope** when:
80+
81+
- Your model's natural key doesn't match any existing scope (e.g., keyed on `integration_id` before `INTEGRATION_SCOPE` existed)
82+
- Your handler is high-volume or failure-prone, and blocking other categories is unacceptable
83+
- Your operation is event-style (every message matters) and you need isolation from "latest state wins" categories
84+
- You need a different shard key granularity (e.g., per-token rather than per-org)
85+
86+
**Examples of good scope isolation decisions**:
87+
88+
- `AUDIT_LOG_SCOPE` — high-volume, every event matters, failures shouldn't block org replication
89+
- `USER_IP_SCOPE` — very high-volume fire-and-forget, isolates from user profile replication
90+
- `PROVISION_SCOPE` — rare but critical, isolates from general org updates to avoid head-of-line blocking during provisioning
91+
- `API_TOKEN_SCOPE` — tokens aren't org-scoped or user-scoped in a way that fits existing scopes
92+
93+
**Rule of thumb**: Start with an existing scope that matches your shard key. Only create a new scope if you have a concrete concern about head-of-line blocking, harmful coalescing, or hot shards. Unnecessary scope proliferation adds operational complexity (more shards to monitor, more code paths to maintain).
94+
95+
## How to Pick a Scope
96+
97+
**Rules:**
98+
99+
1. If your model has an `organization_id` (or IS an Organization), use `ORGANIZATION_SCOPE`
100+
2. If your model has a `user_id` (or IS a User) and no org context, use `USER_SCOPE`
101+
3. If your model has an `integration_id`, use `INTEGRATION_SCOPE`
102+
4. If your model has an `api_application_id` or is a SentryApp, use `APP_SCOPE`
103+
5. If none of the above fit, or you have a concrete isolation concern (see above), create a new scope
104+
105+
The `infer_identifiers()` function in `category.py` auto-detects `shard_identifier` and `object_identifier` from model attributes based on the scope. Check its implementation to understand what field names it looks for.
106+
107+
## Registration Mechanics
108+
109+
### Adding a New Category
110+
111+
1. Add a new member to `OutboxCategory` with the next available integer value
112+
2. Add the category to the appropriate `OutboxScope` member's `scope_categories()` call
113+
3. The `scope_categories()` helper asserts no category is registered twice
114+
115+
```python
116+
# In OutboxCategory enum:
117+
MY_NEW_CATEGORY = 45 # Next available value
118+
119+
# In OutboxScope enum, add to the appropriate scope:
120+
ORGANIZATION_SCOPE = scope_categories(0, {
121+
OutboxCategory.ORGANIZATION_UPDATE,
122+
# ... existing categories ...
123+
OutboxCategory.MY_NEW_CATEGORY, # Add here
124+
})
125+
```
126+
127+
### Adding a New Scope
128+
129+
```python
130+
# In OutboxScope enum:
131+
MY_NEW_SCOPE = scope_categories(13, { # Next available integer
132+
OutboxCategory.MY_NEW_CATEGORY,
133+
})
134+
```
135+
136+
Then update `infer_identifiers()` to handle the new scope — add a branch that maps the scope to the correct model attribute for `shard_identifier`.
137+
138+
### Retiring a Category
139+
140+
Categories that are no longer in use should:
141+
142+
1. Keep their enum value (never reuse integer values)
143+
2. Add a `# no longer in use` comment
144+
3. Stay in their `OutboxScope` registration (removing causes assertion failures for in-flight outboxes)
145+
146+
## Identifier Inference
147+
148+
`OutboxCategory.infer_identifiers(scope, model)` auto-detects identifiers by scope:
149+
150+
| Scope | `shard_identifier` source | `object_identifier` source |
151+
| -------------------- | --------------------------------------------------------------------- | -------------------------- |
152+
| `ORGANIZATION_SCOPE` | `model.organization_id` or `model.id` (if model IS Organization) | `model.id` |
153+
| `USER_SCOPE` | `model.user_id` or `model.id` (if model IS User) | `model.id` |
154+
| `INTEGRATION_SCOPE` | `model.integration_id` | `model.id` |
155+
| `APP_SCOPE` | `model.api_application_id` or `model.id` (if model IS ApiApplication) | `model.id` |
156+
| `API_TOKEN_SCOPE` | `model.api_token_id` or `model.id` | `model.id` |
157+
158+
If inference fails (model doesn't have the expected attribute), pass `shard_identifier` explicitly to `outbox_for_update()`.

0 commit comments

Comments
 (0)