Skip to content

Commit 5fb8214

Browse files
authored
rfc(informational): Escalating Issues (#78)
1 parent 56c6bcb commit 5fb8214

2 files changed

Lines changed: 106 additions & 0 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,4 @@ This repository contains RFCs and DACIs. Lost?
3535
- [0070-document-sensitive-data-collected](text/0070-document-sensitive-data-collected.md): Document sensitive data collected
3636
- [0071-continue-trace-over-process-boundaries](text/0071-continue-trace-over-process-boundaries.md): Continue trace over process boundaries
3737
- [0072-kafka-schema-registry](text/0072-kafka-schema-registry.md): Kafka Schema Registry
38+
- [0078-escalating-issues](text/0078-escalating-issues.md): Escalating Issues

text/0078-escalating-issues.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Escalating Issues
2+
3+
- Start Date: 2023-03-06
4+
- RFC Type: informational
5+
- RFC PR: https://github.com/getsentry/rfcs/pull/78
6+
- RFC Status: active
7+
8+
## Summary
9+
10+
Allow customers to mark issues as archived-until-escalating. Issues marked as such will not be shown in the issue stream. When an issue starts escalating we will mark the issue as escalating and allow it to show up again in the issue stream (similar to when issues go from ignored to unignored or from resolved to unresolved). This work removes the need of doing mental math to set an “Ignore until X count is reached” to clear the For Review tab.
11+
12+
This is a simpler version of the original tech spec. You can read the original tech spec in [here](https://www.notion.so/sentry/Tech-Spec-Escalating-issues-4e8cad11598f4c779407ca50bbe33e14?pvs=4) (internal link).
13+
14+
## Motivation
15+
16+
It makes it easier for customers to move known issues out of the Issue Stream without loosing the ability of becoming aware when the issue starts getting worse.
17+
18+
## Background
19+
20+
The Sentry Issue Stream is often laden with several issues that developers cannot actually fix or issues that occur for very few users and are not a priority to be fixed.
21+
22+
However, developers do not ignore such issues because they are worried that they will miss escalations, and using the Ignore Until involves a bit of mathematical juggling that most developers don’t wish to spend time on especially given that high trafficked apps will see a lot of issues come up.
23+
24+
We want to help developers identify an issue that is escalating.
25+
26+
## Supporting Data
27+
28+
Several Sentry customers ingest errors that belong to issues that have been ongoing since a month or longer. This leads to the issue stream showing many old issues.
29+
30+
## Option
31+
32+
Allow customers to mark issues as archived-until-escalating. Issues marked as such will not be shown in the issue stream. When an issue starts escalating we will mark the issue as escalating and allow it to show up again in the issue stream (similar to when issues go from ignored to unignored or from resolved to unresolved). This work removes the need of doing mental math to set an “Ignore until X count is reached” to clear the For Review tab.
33+
34+
![Workflow for user request](https://user-images.githubusercontent.com/44410/223194372-a1bfe61b-2e32-4279-9f02-6fd4605f0ad2.png) *Caption: Workflow for generating intial issue forecast*
35+
36+
Issue forecast generation will be produced with the data team’s algorithm ([internal link](https://github.com/getsentry/data-analysis/tree/spike_protection) to repo - [page](https://www.notion.so/Issue-Spiking-Algorithm-9c7be98895574f3b98c991deb0bbed9e) about the design of the algorithm). This algorithm can handle Spiking and Bursty Issues. For V1 we will be creating a periodic task that will query for issues marked as archived-until-escalating to generate the forecasts. We may be able to use the same cron as the weekly email report but we can’t adapt the queries for this work.
37+
38+
To make it clear, every time an issue is marked as archived-until-escalating we will create an initial forecast while the cron task will focus on updating the forecast.
39+
40+
<img src="https://user-images.githubusercontent.com/44410/223194435-6ebd2337-19b2-45c7-85f0-137a1e7f2b0a.png" alt="Escalating Issues" width="360" height="180" border="10" />
41+
42+
*Caption: This image visualizes an escalating issue.*
43+
44+
<img src="https://user-images.githubusercontent.com/44410/223194456-3bf053be-fafc-4cc9-bb5c-c98101625e88.png" alt="Bursty Issues" width="360" height="180" border="10" />
45+
46+
*Caption: This image visualzes a bursty issue.*
47+
48+
In order for the pipeline to determine if an issue needs to be marked as escalating, we need to evaluate the total count of events for the day as the events come in. We will compare day to day (e.g. Monday to Monday). We will use the cached forecast for the issue which will be used as the ceiling to blow through. The data team has produced an algorithm that can generate the forecast (see link).
49+
50+
![image](https://user-images.githubusercontent.com/44410/223196089-c78aa64d-68a0-4fc9-8bc9-790b66b8c304.png)
51+
52+
*Caption: Perodic workflow to generate new forecasts.*
53+
54+
A forecast will be produced by looking at the last 7 days of data and generating a forecast for the next 14 days and store it as something like this:
55+
56+
```txt
57+
{
58+
"date_created": 2022-10-19 22:00:00,
59+
"group_id": 12345689, # primary key, index
60+
"forecast": [{
61+
"2022-10-19": 500,
62+
"2022-10-20": 600,
63+
...
64+
}]
65+
}
66+
```
67+
68+
We should be cautious in the storage format as in V2 we would be storing a forecast for *every* issue older than 30 days and any issues marked by the customer to be archived-until-escalating. Issues with less than 7 days of data will have a flat ceiling forecast which will be refreshed on the weekly forecast update.
69+
70+
An analysis on how expensive querying Snuba can be found in [here](https://www.notion.so/Support-escalating-issues-detection-1267f6bda052438e9eb1a4ed6ec1f6de) (internal link).
71+
72+
The query get the data to generate the forecast will look something like this:
73+
74+
- Get a list of all group IDs that have been marked to be monitored for escalation
75+
- Bucket them per project since we will have to do a query per project
76+
- Ask Snuba to return the count for each issue bucketed hourly
77+
- Process data and store it as a forecast
78+
79+
```sql
80+
MATCH (errors_new_entity)
81+
SELECT group_id, bucketed_hourly_timestamp, count(*)
82+
WHERE group_id in (1234, 5678, 902) # IDs of archived-until-escalating issues
83+
AND timestamp < some value
84+
AND timestamp > some value
85+
GROUP BY bucketed_timestamp, group_id
86+
ORDER BY group_id, bucketed_hourly_timestamp DESC
87+
LIMIT 10000 OFFSET 10000
88+
```
89+
90+
Notice that this is a paginated call for all issues across all orgs, thus, making it a single paginated call. It will return 168 hour counts for every issue.
91+
92+
Somewhere in the product, we will analyze today’s count for an issue and if it blows through the floor we mark the issues as escalating and will show up in the issue stream of the customer.
93+
94+
Support alerting when an issue starts escalating (e.g. “This issue changed state to escalating”).
95+
96+
## Drawbacks
97+
98+
Known issues:
99+
100+
- Teaching a new behavior to customers.
101+
- Changing the terminology from ignoring issues to archiving issues.
102+
103+
## Unresolved questions
104+
105+
- For v2, we will automate the process of moving old issues from ongoing to archived-until-escalating.

0 commit comments

Comments
 (0)