fix(replay): actionable recordings-list query errors and quieter timing header#67258
Draft
posthog[bot] wants to merge 1 commit into
Draft
fix(replay): actionable recordings-list query errors and quieter timing header#67258posthog[bot] wants to merge 1 commit into
posthog[bot] wants to merge 1 commit into
Conversation
…ng header Split two tangled signals on the replay recordings-list path. The Server-Timing header truncation now increments a counter and logs instead of reporting a handled exception on every hit — truncation is expected behaviour, not a failure, and it was one of the highest-volume issues in error tracking. The recordings list now distinguishes transient capacity/timeout failures (retryable) and too-expensive filter combinations (memory-limit) from real bugs. Heavy property/ duration filter combinations that blow the ClickHouse memory ceiling return an actionable 400 telling the user to narrow their search, and are no longer reported as exceptions. The playlist surfaces a specific, retryable banner instead of the generic "unexpected error". Generated-By: PostHog Code Task-Id: 8a20139d-9829-484b-bfe9-beeda786367f
Contributor
|
Size Change: +451 kB (+0.65%) Total Size: 69.9 MB 📦 View Changed
ℹ️ View Unchanged
|
Contributor
❌ Eager graphHow much code each root forces the browser to download and decode through static imports — the regression class total bundle size can't see.
✅ Largest files eagerly reachable from
|
| Size | File |
|---|---|
| 927.9 KiB | src/styles/global.scss |
| 609.0 KiB | public/hedgehog/burning-money-hog.png |
| 541.9 KiB | public/hedgehog/waving-hog.png |
| 357.8 KiB | ../node_modules/.pnpm/@posthog+icons@0.37.4_react-dom@18.3.1_react@18.3.1__react@18.3.1/node_modules/@posthog/icons/dist/posthog-icons.es.js |
| 343.5 KiB | src/taxonomy/core-filter-definitions-by-group.json |
| 301.5 KiB | src/lib/api.ts |
| 279.2 KiB | ../node_modules/.pnpm/posthog-js@1.396.3/node_modules/posthog-js/dist/rrweb.js |
| 268.2 KiB | ../common/tailwind/tailwind.css |
| 264.9 KiB | src/queries/schema/schema-general.ts |
| 228.3 KiB | src/types.ts |
Largest files eagerly reachable from src/scenes/AuthenticatedShell.tsx
| Size | File |
|---|---|
| 1.92 MiB | ../node_modules/.pnpm/@posthog+brand@0.3.0_react@18.3.1/node_modules/@posthog/brand/dist/generated/hoggies/svg/code-bubble.mjs |
| 1.25 MiB | ../node_modules/.pnpm/@posthog+brand@0.3.0_react@18.3.1/node_modules/@posthog/brand/dist/generated/hoggies/svg/einstein-group.mjs |
| 1.07 MiB | ../node_modules/.pnpm/@posthog+brand@0.3.0_react@18.3.1/node_modules/@posthog/brand/dist/generated/hoggies/svg/evel.mjs |
| 927.9 KiB | src/styles/global.scss |
| 919.6 KiB | ../node_modules/.pnpm/@posthog+brand@0.3.0_react@18.3.1/node_modules/@posthog/brand/dist/generated/hoggies/svg/driving-hogzilla.mjs |
| 787.8 KiB | src/queries/validators.js |
| 739.3 KiB | ../node_modules/.pnpm/@posthog+brand@0.3.0_react@18.3.1/node_modules/@posthog/brand/dist/generated/hoggies/svg/coding-group.mjs |
| 692.5 KiB | ../node_modules/.pnpm/@posthog+brand@0.3.0_react@18.3.1/node_modules/@posthog/brand/dist/generated/hoggies/svg/wizard-hog.mjs |
| 677.6 KiB | ../node_modules/.pnpm/@posthog+brand@0.3.0_react@18.3.1/node_modules/@posthog/brand/dist/generated/hoggies/svg/lemonade.mjs |
| 640.1 KiB | ../node_modules/.pnpm/@posthog+brand@0.3.0_react@18.3.1/node_modules/@posthog/brand/dist/generated/hoggies/svg/scott-pilgrim.mjs |
Posted automatically by check-eager-graph · sizes are input-source bytes from the esbuild metafile · part of #32479
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two signals were tangled on the Session replay recordings-list path.
Heavy property/duration filter combinations push the recordings query past ClickHouse's memory ceiling (memory-limit-exceeded while reading the events
propertiescolumn, plus simultaneous-query and timeout variants). The API caught all of these and returned the generic "An unexpected error has occurred. Please try again later." with no retry affordance, and reported each one as a handled exception.Separately, every list request runs
ServerTimingsGathered.to_header_string(), which already truncates the header safely to stay under the ALB size limit — but it also firedcapture_exceptionon every truncation. That's expected behaviour, not a failure, and it was one of the highest-volume issues in error tracking, drowning real signals.Changes
capture_exceptionon truncation with a counter (server_timing_header_truncated_total) and an info log. Truncation is expected, so it no longer pollutes error tracking.429retryable (as before, now also covering the wrappedClickHouseAtCapacity/ cannot-schedule-task forms).429retryable (now also covering the wrappedClickHouseQueryTimeOut).400with codequery_too_expensivetelling the user to narrow their search. These are driven by user filter choice, so they're not reported as exceptions.query_too_expensive, a transient-capacity message for429, and the generic message otherwise.I deliberately scoped this to graceful degradation + clear guidance rather than proactively rewriting the query cost model — turning the OOM into an actionable, non-noisy error is the safe fix for the observed symptom, and changing the ClickHouse query shape/settings for a core paid surface is riskier and better done separately.
How did you test this code?
I'm an agent. This sandbox has no Postgres/ClickHouse, so I could not run the Django test suite here — CI will. I ran:
ruff check/ruff formaton the backend files (clean).typescript:check(no errors in the changed files),oxlintandoxfmt(clean), and regenerated kea types for the playlist logic.Automated tests added/changed:
test_session_recordings_query_errors— extended with the wrapped capacity/timeout forms (ClickHouseAtCapacity, cannot-schedule-task,ClickHouseQueryTimeOut) that reach the handler in production; the pre-existing cases only covered the raw forms.test_session_recordings_query_too_expensive(new) — locks in the400+query_too_expensiveresponse and asserts the memory-limit path is not captured as an exception. This is the exact regression the change fixes (previously a generic 500 + captured exception).test_server_timing_header_truncated_below_alb_limit(new) — guards that an oversized timings set is truncated below the 10k ALB limit without raising; there was no coverage of the truncation path before.Automatic notifications
Docs update
No docs change needed.
🤖 Agent context
Autonomy: Fully autonomous
Authored by an agent (Claude Code) from a PostHog inbox report. Skills invoked while producing this PR:
/writing-tests(test value gate). The change splits the two error-tracking signals described in the report; the DRF viewset error handling was extended in place (thelisthandler already caught capacity/timeout), so no new serializer/schema surface was introduced.Decisions: I chose to return
400 query_too_expensive(rather than a 5xx) for memory-limit failures because the fix is user-side (narrow filters), and to keep the transient 429 path for capacity/timeout since retrying the same query succeeds there. I did not add a proactive query-cost guard — that would change ClickHouse behaviour on a core surface and is better evaluated on its own.Created with PostHog Code from an inbox report.