Skip to content

ENG-1815 Add sync attempt telemetry and stale end-task handling#1094

Merged
mdroidian merged 6 commits into
mainfrom
eng-1815-add-sync-attempt-telemetry-and-stale-end-task-handling
Jun 9, 2026
Merged

ENG-1815 Add sync attempt telemetry and stale end-task handling#1094
mdroidian merged 6 commits into
mainfrom
eng-1815-add-sync-attempt-telemetry-and-stale-end-task-handling

Conversation

@mdroidian

@mdroidian mdroidian commented May 28, 2026

Copy link
Copy Markdown
Member

This is from a back and forth with codex 5.5 xhigh with linear/posthog mcp access.

Summary

Fixes misleading sync reporting around ENG-1322 and adds the telemetry needed to debug remaining “Wrong worker” sync failures.

Previously, Roam could hit an end_sync_task failure and still emit Sync complete, which made PostHog sessions look contradictory. This change makes end_sync_task return structured results, classifies stale completions explicitly, and only reports sync success after the database confirms the task was ended successfully.

What Changed

  • Changed Roam sync worker identity from the Roam user UID to a per-runtime random worker ID.
  • Added per-sync syncAttemptId, syncWorkerId, syncUserUid, space ID, status, and per-phase timing telemetry.
  • Changed endSyncTask to return a typed success/stale/error result instead of swallowing failures.
  • Sync complete now only fires after end_sync_task returns success.
  • Stale completions now emit Sync stale; end-task failures emit Sync error with status: "end_task_failed".

Database Changes

  • public.end_sync_task(...) now returns jsonb instead of void.
  • If an older sync attempt finishes after a newer task has already started, the function returns:
    • ok: false
    • stale: true
    • reason: "completed_by_newer_task"
  • Real worker mismatches still raise Wrong worker, but now include diagnostic detail in the Postgres error detail payload.
  • Added migration 20260528033000_end_sync_task_result.sql.
  • Updated generated DB types and sync function docs.

Why This Helps

  • Prevents false-positive Sync complete events after failed task cleanup.
  • Separates stale/overlapping sync attempts from true wrong-worker errors.
  • Gives PostHog enough context to debug remaining ENG-1322 cases by phase duration, worker, attempt, and DB task state.

Out of Scope

  • Fully resolving every root cause of ENG-1322.
  • Changing sync timeout/interval policy.
  • Reworking the broader sync scheduler.

Open in Devin Review

Summary by CodeRabbit

Release Notes

  • Improvements

    • Enhanced sync task reliability with improved error detection and recovery mechanisms
    • Better handling of concurrent sync operations to prevent duplicate or stale task processing
  • Documentation

    • Updated sync function documentation for clearer completion and worker behavior specifications

Review Change Stack

@linear-code

linear-code Bot commented May 28, 2026

Copy link
Copy Markdown

ENG-1815

@graphite-app

graphite-app Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

PR size/scope check

This PR is over our review-size guideline.

  • Recommended: ~200 lines changed
  • Acceptable limit: up to 400 lines when well-scoped/self-contained
  • Preferred file count: fewer than 5 files

Please split this into smaller PRs unless there is a clear reason the changes need to land together.

If keeping it as one PR, please add a brief justification covering:

  • What single problem this PR solves
  • Why the files/changes are coupled

@supabase

supabase Bot commented May 28, 2026

Copy link
Copy Markdown

Updates to Preview Branch (eng-1815-add-sync-attempt-telemetry-and-stale-end-task-handling) ↗︎

Deployments Status Updated
Database Tue, 09 Jun 2026 02:49:07 UTC
Services Tue, 09 Jun 2026 02:49:07 UTC
APIs Tue, 09 Jun 2026 02:49:07 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks Status Updated
Configurations Tue, 09 Jun 2026 02:49:08 UTC
Migrations Tue, 09 Jun 2026 02:49:08 UTC
Seeding Tue, 09 Jun 2026 02:49:08 UTC
Edge Functions Tue, 09 Jun 2026 02:49:10 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

@mdroidian

This comment was marked as resolved.

@coderabbitai

This comment was marked as resolved.

@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@mdroidian mdroidian marked this pull request as draft May 28, 2026 04:40
@mdroidian

Copy link
Copy Markdown
Member Author

PR size/scope check

This PR is over our review-size guideline.

  • Recommended: ~200 lines changed
  • Acceptable limit: up to 400 lines when well-scoped/self-contained
  • Preferred file count: fewer than 5 files

Please split this into smaller PRs unless there is a clear reason the changes need to land together.

If keeping it as one PR, please add a brief justification covering:

  • What single problem this PR solves
  • Why the files/changes are coupled

This PR fixes one sync completion correctness issue: end_sync_task failures/stale completions were not distinguishable enough, and Roam could report Sync complete even when completion did not actually succeed.

The files are coupled by the same contract change. The database function now returns structured completion results, the generated DB type and website route need to reflect that return shape, and the Roam caller must consume the result to avoid false success telemetry/backoff behavior. Splitting would create intermediate states where the server exposes a new contract without the client using it, or the client expects behavior the DB has not yet shipped.

@mdroidian mdroidian marked this pull request as ready for review May 28, 2026 04:43
@mdroidian mdroidian assigned maparent and unassigned maparent May 28, 2026
@mdroidian mdroidian requested a review from maparent May 28, 2026 04:44
devin-ai-integration[bot]

This comment was marked as resolved.

@maparent maparent left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Nothing wrong, but one change I'd appreciate:
Add a version number to the JSON output, and check for it in the code. If the version number coming from the database is higher than the one in the code, be more lenient with json parsing errors. This gives you more leeway to change the database json format without breaking old versions of the plugins.
Otherwise lgtm.

@mdroidian mdroidian requested a review from maparent May 28, 2026 20:22
mdroidian added 6 commits June 8, 2026 20:39
…text

- Refactor end_sync_task to return a JSON object containing task status and metadata.
- Enhance error handling to include telemetry context for better debugging.
- Update related API routes and database schema to accommodate new return type.
- Document changes in sync_functions.md to reflect the new response structure.
…error responses

- Update EndSyncTaskRpcResult to require 'ok' and 'stale' fields.
- Enhance isEndSyncTaskRpcResult function to validate response structure.
- Modify endSyncTask function to handle claimed timestamps and improve error handling.
- Update API route to parse and validate request body for task status and timestamps.
- Document changes in sync_functions.md to reflect new requirements for end_sync_task.
…prove request body validation

- Made 'startedAt' in ParsedEndSyncTaskBody optional.
- Enhanced parseEndSyncTaskBody to accept a task status string directly.
- Updated error messages for better clarity on request body requirements.
- Adjusted RPC call to conditionally include 's_started_at' based on availability.
@mdroidian mdroidian force-pushed the eng-1815-add-sync-attempt-telemetry-and-stale-end-task-handling branch from aa3523a to 9a72a28 Compare June 9, 2026 02:48
@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
discourse-graph Ready Ready Preview, Comment Jun 9, 2026 2:50am

Request Review

@mdroidian mdroidian merged commit def6e82 into main Jun 9, 2026
13 checks passed
@mdroidian mdroidian deleted the eng-1815-add-sync-attempt-telemetry-and-stale-end-task-handling branch June 9, 2026 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants