Skip to content

Fix: Remove status partitioning on v1_(runs|dags|tasks)_olap#3603

Merged
mrkaye97 merged 32 commits intomainfrom
mk/remove-status-partitioning
Apr 24, 2026
Merged

Fix: Remove status partitioning on v1_(runs|dags|tasks)_olap#3603
mrkaye97 merged 32 commits intomainfrom
mk/remove-status-partitioning

Conversation

@mrkaye97
Copy link
Copy Markdown
Contributor

@mrkaye97 mrkaye97 commented Apr 10, 2026

Description

Removes status partitioning, which we think will improve performance in multiple parts of the system a bunch. It'll make updates easier since we won't ever need to move rows between partitions (which postgres seems to not recommend), and it should improve query performance a bunch too. Also removed the old tenant index on the runs table in favor of the new one that's superseded it.

Same basic process as before:

  1. Creates new tables that are basically copies of the existing ones, except that they have only (inserted_at, id) as the PK and are partitioned differently
  2. Creates triggers to insert data into those tables
  3. Backfills those tables
  4. Progresses to a second migration where we drop the old tables rename the new ones to cut over

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Copilot AI review requested due to automatic review settings April 10, 2026 17:24
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hatchet-docs Ready Ready Preview, Comment Apr 24, 2026 7:39pm

Request Review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates OLAP schema/indexing for v1_runs_olap to replace the existing tenant-oriented index with a new (tenant_id, readable_status, inserted_at DESC) index, and provides a migration to roll the change out across partition levels.

Changes:

  • Add a new v1_runs_olap index optimized for tenant + status + recency lookups.
  • Introduce a no-tx Goose migration that drops the old ix_v1_runs_olap_tenant_id index and creates the new index across leaf and parent partitions.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

File Description
sql/schema/v1-olap.sql Adds the new ix_v1_runs_olap_tenant_status_ins_at index definition for v1_runs_olap.
cmd/hatchet-migrate/migrate/migrations/20260410190713_v1_0_97.go Drops the old runs OLAP tenant index and builds the new tenant+status+inserted_at index across partitions.

Comment thread sql/schema/v1-olap.sql
Comment thread sql/schema/v1-olap.sql
Comment thread cmd/hatchet-migrate/migrate/migrations/20260424190713_v1_0_99.go
Comment thread cmd/hatchet-migrate/migrate/migrations/20260410190713_v1_0_97.go Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 17, 2026

Benchmark results

goos: linux
goarch: amd64
pkg: github.com/hatchet-dev/hatchet/pkg/scheduling/v1
cpu: AMD Ryzen 9 7950X3D 16-Core Processor          
              │ /tmp/old.txt │         /tmp/new.txt         │
              │    sec/op    │   sec/op     vs base         │
RateLimiter-8    49.85µ ± 4%   51.84µ ± 8%  ~ (p=0.394 n=6)

              │ /tmp/old.txt │         /tmp/new.txt          │
              │     B/op     │     B/op      vs base         │
RateLimiter-8   137.7Ki ± 0%   137.7Ki ± 0%  ~ (p=0.734 n=6)

              │ /tmp/old.txt │          /tmp/new.txt          │
              │  allocs/op   │  allocs/op   vs base           │
RateLimiter-8    1.022k ± 0%   1.022k ± 0%  ~ (p=1.000 n=6) ¹
¹ all samples are equal

Compared against main (3acccfa)

FROM v1_dags_olap
ON CONFLICT (inserted_at, id) DO NOTHING`

func up20260410190713(ctx context.Context, db *sql.DB) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea to speed things up: we can do runs, tasks and dags in parallel by starting 3 separate transactions in parallel and running the migration in non-tx mode

JOIN pg_inherits i ON c.oid = i.inhrelid
JOIN pg_class p ON p.oid = i.inhparent
WHERE
p.relname = '` + oldParent + `'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rare edge case, but what happens if this runs while we add or drop a partition? I'm guessing it won't cause much of an impact on the add partition since we have 24 hours before data starts getting written to the partition. But more curious about the case when we delete a partition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we'd either hang trying to drop the partition or the migration would fail with a no such table error and we'd retry

additional_metadata,
parent_task_external_id
FROM v1_runs_olap
ON CONFLICT (inserted_at, id) DO NOTHING`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the old v1_runs_olap could technically have multiple statuses, do we want to pick the larger status here or just not worry about that edge case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, initially I was thinking that since the runs table doesn't have a retry count we don't know how to do this, but maybe we can naively just pick the largest one in case of a conflict and it'll work...

@mrkaye97 mrkaye97 merged commit 0e0d083 into main Apr 24, 2026
44 of 49 checks passed
@mrkaye97 mrkaye97 deleted the mk/remove-status-partitioning branch April 24, 2026 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants