Fix: Remove status partitioning on v1_(runs|dags|tasks)_olap#3603
Fix: Remove status partitioning on v1_(runs|dags|tasks)_olap#3603
v1_(runs|dags|tasks)_olap#3603Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
Updates OLAP schema/indexing for v1_runs_olap to replace the existing tenant-oriented index with a new (tenant_id, readable_status, inserted_at DESC) index, and provides a migration to roll the change out across partition levels.
Changes:
- Add a new
v1_runs_olapindex optimized for tenant + status + recency lookups. - Introduce a no-tx Goose migration that drops the old
ix_v1_runs_olap_tenant_idindex and creates the new index across leaf and parent partitions.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| sql/schema/v1-olap.sql | Adds the new ix_v1_runs_olap_tenant_status_ins_at index definition for v1_runs_olap. |
| cmd/hatchet-migrate/migrate/migrations/20260410190713_v1_0_97.go | Drops the old runs OLAP tenant index and builds the new tenant+status+inserted_at index across partitions. |
Benchmark resultsCompared against |
| FROM v1_dags_olap | ||
| ON CONFLICT (inserted_at, id) DO NOTHING` | ||
|
|
||
| func up20260410190713(ctx context.Context, db *sql.DB) error { |
There was a problem hiding this comment.
idea to speed things up: we can do runs, tasks and dags in parallel by starting 3 separate transactions in parallel and running the migration in non-tx mode
| JOIN pg_inherits i ON c.oid = i.inhrelid | ||
| JOIN pg_class p ON p.oid = i.inhparent | ||
| WHERE | ||
| p.relname = '` + oldParent + `' |
There was a problem hiding this comment.
rare edge case, but what happens if this runs while we add or drop a partition? I'm guessing it won't cause much of an impact on the add partition since we have 24 hours before data starts getting written to the partition. But more curious about the case when we delete a partition.
There was a problem hiding this comment.
I guess we'd either hang trying to drop the partition or the migration would fail with a no such table error and we'd retry
| additional_metadata, | ||
| parent_task_external_id | ||
| FROM v1_runs_olap | ||
| ON CONFLICT (inserted_at, id) DO NOTHING` |
There was a problem hiding this comment.
since the old v1_runs_olap could technically have multiple statuses, do we want to pick the larger status here or just not worry about that edge case?
There was a problem hiding this comment.
yeah, initially I was thinking that since the runs table doesn't have a retry count we don't know how to do this, but maybe we can naively just pick the largest one in case of a conflict and it'll work...
Description
Removes status partitioning, which we think will improve performance in multiple parts of the system a bunch. It'll make updates easier since we won't ever need to move rows between partitions (which postgres seems to not recommend), and it should improve query performance a bunch too. Also removed the old tenant index on the runs table in favor of the new one that's superseded it.
Same basic process as before:
(inserted_at, id)as the PK and are partitioned differentlyType of change