Skip to content

[BUG] Self hosted hatchet engine consistently corrupts its own database after around a week of being in use #3823

@fourslashw

Description

@fourslashw

We have a constant problem with the self-hosted version of the Hatchet engine, where it completely breaks after around 4–7 days of active use. When this happens, the Hatchet engine stops properly assigning tasks to workers, stops updating any workflow statuses, spams "context deadline exceeded" errors in the logs, and overloads the database. Disconnecting workers or restarting the engine does not help — the only solution is to completely reset the RabbitMQ and PostgreSQL databases and start from a clean slate. Also, creating a new Hatchet instance using the old database, without any workers connected and with an empty RabbitMQ state, still reproduces this issue.

We have two instances of the Hatchet engine in use: production and testing. The production one has only one tenant and a relatively static set of workers and workflows. The testing one has multiple tenants and a much bigger set of workers and workflows, separated with a namespace mechanism. This issue only happens on the testing instance.

Environment

  • SDK: TypeScript v1.10.3
  • Engine: Self hosted v0.74.14
  • DB db.m8g.2xlarge - 8CPU 32 MEM with work_mem set to 64 mb

Code to Reproduce, Logs, or Screenshots
We have 4 different "corrupted" databases on hand - please suggest what should be investigated there.

Top database load while hatchet engine tries to use corrupted database:
Image

We have complete logs for last couple days, but its difficult to find a specific point of interest - a lot of different errors basically start to appear at different times. Sample of logs while hatchet in engine in broken state:
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions