Skip to content

bug: db migration + deployment race condition #1824

@krish-nvidia

Description

@krish-nvidia

Version

main

Describe the bug.

Migrations run once on the DB, but kubernetes keeps old carbide-api pods alive until the rollout finishes. For a window, new schema + old code (or the reverse) can do the wrong things leading to unexpected behavior.

Example

As a consequence of deploying #1610, many explored endpoints were deleted. This caused the preingestion state of these endpoints to be reset and perform mutating actions on their corresponding assigned hosts.

The database migration sets machine_id on BMC machine_interfaces rows, but during a deployment, an old carbide-api pod still builds the site explorer underlay list with “only interfaces where machine_id is null.” After migration, those BMC rows no longer qualify, so the BMC IP drops out of the index. On the next tick, that pod treats the existing explored_endpoints row as orphaned, deletes it, and a later tick re-inserts it as a fresh endpoint with preingestion_state: initial.

Next Steps

A potential mitigation is to adjust kubernetes rollout settings to prevent running old and new carbide-api pods simultaneously during deployment. However, this requires an understanding of how the timing of the database migration aligns with the bringup of the new pod.

Code of Conduct

  • I agree to follow NCX Infra Controller's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugA defect in existing software (deprecated - use issue type, but it's needed for reporting now)

    Type

    No fields configured for Bug.

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions