Version
main
Describe the bug.
Migrations run once on the DB, but kubernetes keeps old carbide-api pods alive until the rollout finishes. For a window, new schema + old code (or the reverse) can do the wrong things leading to unexpected behavior.
Example
As a consequence of deploying #1610, many explored endpoints were deleted. This caused the preingestion state of these endpoints to be reset and perform mutating actions on their corresponding assigned hosts.
The database migration sets machine_id on BMC machine_interfaces rows, but during a deployment, an old carbide-api pod still builds the site explorer underlay list with “only interfaces where machine_id is null.” After migration, those BMC rows no longer qualify, so the BMC IP drops out of the index. On the next tick, that pod treats the existing explored_endpoints row as orphaned, deletes it, and a later tick re-inserts it as a fresh endpoint with preingestion_state: initial.
Next Steps
A potential mitigation is to adjust kubernetes rollout settings to prevent running old and new carbide-api pods simultaneously during deployment. However, this requires an understanding of how the timing of the database migration aligns with the bringup of the new pod.
Code of Conduct
Version
main
Describe the bug.
Migrations run once on the DB, but kubernetes keeps old carbide-api pods alive until the rollout finishes. For a window, new schema + old code (or the reverse) can do the wrong things leading to unexpected behavior.
Example
As a consequence of deploying #1610, many explored endpoints were deleted. This caused the preingestion state of these endpoints to be reset and perform mutating actions on their corresponding assigned hosts.
The database migration sets
machine_idon BMCmachine_interfacesrows, but during a deployment, an old carbide-api pod still builds the site explorer underlay list with “only interfaces where machine_id is null.” After migration, those BMC rows no longer qualify, so the BMC IP drops out of the index. On the next tick, that pod treats the existingexplored_endpointsrow as orphaned, deletes it, and a later tick re-inserts it as a fresh endpoint withpreingestion_state: initial.Next Steps
A potential mitigation is to adjust kubernetes rollout settings to prevent running old and new carbide-api pods simultaneously during deployment. However, this requires an understanding of how the timing of the database migration aligns with the bringup of the new pod.
Code of Conduct