Skip to content

fix(ent): retry HasSBOM include-edge and PackageVersion bulk inserts on FK race#3042

Open
dgellman wants to merge 5 commits into
guacsec:mainfrom
dgellman:fix/hassbom-fk-race-retry
Open

fix(ent): retry HasSBOM include-edge and PackageVersion bulk inserts on FK race#3042
dgellman wants to merge 5 commits into
guacsec:mainfrom
dgellman:fix/hassbom-fk-race-retry

Conversation

@dgellman
Copy link
Copy Markdown

@dgellman dgellman commented May 7, 2026

Summary

Two ingest paths fail intermittently with PostgreSQL FK violations (SQLSTATE 23503) when concurrent transactions are racing on the same parent rows:

  • IngestHasSBOMs writes M2M edges into bill_of_materials_included_software_packages etc. that reference package_version_id rows committed in a sibling IngestPackages transaction. When the sibling has not yet committed, the M2M insert fails. Production telemetry has shown ~30–40% of resulting HasSBOM rows ending up with empty includedSoftware.
  • IngestPackages itself fails on its PackageVersion upsert when a concurrent transaction is mid-flight on the parent PackageName row.

Both paths now wrap their per-batch Exec(ctx) in a small retryOnFKViolation helper. On a PG foreign_key_violation it retries with bounded backoff (500 ms, then 1 s — 3 attempts total) and honors context cancellation. Non-FK errors propagate immediately.

What this does and does not address

This absorbs the race within typical commit-latency windows (~1.5 s). It does not eliminate the race architecturally. A deferred FK constraint (DEFERRABLE INITIALLY DEFERRED) or atomic package + SBOM ingest in a single transaction would remove it at the source. I deliberately did not bundle a schema migration or transaction-boundary refactor into this PR — those involve maintainer-level decisions about migration sequencing and the public ingest API. Happy to follow up with whichever direction maintainers prefer in a separate PR or issue.

Detection

isPGForeignKeyViolation matches *pq.Error.Code == "23503" specifically, not ent's generic IsConstraintError — unique and check-constraint violations are not retried because they are not expected to become valid on retry.

Files

  • pkg/assembler/backends/ent/backend/retry.go (new): detector + bounded-backoff helper, no ent dependency
  • pkg/assembler/backends/ent/backend/retry_test.go (new): unit tests for both helpers using fabricated *pq.Error values, no DB
  • pkg/assembler/backends/ent/backend/sbom.go: wrap the four updateHasSBOMWithInclude* per-batch Exec calls
  • pkg/assembler/backends/ent/backend/package.go: wrap PackageVersion.CreateBulk in upsertBulkPackage and PackageVersion.Create in upsertPackage

Test plan

Run on this branch (rebased onto current main):

  • go build ./... clean (exit 0)
  • go vet ./... clean (exit 0)
  • go test -race -timeout=30s ./... -count=1 — 79 packages OK, 0 fail, 0 data races
  • golangci-lint run ./pkg/assembler/backends/ent/backend/... — 0 issues in changed files (the 2 staticcheck warnings reported are pre-existing in search.go, unrelated to this PR)
  • New unit tests in retry_test.go all pass: detection covers wrapped/by-value/doubly-wrapped *pq.Error, plain errors, nil, unique-violation negative case, check-violation negative case; retry behavior covers success-first-attempt, non-retryable propagation, recovery after transient FK, exhaustion after max attempts, and context cancellation

dgellman added 3 commits May 7, 2026 16:33
… 23503

Signed-off-by: Daniel Gellman <dgellman8@gmail.com>
Signed-off-by: Daniel Gellman <dgellman8@gmail.com>
… and upsertPackage

Signed-off-by: Daniel Gellman <dgellman8@gmail.com>
@dgellman dgellman force-pushed the fix/hassbom-fk-race-retry branch from 2337bfa to b936869 Compare May 7, 2026 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants