[DBMON-6602] Avoid cleanup when cancel called while check running#23728
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 566c265ec1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for job in self._async_jobs: | ||
| job.cancel() |
There was a problem hiding this comment.
Prevent canceled sync jobs from running later
When cancellation happens while check() is still in flight but before it reaches the DBM/data-observability run_job_loop calls, this only sets each job's cancel event. DBMAsyncJob.run_job_loop still executes the _run_sync_job_rate_limited() path when run_sync is enabled, and at least PostgresDataObservability.run_job() does not check the event before opening connections, so custom queries can still be issued after the instance was unscheduled. Please make the in-flight check skip these jobs once cancellation has been requested, or have the sync path no-op on a set cancel event.
Useful? React with 👍 / 👎.
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: b40f036 | Docs | Datadog PR Page | Give us feedback! |
Validation ReportAll 20 validations passed. Show details
|
What does this PR do?
Fixes a race condition where
cancel()could destroy check state (close database connections, null_query_manager,_db, etc.) whilecheck()is still running on another thread. This caused SIGSEGV crashes in libpq during cluster check rebalancing when multiple Postgres checks were unscheduled simultaneously.The fix splits cancel into two phases:
_cancel_async_jobs): sets cancel events on async job threads. Safe to run concurrently withcheck()._finalize): joins async job threads, closes connections, and nulls state. Only runs whencheck()is guaranteed idle.A lock coordinates
run()andcancel()to ensure_finalize()executes exactly once — either bycancel()directly (if the check is idle) or byrun()'s finally block (if the check is in-flight).The cleanup introduced in #23640 is required — it breaks reference cycles (check → async job → check, check → query manager → check, check → logger → check) and closes connections so that garbage collection can free the check object in a timely manner. The problem was not the cleanup itself but where it ran: directly inside
cancel(), which can execute concurrently withcheck(). This PR preserves all of that cleanup by moving it to_finalize(), which only runs whencheck()is guaranteed idle.Note: the base
AgentCheckclass currently lacks a formalized pattern for coordinatingcancel()with an in-flightrun(). Therun()/cancel()lock coordination is implemented directly in the Postgres check for now, with the intention of moving this into the base class (and potentially the Go-sideCheckWrapper) so all checks benefit from the same safety guarantees.Motivation
The base class documents that
cancel()"can be called while the check is running." PR #23640 added aggressive cleanup tocancel()(closing connections, nulling attributes) for GC improvements, which violated this contract. Whencancel()ran whilecheck()was mid-query via psycopg,_close_db()freed the underlying libpq socket. When the I/O completed andcheck()resumed, libpq dereferenced freed memory, producingSIGSEGV addr=0x0instead of a recoverable Python exception. This was observed during a cluster check rebalance (26 configs removed, 29 added) where 8 Postgres checks were cancelled in rapid succession.Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged