Skip to content

[DBMON-6602] Avoid cleanup when cancel called while check running#23728

Merged
eric-weaver merged 4 commits into
masterfrom
eric.weaver/DBMON-6602
May 18, 2026
Merged

[DBMON-6602] Avoid cleanup when cancel called while check running#23728
eric-weaver merged 4 commits into
masterfrom
eric.weaver/DBMON-6602

Conversation

@eric-weaver

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes a race condition where cancel() could destroy check state (close database connections, null _query_manager, _db, etc.) while check() is still running on another thread. This caused SIGSEGV crashes in libpq during cluster check rebalancing when multiple Postgres checks were unscheduled simultaneously.

The fix splits cancel into two phases:

  • Signal phase (_cancel_async_jobs): sets cancel events on async job threads. Safe to run concurrently with check().
  • Finalize phase (_finalize): joins async job threads, closes connections, and nulls state. Only runs when check() is guaranteed idle.

A lock coordinates run() and cancel() to ensure _finalize() executes exactly once — either by cancel() directly (if the check is idle) or by run()'s finally block (if the check is in-flight).

The cleanup introduced in #23640 is required — it breaks reference cycles (check → async job → check, check → query manager → check, check → logger → check) and closes connections so that garbage collection can free the check object in a timely manner. The problem was not the cleanup itself but where it ran: directly inside cancel(), which can execute concurrently with check(). This PR preserves all of that cleanup by moving it to _finalize(), which only runs when check() is guaranteed idle.

Note: the base AgentCheck class currently lacks a formalized pattern for coordinating cancel() with an in-flight run(). The run()/cancel() lock coordination is implemented directly in the Postgres check for now, with the intention of moving this into the base class (and potentially the Go-side CheckWrapper) so all checks benefit from the same safety guarantees.

Motivation

The base class documents that cancel() "can be called while the check is running." PR #23640 added aggressive cleanup to cancel() (closing connections, nulling attributes) for GC improvements, which violated this contract. When cancel() ran while check() was mid-query via psycopg, _close_db() freed the underlying libpq socket. When the I/O completed and check() resumed, libpq dereferenced freed memory, producing SIGSEGV addr=0x0 instead of a recoverable Python exception. This was observed during a cluster check rebalance (26 configs removed, 29 added) where 8 Postgres checks were cancelled in rapid succession.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 566c265ec1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread postgres/datadog_checks/postgres/postgres.py
Comment on lines +537 to +538
for job in self._async_jobs:
job.cancel()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent canceled sync jobs from running later

When cancellation happens while check() is still in flight but before it reaches the DBM/data-observability run_job_loop calls, this only sets each job's cancel event. DBMAsyncJob.run_job_loop still executes the _run_sync_job_rate_limited() path when run_sync is enabled, and at least PostgresDataObservability.run_job() does not check the event before opening connections, so custom queries can still be issued after the instance was unscheduled. Please make the in-flight check skip these jobs once cancellation has been requested, or have the sync path no-op on a set cancel event.

Useful? React with 👍 / 👎.

@codecov

codecov Bot commented May 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.23077% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.49%. Comparing base (d54c6b8) to head (b40f036).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@datadog-prod-us1-5

datadog-prod-us1-5 Bot commented May 18, 2026

Copy link
Copy Markdown

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 94.23%
Overall Coverage: 93.52% (+6.13%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: b40f036 | Docs | Datadog PR Page | Give us feedback!

@dd-octo-sts

dd-octo-sts Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

Validation Report

All 20 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and Codecov settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

@eric-weaver eric-weaver added this pull request to the merge queue May 18, 2026
Merged via the queue into master with commit 4c7e8bb May 18, 2026
112 of 114 checks passed
@eric-weaver eric-weaver deleted the eric.weaver/DBMON-6602 branch May 18, 2026 17:30
@dd-octo-sts dd-octo-sts Bot added this to the 7.81.0 milestone May 18, 2026
AAraKKe pushed a commit that referenced this pull request May 19, 2026
…3728) (#23729)

* Avoid cleanup when cancel called while check running

* Add changelog

* check if cancelled before running jobs

* Add some debug lines for cancel flow

(cherry picked from commit 4c7e8bb)

Co-authored-by: Eric Weaver <eweaver755@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants