Skip to content

[Backport 7.80.x] [DBMON-6602] Avoid cleanup when cancel called while check running#23729

Merged
AAraKKe merged 1 commit into
7.80.xfrom
backport-23728-to-7.80.x
May 19, 2026
Merged

[Backport 7.80.x] [DBMON-6602] Avoid cleanup when cancel called while check running#23729
AAraKKe merged 1 commit into
7.80.xfrom
backport-23728-to-7.80.x

Conversation

@dd-octo-sts

@dd-octo-sts dd-octo-sts Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

Backport 4c7e8bb from #23728.


What does this PR do?

Fixes a race condition where cancel() could destroy check state (close database connections, null _query_manager, _db, etc.) while check() is still running on another thread. This caused SIGSEGV crashes in libpq during cluster check rebalancing when multiple Postgres checks were unscheduled simultaneously.

The fix splits cancel into two phases:

  • Signal phase (_cancel_async_jobs): sets cancel events on async job threads. Safe to run concurrently with check().
  • Finalize phase (_finalize): joins async job threads, closes connections, and nulls state. Only runs when check() is guaranteed idle.

A lock coordinates run() and cancel() to ensure _finalize() executes exactly once — either by cancel() directly (if the check is idle) or by run()'s finally block (if the check is in-flight).

The cleanup introduced in #23640 is required — it breaks reference cycles (check → async job → check, check → query manager → check, check → logger → check) and closes connections so that garbage collection can free the check object in a timely manner. The problem was not the cleanup itself but where it ran: directly inside cancel(), which can execute concurrently with check(). This PR preserves all of that cleanup by moving it to _finalize(), which only runs when check() is guaranteed idle.

Note: the base AgentCheck class currently lacks a formalized pattern for coordinating cancel() with an in-flight run(). The run()/cancel() lock coordination is implemented directly in the Postgres check for now, with the intention of moving this into the base class (and potentially the Go-side CheckWrapper) so all checks benefit from the same safety guarantees.

Motivation

The base class documents that cancel() "can be called while the check is running." PR #23640 added aggressive cleanup to cancel() (closing connections, nulling attributes) for GC improvements, which violated this contract. When cancel() ran while check() was mid-query via psycopg, _close_db() freed the underlying libpq socket. When the I/O completed and check() resumed, libpq dereferenced freed memory, producing SIGSEGV addr=0x0 instead of a recoverable Python exception. This was observed during a cluster check rebalance (26 configs removed, 29 added) where 8 Postgres checks were cancelled in rapid succession.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

…3728)

* Avoid cleanup when cancel called while check running

* Add changelog

* check if cancelled before running jobs

* Add some debug lines for cancel flow

(cherry picked from commit 4c7e8bb)
@dd-octo-sts

dd-octo-sts Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

Validation Report

All 20 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and Codecov settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

@codecov

codecov Bot commented May 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.23077% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (7.80.x@ff792eb). Learn more about missing BASE report.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@datadog-datadog-prod-us1-2

Copy link
Copy Markdown

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 94.23%
Overall Coverage: 93.77%

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: a79e913 | Docs | Datadog PR Page | Give us feedback!

@AAraKKe AAraKKe merged commit 5d19587 into 7.80.x May 19, 2026
80 checks passed
@AAraKKe AAraKKe deleted the backport-23728-to-7.80.x branch May 19, 2026 08:12
@dd-octo-sts dd-octo-sts Bot added this to the 7.79.0 milestone May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants