Skip to content

fix: [SLES-2810] prune sorted_reparenting_info on context release to stop warning flood#1161

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit intomainfrom
tianning.li/SLES-2810-leak-parent
Apr 6, 2026
Merged

fix: [SLES-2810] prune sorted_reparenting_info on context release to stop warning flood#1161
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit intomainfrom
tianning.li/SLES-2810-leak-parent

Conversation

@litianningdatadog
Copy link
Copy Markdown
Contributor

Summary

  • Root cause: After on_platform_report removes a context from context_buffer, the corresponding ReparentingInfo entry was left in sorted_reparenting_info indefinitely (capacity 500). Every subsequent trace batch caused update_reparenting to iterate all stale entries and emit a WARN for each one, producing a flood of "Mismatched request info. Context not found for request_id" messages in CloudWatch.
  • Fix: Call sorted_reparenting_info.retain(...) immediately after context_buffer.remove(request_id) in on_platform_report to prune the completed invocation's entry.
  • Tests: Two regression tests added — one verifying the entry is pruned after on_platform_report, one reproducing the exact production sequence (invoke → add_reparenting → report → trace batch) to confirm stale entries no longer appear in get_reparenting_info().

Fixes https://datadoghq.atlassian.net/browse/SLES-2810

Test plan

  • cargo test test_reparenting_info_pruned_after_on_platform_report passes
  • cargo test test_update_reparenting_ignores_completed_invocations passes
  • Full lifecycle test suite: 218 passed, 0 failed

🤖 Generated with Claude Code

…stop warning flood

After on_platform_report removes a context from context_buffer, the
corresponding ReparentingInfo entry was left in sorted_reparenting_info
indefinitely. Every subsequent trace batch caused update_reparenting to
iterate all stale entries and emit a WARN for each one, producing a
flood of "Mismatched request info. Context not found for request_id"
messages in CloudWatch.

Fix: retain only entries whose request_id matches a live context when
releasing in on_platform_report. Add two regression tests covering the
pruning behaviour and the update_reparenting read path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 3, 2026 20:13
@litianningdatadog litianningdatadog requested a review from a team as a code owner April 3, 2026 20:13
@litianningdatadog litianningdatadog marked this pull request as draft April 3, 2026 20:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes SLES-2810, where stale ReparentingInfo entries were accumulating in the sorted_reparenting_info buffer and causing a flood of warning messages. When invocations completed and their contexts were removed, the corresponding reparenting entries were left behind, causing update_reparenting() to emit warnings for every subsequent trace batch.

Changes:

  • Added pruning logic in on_platform_report() to remove stale reparenting entries when their contexts are released
  • Added a helper function make_trace_sender() to reduce test code duplication
  • Added two regression tests to verify the fix works correctly

@litianningdatadog litianningdatadog marked this pull request as ready for review April 3, 2026 21:33
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d bot merged commit b67655d into main Apr 6, 2026
57 checks passed
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d bot deleted the tianning.li/SLES-2810-leak-parent branch April 6, 2026 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants