Skip to content

fix(memory-monitor): lower fail-fast thresholds and humanize diagnostic messages#995

Merged
Patrick Nilan (pnilan) merged 4 commits intomainfrom
devin/1776796633-lower-memory-thresholds
Apr 23, 2026
Merged

fix(memory-monitor): lower fail-fast thresholds and humanize diagnostic messages#995
Patrick Nilan (pnilan) merged 4 commits intomainfrom
devin/1776796633-lower-memory-thresholds

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented Apr 21, 2026

Summary

Lowers the CDK memory monitor's fail-fast thresholds so syncs near OOM reliably raise AirbyteTracedException (surfacing as a visible failure) instead of being OOM-killed silently by the kernel, and cleans up the operator-facing diagnostic surface (human-readable byte values, a startup config log, and a clarified docstring).

Reported by Patrick: connectors on CDK/SDM 7.16.0+ are still failing silently because the current 98%/95% gates sit too close to the OOM cliff — between two memory checks, allocations can jump a container straight past the gates into kernel OOM-kill.

1. Threshold changes (airbyte_cdk/utils/memory_monitor.py)

  • _CRITICAL_THRESHOLD: 0.98 → 0.95 — fires the AirbyteTracedException at 95% of the container limit instead of 98%, adding headroom for the check-interval race window.
  • _HIGH_PRESSURE_THRESHOLD: 0.95 → 0.90 — must sit strictly below the critical threshold so tightened polling (configured check_interval → 100 messages) activates before the fail-fast gate.
  • _ANON_SHARE_OF_USAGE_THRESHOLD: 0.85 → 0.85 (unchanged) — deliberately left untouched to preserve the current false-positive profile; a high anon share of current cgroup usage remains the gate for "is this actually heap pressure, or file-backed / page-cache memory?"

The dual-condition logic is unchanged: fail-fast still only fires when both (1) cgroup usage ≥ critical threshold AND (2) anonymous memory ≥ 85% of current cgroup usage.

2. Human-readable byte values in errors and logs

The previous internal_message format used raw bytes, which were hard to parse at a glance —

Cgroup memory: 2109915136 / 2147483648 bytes (98%). Anonymous memory (process RssAnon): 2101399552 bytes (99% of current cgroup usage). ...

A new _format_bytes() helper renders byte counts in decimal units (GB / MB / KB) with 2 decimal places. The fail-fast exception's internal_message and both critical-but-not-raising log paths now emit:

Cgroup memory: 2.11 GB / 2.15 GB (98%). Anonymous memory (process RssAnon): 2.10 GB (99% of current cgroup usage). Thresholds: cgroup >= 95%, anon share of usage >= 85%.

3. Startup config log

MemoryMonitor.__init__ now emits a single INFO line reporting the configured thresholds and check intervals so an operator inspecting sync logs can immediately confirm which CDK version's gates are active:

MemoryMonitor instantiated with critical threshold: 95%, anon share of usage threshold: 85%, high-pressure threshold: 90%, check interval: 5000 messages (tightens to 100 under high pressure).

The monitor is instantiated once per AirbyteEntrypoint, so this fires exactly once per source process (not per stream).

4. Docstring clarification

Addressed Copilot review feedback (comment 3119645981) by rewording the class and method docstrings so they no longer claim the interval tightens "from 5000 to 100 messages" unconditionally — the actual behavior tightens from the configured check_interval (default 5000) to 100 messages.

Review & Testing Checklist for Human

  • Confirm the new thresholds match operational intent. 95% critical is what Patrick specified; 90% high-pressure is the derived constraint (must sit below critical for tightened polling to fire first). Anon share stays at 85% to preserve the current false-positive profile.
  • Validate against real production telemetry. The hypothesis is that few healthy connectors sit stably in the 90–94% range. Spot-check a few representative high-memory connectors — if any live there under normal load, the 95% critical gate could produce new false-positive fail-fasts.
  • Release strategy. This ships to every Python CDK / SDM connector on the next CDK release. Recommended: cut a pre-release and bump a high-memory connector that has been silent-failing on 7.16.0+ first, run a production-sized sync in a memory-constrained container, and confirm the monitor emits a visible AirbyteTracedException with failure_type=system_error instead of exiting silently — before promoting to the general population.

Notes

  • No behavior changes beyond the numeric thresholds, the internal_message / log-string formatting, and the new startup INFO log. The fail-fast gate, dual-condition logic, graceful degradation paths, and one-shot logging policy are all unchanged.
  • Decimal units (GB = 10^9) were chosen over binary (GiB = 2^30) because they match how operators describe container limits and because Patrick's worked example used decimal.
  • All 36 memory_monitor unit tests pass locally. ruff check, ruff format, and mypy --config-file mypy.ini are clean on both edited files.

Link to Devin session: https://app.devin.ai/sessions/738dba1a2a1a4c15998cb1dd5d0cd487

…liff

Co-Authored-By: patrick.nilan@airbyte.io <patrick.nilan@airbyte.io>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1776796633-lower-memory-thresholds#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1776796633-lower-memory-thresholds

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@pnilan Patrick Nilan (pnilan) marked this pull request as ready for review April 21, 2026 18:43
Copilot AI review requested due to automatic review settings April 21, 2026 18:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the Python CDK memory monitor thresholds to fail fast earlier under near-OOM conditions, improving visibility of memory-related failures (raising AirbyteTracedException) before the kernel OOM-kills the container.

Changes:

  • Lowered the critical and high-pressure thresholds in MemoryMonitor (0.98→0.95, 0.95→0.90) and updated inline/docs text accordingly.
  • Updated unit_tests/utils/test_memory_monitor.py mocks and assertions to align with the new threshold values.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
airbyte_cdk/utils/memory_monitor.py Lowers fail-fast/high-pressure thresholds and updates docstrings/comments to match new behavior.
unit_tests/utils/test_memory_monitor.py Renames/adjusts test fixtures and assertions to reflect the new threshold gates.
Comments suppressed due to low confidence (1)

unit_tests/utils/test_memory_monitor.py:147

  • This test’s comment/asser­tion is a bit misleading: the monitor doesn’t emit a debug “probe” message on the normal cgroup-v2 path, so caplog.records is typically empty and the all(... <= DEBUG ...) check reads oddly. Consider asserting directly that there are no INFO+ records (and updating the comment) to make the intent clearer and the test more robust to future debug logging changes.
def test_no_log_below_threshold(caplog: pytest.LogCaptureFixture) -> None:
    """No log should be emitted when usage is below the 90% high-pressure threshold."""
    monitor = MemoryMonitor(check_interval=1)
    with (
        caplog.at_level(logging.DEBUG, logger="airbyte"),
        patch.object(Path, "exists", _v2_exists),
        patch.object(Path, "read_text", _v2_mock_read(usage=_MOCK_USAGE_AT_85)),
    ):
        monitor.check_memory_usage()
    # Only the debug probe message, no info/warning
    assert all(r.levelno <= logging.DEBUG for r in caplog.records)


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread airbyte_cdk/utils/memory_monitor.py Outdated
…igured thresholds on init

- Add _format_bytes helper that renders byte counts as GB/MB/KB with 2 decimal places (decimal units).
- Replace raw byte values in the fail-fast internal_message and the two critical-but-not-raising log paths with _format_bytes(...).
- Emit a one-shot INFO log on MemoryMonitor instantiation reporting the configured thresholds and check intervals.
- Clarify docstrings that the tightened high-pressure check interval overrides the configurable check_interval (default 5000), addressing Copilot review feedback.
- Tests: parametrized _format_bytes coverage, instantiation-log assertion, and assertions that the internal_message uses the humanized format.

Co-Authored-By: patrick.nilan@airbyte.io <patrick.nilan@airbyte.io>
@devin-ai-integration devin-ai-integration Bot changed the title fix(memory-monitor): lower fail-fast thresholds to catch OOM before the cliff fix(memory-monitor): lower fail-fast thresholds and humanize diagnostic messages Apr 21, 2026
@pnilan
Copy link
Copy Markdown
Contributor

Patrick Nilan (pnilan) commented Apr 21, 2026

/prerelease

Prerelease Job Info

This job triggers the publish workflow with default arguments to create a prerelease.

Prerelease job started... Check job output.

✅ Prerelease workflow triggered successfully.

View the publish workflow run: https://github.com/airbytehq/airbyte-python-cdk/actions/runs/24740803773

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

PyTest Results (Fast)

4 034 tests  +12   4 023 ✅ +12   7m 14s ⏱️ -32s
    1 suites ± 0      11 💤 ± 0 
    1 files   ± 0       0 ❌ ± 0 

Results for commit db3a583. ± Comparison against base commit fd553bd.

This pull request removes 1 and adds 13 tests. Note that renamed tests count towards both.
unit_tests.utils.test_memory_monitor ‑ test_switches_to_high_pressure_check_interval_after_crossing_95_percent
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[0-0 B]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[1000-1.00 KB]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[1000000-1.00 MB]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[1000000000-1.00 GB]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[1500-1.50 KB]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[2109915136-2.11 GB]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[2147483648-2.15 GB]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[960000000-960.00 MB]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[999-999 B]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[999999-1000.00 KB]
…

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

PyTest Results (Full)

4 037 tests  +12   4 025 ✅ +12   11m 31s ⏱️ +44s
    1 suites ± 0      12 💤 ± 0 
    1 files   ± 0       0 ❌ ± 0 

Results for commit db3a583. ± Comparison against base commit fd553bd.

This pull request removes 1 and adds 13 tests. Note that renamed tests count towards both.
unit_tests.utils.test_memory_monitor ‑ test_switches_to_high_pressure_check_interval_after_crossing_95_percent
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[0-0 B]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[1000-1.00 KB]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[1000000-1.00 MB]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[1000000000-1.00 GB]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[1500-1.50 KB]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[2109915136-2.11 GB]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[2147483648-2.15 GB]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[960000000-960.00 MB]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[999-999 B]
unit_tests.utils.test_memory_monitor ‑ test_format_bytes[999999-1000.00 KB]
…

♻️ This comment has been updated with latest results.

…ions

Scenario tests assert an exact expected list of log messages. The new one-shot INFO log emitted by MemoryMonitor.__init__ is CDK infrastructure — not a connector-level sync log — so we filter it out in _verify_read_output before comparing against scenario.expected_logs, rather than adding it to every scenario.

Co-Authored-By: patrick.nilan@airbyte.io <patrick.nilan@airbyte.io>
OOM pressure near the cliff is a transient condition (workload/slice shape
dependent) rather than an internal CDK logic error, so a retry with fresh
process memory is meaningful. Flip failure_type from system_error to
transient_error so orchestrators can retry appropriately.

Co-Authored-By: patrick.nilan@airbyte.io <patrick.nilan@airbyte.io>
@pnilan Patrick Nilan (pnilan) merged commit 60bae81 into main Apr 23, 2026
26 of 27 checks passed
@pnilan Patrick Nilan (pnilan) deleted the devin/1776796633-lower-memory-thresholds branch April 23, 2026 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants