fix(memory-monitor): lower fail-fast thresholds and humanize diagnostic messages#995
Conversation
…liff Co-Authored-By: patrick.nilan@airbyte.io <patrick.nilan@airbyte.io>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1776796633-lower-memory-thresholds#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1776796633-lower-memory-thresholdsPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
There was a problem hiding this comment.
Pull request overview
Adjusts the Python CDK memory monitor thresholds to fail fast earlier under near-OOM conditions, improving visibility of memory-related failures (raising AirbyteTracedException) before the kernel OOM-kills the container.
Changes:
- Lowered the critical and high-pressure thresholds in
MemoryMonitor(0.98→0.95, 0.95→0.90) and updated inline/docs text accordingly. - Updated
unit_tests/utils/test_memory_monitor.pymocks and assertions to align with the new threshold values.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
airbyte_cdk/utils/memory_monitor.py |
Lowers fail-fast/high-pressure thresholds and updates docstrings/comments to match new behavior. |
unit_tests/utils/test_memory_monitor.py |
Renames/adjusts test fixtures and assertions to reflect the new threshold gates. |
Comments suppressed due to low confidence (1)
unit_tests/utils/test_memory_monitor.py:147
- This test’s comment/assertion is a bit misleading: the monitor doesn’t emit a debug “probe” message on the normal cgroup-v2 path, so
caplog.recordsis typically empty and theall(... <= DEBUG ...)check reads oddly. Consider asserting directly that there are no INFO+ records (and updating the comment) to make the intent clearer and the test more robust to future debug logging changes.
def test_no_log_below_threshold(caplog: pytest.LogCaptureFixture) -> None:
"""No log should be emitted when usage is below the 90% high-pressure threshold."""
monitor = MemoryMonitor(check_interval=1)
with (
caplog.at_level(logging.DEBUG, logger="airbyte"),
patch.object(Path, "exists", _v2_exists),
patch.object(Path, "read_text", _v2_mock_read(usage=_MOCK_USAGE_AT_85)),
):
monitor.check_memory_usage()
# Only the debug probe message, no info/warning
assert all(r.levelno <= logging.DEBUG for r in caplog.records)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…igured thresholds on init - Add _format_bytes helper that renders byte counts as GB/MB/KB with 2 decimal places (decimal units). - Replace raw byte values in the fail-fast internal_message and the two critical-but-not-raising log paths with _format_bytes(...). - Emit a one-shot INFO log on MemoryMonitor instantiation reporting the configured thresholds and check intervals. - Clarify docstrings that the tightened high-pressure check interval overrides the configurable check_interval (default 5000), addressing Copilot review feedback. - Tests: parametrized _format_bytes coverage, instantiation-log assertion, and assertions that the internal_message uses the humanized format. Co-Authored-By: patrick.nilan@airbyte.io <patrick.nilan@airbyte.io>
|
/prerelease
|
PyTest Results (Fast)4 034 tests +12 4 023 ✅ +12 7m 14s ⏱️ -32s Results for commit db3a583. ± Comparison against base commit fd553bd. This pull request removes 1 and adds 13 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
PyTest Results (Full)4 037 tests +12 4 025 ✅ +12 11m 31s ⏱️ +44s Results for commit db3a583. ± Comparison against base commit fd553bd. This pull request removes 1 and adds 13 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
…ions Scenario tests assert an exact expected list of log messages. The new one-shot INFO log emitted by MemoryMonitor.__init__ is CDK infrastructure — not a connector-level sync log — so we filter it out in _verify_read_output before comparing against scenario.expected_logs, rather than adding it to every scenario. Co-Authored-By: patrick.nilan@airbyte.io <patrick.nilan@airbyte.io>
OOM pressure near the cliff is a transient condition (workload/slice shape dependent) rather than an internal CDK logic error, so a retry with fresh process memory is meaningful. Flip failure_type from system_error to transient_error so orchestrators can retry appropriately. Co-Authored-By: patrick.nilan@airbyte.io <patrick.nilan@airbyte.io>
Summary
Lowers the CDK memory monitor's fail-fast thresholds so syncs near OOM reliably raise
AirbyteTracedException(surfacing as a visible failure) instead of being OOM-killed silently by the kernel, and cleans up the operator-facing diagnostic surface (human-readable byte values, a startup config log, and a clarified docstring).Reported by Patrick: connectors on CDK/SDM 7.16.0+ are still failing silently because the current 98%/95% gates sit too close to the OOM cliff — between two memory checks, allocations can jump a container straight past the gates into kernel OOM-kill.
1. Threshold changes (
airbyte_cdk/utils/memory_monitor.py)_CRITICAL_THRESHOLD: 0.98 → 0.95 — fires theAirbyteTracedExceptionat 95% of the container limit instead of 98%, adding headroom for the check-interval race window._HIGH_PRESSURE_THRESHOLD: 0.95 → 0.90 — must sit strictly below the critical threshold so tightened polling (configuredcheck_interval→ 100 messages) activates before the fail-fast gate._ANON_SHARE_OF_USAGE_THRESHOLD: 0.85 → 0.85 (unchanged) — deliberately left untouched to preserve the current false-positive profile; a high anon share of current cgroup usage remains the gate for "is this actually heap pressure, or file-backed / page-cache memory?"The dual-condition logic is unchanged: fail-fast still only fires when both (1) cgroup usage ≥ critical threshold AND (2) anonymous memory ≥ 85% of current cgroup usage.
2. Human-readable byte values in errors and logs
The previous
internal_messageformat used raw bytes, which were hard to parse at a glance —A new
_format_bytes()helper renders byte counts in decimal units (GB / MB / KB) with 2 decimal places. The fail-fast exception'sinternal_messageand both critical-but-not-raising log paths now emit:3. Startup config log
MemoryMonitor.__init__now emits a single INFO line reporting the configured thresholds and check intervals so an operator inspecting sync logs can immediately confirm which CDK version's gates are active:The monitor is instantiated once per
AirbyteEntrypoint, so this fires exactly once per source process (not per stream).4. Docstring clarification
Addressed Copilot review feedback (comment 3119645981) by rewording the class and method docstrings so they no longer claim the interval tightens "from 5000 to 100 messages" unconditionally — the actual behavior tightens from the configured
check_interval(default 5000) to 100 messages.Review & Testing Checklist for Human
AirbyteTracedExceptionwithfailure_type=system_errorinstead of exiting silently — before promoting to the general population.Notes
internal_message/ log-string formatting, and the new startup INFO log. The fail-fast gate, dual-condition logic, graceful degradation paths, and one-shot logging policy are all unchanged.memory_monitorunit tests pass locally.ruff check,ruff format, andmypy --config-file mypy.iniare clean on both edited files.Link to Devin session: https://app.devin.ai/sessions/738dba1a2a1a4c15998cb1dd5d0cd487