[bugfix] ODPS reader: pin pyodps==0.12.5.1 (retry connection resets) + log session/table context on errors#537
Conversation
…tion errors
_read_rows_arrow_with_retry only caught ODPSError, so a transient
requests.exceptions.ConnectionError ("Connection reset by peer", errno 104)
from the MaxCompute storage API escaped uncaught and killed the DataLoader
worker with no diagnostic context. _get_session_record_count had no error
handling at all.
Add a broad `except` to both call sites that logs the session_id, row range,
and table/quota/endpoint (plus the exception repr, which carries the ODPS
RequestId when present) before re-raising — diagnostics only, no new retry.
Also make the existing ODPSError retry non-silent: warn on each attempt and
error on exhaustion.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…repr _storage_debug_info put the raw pyodps Table object into the error string. str()/repr() of a Table dumps the entire column schema (hundreds of lines for wide feature tables) and accesses lazily-loaded fields, so formatting it inside the except block can fire a metadata reload exactly when the storage API is already failing. Use client.table.full_table_name (project.`table`) -- a compact, I/O-free identifier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pyodps 0.12.6 replaced the urllib3 transport-level Retry on the storage-api
HTTP adapter with a Python-level filter that only retries ODPSError with
status 502/503/504. A transient connection reset ("Connection reset by
peer") is a requests.ConnectionError, not an ODPSError, so it is no longer
retried and kills the DataLoader worker. 0.13.0 has the same behavior. Pin
to 0.12.5.1, the last release whose urllib3 Retry retries connection resets.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ffb7800 to
30237b0
Compare
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| "table": client.table.full_table_name, | ||
| "quota": client._quota_name, | ||
| "endpoint": client._rest_endpoint, |
There was a problem hiding this comment.
debug_info is built eagerly on every call, before the try (lines 227 & 263) — so this runs on the success path too, not just on errors. It dereferences private pyodps attrs (_quota_name, _rest_endpoint). If a future pyodps bump renames one — and this PR exists precisely because of pyodps version sensitivity — _storage_debug_info itself raises AttributeError and breaks every read with an error unrelated to the actual failure, instead of producing the diagnostics it's meant to provide.
Consider making it defensive (fall back to a placeholder per field) and/or building the string lazily only on the error branch. OdpsReader already holds the quota/endpoint as public state, which avoids the private-attr coupling for the read path.
| self._raise() | ||
|
|
||
|
|
||
| class OdpsStorageErrorLogTest(unittest.TestCase): |
There was a problem hiding this comment.
The new tests raise requests.exceptions.ConnectionError, which is not an ODPSError, so both tests only exercise the generic except Exception branch (odps_dataset.py:251-253). The ODPSError retry loop — per-attempt logger.warning, the time.sleep backoff, and the "after N retries" logger.error on exhaustion (lines 237-250) — has no coverage. A refactor that broke the retry loop entirely would still pass both tests.
Consider adding a stub whose read_rows_arrow raises ODPSError N times then returns a sentinel reader (patch time.sleep to keep it fast): assert success-after-retry (warning count, no error) and exhaustion-after-3-retries (error text contains the retry count + session id).
Code review summaryFocused, well-scoped bugfix with an excellent PR description — the root-cause analysis (pyodps 0.12.6 dropping the urllib3-level connection-reset retry) and the deliberate Two items worth addressing (inline):
Minor / optional (not blocking):
Performance and documentation review: clean — |
… path _storage_debug_info was built before the try, so it ran on every (including successful) read and dereferenced pyodps client internals on the happy path. Build it only inside the except branches now -- via a small closure in _read_rows_arrow_with_retry (referenced in three branches) and inline in _get_session_record_count -- so the success path does no diagnostic work. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…with_retry The existing tests raise ConnectionError, which only exercises the generic except branch. Add a stub that raises ODPSError N times then returns a sentinel (time.sleep patched) to cover: retry-then-success (2 warnings, no error) and exhaustion after 3 retries (3 warnings + one error carrying the retry count and session id). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the private-attr reads in _storage_debug_info defensive so a missing pyodps internal yields None instead of crashing the error-path diagnostic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| def _storage_debug_info(client: StorageApiArrowClient, **req_fields: Any) -> str: | ||
| """Format MaxCompute storage-api context for error logging.""" | ||
| parts = { | ||
| "table": client.table.full_table_name, |
There was a problem hiding this comment.
Minor / defensive: this helper runs only on the error path, but client.table.full_table_name is the one attribute accessed without a guard, while quota/endpoint below use getattr(..., None). If client.table ever raises (e.g. a partially-constructed client, or an unexpected AttributeError), the diagnostic log this PR exists to produce won't be emitted, and the original ConnectionError/ODPSError gets buried behind a chained secondary exception.
(I checked pyodps: full_table_name resolves via _get_schema_name() + already-set project.name/name, so it does not trigger a network reload() — so this is purely robustness, not the I/O risk the PR description hints at.)
Consider making the whole helper best-effort so logging can never raise:
try:
table = client.table.full_table_name
except Exception:
table = None| self.assertIn("sess-read-123", logged) | ||
| self.assertIn("test_project.test_table", logged) | ||
|
|
||
| def test_get_read_session_logs_session_and_reraises(self): |
There was a problem hiding this comment.
Coverage gap: _get_session_record_count's happy path is untested. This test only exercises the exception branch; the SessionStatus.INIT wait loop (time.sleep(1) → retry) and the final return scan_resp.record_count have no coverage, so a regression in the status comparison or the return value wouldn't be caught.
Consider a stub whose get_read_session returns INIT once then a ready status with a record_count, with time.sleep mocked, asserting the returned count and that it polled once. (Minor, also: none of the tests assert quota=/endpoint= actually appear in the logged string — worth one assertIn so a future pyodps rename of those private attrs doesn't silently log None while tests stay green.)
|
Code review summary Solid, well-scoped bugfix. The root-cause analysis (pyodps dropping urllib3-level connection-reset retry in 0.12.6) is convincing, the Reviewed across code quality, performance, tests, docs, and security. Nothing blocking. Two minor items posted inline:
Cleared with no concerns: performance (all new logic is on per-session/error paths, the per-batch |
Problem
MaxCompute (ODPS) training jobs intermittently crash with a transient network drop during a storage-api read:
The exception escapes the DataLoader worker and tears down the whole distributed job, leaving only a raw stack trace — no session id, table, or row range to tell which read failed.
Root cause
pyodps stopped retrying connection resets in 0.12.6. Up to 0.12.5.1, the storage-api HTTP adapter mounted a urllib3
Retry(total=retry_times, allowed_methods={GET,HEAD,DELETE}, status_forcelist=[502,503,504]). A connection reset surfaces inside urllib3 as aProtocolError, whichRetryclassifies as a retryable read error for idempotent methods — so it was retried up toretry_timestimes. 0.12.6 (and 0.13.0) dropped that adapter-level retry and moved retry into a Python filter that only retriesODPSErrorwith status 502/503/504; arequests.ConnectionErroris neither, so it is re-raised on the first attempt. Upgrading past 0.12.5.1 silently removed connection-reset resilience.On the tzrec side,
tzrec/datasets/odps_dataset.pyalso had two gaps:_read_rows_arrow_with_retryonly caughtODPSError(and retried silently), and_get_session_record_counthad no error handling at all — so an escaped reset died without any context.Change
pyodps==0.12.5.1(requirements/runtime.txt) to restore the transport-level urllib3 retry of connection resets._storage_debug_infoformats session id / row range / table / quota / endpoint, and both call sites log it plus the exceptionrepr(which carries the ODPSRequestIdwhen present). The existingODPSErrorretry path is made non-silent —warningper attempt,erroron exhaustion.client.table.full_table_name, not theTableobject —str(Table)dumps the entire column schema (hundreds of lines for wide feature tables) and touches lazily-loaded fields, which could fire a metadata reload inside theexceptblock exactly when the connection is already broken.Test
Ungated
OdpsStorageErrorLogTest(no credentials/network) drives a stub client that raisesConnectionErrorand asserts both functions re-raise and log the session id / table:🤖 Generated with Claude Code