[bugfix] ODPS reader: pin pyodps==0.12.5.1 (retry connection resets) + log session/table context on errors by tiankongdeguiji · Pull Request #537 · alibaba/TorchEasyRec

tiankongdeguiji · 2026-06-02T03:29:52Z

Problem

MaxCompute (ODPS) training jobs intermittently crash with a transient network drop during a storage-api read:

requests.exceptions.ConnectionError:
  ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

The exception escapes the DataLoader worker and tears down the whole distributed job, leaving only a raw stack trace — no session id, table, or row range to tell which read failed.

Root cause

pyodps stopped retrying connection resets in 0.12.6. Up to 0.12.5.1, the storage-api HTTP adapter mounted a urllib3 Retry(total=retry_times, allowed_methods={GET,HEAD,DELETE}, status_forcelist=[502,503,504]). A connection reset surfaces inside urllib3 as a ProtocolError, which Retry classifies as a retryable read error for idempotent methods — so it was retried up to retry_times times. 0.12.6 (and 0.13.0) dropped that adapter-level retry and moved retry into a Python filter that only retries ODPSError with status 502/503/504; a requests.ConnectionError is neither, so it is re-raised on the first attempt. Upgrading past 0.12.5.1 silently removed connection-reset resilience.

On the tzrec side, tzrec/datasets/odps_dataset.py also had two gaps: _read_rows_arrow_with_retry only caught ODPSError (and retried silently), and _get_session_record_count had no error handling at all — so an escaped reset died without any context.

Change

Pin pyodps==0.12.5.1 (requirements/runtime.txt) to restore the transport-level urllib3 retry of connection resets.
Log storage-api context before re-raising (diagnostics only — no new tzrec-side retry): _storage_debug_info formats session id / row range / table / quota / endpoint, and both call sites log it plus the exception repr (which carries the ODPS RequestId when present). The existing ODPSError retry path is made non-silent — warning per attempt, error on exhaustion.
Log client.table.full_table_name, not the Table object — str(Table) dumps the entire column schema (hundreds of lines for wide feature tables) and touches lazily-loaded fields, which could fire a metadata reload inside the except block exactly when the connection is already broken.

Test

Ungated OdpsStorageErrorLogTest (no credentials/network) drives a stub client that raises ConnectionError and asserts both functions re-raise and log the session id / table:

python -m tzrec.datasets.odps_dataset_test OdpsStorageErrorLogTest

🤖 Generated with Claude Code

…tion errors _read_rows_arrow_with_retry only caught ODPSError, so a transient requests.exceptions.ConnectionError ("Connection reset by peer", errno 104) from the MaxCompute storage API escaped uncaught and killed the DataLoader worker with no diagnostic context. _get_session_record_count had no error handling at all. Add a broad `except` to both call sites that logs the session_id, row range, and table/quota/endpoint (plus the exception repr, which carries the ODPS RequestId when present) before re-raising — diagnostics only, no new retry. Also make the existing ODPSError retry non-silent: warn on each attempt and error on exhaustion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…repr _storage_debug_info put the raw pyodps Table object into the error string. str()/repr() of a Table dumps the entire column schema (hundreds of lines for wide feature tables) and accesses lazily-loaded fields, so formatting it inside the except block can fire a metadata reload exactly when the storage API is already failing. Use client.table.full_table_name (project.`table`) -- a compact, I/O-free identifier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

pyodps 0.12.6 replaced the urllib3 transport-level Retry on the storage-api HTTP adapter with a Python-level filter that only retries ODPSError with status 502/503/504. A transient connection reset ("Connection reset by peer") is a requests.ConnectionError, not an ODPSError, so it is no longer retried and kills the DataLoader worker. 0.13.0 has the same behavior. Pin to 0.12.5.1, the last release whose urllib3 Retry retries connection resets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ection-error-log

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-18T10:59:48Z

+        "table": client.table.full_table_name,
+        "quota": client._quota_name,
+        "endpoint": client._rest_endpoint,


debug_info is built eagerly on every call, before the try (lines 227 & 263) — so this runs on the success path too, not just on errors. It dereferences private pyodps attrs (_quota_name, _rest_endpoint). If a future pyodps bump renames one — and this PR exists precisely because of pyodps version sensitivity — _storage_debug_info itself raises AttributeError and breaks every read with an error unrelated to the actual failure, instead of producing the diagnostics it's meant to provide.

Consider making it defensive (fall back to a placeholder per field) and/or building the string lazily only on the error branch. OdpsReader already holds the quota/endpoint as public state, which avoids the private-attr coupling for the read path.

github-actions · 2026-06-18T10:59:52Z

+        self._raise()
+
+
+class OdpsStorageErrorLogTest(unittest.TestCase):


The new tests raise requests.exceptions.ConnectionError, which is not an ODPSError, so both tests only exercise the generic except Exception branch (odps_dataset.py:251-253). The ODPSError retry loop — per-attempt logger.warning, the time.sleep backoff, and the "after N retries" logger.error on exhaustion (lines 237-250) — has no coverage. A refactor that broke the retry loop entirely would still pass both tests.

Consider adding a stub whose read_rows_arrow raises ODPSError N times then returns a sentinel reader (patch time.sleep to keep it fast): assert success-after-retry (warning count, no error) and exhaustion-after-3-retries (error text contains the retry count + session id).

github-actions · 2026-06-18T11:00:06Z

Code review summary

Focused, well-scoped bugfix with an excellent PR description — the root-cause analysis (pyodps 0.12.6 dropping the urllib3-level connection-reset retry) and the deliberate full_table_name-over-str(Table) choice are clearly reasoned and correct. The new test is appropriately ungated (no creds/network), and the diagnostics genuinely improve on-call debuggability.

Two items worth addressing (inline):

_storage_debug_info is fragile + eager — built on every call before the try (success path included) and reads private pyodps attrs (_quota_name/_rest_endpoint). A future pyodps rename would make the diagnostic helper itself crash every read. Suggest defensive per-field fallback and/or lazy construction on the error branch.
Retry path untested — ConnectionError isn't an ODPSError, so the tests only hit the except Exception branch; the ODPSError retry loop (warning-per-attempt, backoff, exhaustion error) has no coverage.

Minor / optional (not blocking):

The except Exception branch re-raises connection errors with no tzrec-side retry — intentional per the PR (transport-layer retry is restored via the pin), but a one-line comment stating that would help future readers.
{e!r} is logged at several sites; safe today (endpoint is a host, not a signed URL; creds are never passed in), but logging exception repr is a recurring leak vector — str(e) or a small redaction pass would future-proof it.

Performance and documentation review: clean — .full_table_name correctly avoids the lazy metadata reload, debug_info is per-interval/session (not per-batch), and the docstring/comment are accurate and consistent with the module's conventions.

… path _storage_debug_info was built before the try, so it ran on every (including successful) read and dereferenced pyodps client internals on the happy path. Build it only inside the except branches now -- via a small closure in _read_rows_arrow_with_retry (referenced in three branches) and inline in _get_session_record_count -- so the success path does no diagnostic work. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…with_retry The existing tests raise ConnectionError, which only exercises the generic except branch. Add a stub that raises ODPSError N times then returns a sentinel (time.sleep patched) to cover: retry-then-success (2 warnings, no error) and exhaustion after 3 retries (3 warnings + one error carrying the retry count and session id). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Make the private-attr reads in _storage_debug_info defensive so a missing pyodps internal yields None instead of crashing the error-path diagnostic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-18T11:24:51Z

+def _storage_debug_info(client: StorageApiArrowClient, **req_fields: Any) -> str:
+    """Format MaxCompute storage-api context for error logging."""
+    parts = {
+        "table": client.table.full_table_name,


Minor / defensive: this helper runs only on the error path, but client.table.full_table_name is the one attribute accessed without a guard, while quota/endpoint below use getattr(..., None). If client.table ever raises (e.g. a partially-constructed client, or an unexpected AttributeError), the diagnostic log this PR exists to produce won't be emitted, and the original ConnectionError/ODPSError gets buried behind a chained secondary exception.

(I checked pyodps: full_table_name resolves via _get_schema_name() + already-set project.name/name, so it does not trigger a network reload() — so this is purely robustness, not the I/O risk the PR description hints at.)

Consider making the whole helper best-effort so logging can never raise:

try: table = client.table.full_table_name except Exception: table = None

github-actions · 2026-06-18T11:24:56Z

+        self.assertIn("sess-read-123", logged)
+        self.assertIn("test_project.test_table", logged)
+
+    def test_get_read_session_logs_session_and_reraises(self):


Coverage gap: _get_session_record_count's happy path is untested. This test only exercises the exception branch; the SessionStatus.INIT wait loop (time.sleep(1) → retry) and the final return scan_resp.record_count have no coverage, so a regression in the status comparison or the return value wouldn't be caught.

Consider a stub whose get_read_session returns INIT once then a ready status with a record_count, with time.sleep mocked, asserting the returned count and that it polled once. (Minor, also: none of the tests assert quota=/endpoint= actually appear in the logged string — worth one assertIn so a future pyodps rename of those private attrs doesn't silently log None while tests stay green.)

github-actions · 2026-06-18T11:25:10Z

Code review summary

Solid, well-scoped bugfix. The root-cause analysis (pyodps dropping urllib3-level connection-reset retry in 0.12.6) is convincing, the ==0.12.5.1 pin is the right call given the regression, and the diagnostic logging adds genuine value for debugging transient MaxCompute failures. The choice of full_table_name over str(Table) to avoid a lazy schema reload is correct, and the new tests cover the three meaningful paths (re-raise non-ODPSError, retry-then-succeed, retry-exhaustion). Nice diagnostics design (session_id/row_index per line aids disambiguation across distributed workers).

Reviewed across code quality, performance, tests, docs, and security. Nothing blocking. Two minor items posted inline:

_storage_debug_info robustness — client.table.full_table_name is the one unguarded access in an error-only helper; making it best-effort (like the getattr guards on quota/endpoint) ensures the diagnostic log never raises a secondary exception. Defensive only — verified it doesn't trigger network I/O.
Test coverage — the happy path of _get_session_record_count (the INIT wait loop + record_count return) is untested; and no test asserts quota=/endpoint= actually appear in the logged string.

Cleared with no concerns: performance (all new logic is on per-session/error paths, the per-batch reader.read() hot path is untouched), security (logged fields — table/quota/endpoint/RequestId — carry no credentials; helpers are stateless and worker-safe), and docs (no conflicting pyodps version references; version bump and docstrings follow convention).

tiankongdeguiji and others added 3 commits June 2, 2026 11:29

tiankongdeguiji force-pushed the bugfix/odps-mc-connection-error-log branch from ffb7800 to 30237b0 Compare June 18, 2026 10:02

tiankongdeguiji changed the title ~~[bugfix] ODPS reader: log session/table context on storage-api connection errors~~ [bugfix] ODPS reader: pin pyodps==0.12.5.1 (retry connection resets) + log session/table context on errors Jun 18, 2026

tiankongdeguiji and others added 2 commits June 18, 2026 18:04

Merge remote-tracking branch 'origin/master' into bugfix/odps-mc-conn…

1e59b81

…ection-error-log

[chore] bump version to 1.2.20

54a9b8e

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

tiankongdeguiji added the claude-review Let Claude Review label Jun 18, 2026

github-actions Bot removed the claude-review Let Claude Review label Jun 18, 2026

github-actions Bot reviewed Jun 18, 2026

View reviewed changes

tiankongdeguiji and others added 3 commits June 18, 2026 19:09

[bugfix] ODPS reader: read quota/endpoint via getattr(..., None)

60798f0

Make the private-attr reads in _storage_debug_info defensive so a missing pyodps internal yields None instead of crashing the error-path diagnostic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

tiankongdeguiji added the claude-review Let Claude Review label Jun 18, 2026

github-actions Bot removed the claude-review Let Claude Review label Jun 18, 2026

github-actions Bot reviewed Jun 18, 2026

View reviewed changes

chengaofei approved these changes Jun 22, 2026

View reviewed changes

tiankongdeguiji merged commit e2bc17d into alibaba:master Jun 22, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bugfix] ODPS reader: pin pyodps==0.12.5.1 (retry connection resets) + log session/table context on errors#537

[bugfix] ODPS reader: pin pyodps==0.12.5.1 (retry connection resets) + log session/table context on errors#537
tiankongdeguiji merged 8 commits into
alibaba:masterfrom
tiankongdeguiji:bugfix/odps-mc-connection-error-log

tiankongdeguiji commented Jun 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot Jun 18, 2026

Uh oh!

github-actions Bot Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

github-actions Bot Jun 18, 2026

Uh oh!

github-actions Bot Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		self._raise()


		class OdpsStorageErrorLogTest(unittest.TestCase):

Uh oh!

Conversation

tiankongdeguiji commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Change

Test

Uh oh!

github-actions Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 18, 2026

Code review summary

Uh oh!

github-actions Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tiankongdeguiji commented Jun 2, 2026 •

edited

Loading