Skip to content

Fix: Eagerly initialize ML Diagnostics to prevent Protobuf descriptor collision crash#4092

Open
bvandermoon wants to merge 1 commit into
mainfrom
fix-mldiagnostics-crash
Open

Fix: Eagerly initialize ML Diagnostics to prevent Protobuf descriptor collision crash#4092
bvandermoon wants to merge 1 commit into
mainfrom
fix-mldiagnostics-crash

Conversation

@bvandermoon
Copy link
Copy Markdown
Collaborator

Description

Eagerly initializes libtpu_metric at the beginning of MaxText's root initialization to prevent a fatal C++ Protobuf descriptor registration crash.

Context & Problem

Customers experience a fatal C++ Protobuf descriptor registration crash (File already exists in database: google/protobuf/timestamp.proto) when running MaxText training jobs with ML Diagnostics enabled (managed_mldiagnostics=True).

Root Cause

Lazy loading of ML Diagnostics occurs midway through the training flow. By then, JAX and other extensions have already registered timestamp.proto in the shared DescriptorPool. When libtpu_metric._initialize() is called later, it tries to register the same descriptor, causing an abort.

Solution

Eagerly initialize libtpu_metric at the beginning of MaxText's root initialization to ensure registration happens before JAX imports and calls cloud_tpu_init(). We use max_logging for cleaner logging integration and catch any Exception (e.g. if the package isn't installed) to fail gracefully.

Tests

Verified on dedicated TPU VM:

  1. Checked that importing maxtext package completes successfully and prints INFO:absl:ML diagnostics initialization without any Protobuf descriptor registration crash:
    python3 -c "from absl import logging; logging.set_verbosity(logging.INFO); import maxtext"
  2. Ran all unit tests in tests/unit/configs_test.py and they all passed:
    pytest ../tests/unit/configs_test.py

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

Codecov Report

❌ Patch coverage is 75.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/__init__.py 75.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@bvandermoon bvandermoon force-pushed the fix-mldiagnostics-crash branch from 317fbfb to 0731546 Compare June 6, 2026 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant