Cache already downloaded HuggingFace shards.

niting · niting · commit 9812b96dad4f · 2026-05-21T15:59:33.000-07:00
Currently, shards seem to be redownloaded every time they are required
causing slowdowns in conversion. Tried running the script with the
changes and there's significant improvements.

Benchmark: 2-Layer Qwen3 MoE Checkpoint Conversion (Lazy Loading Enabled)

| Metric                       | Baseline (Cached) | Optimized (Phase 1 Only) | Speedup  |
|------------------------------|-------------------|--------------------------|----------|
| Sharding (Materialization)   | 81.6s (1.36 min)  | 16.2s (0.27 min)         | **5.0x** |
| Overall Elapse               | 83.4s (1.39 min)  | 17.4s (0.29 min)         | **4.8x** |

Integration Tests (tests/integration/checkpoint_conversion_test.py):
- Baseline: 148.73s (2:28)
- Optimized: 77.33s (1:17) -&gt; **1.9x speedup overall** (includes model download)
diff --git a/src/maxtext/checkpoint_conversion/to_maxtext.py b/src/maxtext/checkpoint_conversion/to_maxtext.py
@@ -116,6 +116,8 @@ def __init__(self, model_id, token, revision=None):
     self.shard_map = {}
     self.current_shard_name = None
     self.current_shard_content = {}
+    # Cache for resolved local shard paths
+    self._local_shard_paths = {}
     # Use a lock to serialize heavy RAM operations, but NOT downloads
     self._ram_lock = threading.Lock()
     self._initialize_index()
@@ -183,17 +185,21 @@ def get_tensor(self, key: str) -> np.ndarray:
       # You might need advanced fuzzy matching here if you encounter errors.
       raise ValueError(f"Key {key} not found in HF checkpoint index.")
 
-    if self.is_local:
-      local_path = os.path.join(self.model_id, shard_name)
+    if shard_name in self._local_shard_paths:
+      local_path = self._local_shard_paths[shard_name]
     else:
-      # STEP 1: Download outside the lock.
-      # multiple threads can download different shards at the same time.
-      local_path = hf_hub_download(
-          repo_id=self.model_id,
-          filename=shard_name,
-          token=self.token,
-          revision=self.revision,
-      )
+      if self.is_local:
+        local_path = os.path.join(self.model_id, shard_name)
+      else:
+        # STEP 1: Download outside the lock.
+        # multiple threads can download different shards at the same time.
+        local_path = hf_hub_download(
+            repo_id=self.model_id,
+            filename=shard_name,
+            token=self.token,
+            revision=self.revision,
+        )
+      self._local_shard_paths[shard_name] = local_path
 
     # STEP 2: Lock ONLY the reading into RAM.
     # This prevents multiple threads from simultaneously allocating large chunks of RAM.