Skip to content

tinker checkpoint delete: 32-way parallelism is ~2× slower than serial #45

@Butanium

Description

@Butanium

Summary

tinker checkpoint delete parallelizes server requests with ThreadPoolExecutor(max_workers=32) (cli/commands/checkpoint.py:919, 937-958). For large delete jobs this is slower than a serial loop — both in throughput and per-call latency — by about 2×. So the CLI's default concurrency hurts the workflow it's optimized for (cleaning up many checkpoints at once).

Reproduction

I had ~728 step-numbered LoRA checkpoints to clean up. Took two disjoint slices of 50 paths and timed deletion via the SDK:

from tinker import ServiceClient
from concurrent.futures import ThreadPoolExecutor, as_completed
import time

client = ServiceClient().create_rest_client()

def delete_one(p):
    client.delete_checkpoint_from_tinker_path(p).result()

# Parallel (workers=32, matching the CLI default)
t0 = time.time()
with ThreadPoolExecutor(max_workers=32) as pool:
    for f in as_completed([pool.submit(delete_one, p) for p in sample_a]):
        f.result()
print("parallel:", time.time() - t0)

# Serial
t0 = time.time()
for p in sample_b:
    delete_one(p)
print("serial:", time.time() - t0)
Mode n elapsed rate avg s/delete
Parallel (workers=32) 50 19.90s 2.51/s 0.398s
Serial (workers=1) 50 10.47s 4.78/s 0.209s

This isn't just lower throughput — each individual request is ~2× slower under concurrent load. Consistent with server-side contention (lock / serialization of the delete path on the backend).

Sustained rate confirmed over a follow-on cleanup of 600+ more paths:

[50/631]  ok=50  elapsed=10s  rate=4.91/s
[150/631] ok=150 elapsed=30s  rate=5.01/s
[250/631] ok=250 elapsed=48s  rate=5.17/s

Steady ~5/s serial → ~125s for 628 deletes. Under the CLI default this would be ~250s.

Versions

  • tinker SDK: 0.21.0 (CLI bundled)
  • Python 3.13.5

Suggested fix

Either:

  1. Lower _DELETE_CONCURRENCY in tinker/cli/commands/checkpoint.py:919 — based on the numbers above, 1 looks better than 32 here. Maybe benchmark at 2, 4, 8 to find the actual sweet spot — but 1 is already strictly better than the current default.
  2. Investigate the server-side contention — a ~2× per-request slowdown under 32-way concurrency suggests a hot lock that could be relaxed; if so the CLI's parallelism would be worth keeping.
  3. Expose --concurrency as a CLI flag so users can pick. Reasonable middle ground.

The current behavior is also confusing because intuition says "more workers = faster", but the opposite is true. Wasted ~2 minutes of my cleanup time before I read the CLI source and re-tested serially.


Discovered by clement-dumas with Claude Code (Opus 4.7).

Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions