tinker checkpoint delete: 32-way parallelism is ~2× slower than serial

## Summary

`tinker checkpoint delete` parallelizes server requests with `ThreadPoolExecutor(max_workers=32)` ([cli/commands/checkpoint.py:919, 937-958](https://github.com/thinking-machines-lab/tinker/blob/main/tinker/cli/commands/checkpoint.py)). For large delete jobs this is **slower** than a serial loop — both in throughput and per-call latency — by about 2×. So the CLI's default concurrency hurts the workflow it's optimized for (cleaning up many checkpoints at once).

## Reproduction

I had ~728 step-numbered LoRA checkpoints to clean up. Took two disjoint slices of 50 paths and timed deletion via the SDK:

```python
from tinker import ServiceClient
from concurrent.futures import ThreadPoolExecutor, as_completed
import time

client = ServiceClient().create_rest_client()

def delete_one(p):
    client.delete_checkpoint_from_tinker_path(p).result()

# Parallel (workers=32, matching the CLI default)
t0 = time.time()
with ThreadPoolExecutor(max_workers=32) as pool:
    for f in as_completed([pool.submit(delete_one, p) for p in sample_a]):
        f.result()
print("parallel:", time.time() - t0)

# Serial
t0 = time.time()
for p in sample_b:
    delete_one(p)
print("serial:", time.time() - t0)
```

| Mode | n | elapsed | rate | avg s/delete |
|---|---:|---:|---:|---:|
| Parallel (workers=32) | 50 | 19.90s | 2.51/s | 0.398s |
| Serial (workers=1) | 50 | 10.47s | 4.78/s | 0.209s |

This isn't just lower throughput — each *individual* request is ~2× slower under concurrent load. Consistent with server-side contention (lock / serialization of the delete path on the backend).

Sustained rate confirmed over a follow-on cleanup of 600+ more paths:

```
[50/631]  ok=50  elapsed=10s  rate=4.91/s
[150/631] ok=150 elapsed=30s  rate=5.01/s
[250/631] ok=250 elapsed=48s  rate=5.17/s
```

Steady ~5/s serial → ~125s for 628 deletes. Under the CLI default this would be ~250s.

## Versions

- tinker SDK: 0.21.0 (CLI bundled)
- Python 3.13.5

## Suggested fix

Either:

1. **Lower `_DELETE_CONCURRENCY`** in [`tinker/cli/commands/checkpoint.py:919`](https://github.com/thinking-machines-lab/tinker/blob/main/tinker/cli/commands/checkpoint.py) — based on the numbers above, 1 looks better than 32 here. Maybe benchmark at 2, 4, 8 to find the actual sweet spot — but `1` is already strictly better than the current default.
2. **Investigate the server-side contention** — a ~2× per-request slowdown under 32-way concurrency suggests a hot lock that could be relaxed; if so the CLI's parallelism would be worth keeping.
3. **Expose `--concurrency`** as a CLI flag so users can pick. Reasonable middle ground.

The current behavior is also confusing because intuition says "more workers = faster", but the opposite is true. Wasted ~2 minutes of my cleanup time before I read the CLI source and re-tested serially.

---

Discovered by clement-dumas with Claude Code (Opus 4.7).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tinker checkpoint delete: 32-way parallelism is ~2× slower than serial #45

Summary

Reproduction

Versions

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Mode	n	elapsed	rate	avg s/delete
Parallel (workers=32)	50	19.90s	2.51/s	0.398s
Serial (workers=1)	50	10.47s	4.78/s	0.209s

tinker checkpoint delete: 32-way parallelism is ~2× slower than serial #45

Description

Summary

Reproduction

Versions

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions