Skip to content

All LoRA sampler checkpoints permanently hang on sample_async() — base model works fine #108

@xiayang851023

Description

@xiayang851023

Bug Description

Calling sample_async() with any LoRA checkpoint hangs indefinitely (times out after 60s), while base model sampling works normally (~2.4s response time).

Environment

  • Tinker SDK: 0.18.2 (latest from PyPI)
  • Base model: Qwen/Qwen3-4B-Instruct-2507
  • LoRA rank: 16
  • 5 training runs, 31 checkpoints total, all show "Never" expiry in console

Reproduction

from tinker import ServiceClient, ModelInput, SamplingParams
import asyncio

async def test():
    sc = ServiceClient()
    
    # Base model — works, ~2.4s
    sampler = await sc.create_sampling_client_async(base_model="Qwen/Qwen3-4B-Instruct-2507")
    
    # Any LoRA checkpoint — permanently hangs, 60s timeout
    sampler = await sc.create_sampling_client_async(
        model_path="tinker://0ac02541-f609-540f-860e-0bf885b7292e:train:0/sampler_weights/dpo_final"
    )
    # create_sampling_client_async() returns OK (0.5s)
    # but sample_async() never returns — times out after 60s with CancelledError

Tested checkpoints (all hang)

  • tinker://0ac02541-f609-540f-860e-0bf885b7292e:train:0/sampler_weights/dpo_final (created 23 min ago, 136 MB)
  • tinker://a735527d-a478-59b1-b510-e3b0211f4989:train:0/sampler_weights/dpo_final (created 2 weeks ago, 136 MB)
  • All show "Never" expiry in the Tinker console

Call stack at timeout

sample_async() → _sample_async_impl() → _APIFuture.result_async()
  → asyncio.wait_for(future, timeout=60) → CancelledError

No server response is returned — not a 400/404/500 error. The request appears to be queued server-side and never processed.

Related Issues

  • GitHub Issue #234 ("Stuck when sampling") was closed on 2026-03-27 by @YujiaBao with comment "Closing for now — if still experiencing, please reopen." The issue described the same 7200s permanent hang with sample_async() in a CUA RL workflow. That fix does not appear to have resolved this issue.

Impact

All inference via LoRA checkpoints is blocked. DPO/SFT training works fine and checkpoints save successfully, but sampling from trained checkpoints is completely unusable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions