You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(training): live progress metrics persist mid-training
Two independent bugs suppressed all but the final PROGRESS emit, so job.metrics stayed None during training and only the post-training summary landed in the DB.
1. unsloth_train.py ProgressEmitter wrote PROGRESS:{...} without a leading newline. HuggingFace tqdm writes its step progress bar with carriage returns (no trailing newline), causing PROGRESS lines to concatenate onto the tqdm output. The worker's parser used line.startswith('PROGRESS:') which failed on the concatenated form. Prepending \\\\n guarantees each PROGRESS emit lands on its own line regardless of tqdm state.
2. worker.py performed DB commits synchronously inside the subprocess stdout read loop. Moved writes to a daemon thread drained from a queue.Queue so stdout reads never block on DB latency. Added attempted/succeeded/failed counters for observability.
Live training metrics (step, epoch, loss, learning_rate) now land in job.metrics on every logging step and are visible via client.training.retrieve().
0 commit comments