Commit a57e9e7
authored
fix: Launchers - Kubernetes - Fix getting logs when logs arte not valid UTF-8 (#277)
The Kubernetes client's default read_namespaced_pod_log path does a strict
.decode('utf8') over the full log payload before checking HTTP status.
When a pod with high-volume tqdm progress bars (block glyphs █▉▊▋▌▍▎▏,
3-byte UTF-8) runs with num_proc>1, concurrent writes to the same fd can
split a multi-byte glyph across a chunk boundary, leaving an orphaned
continuation byte. The strict decode throws UnicodeDecodeError, which
bubbles through the log-upload retry wrapper and marks an otherwise-healthy
training run as SYSTEM_ERROR.
Fix: pass _preload_content=False to get the raw urllib3 response and decode
manually with errors="replace". This is applied to both the single-pod
(LaunchedKubernetesContainer.get_log) and multi-pod
(LaunchedKubernetesJob._get_log_by_pod_key) log-read paths.
A warning is logged whenever replacement characters are injected, so the
next occurrence is observable in Observe without requiring a separate
debug build.
The existing "Bad Request" catch for PodInitializing is unaffected:
the kubernetes client's status check runs outside the _preload_content
block and still raises ApiException with the correct reason phrase.
## User experience: before and after
### Orchestrator log-upload path (run lifecycle)
**Before** — the UnicodeDecodeError bubbles out of the retry wrapper. The run is marked `SYSTEM_ERROR`, no logs are uploaded, and all downstream tasks (e.g. Upload HF, Upload Training Summary) are skipped. The user sees a failed run with no log output and no indication that their training code was healthy.
**After** — the log is decoded successfully and uploaded. The run continues to completion. One or two progress-bar characters are substituted with `?` (U+FFFD) at the point of corruption, but the rest of the log is intact and readable.
### API log-read path (viewing logs for a running execution)
**Before** — the request throws before returning a response. The user gets a 500 error in the UI when trying to view logs mid-run.
**After** — the full log is returned. The substituted character appears inline exactly where the torn byte was, typically mid-progress-bar where it is visually unnoticeable.
---
### Example log output
**Before** (UnicodeDecodeError thrown at byte 5,115,152 — nothing returned):
```
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 5115152: unexpected end of data
```
**After** (log returned; `?` marks the single substituted byte at the corruption point):
```
2024-06-17T20:29:11Z Map (num_proc=64): 77%|██████████████████████████████████████████████████████████████████████████████████████▌ | 137412/178432 [00:34<00:10, 3991.03 examples/s]
2024-06-17T20:29:12Z Map (num_proc=64): 79%|███████████████████████████████████████████████████████████████?██████ | 140876/178432 [00:35<00:09, 3987.11 examples/s]
2024-06-17T20:29:13Z Map (num_proc=64): 81%|█████████████████████████████████████████████████████████████████████▏ | 144501/178432 [00:36<00:08, 3994.77 examples/s]
2024-06-17T20:29:55Z ***** Running training *****
2024-06-17T20:29:55Z Num examples = 144,501
2024-06-17T20:29:55Z Num Epochs = 3
```
The `?` on line 2 is where one torn block glyph was replaced. All structured log lines above and below it — training config, loss values, eval metrics — are fully intact.
Fixes: #2811 parent decae68 commit a57e9e7
1 file changed
Lines changed: 26 additions & 3 deletions
Lines changed: 26 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
922 | 922 | | |
923 | 923 | | |
924 | 924 | | |
925 | | - | |
| 925 | + | |
| 926 | + | |
926 | 927 | | |
927 | 928 | | |
928 | 929 | | |
| |||
931 | 932 | | |
932 | 933 | | |
933 | 934 | | |
| 935 | + | |
934 | 936 | | |
| 937 | + | |
| 938 | + | |
| 939 | + | |
| 940 | + | |
| 941 | + | |
| 942 | + | |
| 943 | + | |
| 944 | + | |
| 945 | + | |
935 | 946 | | |
936 | 947 | | |
937 | 948 | | |
| |||
1490 | 1501 | | |
1491 | 1502 | | |
1492 | 1503 | | |
1493 | | - | |
| 1504 | + | |
| 1505 | + | |
| 1506 | + | |
| 1507 | + | |
1494 | 1508 | | |
1495 | 1509 | | |
1496 | 1510 | | |
1497 | 1511 | | |
1498 | 1512 | | |
| 1513 | + | |
1499 | 1514 | | |
1500 | | - | |
| 1515 | + | |
| 1516 | + | |
| 1517 | + | |
| 1518 | + | |
| 1519 | + | |
| 1520 | + | |
| 1521 | + | |
| 1522 | + | |
| 1523 | + | |
1501 | 1524 | | |
1502 | 1525 | | |
1503 | 1526 | | |
| |||
0 commit comments