Skip to content

Commit 112746b

Browse files
committed
docs(examples/server): add health-check trap when switching to background load
1 parent f7fd76a commit 112746b

1 file changed

Lines changed: 51 additions & 0 deletions

File tree

examples/server/README.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,3 +59,54 @@ output = await loop.run_in_executor(None, lambda: pipeline(image_input.prompt, g
5959
At this point, the execution of the pipeline function is placed onto a [new thread](https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor), and the main thread performs other things until a result is returned from the `pipeline`.
6060

6161
Another important aspect of this implementation is creating a `pipeline` from `shared_pipeline`. The goal behind this is to avoid loading the underlying model more than once onto the GPU while still allowing for each new request that is running on a separate thread to have its own generator and scheduler. The scheduler, in particular, is not thread-safe, and it will cause errors like: `IndexError: index 21 is out of bounds for dimension 0 with size 21` if you try to use the same scheduler across multiple threads.
62+
63+
## Production deployment notes
64+
65+
### Synchronous load (the default in this example) is the safe pattern
66+
67+
`server.py` calls `from_pretrained` at module import time, before `uvicorn.run(...)` binds the HTTP port. Whatever orchestrates the process — Kubernetes, Cloud Run, Vertex AI, AWS Fargate — does not see an open port until the model is fully on GPU and ready to serve. No health-check work is required: the readiness probe naturally fails (connection refused) until the model is loaded, and naturally succeeds once it is.
68+
69+
This is fine for small models and slow rollouts. It is **not** fine for cold-start–sensitive deployments of large pipelines like SD3 or FLUX.2, where the orchestrator's startup probe (often a few minutes by default) can time out before `from_pretrained` returns and the replica gets killed before it ever serves a request.
70+
71+
### Background-loading variant — and the health-check trap
72+
73+
The standard fix is to spawn `from_pretrained` on a background thread inside FastAPI's lifespan and bind the port immediately. That moves the readiness signal from "is the port open?" to "what does `/health` return?", which introduces a subtle and very common bug:
74+
75+
> If `/health` returns `200 OK` from the moment the process starts — for example because the route is just `return {"status": "ok"}` — the orchestrator will mark the replica ready as soon as the container boots, then route real traffic to it. Every request will return 5xx until the background load finishes. Worse: if the background load *crashes*, the replica still returns `200 OK` from `/health`, so the orchestrator silently keeps it in rotation.
76+
77+
The `/health` endpoint must reflect the actual state of the pipeline: `503` while loading, `503` if the load errored out, and `200` only once `from_pretrained` returned successfully (and ideally a smoke prediction has succeeded). A minimal pattern:
78+
79+
```python
80+
from fastapi import FastAPI, Response
81+
82+
app = FastAPI()
83+
app.state.pipe = None
84+
app.state.load_error = None
85+
86+
@app.on_event("startup")
87+
async def _kickoff_load():
88+
import threading
89+
threading.Thread(target=_load_in_background, daemon=True).start()
90+
91+
def _load_in_background():
92+
try:
93+
app.state.pipe = StableDiffusion3Pipeline.from_pretrained(
94+
"stabilityai/stable-diffusion-3-medium-diffusers",
95+
torch_dtype=torch.float16,
96+
).to("cuda")
97+
except Exception as e:
98+
app.state.load_error = repr(e)
99+
100+
@app.get("/health")
101+
async def health(response: Response):
102+
if app.state.pipe is not None:
103+
return {"status": "ready"}
104+
response.status_code = 503
105+
if app.state.load_error is not None:
106+
return {"status": "error", "detail": app.state.load_error}
107+
return {"status": "loading"}
108+
```
109+
110+
The same `503-while-loading` rule applies to Kubernetes `readinessProbe`, Vertex AI's prediction container health route (`AIP_HEALTH_ROUTE`), and Cloud Run's startup probe.
111+
112+
If you do not need cold-start parallelism, prefer the synchronous-load pattern in `server.py`. The bug above is one of the easier ways to lose half a day debugging an "available replica that returns 500 to every request".

0 commit comments

Comments
 (0)