fix(ai-proxy): yield to scheduler in streaming SSE loop to avoid worker CPU starvation

nic-6443 · nic-6443 · commit 1008af0fbde3 · 2026-04-19T11:05:05.000+08:00
When an upstream LLM emits SSE chunks in a tight burst (e.g. a model
hallucinating and producing tokens at 100+ per second), the streaming
loop in parse_streaming_response can run for an extended period without
yielding to the nginx scheduler.

body_reader() (cosocket recv) only yields when the recv buffer is
empty; if the kernel has already buffered several chunks, successive
calls return immediately. ngx.flush(true) only yields when the
downstream send buffer is full; a fast client drains immediately. So
neither end of the loop guarantees a yield, and the SSE coroutine ends
up monopolizing the worker — starving health checks, concurrent
requests, and timer callbacks on the same worker.

Add an explicit ngx.sleep(0) at the end of each loop iteration. This
is a no-op timer that just yields the current coroutine, allowing
other ready coroutines to run. The cost is negligible: in normal AI
traffic chunks already arrive with inter-chunk gaps so an extra yield
per chunk is invisible; in burst scenarios it caps per-coroutine
runtime to one chunk's worth of work.
diff --git a/apisix/plugins/ai-providers/base.lua b/apisix/plugins/ai-providers/base.lua
@@ -361,6 +361,13 @@ function _M.parse_streaming_response(self, ctx, res, target_proto, converter)
         else
             plugin.lua_response_filter(ctx, res.headers, chunk)
         end
+
+        -- Yield to the nginx scheduler so other coroutines on this worker
+        -- (health checks, concurrent requests) can run. body_reader() and
+        -- ngx.flush() do not yield when the upstream socket already has data
+        -- buffered or the downstream client drains immediately, so under
+        -- bursty SSE upstreams this loop can monopolize the worker CPU.
+        ngx.sleep(0)
     end
 end