fix(ollama): soft-fail on empty response in generate_from_raw (#1161)

planetf1 · web-flow · commit 25729807c230 · 2026-06-03T10:38:30.000Z
* fix(ollama): soft-fail on empty response body; remove stale xfail (#599) Ollama returns HTTP 200 with an empty `response` field when the first sampled token is EOS (runner.go:546-552). This is real but vanishingly rare — 4 400 requests across 1 100 trials showed zero occurrences once the primary cause (return_exceptions=True in asyncio.gather, PR #1163) was removed. This PR adds belt-and-braces handling for that genuine-but-rare path: warm the model in isolation runs, detect an HTTP-200-with-empty-body response and log a warning rather than silently returning an empty string. The stale xfail on test_generate_from_raw (which was passing cleanly after #1163) is removed. Three unit tests cover the soft-fail branch directly without needing a live Ollama server. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> * docs(agents): add asyncio.gather return_exceptions footgun to common issues Lesson from #599 investigation: return_exceptions=True silently converts exceptions to empty values in batch backends. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> * docs(ollama): document soft-fail contract in generate_from_raw Returns Addresses reviewer nit on PR #1161: the removed `Raises:` block left callers with no way to discover that empty done responses now soft-fail (``value=""``, error stashed in ``_generate_log.extra["error"]``) rather than propagate, while network-level exceptions still propagate all-or-nothing. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> * docs(ollama): correct generate_from_raw soft-fail/error wording Self-review fixes on the prior docstring update: - "raw response" -> "serialized response dict" (it is response.model_dump()) - move the gather all-or-nothing behaviour to a Note: an exception propagates (the call raises) rather than returning an empty list, and successful sibling requests are discarded - drop the inaccurate "Ollama client exceptions (e.g. ConnectionError)" framing (ConnectionError is not an ollama type) Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> * docs(ollama): note sibling actions unaffected by empty-response soft-fail Restores the per-action isolation point from Paul's review suggestion: an empty-response soft-fail only affects that action's thunk; other actions in the batch still produce normal results. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> * test(ollama): use mock_ollama_backend fixture for #599 tests After rebasing onto main, the shared mock_ollama_backend factory fixture (test/backends/conftest.py) is now the canonical way to build a patched backend. Convert the three generate_from_raw empty-response tests from the local _make_backend helper (which no longer exists post-rebase) to the fixture, matching the rest of the file. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> --------- Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
diff --git a/AGENTS.md b/AGENTS.md
@@ -119,6 +119,7 @@ Use the tool's common name (e.g., GitHub Copilot, Cursor, etc.).
 | `uv.lock` out of sync | Run `uv sync` |
 | Ollama refused | Run `ollama serve` |
 | Telemetry import errors | Run `uv sync` to install OpenTelemetry deps |
+| Silent empty strings from async backends | Check for `asyncio.gather(..., return_exceptions=True)` — exceptions become values silently; use `return_exceptions=False` unless callers explicitly handle `BaseException` values |
 
 ## 10. Self-Review (before notifying user)
 1. `uv run pytest test/ -m "not qualitative"` passes?
diff --git a/mellea/backends/ollama.py b/mellea/backends/ollama.py
@@ -44,22 +44,22 @@ class OllamaModelBackend(FormatterBackend):
 
     Args:
         model_id (str | ModelIdentifier): Ollama model ID. If a
-            `ModelIdentifier` is passed, its `ollama_name` attribute must
+            ``ModelIdentifier`` is passed, its ``ollama_name`` attribute must
             be set.
         formatter (ChatFormatter | None): Formatter for rendering components.
-            Defaults to `TemplateFormatter`.
+            Defaults to ``TemplateFormatter``.
         base_url (str | None): Ollama server endpoint; defaults to
-            `env(OLLAMA_HOST)` or `http://localhost:11434`.
+            ``env(OLLAMA_HOST)`` or ``http://localhost:11434``.
         model_options (dict | None): Default model options for generation requests.
         timeout (float | None): Request timeout in seconds for the underlying HTTP
-            client. `None` (the default) preserves the upstream `ollama` SDK
+            client. ``None`` (the default) preserves the upstream ``ollama`` SDK
             default. Set this to bound how long a single request will wait when
             the Ollama server is overloaded or stalled.
 
     Attributes:
         to_mellea_model_opts_map (dict): Mapping from Ollama-specific option names
-            to Mellea `ModelOption` sentinel keys.
-        from_mellea_model_opts_map (dict): Mapping from Mellea `ModelOption`
+            to Mellea ``ModelOption`` sentinel keys.
+        from_mellea_model_opts_map (dict): Mapping from Mellea ``ModelOption``
             sentinel keys to Ollama-specific option names.
     """
 
@@ -281,9 +281,9 @@ async def _generate_from_context(
         model_options: dict | None = None,
         tool_calls: bool = False,
     ) -> tuple[ModelOutputThunk[C], Context]:
-        """Generate a completion for `action` given `ctx` via the Ollama chat API.
+        """Generate a completion for ``action`` given ``ctx`` via the Ollama chat API.
 
-        Delegates to `generate_from_chat_context`. Only chat contexts are supported.
+        Delegates to ``generate_from_chat_context``. Only chat contexts are supported.
 
         Args:
             action (Component[C] | CBlock): The component or content block to generate
@@ -293,12 +293,12 @@ async def _generate_from_context(
                 structured/constrained output decoding.
             model_options (dict | None): Per-call model options that override the
                 backend's defaults.
-            tool_calls (bool): If `True`, expose available tools to the model and
+            tool_calls (bool): If ``True``, expose available tools to the model and
                 parse tool-call responses.
 
         Returns:
             tuple[ModelOutputThunk[C], Context]: A thunk holding the (lazy) model output
-                and an updated context that includes `action` and the new output.
+                and an updated context that includes ``action`` and the new output.
         """
         # Start span without auto-closing (will be closed in post_processing)
         span = start_generate_span(self, action, ctx, format, tool_calls)
@@ -334,7 +334,7 @@ async def generate_from_chat_context(
     ) -> ModelOutputThunk[C]:
         """Generate a new completion from the provided context using this backend's formatter.
 
-        Treats the `Context` as a chat history and uses the `ollama.Client.chat()`
+        Treats the ``Context`` as a chat history and uses the ``ollama.Client.chat()``
         interface to generate a completion. Returns a thunk that lazily resolves
         the model output.
 
@@ -345,7 +345,7 @@ async def generate_from_chat_context(
             _format (type[BaseModelSubclass] | None): Optional Pydantic model class for
                 structured output decoding.
             model_options (dict | None): Per-call model options.
-            tool_calls (bool): If `True`, expose available tools and parse responses.
+            tool_calls (bool): If ``True``, expose available tools and parse responses.
 
         Returns:
             ModelOutputThunk[C]: A thunk holding the (lazy) model output.
@@ -502,12 +502,18 @@ async def generate_from_raw(
 
         Returns:
             list[ModelOutputThunk]: A list of model output thunks, one per action.
-
-        Raises:
-            Exception: Any exception raised by the Ollama client (e.g.,
-                ``ConnectionError``, ``ollama.ResponseError``) propagates
-                directly to the caller. Semantics are all-or-nothing: if any
-                request fails, no thunks are returned.
+                If Ollama returns an empty done response (``response=""``,
+                ``done=True``, no thinking content) for an action, that thunk
+                soft-fails: it has ``value=""``, with the ``RuntimeError`` stored
+                at ``thunk._generate_log.extra["error"]`` and the serialized
+                response dict at ``thunk._generate_log.extra["empty_response"]``.
+                Other actions in the batch are unaffected.
+
+        Note:
+            Requests are awaited with ``asyncio.gather`` (all-or-nothing): if any
+            request raises (e.g. ``ollama.ResponseError`` or a connection error),
+            that exception propagates to the caller and no list is returned, even
+            for requests that completed successfully.
         """
         if len(actions) > 1:
             MelleaLogger.get_logger().info(
@@ -549,22 +555,39 @@ async def generate_from_raw(
         results = []
         date = datetime.datetime.now()
         for i, response in enumerate(responses):
-            result = ModelOutputThunk(
-                value=response.response,
-                meta={
-                    "generate_response": response.model_dump(),
-                    "usage": {
-                        "completion_tokens": response.eval_count,
-                        "prompt_tokens": response.prompt_eval_count,
-                        "total_tokens": (
-                            response.prompt_eval_count + response.eval_count
-                            if response.prompt_eval_count is not None
-                            and response.eval_count is not None
-                            else None
-                        ),
+            result = None
+            error = None
+            if response.done and not response.response and not response.thinking:
+                # Empty done response with no thinking content. Commonly caused by the
+                # Ollama model-load race (#599) but can also occur on an early stop or
+                # stop-sequence hit.
+                empty_err = RuntimeError(
+                    f"generate_from_raw: request {i} returned an empty response from Ollama "
+                    "(response='', done=True). This commonly occurs when the model is still "
+                    "loading, but can also indicate an early stop or stop-sequence hit. "
+                    "See https://github.com/generative-computing/mellea/issues/599 "
+                    "and https://github.com/ollama/ollama/issues/16326"
+                )
+                MelleaLogger.get_logger().warning(str(empty_err))
+                result = ModelOutputThunk(value="")
+                error = empty_err
+            else:
+                result = ModelOutputThunk(
+                    value=response.response,
+                    meta={
+                        "generate_response": response.model_dump(),
+                        "usage": {
+                            "completion_tokens": response.eval_count,
+                            "prompt_tokens": response.prompt_eval_count,
+                            "total_tokens": (
+                                response.prompt_eval_count + response.eval_count
+                                if response.prompt_eval_count is not None
+                                and response.eval_count is not None
+                                else None
+                            ),
+                        },
                     },
-                },
-            )
+                )
             action = actions[i]
             result.parsed_repr = (
                 action.parse(result) if isinstance(action, Component) else result.value
@@ -582,6 +605,10 @@ async def generate_from_raw(
                 "seed": model_opts.get(ModelOption.SEED, None),
             }
             generate_log.action = action
+
+            if error:
+                generate_log.extra["error"] = error
+                generate_log.extra["empty_response"] = response.model_dump()
             result._generate_log = generate_log
 
             results.append(result)
@@ -624,9 +651,9 @@ async def processing(
     ):
         """Accumulate text and tool calls from a single Ollama ChatResponse chunk.
 
-        Called for each streaming or non-streaming `ollama.ChatResponse`. Also
+        Called for each streaming or non-streaming ``ollama.ChatResponse``. Also
         extracts tool call requests inline and merges the chunk into the running
-        aggregated response stored in `mot._meta["chat_response"]`.
+        aggregated response stored in ``mot._meta["chat_response"]``.
 
         Args:
             mot (ModelOutputThunk): The output thunk being populated.
diff --git a/test/backends/test_ollama.py b/test/backends/test_ollama.py
@@ -2,11 +2,13 @@
 import json
 from typing import Annotated
 
+import ollama as _ollama
 import pydantic
 import pytest
 
 from mellea import start_session
 from mellea.backends import ModelOption
+from mellea.backends.model_ids import IBM_GRANITE_4_1_3B
 from mellea.backends.ollama import OllamaModelBackend
 from mellea.core import CBlock, Requirement
 from mellea.stdlib.context import SimpleContext
@@ -16,6 +18,30 @@
 pytestmark = [pytest.mark.ollama, pytest.mark.e2e]
 
 
+@pytest.fixture(scope="module", autouse=True)
+def _ensure_model_warm() -> None:
+    """Warm up the default model before tests run in this module.
+
+    The conftest warms models when transitioning *into* the ollama test group, but
+    that warm-up does not fire when this file is run in isolation (e.g.
+    ``pytest test/backends/test_ollama.py``). Without this fixture the first test
+    in such a run fires concurrent requests against a cold model, triggering the
+    Ollama load-race that returns empty responses (see #599 and
+    https://github.com/ollama/ollama/issues/16326).
+
+    ``keep_alive=-1`` pins the model in memory until the conftest module-boundary
+    eviction fires at the end of this test file.
+    """
+    _model = IBM_GRANITE_4_1_3B.ollama_name
+    assert _model is not None  # IBM_GRANITE_4_1_3B always has ollama_name set
+    try:
+        _ollama.generate(
+            model=_model, prompt="hi", options={"num_predict": 1}, keep_alive=-1
+        )
+    except Exception:
+        pass  # best-effort; per-test failures will be clearer than a fixture abort
+
+
 @pytest.fixture(scope="function")
 def session():
     """Fresh Ollama session for each test."""
@@ -101,15 +127,9 @@ class Email(pydantic.BaseModel):
     # assert email.to.email_address.endswith("example.com")
 
 
-@pytest.mark.xfail(
-    strict=False, reason="Ollama intermittently returns empty responses for raw prompts"
-)
 @pytest.mark.qualitative
 @pytest.mark.timeout(150)
 async def test_generate_from_raw(session) -> None:
-    # Note capital letter "W" at the beginning of each prompt. This capital letter is
-    # very important to the ollama version of Granite 4.0 micro, the current default
-    # model for Mellea.
     prompts = ["What is 1+1?", "What is 2+2?", "What is 3+3?", "What is 4+4?"]
 
     results = await session.backend.generate_from_raw(
diff --git a/test/backends/test_ollama_unit.py b/test/backends/test_ollama_unit.py