Skip to content

Commit 2572980

Browse files
authored
fix(ollama): soft-fail on empty response in generate_from_raw (#1161)
* fix(ollama): soft-fail on empty response body; remove stale xfail (#599) Ollama returns HTTP 200 with an empty `response` field when the first sampled token is EOS (runner.go:546-552). This is real but vanishingly rare — 4 400 requests across 1 100 trials showed zero occurrences once the primary cause (return_exceptions=True in asyncio.gather, PR #1163) was removed. This PR adds belt-and-braces handling for that genuine-but-rare path: warm the model in isolation runs, detect an HTTP-200-with-empty-body response and log a warning rather than silently returning an empty string. The stale xfail on test_generate_from_raw (which was passing cleanly after #1163) is removed. Three unit tests cover the soft-fail branch directly without needing a live Ollama server. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> * docs(agents): add asyncio.gather return_exceptions footgun to common issues Lesson from #599 investigation: return_exceptions=True silently converts exceptions to empty values in batch backends. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> * docs(ollama): document soft-fail contract in generate_from_raw Returns Addresses reviewer nit on PR #1161: the removed `Raises:` block left callers with no way to discover that empty done responses now soft-fail (``value=""``, error stashed in ``_generate_log.extra["error"]``) rather than propagate, while network-level exceptions still propagate all-or-nothing. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> * docs(ollama): correct generate_from_raw soft-fail/error wording Self-review fixes on the prior docstring update: - "raw response" -> "serialized response dict" (it is response.model_dump()) - move the gather all-or-nothing behaviour to a Note: an exception propagates (the call raises) rather than returning an empty list, and successful sibling requests are discarded - drop the inaccurate "Ollama client exceptions (e.g. ConnectionError)" framing (ConnectionError is not an ollama type) Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> * docs(ollama): note sibling actions unaffected by empty-response soft-fail Restores the per-action isolation point from Paul's review suggestion: an empty-response soft-fail only affects that action's thunk; other actions in the batch still produce normal results. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> * test(ollama): use mock_ollama_backend fixture for #599 tests After rebasing onto main, the shared mock_ollama_backend factory fixture (test/backends/conftest.py) is now the canonical way to build a patched backend. Convert the three generate_from_raw empty-response tests from the local _make_backend helper (which no longer exists post-rebase) to the fixture, matching the rest of the file. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> --------- Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
1 parent 006485c commit 2572980

4 files changed

Lines changed: 180 additions & 79 deletions

File tree

AGENTS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@ Use the tool's common name (e.g., GitHub Copilot, Cursor, etc.).
119119
| `uv.lock` out of sync | Run `uv sync` |
120120
| Ollama refused | Run `ollama serve` |
121121
| Telemetry import errors | Run `uv sync` to install OpenTelemetry deps |
122+
| Silent empty strings from async backends | Check for `asyncio.gather(..., return_exceptions=True)` — exceptions become values silently; use `return_exceptions=False` unless callers explicitly handle `BaseException` values |
122123

123124
## 10. Self-Review (before notifying user)
124125
1. `uv run pytest test/ -m "not qualitative"` passes?

mellea/backends/ollama.py

Lines changed: 62 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -44,22 +44,22 @@ class OllamaModelBackend(FormatterBackend):
4444
4545
Args:
4646
model_id (str | ModelIdentifier): Ollama model ID. If a
47-
`ModelIdentifier` is passed, its `ollama_name` attribute must
47+
``ModelIdentifier`` is passed, its ``ollama_name`` attribute must
4848
be set.
4949
formatter (ChatFormatter | None): Formatter for rendering components.
50-
Defaults to `TemplateFormatter`.
50+
Defaults to ``TemplateFormatter``.
5151
base_url (str | None): Ollama server endpoint; defaults to
52-
`env(OLLAMA_HOST)` or `http://localhost:11434`.
52+
``env(OLLAMA_HOST)`` or ``http://localhost:11434``.
5353
model_options (dict | None): Default model options for generation requests.
5454
timeout (float | None): Request timeout in seconds for the underlying HTTP
55-
client. `None` (the default) preserves the upstream `ollama` SDK
55+
client. ``None`` (the default) preserves the upstream ``ollama`` SDK
5656
default. Set this to bound how long a single request will wait when
5757
the Ollama server is overloaded or stalled.
5858
5959
Attributes:
6060
to_mellea_model_opts_map (dict): Mapping from Ollama-specific option names
61-
to Mellea `ModelOption` sentinel keys.
62-
from_mellea_model_opts_map (dict): Mapping from Mellea `ModelOption`
61+
to Mellea ``ModelOption`` sentinel keys.
62+
from_mellea_model_opts_map (dict): Mapping from Mellea ``ModelOption``
6363
sentinel keys to Ollama-specific option names.
6464
"""
6565

@@ -281,9 +281,9 @@ async def _generate_from_context(
281281
model_options: dict | None = None,
282282
tool_calls: bool = False,
283283
) -> tuple[ModelOutputThunk[C], Context]:
284-
"""Generate a completion for `action` given `ctx` via the Ollama chat API.
284+
"""Generate a completion for ``action`` given ``ctx`` via the Ollama chat API.
285285
286-
Delegates to `generate_from_chat_context`. Only chat contexts are supported.
286+
Delegates to ``generate_from_chat_context``. Only chat contexts are supported.
287287
288288
Args:
289289
action (Component[C] | CBlock): The component or content block to generate
@@ -293,12 +293,12 @@ async def _generate_from_context(
293293
structured/constrained output decoding.
294294
model_options (dict | None): Per-call model options that override the
295295
backend's defaults.
296-
tool_calls (bool): If `True`, expose available tools to the model and
296+
tool_calls (bool): If ``True``, expose available tools to the model and
297297
parse tool-call responses.
298298
299299
Returns:
300300
tuple[ModelOutputThunk[C], Context]: A thunk holding the (lazy) model output
301-
and an updated context that includes `action` and the new output.
301+
and an updated context that includes ``action`` and the new output.
302302
"""
303303
# Start span without auto-closing (will be closed in post_processing)
304304
span = start_generate_span(self, action, ctx, format, tool_calls)
@@ -334,7 +334,7 @@ async def generate_from_chat_context(
334334
) -> ModelOutputThunk[C]:
335335
"""Generate a new completion from the provided context using this backend's formatter.
336336
337-
Treats the `Context` as a chat history and uses the `ollama.Client.chat()`
337+
Treats the ``Context`` as a chat history and uses the ``ollama.Client.chat()``
338338
interface to generate a completion. Returns a thunk that lazily resolves
339339
the model output.
340340
@@ -345,7 +345,7 @@ async def generate_from_chat_context(
345345
_format (type[BaseModelSubclass] | None): Optional Pydantic model class for
346346
structured output decoding.
347347
model_options (dict | None): Per-call model options.
348-
tool_calls (bool): If `True`, expose available tools and parse responses.
348+
tool_calls (bool): If ``True``, expose available tools and parse responses.
349349
350350
Returns:
351351
ModelOutputThunk[C]: A thunk holding the (lazy) model output.
@@ -502,12 +502,18 @@ async def generate_from_raw(
502502
503503
Returns:
504504
list[ModelOutputThunk]: A list of model output thunks, one per action.
505-
506-
Raises:
507-
Exception: Any exception raised by the Ollama client (e.g.,
508-
``ConnectionError``, ``ollama.ResponseError``) propagates
509-
directly to the caller. Semantics are all-or-nothing: if any
510-
request fails, no thunks are returned.
505+
If Ollama returns an empty done response (``response=""``,
506+
``done=True``, no thinking content) for an action, that thunk
507+
soft-fails: it has ``value=""``, with the ``RuntimeError`` stored
508+
at ``thunk._generate_log.extra["error"]`` and the serialized
509+
response dict at ``thunk._generate_log.extra["empty_response"]``.
510+
Other actions in the batch are unaffected.
511+
512+
Note:
513+
Requests are awaited with ``asyncio.gather`` (all-or-nothing): if any
514+
request raises (e.g. ``ollama.ResponseError`` or a connection error),
515+
that exception propagates to the caller and no list is returned, even
516+
for requests that completed successfully.
511517
"""
512518
if len(actions) > 1:
513519
MelleaLogger.get_logger().info(
@@ -549,22 +555,39 @@ async def generate_from_raw(
549555
results = []
550556
date = datetime.datetime.now()
551557
for i, response in enumerate(responses):
552-
result = ModelOutputThunk(
553-
value=response.response,
554-
meta={
555-
"generate_response": response.model_dump(),
556-
"usage": {
557-
"completion_tokens": response.eval_count,
558-
"prompt_tokens": response.prompt_eval_count,
559-
"total_tokens": (
560-
response.prompt_eval_count + response.eval_count
561-
if response.prompt_eval_count is not None
562-
and response.eval_count is not None
563-
else None
564-
),
558+
result = None
559+
error = None
560+
if response.done and not response.response and not response.thinking:
561+
# Empty done response with no thinking content. Commonly caused by the
562+
# Ollama model-load race (#599) but can also occur on an early stop or
563+
# stop-sequence hit.
564+
empty_err = RuntimeError(
565+
f"generate_from_raw: request {i} returned an empty response from Ollama "
566+
"(response='', done=True). This commonly occurs when the model is still "
567+
"loading, but can also indicate an early stop or stop-sequence hit. "
568+
"See https://github.com/generative-computing/mellea/issues/599 "
569+
"and https://github.com/ollama/ollama/issues/16326"
570+
)
571+
MelleaLogger.get_logger().warning(str(empty_err))
572+
result = ModelOutputThunk(value="")
573+
error = empty_err
574+
else:
575+
result = ModelOutputThunk(
576+
value=response.response,
577+
meta={
578+
"generate_response": response.model_dump(),
579+
"usage": {
580+
"completion_tokens": response.eval_count,
581+
"prompt_tokens": response.prompt_eval_count,
582+
"total_tokens": (
583+
response.prompt_eval_count + response.eval_count
584+
if response.prompt_eval_count is not None
585+
and response.eval_count is not None
586+
else None
587+
),
588+
},
565589
},
566-
},
567-
)
590+
)
568591
action = actions[i]
569592
result.parsed_repr = (
570593
action.parse(result) if isinstance(action, Component) else result.value
@@ -582,6 +605,10 @@ async def generate_from_raw(
582605
"seed": model_opts.get(ModelOption.SEED, None),
583606
}
584607
generate_log.action = action
608+
609+
if error:
610+
generate_log.extra["error"] = error
611+
generate_log.extra["empty_response"] = response.model_dump()
585612
result._generate_log = generate_log
586613

587614
results.append(result)
@@ -624,9 +651,9 @@ async def processing(
624651
):
625652
"""Accumulate text and tool calls from a single Ollama ChatResponse chunk.
626653
627-
Called for each streaming or non-streaming `ollama.ChatResponse`. Also
654+
Called for each streaming or non-streaming ``ollama.ChatResponse``. Also
628655
extracts tool call requests inline and merges the chunk into the running
629-
aggregated response stored in `mot._meta["chat_response"]`.
656+
aggregated response stored in ``mot._meta["chat_response"]``.
630657
631658
Args:
632659
mot (ModelOutputThunk): The output thunk being populated.

test/backends/test_ollama.py

Lines changed: 26 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,13 @@
22
import json
33
from typing import Annotated
44

5+
import ollama as _ollama
56
import pydantic
67
import pytest
78

89
from mellea import start_session
910
from mellea.backends import ModelOption
11+
from mellea.backends.model_ids import IBM_GRANITE_4_1_3B
1012
from mellea.backends.ollama import OllamaModelBackend
1113
from mellea.core import CBlock, Requirement
1214
from mellea.stdlib.context import SimpleContext
@@ -16,6 +18,30 @@
1618
pytestmark = [pytest.mark.ollama, pytest.mark.e2e]
1719

1820

21+
@pytest.fixture(scope="module", autouse=True)
22+
def _ensure_model_warm() -> None:
23+
"""Warm up the default model before tests run in this module.
24+
25+
The conftest warms models when transitioning *into* the ollama test group, but
26+
that warm-up does not fire when this file is run in isolation (e.g.
27+
``pytest test/backends/test_ollama.py``). Without this fixture the first test
28+
in such a run fires concurrent requests against a cold model, triggering the
29+
Ollama load-race that returns empty responses (see #599 and
30+
https://github.com/ollama/ollama/issues/16326).
31+
32+
``keep_alive=-1`` pins the model in memory until the conftest module-boundary
33+
eviction fires at the end of this test file.
34+
"""
35+
_model = IBM_GRANITE_4_1_3B.ollama_name
36+
assert _model is not None # IBM_GRANITE_4_1_3B always has ollama_name set
37+
try:
38+
_ollama.generate(
39+
model=_model, prompt="hi", options={"num_predict": 1}, keep_alive=-1
40+
)
41+
except Exception:
42+
pass # best-effort; per-test failures will be clearer than a fixture abort
43+
44+
1945
@pytest.fixture(scope="function")
2046
def session():
2147
"""Fresh Ollama session for each test."""
@@ -101,15 +127,9 @@ class Email(pydantic.BaseModel):
101127
# assert email.to.email_address.endswith("example.com")
102128

103129

104-
@pytest.mark.xfail(
105-
strict=False, reason="Ollama intermittently returns empty responses for raw prompts"
106-
)
107130
@pytest.mark.qualitative
108131
@pytest.mark.timeout(150)
109132
async def test_generate_from_raw(session) -> None:
110-
# Note capital letter "W" at the beginning of each prompt. This capital letter is
111-
# very important to the ollama version of Granite 4.0 micro, the current default
112-
# model for Mellea.
113133
prompts = ["What is 1+1?", "What is 2+2?", "What is 3+3?", "What is 4+4?"]
114134

115135
results = await session.backend.generate_from_raw(

0 commit comments

Comments
 (0)