Summary
The Ollama Python SDK (ollama) is the official Python client for Ollama, the leading platform for running LLMs locally. It provides execution APIs for chat completions (chat()), text generation (generate()), and embeddings (embed()) with a unique API surface distinct from OpenAI's format. This repository has zero instrumentation for any Ollama SDK surface — no integration, no wrapper, no patcher, no auto_instrument() support.
The ollama package has 9.9k GitHub stars, is used by ~34.8k downstream projects, and is actively maintained (latest: v0.6.2, April 29, 2026). It is one of the most widely used interfaces for local LLM inference in the Python ecosystem.
While Ollama also exposes an OpenAI-compatible HTTP endpoint, most Python users interact through the native ollama SDK which has its own request/response schemas. wrap_openai() cannot be used with the native ollama.Client or module-level functions. The AgentScope integration in this repo patches agentscope.model.OllamaChatModel.__call__ (an AgentScope wrapper), not the Ollama SDK itself — direct ollama.chat() calls produce no Braintrust spans.
What needs to be instrumented
The ollama package (v0.6.2) exposes these execution surfaces via module-level functions, Client, and AsyncClient, none of which are instrumented:
Chat (highest priority)
| SDK Method |
Description |
Streaming |
Return type |
ollama.chat(model, messages, ...) |
Chat completions with conversation history and tool use |
stream=True returns iterator of dicts |
dict with message, model, eval_count, prompt_eval_count |
Response shape: Returns a dict with message (role + content), model, eval_count (completion tokens), prompt_eval_count (prompt tokens), total_duration, load_duration, prompt_eval_duration, eval_duration. Token usage and latency metrics are directly available.
Tool calling: Supports tools parameter for function calling. Tool calls appear in message.tool_calls.
Generate
| SDK Method |
Description |
Streaming |
Return type |
ollama.generate(model, prompt, ...) |
Text generation from a prompt |
stream=True returns iterator of dicts |
dict with response, model, eval_count, prompt_eval_count |
Embed
| SDK Method |
Description |
Return type |
ollama.embed(model, input) |
Generate embeddings from text (single or batch) |
dict with embeddings, model |
All methods have corresponding Client instance methods and AsyncClient async variants with identical signatures.
Implementation notes
Module-level and client-level API: The ollama package exposes both module-level convenience functions (ollama.chat(...)) and class-based clients (Client().chat(...), AsyncClient().chat(...)). Both need instrumentation.
Patching strategy: The module-level functions delegate to a default Client instance. Patching Client.chat, Client.generate, Client.embed and corresponding AsyncClient methods should cover both usage patterns.
Streaming: Both chat and generate support stream=True, returning iterators of partial response dicts. The integration must accumulate chunks and finalize the span when the stream is exhausted.
Rich timing metrics: Ollama responses include total_duration, load_duration, prompt_eval_duration, and eval_duration in nanoseconds, providing fine-grained latency data beyond what most cloud providers expose.
Parameters relevant for span metadata: model, options (contains temperature, top_p, top_k, num_predict, stop, seed), format (structured output), tools, keep_alive.
No API key: Ollama runs locally, so there's no API key to sanitize in VCR cassettes. However, testing requires a running Ollama server with models pulled.
Proposed span shape
chat() / generate()
| Span field |
Content |
| input |
messages (chat) or prompt (generate), system, tools |
| output |
message (chat) or response (generate) |
| metadata |
provider: "ollama", model, options (temperature, etc.) |
| metrics |
tokens, prompt_tokens, completion_tokens, time_to_first_token (streaming), Ollama-specific durations |
embed()
| Span field |
Content |
| input |
input text(s) |
| output |
Embedding dimensions/count |
| metadata |
provider: "ollama", model |
No coverage in any instrumentation layer
- No integration directory (
py/src/braintrust/integrations/ollama/)
- No wrapper function (e.g.
wrap_ollama())
- No patcher in any existing integration (the AgentScope
_OllamaChatModelPatcher patches AgentScope's model wrapper, not the ollama SDK)
- No nox test session (
test_ollama)
- No version entry in
py/src/braintrust/integrations/versioning.py
- No mention in
py/src/braintrust/integrations/__init__.py
A grep for ollama across py/src/braintrust/integrations/ returns only agentscope/patchers.py which patches AgentScope's own OllamaChatModel class, not the ollama SDK.
Braintrust docs status
not_found — Ollama is not listed on the Braintrust tracing guide or the integrations directory. The custom providers page documents using Ollama's OpenAI-compatible endpoint via the proxy, but this does not cover native ollama SDK calls.
Upstream references
Local repo files inspected
py/src/braintrust/integrations/ — no ollama/ directory exists on main
py/src/braintrust/wrappers/ — no Ollama wrapper
py/noxfile.py — no test_ollama session
py/src/braintrust/integrations/__init__.py — Ollama not listed in integration registry
py/src/braintrust/integrations/versioning.py — no Ollama version matrix
py/src/braintrust/integrations/agentscope/patchers.py — patches agentscope.model.OllamaChatModel, not the native ollama SDK
Summary
The Ollama Python SDK (
ollama) is the official Python client for Ollama, the leading platform for running LLMs locally. It provides execution APIs for chat completions (chat()), text generation (generate()), and embeddings (embed()) with a unique API surface distinct from OpenAI's format. This repository has zero instrumentation for any Ollama SDK surface — no integration, no wrapper, no patcher, noauto_instrument()support.The
ollamapackage has 9.9k GitHub stars, is used by ~34.8k downstream projects, and is actively maintained (latest: v0.6.2, April 29, 2026). It is one of the most widely used interfaces for local LLM inference in the Python ecosystem.While Ollama also exposes an OpenAI-compatible HTTP endpoint, most Python users interact through the native
ollamaSDK which has its own request/response schemas.wrap_openai()cannot be used with the nativeollama.Clientor module-level functions. The AgentScope integration in this repo patchesagentscope.model.OllamaChatModel.__call__(an AgentScope wrapper), not the Ollama SDK itself — directollama.chat()calls produce no Braintrust spans.What needs to be instrumented
The
ollamapackage (v0.6.2) exposes these execution surfaces via module-level functions,Client, andAsyncClient, none of which are instrumented:Chat (highest priority)
ollama.chat(model, messages, ...)stream=Truereturns iterator of dictsdictwithmessage,model,eval_count,prompt_eval_countResponse shape: Returns a dict with
message(role + content),model,eval_count(completion tokens),prompt_eval_count(prompt tokens),total_duration,load_duration,prompt_eval_duration,eval_duration. Token usage and latency metrics are directly available.Tool calling: Supports
toolsparameter for function calling. Tool calls appear inmessage.tool_calls.Generate
ollama.generate(model, prompt, ...)stream=Truereturns iterator of dictsdictwithresponse,model,eval_count,prompt_eval_countEmbed
ollama.embed(model, input)dictwithembeddings,modelAll methods have corresponding
Clientinstance methods andAsyncClientasync variants with identical signatures.Implementation notes
Module-level and client-level API: The
ollamapackage exposes both module-level convenience functions (ollama.chat(...)) and class-based clients (Client().chat(...),AsyncClient().chat(...)). Both need instrumentation.Patching strategy: The module-level functions delegate to a default
Clientinstance. PatchingClient.chat,Client.generate,Client.embedand correspondingAsyncClientmethods should cover both usage patterns.Streaming: Both
chatandgeneratesupportstream=True, returning iterators of partial response dicts. The integration must accumulate chunks and finalize the span when the stream is exhausted.Rich timing metrics: Ollama responses include
total_duration,load_duration,prompt_eval_duration, andeval_durationin nanoseconds, providing fine-grained latency data beyond what most cloud providers expose.Parameters relevant for span metadata:
model,options(containstemperature,top_p,top_k,num_predict,stop,seed),format(structured output),tools,keep_alive.No API key: Ollama runs locally, so there's no API key to sanitize in VCR cassettes. However, testing requires a running Ollama server with models pulled.
Proposed span shape
chat()/generate()messages(chat) orprompt(generate),system,toolsmessage(chat) orresponse(generate)provider: "ollama",model, options (temperature, etc.)tokens,prompt_tokens,completion_tokens,time_to_first_token(streaming), Ollama-specific durationsembed()inputtext(s)provider: "ollama",modelNo coverage in any instrumentation layer
py/src/braintrust/integrations/ollama/)wrap_ollama())_OllamaChatModelPatcherpatches AgentScope's model wrapper, not theollamaSDK)test_ollama)py/src/braintrust/integrations/versioning.pypy/src/braintrust/integrations/__init__.pyA grep for
ollamaacrosspy/src/braintrust/integrations/returns onlyagentscope/patchers.pywhich patches AgentScope's ownOllamaChatModelclass, not theollamaSDK.Braintrust docs status
not_found— Ollama is not listed on the Braintrust tracing guide or the integrations directory. The custom providers page documents using Ollama's OpenAI-compatible endpoint via the proxy, but this does not cover nativeollamaSDK calls.Upstream references
Local repo files inspected
py/src/braintrust/integrations/— noollama/directory exists onmainpy/src/braintrust/wrappers/— no Ollama wrapperpy/noxfile.py— notest_ollamasessionpy/src/braintrust/integrations/__init__.py— Ollama not listed in integration registrypy/src/braintrust/integrations/versioning.py— no Ollama version matrixpy/src/braintrust/integrations/agentscope/patchers.py— patchesagentscope.model.OllamaChatModel, not the nativeollamaSDK