Skip to content

HYBIM 665 Converting stop_llm calls from sync to async#319

Open
shuningc wants to merge 6 commits into
mainfrom
HYBIM-665-async-stop_llm
Open

HYBIM 665 Converting stop_llm calls from sync to async#319
shuningc wants to merge 6 commits into
mainfrom
HYBIM-665-async-stop_llm

Conversation

@shuningc

@shuningc shuningc commented May 8, 2026

Copy link
Copy Markdown
Contributor

Problem

TelemetryHandler.stop_llm() (and all other stop_/fail_ methods) runs entirely on the caller's request thread. This includes:

  1. Message serialization — _emitter.on_end() → SpanEmitter._apply_finish_attrs() serializes input/output messages into span attributes (JSON encoding,
    string truncation)
  2. Completion callbacks — _notify_completion() fans out to every registered callback (evaluators, persistence hooks, upload hooks), which may do I/O
  3. Metric flush — meter_provider.force_flush() waits for the metric export cycle

In a multi-step agentic workflow, stop_llm is called once per LLM turn, so these costs stack. In high-throughput services that have added manual
instrumentation calls on the hot path, this adds measurable latency to every user-facing request.

Solution

Add an optional background ThreadPoolExecutor inside TelemetryHandler. When OTEL_INSTRUMENTATION_GENAI_ASYNC_FINALIZATION=true, the expensive part of
each stop_/fail_ call is submitted to the thread pool instead of running inline. The flag defaults to false — no behavior change without opt-in.


File-by-file changes

  1. environment_variables.py — two new env var constants

Why: All configuration in this library is driven by env vars. Adding the two new variables here keeps them alongside the existing constants, makes them
importable by name, and ensures they appear in all for documentation tooling.

  • OTEL_INSTRUMENTATION_GENAI_ASYNC_FINALIZATION — opt-in flag (default false)
  • OTEL_INSTRUMENTATION_GENAI_ASYNC_FINALIZATION_QUEUE_SIZE — max concurrent+queued tasks before falling back inline (default 128)

  1. handler.py — core implementation

Why this approach: TelemetryHandler is the single shared singleton that all instrumented frameworks call. Fixing it here fixes LangChain, LlamaIndex,
CrewAI, and FastMCP in one place without changing any instrumentation code.

What runs inline (unchanged behaviour):

  • invocation.end_time — must be set before the call returns so timing is accurate
  • pop_current_span — resets a ContextVar and detaches an OTel context token that belongs to the caller's thread. Moving this to a background thread
    would corrupt parent/child span relationships for any code that runs after stop
    * returns
  • _agent_context_stack pop in stop_agent/fail_agent — same thread-safety requirement
  • _emitter.on_end() / _emitter.on_error() — calls span.end(), which must complete before the framework proceeds (LlamaIndex's workflow event loop checks
    span.is_recording() to decide whether to open new child spans; if span.end() is deferred, it sees the parent still open and creates extra outer wrapper
    spans)

What moves to the background:

  • _notify_completion(invocation) — fans out to completion callbacks (evaluators, upload hooks). These are the expensive part
  • meter_provider.force_flush() — waits for metric export

_submit_finalization(fn) helper logic:

  1. If _finalizer_executor is None (flag off or after shutdown): run fn() inline — identical to old behaviour
  2. If the BoundedSemaphore cannot be acquired (queue full): run fn() inline — telemetry is never silently dropped
  3. Otherwise: submit to the thread pool and release the semaphore in a done_callback when the task completes

shutdown(wait=True): Drains the thread pool. Called automatically by _reset_for_testing() so tests don't leak background threads between cases.


  1. tests/test_async_finalization.py — new test file (16 tests)

Why: The async/inline boundary is subtle. Without tests it's easy to accidentally move something that must stay inline (like _pop_current_span) into the
background closure.

Four test classes:

  • TestDefaultInlineBehavior — with flag off, confirms executor is None and on_end/callbacks are called synchronously before stop_llm returns
  • TestAsyncFinalizationEnabled — with flag on, confirms:
    • on_end runs on the caller's thread before stop_llm returns
    • _notify_completion runs on a background thread
    • end_time is set inline
    • _pop_current_span restores context inline
    • callbacks eventually complete after shutdown(wait=True)
    • shutdown drains all queued work
    • all invocation types (stop_embedding, stop_workflow, fail_llm) follow the same pattern
  • TestQueueFullFallback — drains the semaphore to simulate a full queue, verifies on_end falls back to the caller's thread
  • TestShutdown — shutdown() is idempotent, clears executor, and subsequent stop_llm calls still emit telemetry inline

Correctness verification

Two local test scripts run the same agent query twice (sync then async) against the real Circuit API and compare span output:

  • .local/test_circuit_langchain.py — LangChain create_react_agent via LangchainInstrumentor (auto instrumentation)

Both confirm same span count, same span names, same attributes between sync and async modes.


Non-breaking guarantee

  • Flag defaults to false — zero behaviour change for all existing users
  • When flag is on, span.end() still completes before stop_* returns, so all OTel span timing, parent/child relationships, and framework-level span checks
    are unaffected
  • Queue-full fallback ensures telemetry is never dropped under backpressure

langchain sync output

[
  {
    "name": "chat gpt-5-nano",
    "trace_id": "c0d7afd89e3687f2967972ab19ccb102",
    "span_id": "888688537ba42111",
    "parent_span_id": "145768e7306eb692",
    "start_time": 1777946530406614000,
    "end_time": 1777946534038007000,
    "duration_ms": 3631.393,
    "status": "StatusCode.UNSET",
    "attributes": {
      "gen_ai.tool.definitions": "[{\"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"description\": \"Get the current weather for a city.\", \"parameters\": {\"properties\": {\"city\": {\"type\": \"string\"}}, \"required\": [\"city\"], \"type\": \"object\"}}}, {\"type\": \"function\", \"function\": {\"name\": \"get_time\", \"description\": \"Get the current time in a timezone.\", \"parameters\": {\"properties\": {\"timezone\": {\"type\": \"string\"}}, \"required\": [\"timezone\"], \"type\": \"object\"}}}, {\"type\": \"function\", \"function\": {\"name\": \"calculate\", \"description\": \"Calculate a math expression.\", \"parameters\": {\"properties\": {\"expression\": {\"type\": \"string\"}}, \"required\": [\"expression\"], \"type\": \"object\"}}}]",
      "gen_ai.evaluation.sampled": true,
      "gen_ai.evaluation.error": "None",
      "gen_ai.input.messages": "[{\"role\": \"user\", \"parts\": [{\"type\": \"text\", \"content\": \"You are a helpful assistant. Use tools when needed.\"}]}, {\"role\": \"user\", \"parts\": [{\"type\": \"text\", \"content\": \"What is 2 + 2?\"}]}]",
      "gen_ai.provider.name": "openai",
      "gen_ai.agent.name": "circuit-react-agent",
      "gen_ai.agent.id": "c49542bbf7b48e41",
      "gen_ai.request.model": "gpt-5-nano",
      "gen_ai.operation.name": "chat",
      "gen_ai.response.model": "gpt-5-nano-2025-08-07",
      "gen_ai.usage.input_tokens": 188,
      "gen_ai.usage.output_tokens": 75,
      "gen_ai.response.finish_reasons": [
        "stop"
      ],
      "gen_ai.request.stream": false,
      "gen_ai.output.messages": "[{\"role\": \"assistant\", \"parts\": [{\"type\": \"text\", \"content\": \"4\"}], \"finish_reason\": \"stop\"}]"
    },
    "events": []
  },
  {
    "name": "step model",
    "trace_id": "c0d7afd89e3687f2967972ab19ccb102",
    "span_id": "145768e7306eb692",
    "parent_span_id": "c49542bbf7b48e41",
    "start_time": 1777946530403703000,
    "end_time": 1777946534038877000,
    "duration_ms": 3635.174,
    "status": "StatusCode.UNSET",
    "attributes": {
      "gen_ai.step.name": "model",
      "gen_ai.step.type": "chain",
      "gen_ai.evaluation.sampled": true,
      "gen_ai.evaluation.error": "None",
      "gen_ai.agent.name": "circuit-react-agent",
      "gen_ai.agent.id": "c49542bbf7b48e41"
    },
    "events": []
  },
  {
    "name": "invoke_agent circuit-react-agent",
    "trace_id": "c0d7afd89e3687f2967972ab19ccb102",
    "span_id": "c49542bbf7b48e41",
    "parent_span_id": null,
    "start_time": 1777946530402301000,
    "end_time": 1777946534046669000,
    "duration_ms": 3644.368,
    "status": "StatusCode.UNSET",
    "attributes": {
      "gen_ai.agent.id": "c49542bbf7b48e41",
      "gen_ai.framework": "langchain",
      "gen_ai.evaluation.sampled": true,
      "gen_ai.evaluation.error": "None",
      "gen_ai.output.messages": "[{\"role\": \"assistant\", \"parts\": [{\"type\": \"text\", \"content\": \"4\"}]}]",
      "gen_ai.agent.name": "circuit-react-agent",
      "gen_ai.conversation_root": true,
      "gen_ai.operation.name": "invoke_agent"
    },
    "events": []
  }
]

async span output

[
  {
    "name": "chat gpt-5-nano",
    "trace_id": "7535042f0263427201f59e5f01475de5",
    "span_id": "dc5a3b3c6cf342f3",
    "parent_span_id": "9f1809b2d7c2fc17",
    "start_time": 1777946534057605000,
    "end_time": 1777946535507804000,
    "duration_ms": 1450.199,
    "status": "StatusCode.UNSET",
    "attributes": {
      "gen_ai.tool.definitions": "[{\"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"description\": \"Get the current weather for a city.\", \"parameters\": {\"properties\": {\"city\": {\"type\": \"string\"}}, \"required\": [\"city\"], \"type\": \"object\"}}}, {\"type\": \"function\", \"function\": {\"name\": \"get_time\", \"description\": \"Get the current time in a timezone.\", \"parameters\": {\"properties\": {\"timezone\": {\"type\": \"string\"}}, \"required\": [\"timezone\"], \"type\": \"object\"}}}, {\"type\": \"function\", \"function\": {\"name\": \"calculate\", \"description\": \"Calculate a math expression.\", \"parameters\": {\"properties\": {\"expression\": {\"type\": \"string\"}}, \"required\": [\"expression\"], \"type\": \"object\"}}}]",
      "gen_ai.evaluation.sampled": true,
      "gen_ai.evaluation.error": "None",
      "gen_ai.input.messages": "[{\"role\": \"user\", \"parts\": [{\"type\": \"text\", \"content\": \"You are a helpful assistant. Use tools when needed.\"}]}, {\"role\": \"user\", \"parts\": [{\"type\": \"text\", \"content\": \"What is 2 + 2?\"}]}]",
      "gen_ai.provider.name": "openai",
      "gen_ai.agent.name": "circuit-react-agent",
      "gen_ai.agent.id": "69d14c88266c2d3b",
      "gen_ai.request.model": "gpt-5-nano",
      "gen_ai.operation.name": "chat",
      "gen_ai.response.model": "gpt-5-nano-2025-08-07",
      "gen_ai.usage.input_tokens": 188,
      "gen_ai.usage.output_tokens": 75,
      "gen_ai.response.finish_reasons": [
        "stop"
      ],
      "gen_ai.request.stream": false,
      "gen_ai.output.messages": "[{\"role\": \"assistant\", \"parts\": [{\"type\": \"text\", \"content\": \"4\"}], \"finish_reason\": \"stop\"}]"
    },
    "events": []
  },
  {
    "name": "step model",
    "trace_id": "7535042f0263427201f59e5f01475de5",
    "span_id": "9f1809b2d7c2fc17",
    "parent_span_id": "69d14c88266c2d3b",
    "start_time": 1777946534054480000,
    "end_time": 1777946535508086000,
    "duration_ms": 1453.606,
    "status": "StatusCode.UNSET",
    "attributes": {
      "gen_ai.step.name": "model",
      "gen_ai.step.type": "chain",
      "gen_ai.evaluation.sampled": true,
      "gen_ai.evaluation.error": "None",
      "gen_ai.agent.name": "circuit-react-agent",
      "gen_ai.agent.id": "69d14c88266c2d3b"
    },
    "events": []
  },
  {
    "name": "invoke_agent circuit-react-agent",
    "trace_id": "7535042f0263427201f59e5f01475de5",
    "span_id": "69d14c88266c2d3b",
    "parent_span_id": null,
    "start_time": 1777946534053922000,
    "end_time": 1777946535508631000,
    "duration_ms": 1454.709,
    "status": "StatusCode.UNSET",
    "attributes": {
      "gen_ai.agent.id": "69d14c88266c2d3b",
      "gen_ai.framework": "langchain",
      "gen_ai.evaluation.sampled": true,
      "gen_ai.evaluation.error": "None",
      "gen_ai.output.messages": "[{\"role\": \"assistant\", \"parts\": [{\"type\": \"text\", \"content\": \"4\"}]}]",
      "gen_ai.agent.name": "circuit-react-agent",
      "gen_ai.conversation_root": true,
      "gen_ai.operation.name": "invoke_agent"
    },
    "events": []
  }
]

shuningc and others added 6 commits May 7, 2026 23:46
@shuningc shuningc requested review from a team as code owners May 8, 2026 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant