Reference example for writing a custom converter when no built-in framework adapter fits your agent. The agent here uses the Anthropic Python SDK directly (no LangChain, LangGraph, Strands, or OpenAI Agents SDK in sight) and drives a hand-rolled ReAct tool-call loop over the Messages API. The companion converter.py maps the agent's raw conversation history into an agentverify ExecutionResult that every assertion, including step-level ones, accepts unchanged.
agentverify ships built-in adapters for LangChain, LangGraph, Strands Agents, and OpenAI Agents SDK. If you're using any of those, you don't need this example. from_langchain(...) / from_langgraph(...) / from_strands(...) / from_openai_agents(...) each turn the framework's native output into an ExecutionResult in one call.
If your agent is built without any of those frameworks (a pure-Python script calling an LLM SDK directly, a custom orchestrator, or an in-house framework), you write a small converter function that does the same job. This example shows what that converter looks like and how much code it actually takes (about 80 lines of real logic).
The agent's job is intentionally small so the converter is easy to read: given "What's 100 + 200 with 10% tax added?", it chains two tools (add for the pre-tax total, apply_tax for the grossed-up final figure) and returns a one-sentence summary. The interesting assertion is that apply_tax's amount argument must actually come from add's tool result, not be hallucinated by the model.
- Python 3.10+
- Anthropic API key (only for re-recording the cassette; not needed for replay)
See the main README for setup instructions (git clone, venv, pip install).
cd examples/custom-converter-python-agent
export ANTHROPIC_API_KEY=sk-ant-your_key_here
python agent.pyThe agent prints its final answer and the total token usage for the session.
Tests ship with a pre-recorded cassette under tests/cassettes/. Just run:
pytestThe cassette replays deterministically. No Anthropic API calls, zero cost.
| Test Class | Test | agentverify Assertion | Description |
|---|---|---|---|
TestTaxAgentFlat |
test_tool_sequence |
assert_tool_calls() |
Exact sequence: add(100, 200) then apply_tax(_, rate=0.1) |
TestTaxAgentFlat |
test_safety_no_dangerous_tools |
assert_no_tool_call() |
delete_file, execute_command, transfer_funds never appear |
TestTaxAgentFlat |
test_budget_and_output |
assert_cost() + assert_final_output() via assert_all |
Token budget ≤ 10k and the final output quotes 330 |
TestTaxAgentStepLevel |
test_first_step_calls_add |
assert_step() |
Step 0 is the assistant's first turn: exactly one add call |
TestTaxAgentStepLevel |
test_second_step_calls_apply_tax_with_sum |
assert_step() |
Step 1 is the assistant's second turn: apply_tax with some amount and the requested rate |
TestTaxAgentStepLevel |
test_apply_tax_uses_add_result |
assert_step_uses_result_from() |
The headline check. Step 1 actually consumes step 0's add result. Catches "model hallucinated a number instead of using the tool result" bugs |
TestTaxAgentStepLevel |
test_final_step_has_no_tool_calls |
assert_step() |
The last step (the summary) has zero tool calls |
TestTaxAgentStepLevel |
test_final_output_contains_grossed_up_number |
assert_final_output() |
The summary quotes the post-tax figure |
Every step-level check works on cassette replay because the converter attaches each user message's tool_result blocks onto the preceding assistant step's tool_results, exactly matching the structure agentverify's step-level API expects.
To re-record the cassette with real LLM calls:
- Set the API key:
export ANTHROPIC_API_KEY=sk-ant-your_key_here - Run a single test with
--cassette-mode=record. All eight tests share the same cassette file, so running the whole suite in record mode rewrites the file once per test; recording from a single test keeps the file in a deterministic state.pytest -k test_first_step_calls_add --cassette-mode=record
- Commit the updated cassette.
temperature is not explicitly pinned (Anthropic defaults apply), and claude-haiku-4-5 is non-deterministic even at low temperature. The test assertions are written to tolerate that variance: assert_tool_calls uses partial_args=True so the exact amount the model passes to apply_tax isn't hard-coded, and assert_final_output matches the post-tax number via a regex that accepts both 330 and 330.00.
The SDK emits the conversation as a flat list of user / assistant messages. Each assistant turn becomes one step; each user turn carries the tool results for the previous assistant turn. converter.py walks the list in order and shapes it like this:
| Anthropic message field | ExecutionResult / Step field |
|---|---|
assistant message's tool_use content blocks |
step.tool_calls |
Next user message's tool_result content blocks |
step.tool_results |
assistant message's text content blocks (joined) |
step.output |
| Conversation history up to this assistant turn | step.input_context |
response.usage.input_tokens / output_tokens (summed) |
execution_result.token_usage |
Final assistant text block |
execution_result.final_output |
The actual code is about 80 lines including docstrings and type hints. Copy converter.py as a starting point for your own adapter and adjust the field extraction for whatever shape your agent emits.
Use a custom converter when:
- Your agent is built on an SDK that agentverify doesn't ship a built-in adapter for (Anthropic direct, OpenAI direct without the Agents SDK, Cohere, Mistral, your own in-house orchestrator, etc.)
- Your agent deviates meaningfully from what a built-in adapter expects (e.g. a LangChain agent wrapped in custom pre/post-processing that changes the step structure)
- You want the conversion logic to live in your own codebase alongside tests, rather than importing it from a library
If you're using LangChain, LangGraph, Strands Agents, or OpenAI Agents SDK unmodified, prefer the built-in adapter. It's a one-liner and handles edge cases (token usage aggregation, tool-result backfill, message-format quirks) that you'd otherwise rediscover yourself.