LLM Tool-Use Accuracy Testing Harness

A benchmarking framework for measuring LLM accuracy in autonomous tool-use scenarios. This harness was developed to produce data for research on AI agent performance when interacting with external systems via the Model Context Protocol (MCP).

Research Context

This project implements a ReAct (Reasoning + Acting) agent pattern to test how well large language models can:

Reason about what information they need to answer a question
Select appropriate tools from available MCP server endpoints
Execute tool calls with correct parameters
Interpret results and iterate until the question is answered
Synthesize accurate final responses

The harness captures comprehensive metrics including token usage, API costs, execution time, and tool call patterns—enabling quantitative analysis of LLM accuracy and efficiency.

Test Matrix

The harness tests 16 prompts across 4 data sources using 6 MCP providers:

Data Sources & Prompts

Data Source	# Prompts	Test Categories
CRM	4	Deals count, contact search, deal listing, deal updates
Project Management	4	Issue filtering, backlog queries, sprint issues, status transitions
Data Warehouse	4	Schema introspection, ORDER BY queries, date filtering, UPDATE operations
ERP	4	Account filtering, ID range queries, charge listing, timezone lookup

MCP Providers Tested

Provider	Auth Type	Supported Data Sources
CData Connect AI	Basic Auth	CRM, Project Management, Data Warehouse, ERP (350+ sources)
CRM Native	Bearer	CRM only
ERP Native	Bearer	ERP only
iPaaS Provider	URL Token	CRM, Project Management, Data Warehouse
MCP Gateway	URL Token	CRM, Project Management
Unified API	stdio/Bearer	CRM, Project Management

Architecture

User Prompt
    │
    ▼
┌─────────────────────────────────────┐
│  LangGraph ReAct Agent (GPT-5)      │
│  - Reasoning loop                   │
│  - Tool selection                   │
│  - Result interpretation            │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  MCP Tools (via MultiServerMCPClient)│
│  - CRM operations                   │
│  - Project management operations    │
│  - Data warehouse SQL queries       │
│  - ERP data                         │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  DynamicToolLogger                  │
│  - Token counting (OpenAI API)      │
│  - Cost calculation                 │
│  - Execution timing                 │
│  - Tool call statistics             │
└─────────────────────────────────────┘

Installation

# Clone the repository
git clone https://github.com/CDataSoftware/connect-ai-accuracy-test-harness.git
cd connect-ai-accuracy-test-harness

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env with your credentials

Quick Start

Run a single test

python run_tests.py --provider connectai --prompt-id crm_deals_count

Run all CRM tests for a provider

python run_tests.py --provider connectai --data-source crm --runs 5

Run tests and export results

python run_tests.py --provider connectai --data-source proj_mgmt --runs 5 --output results.csv

List available prompts and providers

python run_tests.py --list-prompts
python run_tests.py --list-providers

Configuration

Edit .env to configure your MCP server credentials. See .env.example for all options.

Required Environment Variables

For all tests:

OPENAI_API_KEY=your_openai_api_key

For CData Connect AI (supports all data sources):

CONNECTAI_MCP_URL=https://mcp.cloud.cdata.com/mcp
CONNECTAI_AUTH=your_base64_encoded_credentials

Note: CONNECTAI_AUTH should be base64 encoded username:personal_access_token

For CRM Native:

CRM_NATIVE_MCP_URL=https://example.com/mcp
CRM_NATIVE_BEARER_TOKEN=your_token

For ERP Native:

ERP_NATIVE_MCP_URL=https://your_account_id.myerp.com/mcp
ERP_NATIVE_BEARER_TOKEN=your_token

For iPaaS Provider:

IPAAS_MCP_URL=https://example.com/mcp

For MCP Gateway:

MCP_GATEWAY_URL=https://example.com/mcp

For Unified API:

UNIFIED_API_MCP_URL=https://example.com/mcp
UNIFIED_API_AUTH_TOKEN=your_api_key

Running Tests

The run_tests.py script provides a unified interface for running all tests:

# Run specific prompt against specific provider
python run_tests.py --provider connectai --prompt-id crm_deals_count --runs 5

# Run all prompts for a data source
python run_tests.py --provider ipaas --data-source proj_mgmt --runs 3

# Run all available prompts for a provider
python run_tests.py --provider mcp_gateway --runs 1

# Export results to CSV
python run_tests.py --provider connectai --data-source crm --runs 5 --output crm_results.csv

# Export results to JSON
python run_tests.py --provider connectai --data-source crm --runs 5 --output crm_results.json

Metrics Collected

The DynamicToolLogger captures:

Metric	Description
`llm_prompt_tokens`	Input tokens sent to GPT-5
`llm_completion_tokens`	Output tokens generated
`total_tool_calls`	Number of MCP tool invocations
`execution_time`	End-to-end duration in seconds
`input_cost`	Cost at $1.25/1M tokens
`output_cost`	Cost at $10.00/1M tokens
`tools_available`	Number of tools exposed by MCP server

Sample Output

FINAL TOKEN USAGE STATISTICS
========================================
TOOL CALL STATISTICS:
   Total tool calls: 3

LLM TOKEN USAGE (from OpenAI API):
   Prompt tokens:      12,847
   Completion tokens:     892
   Total LLM tokens:   13,739

COST BREAKDOWN:
   Input cost:  $0.0161
   Output cost: $0.0089
   Total cost:  $0.0250

Project Structure

.
├── run_tests.py            # Unified test runner
├── prompts.py              # Test prompt definitions (16 prompts)
├── config.py               # MCP provider configurations
├── logger.py               # Shared metrics logger
├── validate_config.py      # Configuration validation utility
├── requirements.txt        # Python dependencies
├── .env.example            # Environment template
└── LICENSE                 # MIT License

Test Prompts Reference

CRM Prompts

ID	Prompt	Category
`crm_deals_count`	How many deals set to close this quarter?	read
`crm_find_contact`	Find contacts named John Smith	read
`crm_list_deals`	List deals for Acme Corp	read
`crm_update_deal`	Change deal stage to contract sent	write

Project Management Prompts

ID	Prompt	Category
`proj_mgmt_issues_by_assignee`	Issues assigned to user in To Do	read
`proj_mgmt_backlog_issues`	Issues in project backlog	read
`proj_mgmt_sprint_issues`	Sprint issues sorted by priority	read
`proj_mgmt_move_issue`	Move issue to In Progress	write

Data Warehouse Prompts

ID	Prompt	Category
`data_warehouse_show_tables`	Show tables in schema	read
`data_warehouse_top_orders`	Top 10 orders by amount	read
`data_warehouse_recent_orders`	Orders in last 30 days	read
`data_warehouse_update_order`	Update order status	write

ERP Prompts

ID	Prompt	Category
`erp_income_accounts`	Accounts where type is Income	read
`erp_bins`	Bins for IDs 1-10	read
`erp_charges`	List all charges	read
`erp_timezones`	Time zones for country IDs	read

Reproducing Research Results

To reproduce the accuracy testing from the whitepaper:

Configure credentials for all MCP providers you want to test

Run the test suite with 5 runs per prompt:

python run_tests.py --provider connectai --data-source crm --runs 5 --output connectai_crm.csv
python run_tests.py --provider crm_native --data-source crm --runs 5 --output native_crm.csv
python run_tests.py --provider ipaas --data-source crm --runs 5 --output ipaas_crm.csv

Compare results with expected answers in prompts.py
Calculate accuracy by comparing answer to expected_answer

Dependencies

langchain / langchain-openai - LLM orchestration
langgraph - ReAct agent implementation
langchain_mcp_adapters - MCP protocol integration
tiktoken - Token counting
python-dotenv - Environment management

License

MIT License - See LICENSE file for details.

Contributing

This is a research tool. Issues and pull requests are welcome for bug fixes and improvements to measurement accuracy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Tool-Use Accuracy Testing Harness

Research Context

Test Matrix

Data Sources & Prompts

MCP Providers Tested

Architecture

Installation

Quick Start

Run a single test

Run all CRM tests for a provider

Run tests and export results

List available prompts and providers

Configuration

Required Environment Variables

Running Tests

Metrics Collected

Sample Output

Project Structure

Test Prompts Reference

CRM Prompts

Project Management Prompts

Data Warehouse Prompts

ERP Prompts

Reproducing Research Results

Dependencies

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
logger.py		logger.py
prompts.py		prompts.py
requirements.txt		requirements.txt
run_tests.py		run_tests.py
validate_config.py		validate_config.py

Folders and files

Latest commit

History

Repository files navigation

LLM Tool-Use Accuracy Testing Harness

Research Context

Test Matrix

Data Sources & Prompts

MCP Providers Tested

Architecture

Installation

Quick Start

Run a single test

Run all CRM tests for a provider

Run tests and export results

List available prompts and providers

Configuration

Required Environment Variables

Running Tests

Metrics Collected

Sample Output

Project Structure

Test Prompts Reference

CRM Prompts

Project Management Prompts

Data Warehouse Prompts

ERP Prompts

Reproducing Research Results

Dependencies

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages