Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

README.md

@reaatech/agent-eval-harness-mcp-server

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Three-layer MCP (Model Context Protocol) server exposing evaluation tools. Provides 13 tools across three layers — atomic judge operations, orchestrated suite runs, and CI gate operations — all accessible via MCP stdio transport for integration with AI coding agents like Claude Desktop.

Installation

npm install @reaatech/agent-eval-harness-mcp-server

Feature Overview

  • 13 MCP tools — covering the full evaluation lifecycle from atomic judgment to CI gate checking
  • Three-layer architectureeval.judge.* (5 fast, stateless atomic ops), eval.suite.* (5 orchestrated longer-running ops), eval.gate.* (3 blocking CI gate ops)
  • Stdio transport — standard MCP protocol over stdin/stdout, no HTTP server required
  • Auto-discovery — agents can list available tools and their input/output schemas at connection
  • In-memory state — session-scoped run storage with no external database dependency
  • JSON Schema tool definitions — each tool declares its input shape for type-safe agent invocation

Quick Start

import { createMCPServer } from '@reaatech/agent-eval-harness-mcp-server';

const server = await createMCPServer();
await server.start(); // Connects via stdio — ready for MCP clients

Configure tool layers individually:

import { createMCPServer } from '@reaatech/agent-eval-harness-mcp-server';

const server = await createMCPServer({
  name: 'my-eval-server',
  enableJudgeTools: true,
  enableSuiteTools: true,
  enableGateTools: false, // gate ops disabled
});

API Reference

Server

Export Type Description
EvalHarnessMCPServer Class MCP server instance wrapping @modelcontextprotocol/sdk
createMCPServer(config?) async (config?: Partial<MCPServerConfig>) => Promise<EvalHarnessMCPServer> Create and start server in one call

EvalHarnessMCPServer methods:

Method Description
run() Connect and start listening on stdio transport
getServer() Access underlying MCP Server instance
close() Gracefully close the server connection

Configuration

MCPServerConfig

Field Type Default Description
name string 'agent-eval-harness' Server name reported to MCP clients
version string package.version Server version
enableJudgeTools boolean true Register eval.judge.* tools
enableSuiteTools boolean true Register eval.suite.* tools
enableGateTools boolean true Register eval.gate.* tools

Tool Reference

Layer 1 — eval.judge.* (Atomic Operations)

Fast, stateless operations designed for mid-task self-evaluation by agents.

Tool Input Output Description
eval.judge.faithfulness { context: string, response: string } { score, explanation, confidence } Score response faithfulness to context
eval.judge.relevance { intent: string, response: string } { score, explanation, confidence } Score response relevance to intent
eval.judge.tool_correctness { expected_tool: string, actual_tool: string, arguments?: object, result?: object } { score, explanation, confidence } Validate tool selection and arguments
eval.judge.cost_check { trajectory: object, budget: number } { within_budget, cost, budget, usage_percentage } Verify cost within budget
eval.judge.latency_check { trajectory: object, sla: number } { within_sla, p99_ms, p50_ms, p90_ms, total_ms } Verify latency within SLA

Layer 2 — eval.suite.* (Orchestrated Runs)

Stateful operations for eval-driven development. In-memory storage per session.

Tool Input Output Description
eval.suite.run { trajectories: object[], config?: { metrics?, judge_model?, budget_limit? } } { run_id, status, total_trajectories, completed, failed, duration_ms } Execute evaluation suite
eval.suite.status { run_id: string } { run_id, status, progress, completed, total, started_at, ended_at } Get run progress
eval.suite.results { run_id: string, format?: 'json' | 'summary' } Aggregated results or summary Retrieve evaluation results
eval.suite.compare { baseline_run: string, candidate_run: string } { score_diff, verdict, regressions, improvements, key_findings } Compare two runs
eval.suite.baseline { run_id: string, name?: string } { baseline_id, name, set_at } Set baseline for regression

Layer 3 — eval.gate.* (CI Gates)

Blocking, opinionated operations for CI/CD pipelines. In-memory gate storage per session.

Tool Input Output Description
eval.gate.run { run_id?: string, gate_config?: string, results?: object, comparison?: object } { passed, total_gates, passed_gates, failed_gates, results, exit_code } Run CI-style pass/fail gate
eval.gate.config { action: 'get' | 'set' | 'list', config?: object[], preset?: 'standard' | 'strict' | 'lenient' } { gates } or { success, gates_loaded } Get/set/list gate configuration
eval.gate.diff { baseline: object, candidate: object, metrics?: string[] } { score_diff, metric_diffs, regressions, improvements, verdict } Detailed diff from baseline

Related Packages

Package Description
@reaatech/agent-eval-harness-types Shared domain types and schemas
@reaatech/agent-eval-harness-trajectory Trajectory evaluation
@reaatech/agent-eval-harness-tool-use Tool-use validation
@reaatech/agent-eval-harness-cost Cost tracking
@reaatech/agent-eval-harness-latency Latency monitoring
@reaatech/agent-eval-harness-judge LLM-as-judge
@reaatech/agent-eval-harness-golden Golden trajectories
@reaatech/agent-eval-harness-suite Suite runner
@reaatech/agent-eval-harness-gate CI gates
@reaatech/agent-eval-harness-cli CLI
@reaatech/agent-eval-harness-observability Observability

License

MIT