Prometheus Metrics

ecs_agent.metrics provides a Prometheus metrics surface for production observability. The integration subscribes to the World event bus, records framework-owned runtime events into a private CollectorRegistry, and exposes Prometheus text format without using the global default registry.

Quick Start

from ecs_agent.core import Runner, World
from ecs_agent.metrics import install_prometheus_metrics, render_metrics

world = World()
metrics = install_prometheus_metrics(world)

# Register systems and add agent entities as usual.
await Runner().run(world, max_ticks=3)

body = render_metrics(metrics)

install_prometheus_metrics(world) is idempotent: repeated calls on the same world return the already installed PrometheusMetrics recorder and do not double-subscribe handlers. Use uninstall_prometheus_metrics(world) to remove the event-bus subscriptions; the returned recorder remains renderable for final scrape or test assertions.

Endpoint Modes

All endpoint helpers use the registry owned by the PrometheusMetrics instance you pass in.

from ecs_agent.metrics import (
    make_metrics_asgi_app,
    make_metrics_wsgi_app,
    render_metrics,
    start_metrics_server,
)

body = render_metrics(metrics)
asgi_app = make_metrics_asgi_app(metrics)
wsgi_app = make_metrics_wsgi_app(metrics)
handle = start_metrics_server(9100, addr="127.0.0.1", metrics=metrics)

try:
    ...
finally:
    handle.close(timeout=5)

render_metrics(metrics) returns Prometheus text bytes for custom handlers, tests, or CLI output.
make_metrics_asgi_app(metrics) returns a small framework-free ASGI callable suitable for mounting at /metrics.
make_metrics_wsgi_app(metrics) returns a framework-free WSGI callable suitable for mounting at /metrics.
start_metrics_server(port, addr=..., metrics=...) starts a standalone HTTP server. The returned handle supports close(), shutdown(), server_close(), join(), and tuple-unpacking for access to the underlying server/thread.

Prometheus and Grafana UI Demo

For a runnable local Prometheus + Grafana setup, see examples/prometheus/. The demo starts an ecs-agent process with a standalone :9100/metrics endpoint, a Docker Compose Prometheus server that scrapes it, and a Grafana instance with an automatically provisioned Prometheus datasource plus ecs-agent Overview dashboard. Open http://localhost:3000 with admin / admin to view charts, or use http://localhost:9090 for raw PromQL queries such as ecs_agent_runs_total, ecs_agent_llm_invocations_total, or rate(ecs_agent_llm_invocations_total[1m]).

Metric Contract

Prometheus names use the ecs_agent_ prefix. Duration histograms are in seconds. Counters are listed without the Prometheus client's generated _created samples.

Metric	Type	Labels	Meaning
`ecs_agent_runs_total`	Counter	`status`	Agent run outcomes.
`ecs_agent_runner_ticks_total`	Counter	`status`	Completed runner tick outcomes.
`ecs_agent_runner_tick_duration_seconds`	Histogram	`status`	Runner tick duration.
`ecs_agent_system_executions_total`	Counter	`system`, `status`	System execution outcomes.
`ecs_agent_system_execution_duration_seconds`	Histogram	`system`, `status`	Per-system execution duration.
`ecs_agent_active_entities`	Gauge	none	Latest observed active entity count.
`ecs_agent_llm_invocations_total`	Counter	`provider`, `model`, `operation`, `status`, `streaming`	Logical framework-owned LLM invocation outcomes.
`ecs_agent_llm_invocation_duration_seconds`	Histogram	`provider`, `model`, `operation`, `status`, `streaming`	Logical LLM invocation duration.
`ecs_agent_llm_tokens_total`	Counter	`provider`, `model`, `token_type`	Normalized prompt/completion/total/cache token usage when providers return usage.
`ecs_agent_llm_retries_total`	Counter	`provider`, `model`, `reason`	Retry attempts emitted by retry-capable model wrappers.
`ecs_agent_tool_calls_total`	Counter	`tool`, `status`	Tool call outcomes.
`ecs_agent_tool_call_duration_seconds`	Histogram	`tool`, `status`	Tool call runtime.
`ecs_agent_tool_denied_total`	Counter	`tool`, `reason`	Denied tool attempts.
`ecs_agent_tool_approved_total`	Counter	`tool`, `policy`	Approved tool attempts.
`ecs_agent_errors_total`	Counter	`system`, `error_type`	Captured system error outcomes without raw exception text.
`ecs_agent_terminals_total`	Counter	`reason`	Terminal run outcomes.
`ecs_agent_stream_events_total`	Counter	`event`, `status`	Stream start/delta/end/interruption lifecycle counts.
`ecs_agent_stream_first_delta_seconds`	Histogram	`provider`, `model`, `operation`, `status`	Time to first stream delta.
`ecs_agent_stream_duration_seconds`	Histogram	`provider`, `model`, `operation`, `status`	Total stream duration.
`ecs_agent_subagent_lifecycle_total`	Counter	`phase`, `status`	Subagent lifecycle events.
`ecs_agent_message_bus_events_total`	Counter	`event`, `operation`	Message-bus publish/deliver/timeout/response style operations.
`ecs_agent_checkpoint_operations_total`	Counter	`operation`, `status`	Checkpoint save/restore operations.
`ecs_agent_compaction_operations_total`	Counter	`operation`, `status`	Conversation compaction operations.
`ecs_agent_mcts_nodes_scored_total`	Counter	`phase`, `status`	MCTS node scoring outcomes without node IDs or scores as labels.
`ecs_agent_plan_steps_total`	Counter	`operation`, `status`	Planning and replanning step operations.
`ecs_agent_tool_result_cached_total`	Counter	`status`	Tool-result cache writes/reuse outcomes.

Business and quality metrics are intentionally not inferred in this phase. They need explicit application events in a future extension so the framework does not guess business semantics from prompts, responses, tool payloads, or result text.

Label Safety Policy

Allowed labels are the low-cardinality contract values: system, status, reason, operation, provider, model, streaming, event, phase, policy, tool, error_type, and token_type.

The metrics layer rejects labels outside that set. It never exports ID-class labels such as entity, tool-call, request, correlation, trace, session, checkpoint, node, branch, or message identifiers. It also does not export world names, raw topics, prompts, responses, tool arguments/results, raw exception strings, artifact paths, API keys, or tokens.

Label values are sanitized to bounded strings. Empty values, very long values, whitespace-bearing values, path-like values, and quoted values fall back to unknown rather than becoming high-cardinality labels.

Event Sources

Metrics are recorded from explicit event-bus events rather than log scraping:

Runner/system lifecycle events feed run, tick, system, terminal, and active-entity metrics.
LLM invocation, retry, and stream events feed model, token, retry, and stream metrics.
Tool execution and approval events feed tool counters and durations.
Delegation, message-bus, checkpoint, compaction, MCTS, plan-step, and tool-result cache events feed runtime-control metrics.

Live Smoke Test

The live smoke test is optional and environment-gated. It skips cleanly without LLM_API_KEY and uses only environment variables for live configuration:

LLM_API_KEY="$LLM_API_KEY" \
  LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1 \
  LLM_MODEL=qwen3.5-flash \
  LLM_API_FORMAT=openai_chat_completions \
  uv run pytest tests/live/test_prometheus_metrics_live.py -v

Supported LLM_API_FORMAT values are openai_chat_completions, openai_responses, and anthropic_messages; existing shorthand values openai, chat, responses, and anthropic are accepted by the live test helper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus Metrics

Quick Start

Endpoint Modes

Prometheus and Grafana UI Demo

Metric Contract

Label Safety Policy

Event Sources

Live Smoke Test

FilesExpand file tree

metrics.md

Latest commit

History

metrics.md

File metadata and controls

Prometheus Metrics

Quick Start

Endpoint Modes

Prometheus and Grafana UI Demo

Metric Contract

Label Safety Policy

Event Sources

Live Smoke Test