The ToolHive Registry Server provides comprehensive observability through OpenTelemetry (OTEL), supporting both distributed tracing and metrics collection via OTLP exporters.
┌─────────────────────────────────────────────────────────────────────┐
│ Registry Server │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ HTTP │ │ Sync │ │ Registry │ │
│ │ Middleware │ │ Metrics │ │ Metrics │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ Telemetry │ │
│ │ Facade │ │
│ └──────┬──────┘ │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │ Tracer │ │ Meter │ │ Resource │ │
│ │ Provider │ │ Provider │ │ Attributes │ │
│ └──────┬──────┘ └──────┬──────┘ └─────────────┘ │
│ │ │ │
│ └────────┬───────┘ │
│ │ OTLP HTTP │
└──────────────────┼──────────────────────────────────────────────────┘
│
▼
┌────────────────┐
│ OTEL │
│ Collector │
└───────┬────────┘
│
┌─────────┼─────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ Jaeger │ │Promethe│ │ Grafana│
│ │ │ us │ │ │
└────────┘ └────────┘ └────────┘
The telemetry implementation is located in internal/telemetry/:
| File | Responsibility |
|---|---|
telemetry.go |
Main facade orchestrating tracer and meter providers |
config.go |
Configuration types with validation and defaults |
tracer.go |
TracerProvider setup with OTLP HTTP exporter |
meter.go |
MeterProvider setup with OTLP HTTP exporter |
metrics.go |
Application-specific metrics (registry and sync) |
middleware.go |
HTTP metrics middleware for Chi router |
tracing_middleware.go |
HTTP tracing middleware for distributed tracing |
Telemetry is configured via the main application config file:
telemetry:
enabled: true
serviceName: "thv-registry-api"
serviceVersion: "1.0.0"
endpoint: "otel-collector:4318"
insecure: true
tracing:
enabled: true
sampling: 0.05 # 5% of traces sampled
metrics:
enabled: true| Option | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Enable/disable all telemetry |
serviceName |
string | "thv-registry-api" |
Service name in telemetry data |
serviceVersion |
string | "" |
Service version in telemetry data |
endpoint |
string | "localhost:4318" |
OTLP HTTP endpoint |
insecure |
bool | false |
Use insecure connection (no TLS) |
tracing.enabled |
bool | false |
Enable distributed tracing |
tracing.sampling |
float64 | 0.05 |
Trace sampling ratio (0.0-1.0) |
metrics.enabled |
bool | false |
Enable metrics collection |
All metrics are prefixed with thv_reg_srv_ to distinguish them from other metrics in the system.
| Metric | Type | Labels | Description |
|---|---|---|---|
thv_reg_srv_http_request_duration_seconds |
Histogram | method, route, status_code |
Duration of HTTP requests |
thv_reg_srv_http_requests_total |
Counter | method, route, status_code |
Total number of HTTP requests |
thv_reg_srv_http_active_requests |
UpDownCounter | - | Number of in-flight requests |
thv_reg_srv_servers_total |
Gauge | registry |
Number of servers in each registry |
thv_reg_srv_sync_duration_seconds |
Histogram | registry, success |
Duration of sync operations |
- HTTP metrics: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10 seconds
- Sync metrics: 0.1, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300 seconds
The Registry Server implements distributed tracing across three layers: HTTP, Service, and Sync operations. Traces provide end-to-end visibility into request flows and background operations.
┌─────────────────────────────────────────────────────────────────────┐
│ HTTP Request Span (root) │
│ Name: "GET /registry/v0.1/servers" │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Service Span (child) │ │
│ │ Name: "dbService.ListServers" │ │
│ │ - Attributes: registry.name, pagination.limit, result.count │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Background Sync Span │
│ Name: "sync.performRegistrySync" │
│ - Attributes: registry.name, registry.type, sync.success, │
│ sync.duration_seconds, sync.server_count │
└─────────────────────────────────────────────────────────────────────┘
| Span Name | Kind | Description |
|---|---|---|
{METHOD} {route} |
Server | Root span for all HTTP requests |
Attributes:
| Attribute | Type | Description |
|---|---|---|
http.request.method |
string | HTTP method (GET, POST, etc.) |
http.route |
string | Route pattern (e.g., /registry/v0.1/servers/{serverName}) |
url.path |
string | Actual URL path |
user_agent.original |
string | Client user agent |
http.response.status_code |
int | Response status code |
Excluded Endpoints:
The following endpoints are intentionally excluded from tracing:
| Endpoint | Reason for Exclusion |
|---|---|
/health |
Health check endpoint |
/readiness |
Readiness probe endpoint |
Rationale for excluding health and readiness endpoints:
-
High frequency, low diagnostic value: Health and readiness probes are typically called every 5-30 seconds by Kubernetes or load balancers. This generates a high volume of nearly identical spans that provide minimal insight into application behavior.
-
Trace storage costs: Each span consumes storage in your tracing backend (e.g., Jaeger, Tempo). Health check spans can account for 50-90% of total span volume while providing almost no diagnostic value, significantly increasing storage costs.
-
Signal-to-noise ratio: When investigating issues, health check spans clutter trace views and make it harder to find meaningful application traces. Excluding them improves the signal-to-noise ratio for debugging.
-
Predictable behavior: Health and readiness endpoints have simple, predictable behavior (return 200 OK or error). Unlike business logic endpoints, they rarely need trace-level debugging—HTTP metrics are sufficient for monitoring their behavior.
-
Industry best practice: Most observability frameworks and guidelines recommend excluding infrastructure endpoints from tracing. The OpenTelemetry community generally advises filtering out health checks at the instrumentation level.
If you need to debug health check issues, HTTP metrics (thv_reg_srv_http_request_duration_seconds) still capture latency and error rates for these endpoints.
| Span Name | Description |
|---|---|
dbService.ListServers |
List servers with optional filtering |
dbService.ListServerVersions |
List versions of a specific server |
dbService.GetServerVersion |
Get a specific server version |
dbService.PublishServerVersion |
Publish a new server version |
dbService.DeleteServerVersion |
Delete a server version |
dbService.ListRegistries |
List all registries |
dbService.GetRegistryByName |
Get a specific registry |
dbService.CreateRegistry |
Create a new registry |
dbService.UpdateRegistry |
Update an existing registry |
dbService.DeleteRegistry |
Delete a registry |
Common Attributes:
| Attribute | Type | Description |
|---|---|---|
registry.name |
string | Name of the registry |
server.name |
string | Name of the server |
server.version |
string | Version of the server |
pagination.limit |
int | Page size limit |
pagination.has_cursor |
bool | Whether pagination cursor is used |
result.count |
int | Number of results returned |
registry.type |
string | Type of registry (git, api, file, managed) |
| Span Name | Description |
|---|---|
sync.performRegistrySync |
Sync a registry from its source |
Attributes:
| Attribute | Type | Description |
|---|---|---|
registry.name |
string | Name of the registry being synced |
registry.type |
string | Type of registry source |
sync.success |
bool | Whether sync completed successfully |
sync.duration_seconds |
float64 | Duration of the sync operation |
sync.server_count |
int | Number of servers synced (on success) |
Trace ID: abc123def456...
[12ms] GET /registry/v0.1/servers
├── http.request.method: GET
├── http.route: /registry/v0.1/servers
├── http.response.status_code: 200
│
└── [10ms] dbService.ListServers
├── registry.name: upstream
├── pagination.limit: 50
├── pagination.has_cursor: false
└── result.count: 25
Trace ID: xyz789abc...
[30.5s] sync.performRegistrySync
├── registry.name: upstream
├── registry.type: git
├── sync.success: true
├── sync.duration_seconds: 30.5
└── sync.server_count: 42
Trace ID: err456def...
[5ms] GET /registry/v0.1/servers/unknown/versions/1.0.0
├── http.request.method: GET
├── http.route: /registry/v0.1/servers/{serverName}/versions/{version}
├── http.response.status_code: 404
│
└── [3ms] dbService.GetServerVersion
├── server.name: unknown
├── server.version: 1.0.0
├── status: ERROR
└── exception.message: server not found: unknown@1.0.0
Each component uses a unique tracer name for identification:
| Component | Tracer Name |
|---|---|
| HTTP Middleware | github.com/stacklok/toolhive-registry-server/http |
| Database Service | github.com/stacklok/toolhive-registry-server/service/db |
| Sync Coordinator | github.com/stacklok/toolhive-registry-server/sync/coordinator |
The Registry Server supports W3C Trace Context propagation. Incoming requests with traceparent headers will have their trace context extracted and used as the parent for all child spans. This enables distributed tracing across multiple services.
The following components are not yet instrumented but are planned for future tracing coverage:
| Component | Description | Potential Value |
|---|---|---|
| Sync Writer | Database write operations during sync | Diagnose write performance bottlenecks |
| State Service | Sync state tracking operations | Debug sync state management issues |
These additions would provide deeper visibility into sync operations, particularly useful for diagnosing performance issues in large-scale deployments.
The telemetry implementation handles disabled or missing components gracefully:
- When telemetry is disabled, no-op providers are used (zero overhead)
- Metrics and tracing can be independently enabled/disabled
- Nil provider checks prevent panics if metrics are not configured
The HTTP middleware extracts Chi route patterns (e.g., /registry/v0.1/servers/{serverName}) instead of actual URLs (e.g., /registry/v0.1/servers/my-server) to prevent metric cardinality explosion.
All telemetry data includes these resource attributes:
| Attribute | Description |
|---|---|
service.name |
Service name from config |
service.version |
Service version from config |
host.name |
Hostname of the running instance |
telemetry.sdk.name |
"opentelemetry" |
telemetry.sdk.language |
"go" |
telemetry.sdk.version |
OTEL SDK version |
Both traces and metrics are exported via OTLP HTTP (port 4318 by default):
- Traces use batch processing for efficiency
- Metrics use a periodic reader with 60-second intervals
-
Verify telemetry is enabled in config:
telemetry: enabled: true metrics: enabled: true
-
Check the OTEL endpoint is reachable from the registry server
-
Verify the OTEL Collector is configured to receive OTLP and export to Prometheus
-
Check Prometheus is scraping the OTEL Collector's metrics endpoint (default port 8889)
If you see high cardinality warnings, check for:
- Custom routes not registered with Chi (will show as
unknown_route) - Dynamic path segments not using Chi parameters
-
Verify tracing is enabled:
telemetry: tracing: enabled: true
-
Check sampling rate - default is 5% (
sampling: 0.05) -
Verify Jaeger or trace backend is configured in OTEL Collector