This guide covers the main configuration knobs, environment variables, and diagnostic tools available for tuning yagpcc performance. Every option listed here can be set in yagpcc.yaml unless stated otherwise.
Go 1.19+ honours the GOMEMLIMIT environment variable (or its runtime equivalent). Setting it caps the total amount of memory the Go runtime will try to use before triggering more aggressive garbage collection.
# Limit the Go heap to 512 MiB
export GOMEMLIMIT=512MiB
./devbin/yagpccUseful when yagpcc runs on a host shared with Greenplum processes and you want to prevent the agent from consuming too much RAM. The runtime will GC more frequently rather than exceed the limit.
Tip: Combine
GOMEMLIMITwithmaximum_stored_queries(see below) for predictable memory usage.
yagpcc uses zap in production mode for structured JSON logging. The log destination and level are configured under the app.logging key:
app:
logging:
level: debug # debug, info, warn, error
file: stdout # default — write to stdoutfile value |
Behaviour |
|---|---|
stdout (default) |
Logs go to standard output. When running under systemd or a process manager, they are captured by the journal or redirected to a file by the service unit. |
A file path (e.g. /var/log/yagpcc/server.log) |
Logs are written directly to the specified file. The directory must exist and be writable by the yagpcc process. |
Besides the structured log, Go itself may write to stderr in certain situations (e.g. runtime panics, fatal errors from log.Fatal, or unrecovered goroutine crashes). These messages bypass zap and go directly to the process stderr stream.
When running yagpcc as a systemd service, both stdout and stderr are typically captured by journalctl:
# View recent yagpcc logs from the journal
journalctl -u yagpcc -n 200 --no-pager
# Follow logs in real time
journalctl -u yagpcc -f
# Filter by priority (errors only)
journalctl -u yagpcc -p errWhen running manually, redirect both streams to a file if needed:
./devbin/yagpcc > /var/log/yagpcc/output.log 2>&1The master role writes historical data (sessions, queries, segment metrics) to JSON files via the archiver. These are configured under arch_config:
arch_config:
sessions_file: sessions.json # default
queries_file: queries.json # default
segments_file: segments.json # default
plan_detail_file: plan_details.json # default
max_file_size: 419430400 # 400 MiB — files rotate when this size is exceededThese files grow over time and are rotated when they exceed max_file_size. Monitor their disk usage, especially on clusters with high query throughput.
Setting level: debug produces verbose output (every session refresh, segment pull, and procfs cycle is logged). For production, use info or warn to reduce I/O overhead:
app:
logging:
level: infoThe yagp-hooks-collector extension inside Greenplum can send telemetry for nested (internal) queries. On busy clusters this can generate a large volume of messages over UDS. If you do not need nested-query visibility, disable it in the hooks-collector configuration to reduce CPU and memory pressure on both the extension and yagpcc.
Refer to the yagp-hooks-collector documentation for the exact parameter name (typically a GUC like yagp_hooks_collector.send_nested_queries).
maximum_stored_queries: 50000 # defaultControls the upper bound on how many running-query records yagpcc keeps in memory. Lowering this value reduces RAM usage but may cause recently started short queries to be evicted before they complete. Raising it allows tracking more concurrent queries at the cost of higher memory.
short_agg_interval: 10m # defaultQueries shorter than minimum_query_duration_sec are aggregated into per-user buckets. short_agg_interval defines the time window for these buckets. A shorter interval produces more granular aggregation but increases the number of stored aggregation records; a longer interval reduces record count but lowers time resolution.
procfs_enabled: true # default — set to false to disable
procfs_refresh_interval: 60s # defaultWhen enabled, the master periodically queries each segment host for /proc/<pid>/stat, /proc/<pid>/status, /proc/<pid>/io data of running Greenplum processes. This provides per-session CPU, RSS, and I/O metrics with 5/15/30-minute exponential moving averages.
Disable procfs gathering (procfs_enabled: false) if:
- You do not need per-process resource metrics.
- The cluster has many segment hosts and the extra gRPC calls add noticeable load.
- You are running on a non-Linux platform where
/procis unavailable.
Adjust the interval to balance freshness vs. load. A shorter interval gives more responsive EMA values but increases gRPC traffic to segment hosts.
segment_pull_threads: 15 # defaultNumber of goroutines that concurrently pull query metrics from segment-host yagpcc instances. Increasing this value speeds up data collection on large clusters (many segment hosts) but uses more network connections and CPU on the master. Decreasing it reduces master load at the cost of slower data freshness.
Rule of thumb: Set to roughly the number of distinct segment hosts, capped at a reasonable concurrency limit (e.g. 30–50).
segment_pull_rate_sec: 2 # default (seconds between pull cycles)Controls how often the master enqueues segment hosts for data pulling. A lower value means more frequent pulls and fresher data, but higher network and CPU usage on both master and segments. A higher value reduces load but increases the staleness of segment-level metrics.
session_send_metric_interval: 60s # defaultHow often session snapshots are written to the archiver channel. Lowering this produces more frequent session records in the archive files (useful for fine-grained historical analysis) but increases I/O and archive file growth. Raising it reduces archive volume.
yagpcc exposes Prometheus metrics on the instrumentation HTTP endpoint. By default this listens on port 1433.
app:
instrumentation:
addr: "[::1]:1433"GET http://<host>:1433/metrics
| Metric | Type | Description |
|---|---|---|
new_sessions |
Counter | Total sessions created since startup. |
new_queries |
Counter | Total queries received since startup. |
deleted_sessions |
Counter | Total sessions removed. |
deleted_queries |
Counter | Total queries removed after archival. |
new_aggregated_queries |
Counter | Short queries aggregated. |
dropped_queries |
Counter | Queries dropped (e.g. storage full). |
failed_queries |
Counter | Queries that ended with an error status. |
total_sessions |
Gauge | Current number of tracked sessions. |
total_queries |
Gauge | Current number of tracked running queries. |
aggregated_queries |
Gauge | Current number of aggregated query buckets. |
queries_in_flight |
Gauge | Queries currently being processed. |
time |
Histogram | Internal handler latencies, labeled by method (e.g. processSegment, RefreshSessions, RefreshQueries, RefreshProcfs, SendSessionMetrics). |
query |
Histogram | Query duration distribution. |
slice |
Histogram | Slice-level duration distribution. |
query_statuses |
Counter (vec) | Query completion statuses, labeled by status. |
executing_query |
Gauge (vec) | Currently executing queries bucketed by elapsed time (le labels). Recalculated on scrape. |
# prometheus.yml
scrape_configs:
- job_name: yagpcc
scrape_interval: 15s
static_configs:
- targets:
- "master-host:1433"
- "segment-host-1:1433"
- "segment-host-2:1433"total_sessionsandtotal_queries— if these grow unboundedly, checkmaximum_stored_queriesandclear_deleted_sessions.time{method=processSegment}— high p99 indicates slow segment pulls; consider increasingsegment_pull_threadsor investigating network latency.time{method=RefreshProcfs}— if this approachesprocfs_refresh_interval, procfs gathering is too slow; increase the interval or reducesegment_pull_threadsfor procfs.dropped_queries— non-zero means the running-queries storage is full; raisemaximum_stored_queries.executing_query— shows the distribution of currently running query durations; useful for spotting long-running query buildup.
yagpcc includes a built-in Go pprof debug server that can be activated on demand via a Unix signal. This avoids the security risk of leaving pprof endpoints permanently open.
debug_port: 6060 # port for the pprof HTTP server (0 = disabled)
debug_minutes: 10 # how long the server stays up after activationSend SIGUSR2 to the yagpcc process:
kill -USR2 $(pidof yagpcc)The debug HTTP server starts on [::1]:<debug_port> and automatically shuts down after debug_minutes. Sending SIGUSR2 again while the server is running extends the timer.
Once activated, the following endpoints are available:
# Interactive web UI (requires graphviz for SVG)
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap
# 30-second CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Goroutine dump
curl http://localhost:6060/debug/pprof/goroutine?debug=2
# Heap profile (text)
go tool pprof -text http://localhost:6060/debug/pprof/heap
# Allocs profile (total allocations, useful for GC pressure)
go tool pprof -text http://localhost:6060/debug/pprof/allocs
# Execution trace (5 seconds)
curl -o trace.out http://localhost:6060/debug/pprof/trace?seconds=5
go tool trace trace.out
# Block profile (contention)
curl http://localhost:6060/debug/pprof/block?debug=1
# Mutex profile
curl http://localhost:6060/debug/pprof/mutex?debug=1High CPU usage:
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30Look at the flame graph for hot functions. Common culprits: JSON serialization in archivers, lock contention in storage, or excessive segment pulls.
High memory usage:
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heapCheck inuse_space for current allocations and alloc_space for cumulative. If total_queries is high, consider lowering maximum_stored_queries.
Goroutine leaks:
curl http://localhost:6060/debug/pprof/goroutine?debug=2 | head -100Look for goroutines stuck in channel operations or network I/O. A growing goroutine count may indicate segment hosts that are unreachable (connections timing out).
| Parameter | YAML key | Default | Effect |
|---|---|---|---|
| Go memory limit | GOMEMLIMIT env var |
unlimited | Caps Go heap; triggers more aggressive GC |
| Log level | app.logging.level |
debug | Controls log verbosity; use info or warn in production to reduce I/O |
| Log destination | app.logging.file |
stdout | Where structured logs are written (stdout or a file path) |
| Archive max file size | arch_config.max_file_size |
400 MiB | Archiver JSON files rotate when exceeding this size |
| Stored queries cap | maximum_stored_queries |
50000 | Max running-query records in memory |
| Short-query aggregation window | short_agg_interval |
10m | Time bucket size for short-query aggregation |
| Procfs gathering | procfs_enabled |
true | Enable/disable per-process resource stats |
| Procfs refresh rate | procfs_refresh_interval |
60s | How often procfs data is collected |
| Segment pull workers | segment_pull_threads |
15 | Concurrent goroutines pulling from segments |
| Segment pull cycle | segment_pull_rate_sec |
2 | Seconds between segment pull cycles |
| Session snapshot rate | session_send_metric_interval |
60s | How often session snapshots are archived |
| Pprof port | debug_port |
0 (disabled) | Port for on-demand pprof server |
| Pprof duration | debug_minutes |
0 | Minutes pprof stays active after SIGUSR2 |
- Service architecture — Roles, interfaces, ports, data flow.
- Per-process resource statistics — Procfs data flow and EMA calculations.
- API description — gRPC and CSV HTTP API reference.