Guide to tuning Flagent performance based on load testing results.
Note: Metrics, anomaly, and smart rollout indices/tables apply when using the corresponding features (Core metrics or Enterprise). Evaluation and flag CRUD are always relevant for OSS.
The following indices are created automatically to optimize queries:
-- Primary metrics queries (by flag and time)
CREATE INDEX idx_metric_flag_timestamp ON metric_data_points(flag_id, timestamp DESC);
CREATE INDEX idx_metric_flag_key_timestamp ON metric_data_points(flag_key, timestamp DESC);
CREATE INDEX idx_metric_type_timestamp ON metric_data_points(metric_type, timestamp DESC);
-- Variant-specific queries
CREATE INDEX idx_metric_variant ON metric_data_points(flag_id, variant_id, timestamp DESC)
WHERE variant_id IS NOT NULL;-- Anomaly lookups
CREATE INDEX idx_anomaly_flag_resolved ON anomaly_alerts(flag_id, resolved, detected_at DESC);
CREATE INDEX idx_anomaly_severity ON anomaly_alerts(severity, resolved, detected_at DESC);
-- Unresolved anomalies (hot queries)
CREATE INDEX idx_anomaly_unresolved ON anomaly_alerts(resolved, detected_at DESC)
WHERE resolved = false;-- Rollout config lookups
CREATE INDEX idx_rollout_flag_status ON smart_rollout_configs(flag_id, status, enabled);
-- Active rollouts (background job queries)
CREATE INDEX idx_rollout_active ON smart_rollout_configs(enabled, status, last_increment_at)
WHERE enabled = true AND status = 'ACTIVE';
-- History queries
CREATE INDEX idx_rollout_history ON smart_rollout_history(rollout_config_id, changed_at DESC);Checking index usage (PostgreSQL):
SELECT
schemaname,
tablename,
indexname,
idx_scan as scans,
idx_tup_read as tuples_read,
idx_tup_fetch as tuples_fetched
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
AND tablename LIKE '%metric%' OR tablename LIKE '%anomaly%' OR tablename LIKE '%rollout%'
ORDER BY idx_scan DESC;In code:
val stats = PerformanceOptimization.getIndexStats()
stats.forEach { stat ->
println("${stat.table}.${stat.index}: ${stat.scans} scans")
}Analyze tables (update statistics for query optimizer):
PerformanceOptimization.analyze()Vacuum (PostgreSQL, reclaim space):
VACUUM ANALYZE metric_data_points;
VACUUM ANALYZE anomaly_alerts;
VACUUM ANALYZE smart_rollout_configs;Auto-vacuum settings (postgresql.conf):
autovacuum = on
autovacuum_max_workers = 3
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 50
autovacuum_analyze_threshold = 50
- Throughput: 1000+ req/s per instance (goal: 2000+ req/s)
- Latency: p99 < 10ms, mean < 1ms for evaluation-only path
- Error rate: < 1%
EvalCache refresh:
# Shorter interval = fresher data, more DB load. Default: 3s
export FLAGENT_EVALCACHE_REFRESHINTERVAL=3s
# For high-throughput evaluation, consider 5-10s if data changes infrequently
export FLAGENT_EVALCACHE_REFRESHINTERVAL=5sWorker pool (Netty):
# Default = CPU cores. For I/O-bound evaluation, increase if CPU is underutilized
export FLAGENT_WORKER_POOL_SIZE=8Database pool (for cache fetcher):
export DB_POOL_SIZE=50
export DB_MIN_IDLE=10Enable pprof endpoints for heap/thread analysis:
export FLAGENT_PPROF_ENABLED=trueThen access:
GET /debug/pprof/heap— heap dumpGET /debug/pprof/thread— thread dumpGET /debug/pprof/profile— CPU profile (see response for JFR instructions)
JFR (Java Flight Recorder) for CPU profiling:
java -XX:StartFlightRecording=filename=recording.jfr,duration=60s -jar flagent.jarAsync-profiler (Linux):
# Attach to running JVM
./profiler.sh -e cpu -d 30 -f flamegraph.svg <pid>- EvalCache: Ensure cache is warmed before traffic; avoid cold start
- enableDebug: Set
enableDebug=falsein production — debug path is significantly heavier - Logging: Evaluation endpoints are excluded from verbose logging by default. To add more paths:
export FLAGENT_MIDDLEWARE_VERBOSE_LOGGER_EXCLUDE_URLS=/api/v1/evaluation,/api/v1/evaluation/batch,/health - Serialization: Evaluation response uses kotlinx.serialization; ensure no custom serializers add overhead in hot path
Production (200 concurrent users):
export DB_POOL_SIZE=50
export DB_MIN_IDLE=10High Load (500+ concurrent users):
export DB_POOL_SIZE=100
export DB_MIN_IDLE=25Development:
export DB_POOL_SIZE=10
export DB_MIN_IDLE=2Recommended HikariCP settings (configured in DatabaseConfig.kt):
maximumPoolSize = 50 // Max connections
minimumIdle = 10 // Always ready connections
connectionTimeout = 30000 // 30s timeout
idleTimeout = 600000 // 10 min idle timeout
maxLifetime = 1800000 // 30 min connection lifetime
validationTimeout = 5000 // 5s validation timeoutCPU-based:
connections = (core_count × 2) + effective_spindle_count
Load-based (web apps):
connections = expected_concurrent_requests / 2
Example:
- 8 CPU cores →
(8 × 2) + 1 = 17 connections - 200 concurrent requests →
200 / 2 = 100 connections - Recommended: take the higher value, cap at 100
In code:
val recommended = DatabaseConfig.getRecommendedPoolSize(
coreCount = 8,
expectedConcurrentRequests = 200
)
// Returns: 100Enable slow query logging:
-- postgresql.conf
log_min_duration_statement = 1000 # Log queries > 1sFind slow queries:
SELECT
query,
calls,
total_time / 1000 as total_seconds,
mean_time / 1000 as avg_seconds,
max_time / 1000 as max_seconds
FROM pg_stat_statements
WHERE query LIKE '%metric%' OR query LIKE '%anomaly%'
ORDER BY mean_time DESC
LIMIT 20;Configured timeouts in DatabaseConfig.QueryTimeouts:
- SHORT_QUERY_MS: 1s - simple inserts/selects
- MEDIUM_QUERY_MS: 5s - aggregations
- LONG_QUERY_MS: 30s - complex analytics
Usage:
transaction {
queryTimeout = DatabaseConfig.QueryTimeouts.MEDIUM_QUERY_MS.toInt()
// ... query
}AiRolloutScheduler automatically cleans up old data:
// In AiRolloutScheduler
private suspend fun cleanupOldData() {
val retentionDays = 90 // 90 days retention
val cutoffTime = System.currentTimeMillis() - (retentionDays * 24 * 60 * 60 * 1000L)
metricsCollectionService.cleanupOldMetrics(cutoffTime)
}Retention configuration:
export METRIC_RETENTION_DAYS=90 # Default: 90 days
export ANOMALY_RETENTION_DAYS=180 # Default: 180 days-- Delete metrics older than 90 days
DELETE FROM metric_data_points
WHERE timestamp < (extract(epoch from now() - interval '90 days') * 1000);
-- Delete resolved anomalies older than 180 days
DELETE FROM anomaly_alerts
WHERE resolved = true
AND detected_at < (extract(epoch from now() - interval '180 days') * 1000);For large data volumes (millions of metrics), partitioning by timestamp is recommended:
-- Create partitioned table (PostgreSQL 10+)
CREATE TABLE metric_data_points_partitioned (
-- same columns as metric_data_points
) PARTITION BY RANGE (timestamp);
-- Create monthly partitions
CREATE TABLE metric_data_points_2024_01
PARTITION OF metric_data_points_partitioned
FOR VALUES FROM (1704067200000) TO (1706745600000);
-- Auto-create partitions with pg_partman extensionEvalCache is already implemented for evaluation:
- TTL: 60 seconds
- Refresh interval: 3 seconds
- Cache size: unlimited (production: add LRU eviction)
Metrics Aggregation Cache (recommended):
// Cache aggregated metrics for 5 minutes
val cache = ConcurrentHashMap<String, CachedAggregation>()
data class CachedAggregation(
val aggregation: MetricAggregation,
val timestamp: Long,
val ttlMs: Long = 300_000 // 5 minutes
) {
fun isExpired(): Boolean = System.currentTimeMillis() - timestamp > ttlMs
}For distributed caching (multi-instance deployment):
// Redis for shared cache across instances
val redis = RedisClient.create("redis://localhost:6379")
// Cache metrics aggregations
fun getCachedAggregation(flagId: Int, metricType: String): MetricAggregation? {
val key = "metrics:agg:$flagId:$metricType"
val cached = redis.get(key)
return cached?.let { Json.decodeFromString(it) }
}Evaluation API (POST /api/v1/evaluation) is the hot path. Optimize for low latency and high throughput (target: ~2000 req/s).
Environment variables:
FLAGENT_EVALCACHE_REFRESHINTERVAL— cache refresh interval (default 3s). Shorter = fresher data, more DB load.FLAGENT_WORKER_POOL_SIZE— Netty worker threads (default: CPU cores). Increase for high concurrency (e.g., 8–16 on multi-core).FLAGENT_PPROF_ENABLED— enable pprof for profiling (default: false). Use with async-profiler or JFR to find bottlenecks.FLAGENT_DB_DBCONNECTIONSTR— ensure HikariCP pool size is adequate (see Connection Pool Tuning).
Tips:
- EvalCache serves evaluation from memory; DB is hit only on refresh.
enableDebug=falseavoids extra debug logging overhead (evaluation endpoints are excluded from verbose middleware).- Profile first: run
evaluation-load-test.js, then enable pprof to identify hot paths. - See benchmarks.md for load test instructions.
Metrics API:
- Single metric: ~50ms avg, ~150ms p95
- Batch (50 metrics): ~300ms avg, ~800ms p95
- Get metrics: ~200ms avg, ~500ms p95
- Aggregation: ~400ms avg, ~1000ms p95
Anomaly Detection:
- Detection: ~800ms avg, ~2000ms p95
- Get alerts: ~150ms avg, ~400ms p95
Metrics API:
- Single metric: ~10-20ms avg, ~50ms p95 (5x faster)
- Batch (50 metrics): ~100-150ms avg, ~300ms p95 (2x faster)
- Get metrics: ~50-100ms avg, ~200ms p95 (2-4x faster)
- Aggregation: ~100-200ms avg, ~400ms p95 (2-4x faster)
Anomaly Detection:
- Detection: ~200-400ms avg, ~800ms p95 (2-4x faster)
- Get alerts: ~30-60ms avg, ~150ms p95 (2-5x faster)
- p95 < 200ms for all API endpoints
- p99 < 500ms
- Error rate < 1%
- Throughput: 500+ RPS per instance
Use the provided dashboards:
grafana/dashboards/flagent-metrics.json- Metrics visualizationgrafana/dashboards/flagent-anomalies.json- Anomaly alerts
Run:
docker-compose -f grafana/docker-compose.grafana.yml up -d
# Open http://localhost:3000 (admin/admin)Database:
- Connection pool usage (
hikaricp_connections_active) - Query duration (
query_duration_ms) - Slow queries count
- Table sizes
API:
- Request duration (p50, p95, p99)
- Error rate
- Throughput (RPS)
AI Rollouts:
- Active rollouts count
- Anomalies detected (by severity)
- Metrics collection rate
- Scheduler job duration
Prometheus alerts (recommended):
groups:
- name: flagent
rules:
# High error rate
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
# Slow queries
- alert: SlowQueries
expr: query_duration_p95 > 1000
for: 5m
annotations:
summary: "Slow database queries detected"
# Connection pool exhaustion
- alert: ConnectionPoolExhausted
expr: hikaricp_connections_active / hikaricp_connections_max > 0.9
for: 2m
annotations:
summary: "Connection pool almost exhausted"- Database indices created (
PerformanceOptimization.apply()) - Connection pool configured (default 10 in Database.kt; modify
maximumPoolSizefor higher load) - Query timeouts configured
- Data retention policies set
- Monitoring dashboards deployed
- Alerting rules configured
- Load tests passed (p95 < 500ms, error rate < 1%)
- Weekly: Check index usage stats
- Weekly: Review slow query logs
- Monthly: Analyze tables (
VACUUM ANALYZE) - Monthly: Review data retention and cleanup
- Quarterly: Re-run load tests
- Quarterly: Review and adjust connection pool size
When scaling up:
- Increase connection pool size (modify
maximumPoolSizeinDatabase.kt) - Add read replicas for read-heavy workloads
- Enable Redis caching for multi-instance deployments
- Consider database partitioning for large tables
- Add CDN for static assets
- Implement rate limiting per tenant
Causes:
- Missing indices → add missing indices
- Complex aggregations → add caching
- Too many connections → reduce pool size
Solution:
-- Find expensive queries
SELECT query, total_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;Causes:
- Connection pool too large
- Cache size unlimited
- Memory leaks
Solution:
# Reduce pool size: modify maximumPoolSize in backend/src/main/kotlin/flagent/repository/Database.kt
# Monitor JVM heap (if using JVM)
jstat -gc <pid> 1000Solution:
- Check if indices exist:
\d+ table_name - Explain query:
EXPLAIN ANALYZE SELECT ... - Add missing indices
- Optimize query (add WHERE clauses, reduce JOIN complexity)
Solution:
- Increase pool size: modify
maximumPoolSizeinDatabase.kt(default: 10) - Check for connection leaks (monitor
hikaricp_connections_active) - Reduce connection timeout
- Implement connection retry logic
- Always use indices for timestamp range queries
- Monitor query performance regularly
- Set appropriate timeouts for all queries
- Implement data retention policies
- Use batch operations where possible
- Cache aggregations (5-15 minute TTL)
- Analyze tables after bulk operations
- Test under load before production
- Monitor connection pool usage
- Plan for scale from day one