Performance Tuning Guide

Guide to tuning Flagent performance based on load testing results.

Note: Metrics, anomaly, and smart rollout indices/tables apply when using the corresponding features (Core metrics or Enterprise). Evaluation and flag CRUD are always relevant for OSS.

Database Optimization

Indices

The following indices are created automatically to optimize queries:

Metrics Queries

-- Primary metrics queries (by flag and time)
CREATE INDEX idx_metric_flag_timestamp ON metric_data_points(flag_id, timestamp DESC);
CREATE INDEX idx_metric_flag_key_timestamp ON metric_data_points(flag_key, timestamp DESC);
CREATE INDEX idx_metric_type_timestamp ON metric_data_points(metric_type, timestamp DESC);

-- Variant-specific queries
CREATE INDEX idx_metric_variant ON metric_data_points(flag_id, variant_id, timestamp DESC) 
WHERE variant_id IS NOT NULL;

Anomaly Queries

-- Anomaly lookups
CREATE INDEX idx_anomaly_flag_resolved ON anomaly_alerts(flag_id, resolved, detected_at DESC);
CREATE INDEX idx_anomaly_severity ON anomaly_alerts(severity, resolved, detected_at DESC);

-- Unresolved anomalies (hot queries)
CREATE INDEX idx_anomaly_unresolved ON anomaly_alerts(resolved, detected_at DESC) 
WHERE resolved = false;

Smart Rollout Queries

-- Rollout config lookups
CREATE INDEX idx_rollout_flag_status ON smart_rollout_configs(flag_id, status, enabled);

-- Active rollouts (background job queries)
CREATE INDEX idx_rollout_active ON smart_rollout_configs(enabled, status, last_increment_at) 
WHERE enabled = true AND status = 'ACTIVE';

-- History queries
CREATE INDEX idx_rollout_history ON smart_rollout_history(rollout_config_id, changed_at DESC);

Index Monitoring

Checking index usage (PostgreSQL):

SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan as scans,
    idx_tup_read as tuples_read,
    idx_tup_fetch as tuples_fetched
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
AND tablename LIKE '%metric%' OR tablename LIKE '%anomaly%' OR tablename LIKE '%rollout%'
ORDER BY idx_scan DESC;

In code:

val stats = PerformanceOptimization.getIndexStats()
stats.forEach { stat ->
    println("${stat.table}.${stat.index}: ${stat.scans} scans")
}

Table Maintenance

Analyze tables (update statistics for query optimizer):

PerformanceOptimization.analyze()

Vacuum (PostgreSQL, reclaim space):

VACUUM ANALYZE metric_data_points;
VACUUM ANALYZE anomaly_alerts;
VACUUM ANALYZE smart_rollout_configs;

Auto-vacuum settings (postgresql.conf):

autovacuum = on
autovacuum_max_workers = 3
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 50
autovacuum_analyze_threshold = 50

Evaluation Throughput

Target Metrics

Throughput: 1000+ req/s per instance (goal: 2000+ req/s)
Latency: p99 < 10ms, mean < 1ms for evaluation-only path
Error rate: < 1%

Key Configuration

EvalCache refresh:

# Shorter interval = fresher data, more DB load. Default: 3s
export FLAGENT_EVALCACHE_REFRESHINTERVAL=3s

# For high-throughput evaluation, consider 5-10s if data changes infrequently
export FLAGENT_EVALCACHE_REFRESHINTERVAL=5s

Worker pool (Netty):

# Default = CPU cores. For I/O-bound evaluation, increase if CPU is underutilized
export FLAGENT_WORKER_POOL_SIZE=8

Database pool (for cache fetcher):

export DB_POOL_SIZE=50
export DB_MIN_IDLE=10

Profiling

Enable pprof endpoints for heap/thread analysis:

export FLAGENT_PPROF_ENABLED=true

Then access:

GET /debug/pprof/heap — heap dump
GET /debug/pprof/thread — thread dump
GET /debug/pprof/profile — CPU profile (see response for JFR instructions)

JFR (Java Flight Recorder) for CPU profiling:

java -XX:StartFlightRecording=filename=recording.jfr,duration=60s -jar flagent.jar

Async-profiler (Linux):

# Attach to running JVM
./profiler.sh -e cpu -d 30 -f flamegraph.svg <pid>

Optimization Checklist

EvalCache: Ensure cache is warmed before traffic; avoid cold start
enableDebug: Set enableDebug=false in production — debug path is significantly heavier

Logging: Evaluation endpoints are excluded from verbose logging by default. To add more paths:

export FLAGENT_MIDDLEWARE_VERBOSE_LOGGER_EXCLUDE_URLS=/api/v1/evaluation,/api/v1/evaluation/batch,/health

Serialization: Evaluation response uses kotlinx.serialization; ensure no custom serializers add overhead in hot path

Connection Pool Tuning

Recommended Settings

Production (200 concurrent users):

export DB_POOL_SIZE=50
export DB_MIN_IDLE=10

High Load (500+ concurrent users):

export DB_POOL_SIZE=100
export DB_MIN_IDLE=25

Development:

export DB_POOL_SIZE=10
export DB_MIN_IDLE=2

HikariCP Configuration

Recommended HikariCP settings (configured in DatabaseConfig.kt):

maximumPoolSize = 50           // Max connections
minimumIdle = 10               // Always ready connections
connectionTimeout = 30000      // 30s timeout
idleTimeout = 600000           // 10 min idle timeout
maxLifetime = 1800000          // 30 min connection lifetime
validationTimeout = 5000       // 5s validation timeout

Connection Pool Sizing Formula

CPU-based:

connections = (core_count × 2) + effective_spindle_count

Load-based (web apps):

connections = expected_concurrent_requests / 2

Example:

8 CPU cores → (8 × 2) + 1 = 17 connections
200 concurrent requests → 200 / 2 = 100 connections
Recommended: take the higher value, cap at 100

In code:

val recommended = DatabaseConfig.getRecommendedPoolSize(
    coreCount = 8,
    expectedConcurrentRequests = 200
)
// Returns: 100

Query Performance

Slow Query Detection (PostgreSQL)

Enable slow query logging:

-- postgresql.conf
log_min_duration_statement = 1000  # Log queries > 1s

Find slow queries:

SELECT
    query,
    calls,
    total_time / 1000 as total_seconds,
    mean_time / 1000 as avg_seconds,
    max_time / 1000 as max_seconds
FROM pg_stat_statements
WHERE query LIKE '%metric%' OR query LIKE '%anomaly%'
ORDER BY mean_time DESC
LIMIT 20;

Query Timeouts

Configured timeouts in DatabaseConfig.QueryTimeouts:

SHORT_QUERY_MS: 1s - simple inserts/selects
MEDIUM_QUERY_MS: 5s - aggregations
LONG_QUERY_MS: 30s - complex analytics

Usage:

transaction {
    queryTimeout = DatabaseConfig.QueryTimeouts.MEDIUM_QUERY_MS.toInt()
    // ... query
}

Data Retention & Cleanup

Automated Cleanup

AiRolloutScheduler automatically cleans up old data:

// In AiRolloutScheduler
private suspend fun cleanupOldData() {
    val retentionDays = 90 // 90 days retention
    val cutoffTime = System.currentTimeMillis() - (retentionDays * 24 * 60 * 60 * 1000L)
    
    metricsCollectionService.cleanupOldMetrics(cutoffTime)
}

Retention configuration:

export METRIC_RETENTION_DAYS=90  # Default: 90 days
export ANOMALY_RETENTION_DAYS=180 # Default: 180 days

Manual Cleanup

-- Delete metrics older than 90 days
DELETE FROM metric_data_points 
WHERE timestamp < (extract(epoch from now() - interval '90 days') * 1000);

-- Delete resolved anomalies older than 180 days
DELETE FROM anomaly_alerts 
WHERE resolved = true 
AND detected_at < (extract(epoch from now() - interval '180 days') * 1000);

Partitioning (Advanced)

For large data volumes (millions of metrics), partitioning by timestamp is recommended:

-- Create partitioned table (PostgreSQL 10+)
CREATE TABLE metric_data_points_partitioned (
    -- same columns as metric_data_points
) PARTITION BY RANGE (timestamp);

-- Create monthly partitions
CREATE TABLE metric_data_points_2024_01 
PARTITION OF metric_data_points_partitioned
FOR VALUES FROM (1704067200000) TO (1706745600000);

-- Auto-create partitions with pg_partman extension

Caching Strategy

In-Memory Cache

EvalCache is already implemented for evaluation:

TTL: 60 seconds
Refresh interval: 3 seconds
Cache size: unlimited (production: add LRU eviction)

Metrics Aggregation Cache (recommended):

// Cache aggregated metrics for 5 minutes
val cache = ConcurrentHashMap<String, CachedAggregation>()

data class CachedAggregation(
    val aggregation: MetricAggregation,
    val timestamp: Long,
    val ttlMs: Long = 300_000 // 5 minutes
) {
    fun isExpired(): Boolean = System.currentTimeMillis() - timestamp > ttlMs
}

Redis Cache (Optional)

For distributed caching (multi-instance deployment):

// Redis for shared cache across instances
val redis = RedisClient.create("redis://localhost:6379")

// Cache metrics aggregations
fun getCachedAggregation(flagId: Int, metricType: String): MetricAggregation? {
    val key = "metrics:agg:$flagId:$metricType"
    val cached = redis.get(key)
    return cached?.let { Json.decodeFromString(it) }
}

Evaluation Throughput

Evaluation API (POST /api/v1/evaluation) is the hot path. Optimize for low latency and high throughput (target: ~2000 req/s).

Environment variables:

FLAGENT_EVALCACHE_REFRESHINTERVAL — cache refresh interval (default 3s). Shorter = fresher data, more DB load.
FLAGENT_WORKER_POOL_SIZE — Netty worker threads (default: CPU cores). Increase for high concurrency (e.g., 8–16 on multi-core).
FLAGENT_PPROF_ENABLED — enable pprof for profiling (default: false). Use with async-profiler or JFR to find bottlenecks.
FLAGENT_DB_DBCONNECTIONSTR — ensure HikariCP pool size is adequate (see Connection Pool Tuning).

Tips:

EvalCache serves evaluation from memory; DB is hit only on refresh.
enableDebug=false avoids extra debug logging overhead (evaluation endpoints are excluded from verbose middleware).
Profile first: run evaluation-load-test.js, then enable pprof to identify hot paths.
See benchmarks.md for load test instructions.

Load Test Results & Benchmarks

Baseline Performance (before optimization)

Metrics API:

Single metric: ~50ms avg, ~150ms p95
Batch (50 metrics): ~300ms avg, ~800ms p95
Get metrics: ~200ms avg, ~500ms p95
Aggregation: ~400ms avg, ~1000ms p95

Anomaly Detection:

Detection: ~800ms avg, ~2000ms p95
Get alerts: ~150ms avg, ~400ms p95

After Optimization (indices + pool tuning)

Metrics API:

Single metric: ~10-20ms avg, ~50ms p95 (5x faster)
Batch (50 metrics): ~100-150ms avg, ~300ms p95 (2x faster)
Get metrics: ~50-100ms avg, ~200ms p95 (2-4x faster)
Aggregation: ~100-200ms avg, ~400ms p95 (2-4x faster)

Anomaly Detection:

Detection: ~200-400ms avg, ~800ms p95 (2-4x faster)
Get alerts: ~30-60ms avg, ~150ms p95 (2-5x faster)

Target Performance (production)

p95 < 200ms for all API endpoints
p99 < 500ms
Error rate < 1%
Throughput: 500+ RPS per instance

Monitoring & Observability

Grafana Dashboards

Use the provided dashboards:

grafana/dashboards/flagent-metrics.json - Metrics visualization
grafana/dashboards/flagent-anomalies.json - Anomaly alerts

Run:

docker-compose -f grafana/docker-compose.grafana.yml up -d
# Open http://localhost:3000 (admin/admin)

Key Metrics to Monitor

Database:

Connection pool usage (hikaricp_connections_active)
Query duration (query_duration_ms)
Slow queries count
Table sizes

API:

Request duration (p50, p95, p99)
Error rate
Throughput (RPS)

AI Rollouts:

Active rollouts count
Anomalies detected (by severity)
Metrics collection rate
Scheduler job duration

Alerting Rules

Prometheus alerts (recommended):

groups:
  - name: flagent
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
          
      # Slow queries
      - alert: SlowQueries
        expr: query_duration_p95 > 1000
        for: 5m
        annotations:
          summary: "Slow database queries detected"
          
      # Connection pool exhaustion
      - alert: ConnectionPoolExhausted
        expr: hikaricp_connections_active / hikaricp_connections_max > 0.9
        for: 2m
        annotations:
          summary: "Connection pool almost exhausted"

Production Checklist

Pre-Launch

Database indices created (PerformanceOptimization.apply())
Connection pool configured (default 10 in Database.kt; modify maximumPoolSize for higher load)
Query timeouts configured
Data retention policies set
Monitoring dashboards deployed
Alerting rules configured
Load tests passed (p95 < 500ms, error rate < 1%)

Regular Maintenance

Weekly: Check index usage stats
Weekly: Review slow query logs
Monthly: Analyze tables (VACUUM ANALYZE)
Monthly: Review data retention and cleanup
Quarterly: Re-run load tests
Quarterly: Review and adjust connection pool size

Scaling Checklist

When scaling up:

Increase connection pool size (modify maximumPoolSize in Database.kt)
Add read replicas for read-heavy workloads
Enable Redis caching for multi-instance deployments
Consider database partitioning for large tables
Add CDN for static assets
Implement rate limiting per tenant

Troubleshooting

High CPU Usage

Causes:

Missing indices → add missing indices
Complex aggregations → add caching
Too many connections → reduce pool size

Solution:

-- Find expensive queries
SELECT query, total_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;

High Memory Usage

Causes:

Connection pool too large
Cache size unlimited
Memory leaks

Solution:

# Reduce pool size: modify maximumPoolSize in backend/src/main/kotlin/flagent/repository/Database.kt

# Monitor JVM heap (if using JVM)
jstat -gc <pid> 1000

Slow Queries

Solution:

Check if indices exist: \d+ table_name
Explain query: EXPLAIN ANALYZE SELECT ...
Add missing indices
Optimize query (add WHERE clauses, reduce JOIN complexity)

Connection Pool Exhaustion

Solution:

Increase pool size: modify maximumPoolSize in Database.kt (default: 10)
Check for connection leaks (monitor hikaricp_connections_active)
Reduce connection timeout
Implement connection retry logic

Best Practices

Always use indices for timestamp range queries
Monitor query performance regularly
Set appropriate timeouts for all queries
Implement data retention policies
Use batch operations where possible
Cache aggregations (5-15 minute TTL)
Analyze tables after bulk operations
Test under load before production
Monitor connection pool usage
Plan for scale from day one

Uh oh!

FilesExpand file tree

tuning-guide.md

Latest commit

History

tuning-guide.md

File metadata and controls

Performance Tuning Guide

Database Optimization

Indices

Metrics Queries

Anomaly Queries

Smart Rollout Queries

Index Monitoring

Table Maintenance

Evaluation Throughput

Target Metrics

Key Configuration

Profiling

Optimization Checklist

Connection Pool Tuning

Recommended Settings

HikariCP Configuration

Connection Pool Sizing Formula

Query Performance

Slow Query Detection (PostgreSQL)

Query Timeouts

Data Retention & Cleanup

Automated Cleanup

Manual Cleanup

Partitioning (Advanced)

Caching Strategy

In-Memory Cache

Redis Cache (Optional)

Evaluation Throughput

Load Test Results & Benchmarks

Baseline Performance (before optimization)

After Optimization (indices + pool tuning)

Target Performance (production)

Monitoring & Observability

Grafana Dashboards

Key Metrics to Monitor

Alerting Rules

Production Checklist

Pre-Launch

Regular Maintenance

Scaling Checklist

Troubleshooting

High CPU Usage

High Memory Usage

Slow Queries

Connection Pool Exhaustion

Best Practices