|
| 1 | +# Performance Reference |
| 2 | + |
| 3 | +Backend API response time measurements for pyplots-backend (Cloud Run, europe-west4). |
| 4 | + |
| 5 | +## Infrastructure |
| 6 | + |
| 7 | +| Component | Config | Notes | |
| 8 | +|-----------|--------|-------| |
| 9 | +| Cloud Run (backend) | 1 vCPU, 1Gi RAM, min-instances=1 | gen2, startup-cpu-boost=true | |
| 10 | +| Cloud Run (frontend) | 1 vCPU, 256Mi RAM, min-instances=1 | nginx serving SPA | |
| 11 | +| Cloud SQL | `db-f1-micro`, PostgreSQL 18, PD-SSD 10GB | Shared vCPU, 614MB RAM | |
| 12 | +| Cache | In-memory TTLCache, 600s TTL, max 1000 entries | Per-instance, not shared | |
| 13 | + |
| 14 | +## Baseline: Before `--no-cpu-throttling` (March 24, 2026) |
| 15 | + |
| 16 | +Cloud Run config: `cpu-throttling=true` (request-based billing), 512Mi RAM. |
| 17 | + |
| 18 | +### Uncached Requests (first hit after cache expiry, requires DB query) |
| 19 | + |
| 20 | +| Endpoint | Samples | Min | Median | Max | Notes | |
| 21 | +|----------|---------|-----|--------|-----|-------| |
| 22 | +| `/specs` | 20 | 1.07s | 2.62s | 5.02s | Loads 259 specs + selectinload(impls) | |
| 23 | +| `/stats` | 20 | 1.10s | 2.71s | 11.00s | Aggregate stats | |
| 24 | +| `/libraries` | 4 | 0.46s | 6.96s | 7.06s | 9 rows, simple SELECT | |
| 25 | +| `/specs/{id}` | 4 | 7.08s | 8.66s | 9.59s | Single spec + all impls | |
| 26 | + |
| 27 | +### Cached Requests (cache hit, no DB) |
| 28 | + |
| 29 | +| Endpoint | Samples | Min | Median | Max | |
| 30 | +|----------|---------|-----|--------|-----| |
| 31 | +| `/specs` | 10 | 13ms | 17ms | 19ms | |
| 32 | +| `/stats` | 5 | 2ms | 3ms | 3ms | |
| 33 | +| `/libraries` | 10 | 3ms | 37ms | 57ms | |
| 34 | + |
| 35 | +### OOM Events (512Mi RAM) |
| 36 | + |
| 37 | +14 OOM crashes in 15 days (March 10-23): |
| 38 | + |
| 39 | +``` |
| 40 | +2026-03-23 04:37 Out-of-memory event detected |
| 41 | +2026-03-20 23:12 Out-of-memory event detected |
| 42 | +2026-03-18 20:26 Out-of-memory event detected |
| 43 | +2026-03-15 22:42 Out-of-memory event detected |
| 44 | +2026-03-14 20:00 Out-of-memory event detected (3x within 1 min) |
| 45 | +2026-03-14 17:23 Out-of-memory event detected |
| 46 | +2026-03-12 21:26 Out-of-memory event detected |
| 47 | +2026-03-12 14:37 Out-of-memory event detected (2x within 1 min) |
| 48 | +2026-03-10 14:23 Out-of-memory event detected |
| 49 | +2026-03-10 13:29 Out-of-memory event detected (2x within 1 min) |
| 50 | +``` |
| 51 | + |
| 52 | +## After `--no-cpu-throttling` + 1Gi RAM (March 25, 2026) |
| 53 | + |
| 54 | +Cloud Run config: `cpu-throttling=false` (instance-based billing), 1Gi RAM. |
| 55 | + |
| 56 | +Deployed revision `pyplots-backend-00085-4rn` at 2026-03-24 22:25 UTC. |
| 57 | + |
| 58 | +### Uncached Requests |
| 59 | + |
| 60 | +| Endpoint | Samples | Min | Median | Max | Notes | |
| 61 | +|----------|---------|-----|--------|-----|-------| |
| 62 | +| `/specs` | 20 | 0.77s | 1.85s | 9.20s | No improvement | |
| 63 | +| `/stats` | 20 | 0.77s | 1.84s | 9.93s | No improvement | |
| 64 | +| `/libraries` | 8 | 0.31s | 7.59s | 8.57s | No improvement | |
| 65 | +| `/specs/{id}` | 6 | 7.08s | 8.76s | 9.59s | No improvement | |
| 66 | + |
| 67 | +### Cached Requests |
| 68 | + |
| 69 | +| Endpoint | Samples | Min | Median | Max | |
| 70 | +|----------|---------|-----|--------|-----| |
| 71 | +| `/specs` | 5 | 12ms | 14ms | 20ms | |
| 72 | +| `/stats` | 2 | 2ms | 2ms | 2ms | |
| 73 | +| `/libraries` | 10 | 18ms | 107ms | 228ms | |
| 74 | + |
| 75 | +### OOM Events (1Gi RAM) |
| 76 | + |
| 77 | +**0 OOM events since upgrade** (March 24-25). Memory increase resolved the OOM crashes. |
| 78 | + |
| 79 | +## Cloud SQL Metrics (March 25, 2026) |
| 80 | + |
| 81 | +Measured via Cloud Monitoring API while running `db-f1-micro`: |
| 82 | + |
| 83 | +| Metric | Value | Notes | |
| 84 | +|--------|-------|-------| |
| 85 | +| CPU Utilization | 9-12% | Low — but shared 0.2 vCPU means real capacity is tiny | |
| 86 | +| Memory Utilization | 100.0% | **Misleading** — includes OS page cache (normal Linux behavior) | |
| 87 | +| Memory Total Usage | 184-219 MB | Actual PostgreSQL process memory | |
| 88 | +| Memory Quota | 614 MB | Total available (~400 MB used as OS page cache) | |
| 89 | +| Disk Utilization | 4.0% | 0.4 GB of 10 GB used | |
| 90 | + |
| 91 | +**Note:** The `memory/utilization` metric at 100% is NOT indicative of memory pressure. Linux uses all free RAM as filesystem page cache, which is normal and beneficial. Actual PostgreSQL memory usage is ~200 MB / 614 MB. |
| 92 | + |
| 93 | +### Root Cause: Shared 0.2 vCPU under concurrent load |
| 94 | + |
| 95 | +Connection establishment is **not** the issue — `num_backends` metric shows 5-8 persistent connections to the `pyplots` database at all times. The connection pool (`pool_size=5`, `max_overflow=10`, `pool_pre_ping=True`) keeps connections alive. |
| 96 | + |
| 97 | +The bottleneck is the **0.2 shared vCPU** handling concurrent queries. When the 600s cache expires, the SPA fires 4 parallel requests (`/specs`, `/stats`, `/libraries`, `/specs/{id}`), each triggering a DB query simultaneously. With 0.2 shared vCPU split across 4 queries, each effectively gets ~0.05 vCPU — explaining the 6-9s response times. |
| 98 | + |
| 99 | +``` |
| 100 | +# DB connections (pyplots database) — persistent, never drops to 0 |
| 101 | +gcloud monitoring: num_backends |
| 102 | + 22:07 → 5 22:01 → 8 21:44 → 7 21:38 → 7 |
| 103 | +``` |
| 104 | + |
| 105 | +### Connection errors during OOM events |
| 106 | + |
| 107 | +``` |
| 108 | +2026-03-20 23:12 FATAL: connection to client lost (7x simultaneous) |
| 109 | +``` |
| 110 | + |
| 111 | +Cloud Run backend OOM crash at the same timestamp caused all active DB connections to drop. |
| 112 | + |
| 113 | +## Cloud SQL Tier Comparison |
| 114 | + |
| 115 | +| Spec | `db-f1-micro` (current) | `db-g1-small` | `db-custom-1-3840` | |
| 116 | +|------|-------------------------|---------------|---------------------| |
| 117 | +| CPU | 0.2 shared vCPU (burstable) | 0.5 shared vCPU (burstable) | 1 dedicated vCPU | |
| 118 | +| RAM | 614 MB | 1.7 GB | 3.75 GB | |
| 119 | +| CPU behavior | Sustained workloads throttled to 0.2 vCPU | Sustained throttled to 0.5 vCPU | Full core, no throttling | |
| 120 | +| Price/month | ~$9 | ~$27 | ~$51 | |
| 121 | +| PG buffer cache | ~0 MB (RAM full from OS+PG overhead) | ~800 MB | ~2.5 GB | |
| 122 | +| Google recommendation | Dev/test only | Lightweight workloads | Min. for production | |
| 123 | + |
| 124 | +Upgrade command: `gcloud sql instances patch pyplots-db --tier=db-g1-small` |
| 125 | + |
| 126 | +## Conclusion |
| 127 | + |
| 128 | +`--no-cpu-throttling` had **no measurable impact** on uncached request latency. The bottleneck is the Cloud SQL `db-f1-micro` instance (0.2 shared vCPU, 614 MB RAM). Memory is not the issue (actual usage ~200 MB). Likely causes: shared CPU throttling under concurrent load and/or connection establishment overhead through Cloud SQL Auth Proxy. |
| 129 | + |
| 130 | +Cached responses are consistently fast (2-230ms). The problem only occurs when the 600s cache expires and the backend queries Cloud SQL. |
| 131 | + |
| 132 | +**Next step:** Upgrade Cloud SQL from `db-f1-micro` to `db-g1-small` (0.5 vCPU, 1.7 GB, ~$27/mo) and re-measure. |
| 133 | + |
| 134 | +Upgrade: `gcloud sql instances patch pyplots-db --tier=db-g1-small` |
| 135 | + |
| 136 | +## After Cloud SQL upgrade to `db-g1-small` (pending) |
| 137 | + |
| 138 | +TODO: Re-measure after upgrade and fill in results. |
| 139 | + |
| 140 | +## How to Reproduce These Measurements |
| 141 | + |
| 142 | +### Query slow requests (>500ms) for specific endpoints |
| 143 | + |
| 144 | +```bash |
| 145 | +gcloud logging read \ |
| 146 | + 'resource.type="cloud_run_revision" |
| 147 | + AND resource.labels.service_name="pyplots-backend" |
| 148 | + AND httpRequest.requestUrl=~"/(specs|stats|libraries)$" |
| 149 | + AND httpRequest.latency>="0.5s"' \ |
| 150 | + --limit=30 \ |
| 151 | + --freshness=1d \ |
| 152 | + --format='table(timestamp,httpRequest.requestUrl,httpRequest.latency)' |
| 153 | +``` |
| 154 | + |
| 155 | +### Query fast requests (<500ms) for cache hits |
| 156 | + |
| 157 | +```bash |
| 158 | +gcloud logging read \ |
| 159 | + 'resource.type="cloud_run_revision" |
| 160 | + AND resource.labels.service_name="pyplots-backend" |
| 161 | + AND httpRequest.requestUrl=~"/(specs|stats|libraries)$" |
| 162 | + AND httpRequest.latency<"0.5s"' \ |
| 163 | + --limit=20 \ |
| 164 | + --freshness=1d \ |
| 165 | + --format='table(timestamp,httpRequest.requestUrl,httpRequest.latency)' |
| 166 | +``` |
| 167 | + |
| 168 | +### Query OOM events |
| 169 | + |
| 170 | +```bash |
| 171 | +gcloud logging read \ |
| 172 | + 'resource.type="cloud_run_revision" |
| 173 | + AND resource.labels.service_name="pyplots-backend" |
| 174 | + AND textPayload=~"Out-of-memory"' \ |
| 175 | + --limit=15 \ |
| 176 | + --freshness=30d \ |
| 177 | + --format='table(timestamp,textPayload)' |
| 178 | +``` |
| 179 | + |
| 180 | +### Check Cloud Run service configuration |
| 181 | + |
| 182 | +```bash |
| 183 | +gcloud run services describe pyplots-backend \ |
| 184 | + --region europe-west4 \ |
| 185 | + --format='yaml(spec.template.metadata.annotations,spec.template.spec.containers[0].resources)' |
| 186 | +``` |
| 187 | + |
| 188 | +### Check Cloud SQL instance tier |
| 189 | + |
| 190 | +```bash |
| 191 | +gcloud sql instances describe pyplots-db \ |
| 192 | + --format='yaml(settings.tier,settings.dataDiskSizeGb,databaseVersion,settings.activationPolicy)' |
| 193 | +``` |
| 194 | + |
| 195 | +### Cloud SQL CPU utilization (last 6 hours) |
| 196 | + |
| 197 | +```bash |
| 198 | +curl -s "https://monitoring.googleapis.com/v3/projects/$(gcloud config get-value project)/timeSeries?filter=metric.type%3D%22cloudsql.googleapis.com%2Fdatabase%2Fcpu%2Futilization%22&interval.startTime=$(date -u -d '6 hours ago' +%Y-%m-%dT%H:%M:%SZ)&interval.endTime=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \ |
| 199 | + -H "Authorization: Bearer $(gcloud auth print-access-token)" | \ |
| 200 | + python3 -c " |
| 201 | +import json, sys |
| 202 | +data = json.load(sys.stdin) |
| 203 | +for ts in data.get('timeSeries', []): |
| 204 | + for p in ts.get('points', [])[:20]: |
| 205 | + t = p['interval']['endTime'] |
| 206 | + v = p['value']['doubleValue'] |
| 207 | + print(f'{t}: {v*100:.1f}%') |
| 208 | +" |
| 209 | +``` |
| 210 | + |
| 211 | +### Cloud SQL memory utilization |
| 212 | + |
| 213 | +```bash |
| 214 | +# Replace "cpu" with "memory" in the metric type: |
| 215 | +# cloudsql.googleapis.com/database/memory/utilization |
| 216 | +``` |
0 commit comments