Skip to content

Commit c5f48d6

Browse files
feat(performance): update Cloud Run configuration and add performance reference documentation
- Change CPU throttling setting in cloudbuild.yaml - Add performance.md for backend API response time measurements and infrastructure details
1 parent f80cbae commit c5f48d6

File tree

2 files changed

+217
-1
lines changed

2 files changed

+217
-1
lines changed

api/cloudbuild.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ steps:
5858
- "--execution-environment=gen2"
5959
- "--set-env-vars=GOOGLE_CLOUD_PROJECT=$PROJECT_ID"
6060
- "--set-env-vars=GCS_BUCKET=pyplots-images"
61-
- "--no-cpu-throttling"
61+
- "--cpu-throttling"
6262
- "--timeout=600"
6363
id: "deploy"
6464
waitFor: ["push-image"]

docs/reference/performance.md

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
# Performance Reference
2+
3+
Backend API response time measurements for pyplots-backend (Cloud Run, europe-west4).
4+
5+
## Infrastructure
6+
7+
| Component | Config | Notes |
8+
|-----------|--------|-------|
9+
| Cloud Run (backend) | 1 vCPU, 1Gi RAM, min-instances=1 | gen2, startup-cpu-boost=true |
10+
| Cloud Run (frontend) | 1 vCPU, 256Mi RAM, min-instances=1 | nginx serving SPA |
11+
| Cloud SQL | `db-f1-micro`, PostgreSQL 18, PD-SSD 10GB | Shared vCPU, 614MB RAM |
12+
| Cache | In-memory TTLCache, 600s TTL, max 1000 entries | Per-instance, not shared |
13+
14+
## Baseline: Before `--no-cpu-throttling` (March 24, 2026)
15+
16+
Cloud Run config: `cpu-throttling=true` (request-based billing), 512Mi RAM.
17+
18+
### Uncached Requests (first hit after cache expiry, requires DB query)
19+
20+
| Endpoint | Samples | Min | Median | Max | Notes |
21+
|----------|---------|-----|--------|-----|-------|
22+
| `/specs` | 20 | 1.07s | 2.62s | 5.02s | Loads 259 specs + selectinload(impls) |
23+
| `/stats` | 20 | 1.10s | 2.71s | 11.00s | Aggregate stats |
24+
| `/libraries` | 4 | 0.46s | 6.96s | 7.06s | 9 rows, simple SELECT |
25+
| `/specs/{id}` | 4 | 7.08s | 8.66s | 9.59s | Single spec + all impls |
26+
27+
### Cached Requests (cache hit, no DB)
28+
29+
| Endpoint | Samples | Min | Median | Max |
30+
|----------|---------|-----|--------|-----|
31+
| `/specs` | 10 | 13ms | 17ms | 19ms |
32+
| `/stats` | 5 | 2ms | 3ms | 3ms |
33+
| `/libraries` | 10 | 3ms | 37ms | 57ms |
34+
35+
### OOM Events (512Mi RAM)
36+
37+
14 OOM crashes in 15 days (March 10-23):
38+
39+
```
40+
2026-03-23 04:37 Out-of-memory event detected
41+
2026-03-20 23:12 Out-of-memory event detected
42+
2026-03-18 20:26 Out-of-memory event detected
43+
2026-03-15 22:42 Out-of-memory event detected
44+
2026-03-14 20:00 Out-of-memory event detected (3x within 1 min)
45+
2026-03-14 17:23 Out-of-memory event detected
46+
2026-03-12 21:26 Out-of-memory event detected
47+
2026-03-12 14:37 Out-of-memory event detected (2x within 1 min)
48+
2026-03-10 14:23 Out-of-memory event detected
49+
2026-03-10 13:29 Out-of-memory event detected (2x within 1 min)
50+
```
51+
52+
## After `--no-cpu-throttling` + 1Gi RAM (March 25, 2026)
53+
54+
Cloud Run config: `cpu-throttling=false` (instance-based billing), 1Gi RAM.
55+
56+
Deployed revision `pyplots-backend-00085-4rn` at 2026-03-24 22:25 UTC.
57+
58+
### Uncached Requests
59+
60+
| Endpoint | Samples | Min | Median | Max | Notes |
61+
|----------|---------|-----|--------|-----|-------|
62+
| `/specs` | 20 | 0.77s | 1.85s | 9.20s | No improvement |
63+
| `/stats` | 20 | 0.77s | 1.84s | 9.93s | No improvement |
64+
| `/libraries` | 8 | 0.31s | 7.59s | 8.57s | No improvement |
65+
| `/specs/{id}` | 6 | 7.08s | 8.76s | 9.59s | No improvement |
66+
67+
### Cached Requests
68+
69+
| Endpoint | Samples | Min | Median | Max |
70+
|----------|---------|-----|--------|-----|
71+
| `/specs` | 5 | 12ms | 14ms | 20ms |
72+
| `/stats` | 2 | 2ms | 2ms | 2ms |
73+
| `/libraries` | 10 | 18ms | 107ms | 228ms |
74+
75+
### OOM Events (1Gi RAM)
76+
77+
**0 OOM events since upgrade** (March 24-25). Memory increase resolved the OOM crashes.
78+
79+
## Cloud SQL Metrics (March 25, 2026)
80+
81+
Measured via Cloud Monitoring API while running `db-f1-micro`:
82+
83+
| Metric | Value | Notes |
84+
|--------|-------|-------|
85+
| CPU Utilization | 9-12% | Low — but shared 0.2 vCPU means real capacity is tiny |
86+
| Memory Utilization | 100.0% | **Misleading** — includes OS page cache (normal Linux behavior) |
87+
| Memory Total Usage | 184-219 MB | Actual PostgreSQL process memory |
88+
| Memory Quota | 614 MB | Total available (~400 MB used as OS page cache) |
89+
| Disk Utilization | 4.0% | 0.4 GB of 10 GB used |
90+
91+
**Note:** The `memory/utilization` metric at 100% is NOT indicative of memory pressure. Linux uses all free RAM as filesystem page cache, which is normal and beneficial. Actual PostgreSQL memory usage is ~200 MB / 614 MB.
92+
93+
### Root Cause: Shared 0.2 vCPU under concurrent load
94+
95+
Connection establishment is **not** the issue — `num_backends` metric shows 5-8 persistent connections to the `pyplots` database at all times. The connection pool (`pool_size=5`, `max_overflow=10`, `pool_pre_ping=True`) keeps connections alive.
96+
97+
The bottleneck is the **0.2 shared vCPU** handling concurrent queries. When the 600s cache expires, the SPA fires 4 parallel requests (`/specs`, `/stats`, `/libraries`, `/specs/{id}`), each triggering a DB query simultaneously. With 0.2 shared vCPU split across 4 queries, each effectively gets ~0.05 vCPU — explaining the 6-9s response times.
98+
99+
```
100+
# DB connections (pyplots database) — persistent, never drops to 0
101+
gcloud monitoring: num_backends
102+
22:07 → 5 22:01 → 8 21:44 → 7 21:38 → 7
103+
```
104+
105+
### Connection errors during OOM events
106+
107+
```
108+
2026-03-20 23:12 FATAL: connection to client lost (7x simultaneous)
109+
```
110+
111+
Cloud Run backend OOM crash at the same timestamp caused all active DB connections to drop.
112+
113+
## Cloud SQL Tier Comparison
114+
115+
| Spec | `db-f1-micro` (current) | `db-g1-small` | `db-custom-1-3840` |
116+
|------|-------------------------|---------------|---------------------|
117+
| CPU | 0.2 shared vCPU (burstable) | 0.5 shared vCPU (burstable) | 1 dedicated vCPU |
118+
| RAM | 614 MB | 1.7 GB | 3.75 GB |
119+
| CPU behavior | Sustained workloads throttled to 0.2 vCPU | Sustained throttled to 0.5 vCPU | Full core, no throttling |
120+
| Price/month | ~$9 | ~$27 | ~$51 |
121+
| PG buffer cache | ~0 MB (RAM full from OS+PG overhead) | ~800 MB | ~2.5 GB |
122+
| Google recommendation | Dev/test only | Lightweight workloads | Min. for production |
123+
124+
Upgrade command: `gcloud sql instances patch pyplots-db --tier=db-g1-small`
125+
126+
## Conclusion
127+
128+
`--no-cpu-throttling` had **no measurable impact** on uncached request latency. The bottleneck is the Cloud SQL `db-f1-micro` instance (0.2 shared vCPU, 614 MB RAM). Memory is not the issue (actual usage ~200 MB). Likely causes: shared CPU throttling under concurrent load and/or connection establishment overhead through Cloud SQL Auth Proxy.
129+
130+
Cached responses are consistently fast (2-230ms). The problem only occurs when the 600s cache expires and the backend queries Cloud SQL.
131+
132+
**Next step:** Upgrade Cloud SQL from `db-f1-micro` to `db-g1-small` (0.5 vCPU, 1.7 GB, ~$27/mo) and re-measure.
133+
134+
Upgrade: `gcloud sql instances patch pyplots-db --tier=db-g1-small`
135+
136+
## After Cloud SQL upgrade to `db-g1-small` (pending)
137+
138+
TODO: Re-measure after upgrade and fill in results.
139+
140+
## How to Reproduce These Measurements
141+
142+
### Query slow requests (>500ms) for specific endpoints
143+
144+
```bash
145+
gcloud logging read \
146+
'resource.type="cloud_run_revision"
147+
AND resource.labels.service_name="pyplots-backend"
148+
AND httpRequest.requestUrl=~"/(specs|stats|libraries)$"
149+
AND httpRequest.latency>="0.5s"' \
150+
--limit=30 \
151+
--freshness=1d \
152+
--format='table(timestamp,httpRequest.requestUrl,httpRequest.latency)'
153+
```
154+
155+
### Query fast requests (<500ms) for cache hits
156+
157+
```bash
158+
gcloud logging read \
159+
'resource.type="cloud_run_revision"
160+
AND resource.labels.service_name="pyplots-backend"
161+
AND httpRequest.requestUrl=~"/(specs|stats|libraries)$"
162+
AND httpRequest.latency<"0.5s"' \
163+
--limit=20 \
164+
--freshness=1d \
165+
--format='table(timestamp,httpRequest.requestUrl,httpRequest.latency)'
166+
```
167+
168+
### Query OOM events
169+
170+
```bash
171+
gcloud logging read \
172+
'resource.type="cloud_run_revision"
173+
AND resource.labels.service_name="pyplots-backend"
174+
AND textPayload=~"Out-of-memory"' \
175+
--limit=15 \
176+
--freshness=30d \
177+
--format='table(timestamp,textPayload)'
178+
```
179+
180+
### Check Cloud Run service configuration
181+
182+
```bash
183+
gcloud run services describe pyplots-backend \
184+
--region europe-west4 \
185+
--format='yaml(spec.template.metadata.annotations,spec.template.spec.containers[0].resources)'
186+
```
187+
188+
### Check Cloud SQL instance tier
189+
190+
```bash
191+
gcloud sql instances describe pyplots-db \
192+
--format='yaml(settings.tier,settings.dataDiskSizeGb,databaseVersion,settings.activationPolicy)'
193+
```
194+
195+
### Cloud SQL CPU utilization (last 6 hours)
196+
197+
```bash
198+
curl -s "https://monitoring.googleapis.com/v3/projects/$(gcloud config get-value project)/timeSeries?filter=metric.type%3D%22cloudsql.googleapis.com%2Fdatabase%2Fcpu%2Futilization%22&interval.startTime=$(date -u -d '6 hours ago' +%Y-%m-%dT%H:%M:%SZ)&interval.endTime=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
199+
-H "Authorization: Bearer $(gcloud auth print-access-token)" | \
200+
python3 -c "
201+
import json, sys
202+
data = json.load(sys.stdin)
203+
for ts in data.get('timeSeries', []):
204+
for p in ts.get('points', [])[:20]:
205+
t = p['interval']['endTime']
206+
v = p['value']['doubleValue']
207+
print(f'{t}: {v*100:.1f}%')
208+
"
209+
```
210+
211+
### Cloud SQL memory utilization
212+
213+
```bash
214+
# Replace "cpu" with "memory" in the metric type:
215+
# cloudsql.googleapis.com/database/memory/utilization
216+
```

0 commit comments

Comments
 (0)