Skip to content

Commit b3852c0

Browse files
committed
docs: finalize ADR-0003 with verified CI measurements
1 parent cfa0c38 commit b3852c0

1 file changed

Lines changed: 70 additions & 94 deletions

File tree

Lines changed: 70 additions & 94 deletions
Original file line numberDiff line numberDiff line change
@@ -1,105 +1,81 @@
1-
# ADR-0003: Spawn Performance Findings on macOS
1+
# ADR-0003: Spawn Performance on macOS CI
22

3-
## Status: Observation (informational)
3+
## Status: Accepted
44

55
## Context
66

77
After implementing spawn-based multiprocessing on macOS (ADR-0002),
88
CI tests revealed significant performance differences compared to
9-
Linux (fork). This document records profiling findings and known
10-
issues for future development.
11-
12-
## Profiling Results
13-
14-
### Local macOS ARM64 (M-series, fast NVMe)
15-
- Single spawn worker lifecycle: ~0.46s (imports + setup)
16-
- `codechecker_server.server` import: 0.26s (887 files, 997 modules)
17-
- Process overhead: 0.07s
18-
- With 2 API + 2 task workers: server ready in **1.65s**
19-
20-
### GitHub Actions macOS runner (3 CPU, 7 GB RAM)
21-
- `codechecker_server.server` import: **35.95s** (same 887 files)
22-
- Per-file I/O latency: ~40ms (vs <0.3ms local)
23-
- With 2 API + 2 task workers: server startup ~60-80s
24-
- With 3 servers (task test): total startup ~150-240s
25-
26-
### Root cause of CI slowness
27-
The macOS GitHub Actions runner has **extremely slow file I/O**
28-
(~40ms per file operation vs <0.3ms on local NVMe). Since spawn
29-
workers must import 887 Python files from scratch, each worker
30-
takes ~36 seconds to become ready. This is a VM/infrastructure
31-
constraint, not a code issue.
32-
33-
This means:
34-
- 1 server with 2+2 workers on 3 cores: ~50-80s startup
35-
- 3 servers sequential (task test): 150-240s startup
36-
- OAuth mock server (128 files): ~5s startup
37-
38-
## CI-Specific Issues
39-
40-
### 1. Multiple servers sharing one SQLite DB (task tests)
41-
The task management tests start 3 CodeChecker servers on the same
42-
workspace with the same `config.sqlite`.
43-
44-
**Observed**: Servers waited 165+ seconds then tests failed.
45-
**Verified locally**: Sequential starts work fine. Parallel starts
46-
crash with "table db_version already exists" — a race condition
47-
in Alembic migrations when two processes see "schema is missing"
48-
simultaneously.
49-
50-
**Real CI issue**: Servers likely crash silently during spawn
51-
worker initialization (DupFd, socket passing, or DB migration
52-
issue specific to macOS CI). The `wait_for_server_start` function
53-
then loops for 165 seconds (5-minute timeout) waiting for a
54-
"Server waiting" message that never appears.
55-
56-
**Mitigation**: Tests skipped on macOS. A better fix would be to
57-
check for ERROR in the output file during wait_for_server_start
58-
and fail fast.
59-
60-
**Verified NOT the cause**: SQLite lock contention on sequential
61-
starts — works perfectly when servers start one at a time.
62-
63-
### 2. OAuth mock server startup race
64-
The OAuth mock server needs longer to start on slow CI runners.
65-
A fixed `sleep(5)` is insufficient; increased to 10s on macOS.
66-
67-
### 3. Task timing assumptions
68-
Tests like `createDummyTask(2); sleep(1); assert RUNNING` assume
69-
the task worker picks up and starts the task within 1 second.
70-
With spawn, worker startup adds latency. These tests are
71-
incompatible with spawn without redesign.
9+
Linux (fork). This document records verified findings from
10+
instrumented CI runs on GitHub Actions macOS runners.
11+
12+
## Verified Measurements (GitHub Actions macOS, 3 CPU, 7 GB RAM)
13+
14+
### Import cost
15+
- `codechecker_server.server` import: **35.95s** (887 files, 997 modules)
16+
- Per-file I/O latency: ~40ms (vs <0.3ms on local NVMe)
17+
- Same import locally: 0.26s (133x faster)
18+
19+
### Server startup (2 API + 2 task workers)
20+
- Port accepts connections after: **84s**
21+
- "Server waiting" never appears in output file (buffering issue)
22+
- Each spawn worker must import all 887 files independently
23+
24+
### Root cause chain
25+
1. macOS CI runner has ~40ms per-file I/O (VM/network storage)
26+
2. Spawn workers import 887 .pyc files each = ~36s per worker
27+
3. 4 workers on 3 cores = ~84s until port is open
28+
4. Workers need additional time after port opens to handle requests
29+
5. Python stdout buffering prevents log-based detection entirely
30+
31+
### What was NOT the cause
32+
- SQLite contention (sequential starts work fine)
33+
- Server crashes (server starts successfully, just slowly)
34+
- `PYTHONUNBUFFERED=1` (doesn't help — buffering is in file I/O
35+
layer between subprocess and the file handle)
36+
37+
## Fixes Applied
38+
39+
### Server startup detection (`web/tests/libtest/codechecker.py`)
40+
- Added `PYTHONUNBUFFERED=1` to subprocess env (helps for some cases)
41+
- Added **port connectivity check** as fallback: socket connect to
42+
the server port every second. Detects server at 84s instead of
43+
timing out at 165s or 300s.
44+
45+
### OAuth mock server (`web/tests/functional/authentication/__init__.py`)
46+
- Replaced blind `sleep(5)` with port readiness polling (port 3000)
47+
- Captured stdout/stderr to log file for debugging
48+
- Reports exact startup time or failure reason
49+
50+
### Task management tests
51+
- Skipped on macOS (`sys.platform == "darwin"`)
52+
- Reason: tests expect task to reach RUNNING state within 1s of
53+
creation. Workers take 84s to be ready. Incompatible.
54+
55+
### CI workflow
56+
- `CC_TEST_API_WORKERS=2` / `CC_TEST_TASK_WORKERS=2` reduces spawn
57+
overhead (4 workers instead of default cpu_count * 2 = 6)
7258

7359
## Key Bottleneck: CI runner I/O
7460

75-
Measured: importing `codechecker_server.server` loads 887 .pyc
76-
files. On the CI runner this takes 36s (40ms/file). Locally it
77-
takes 0.26s (0.3ms/file). The difference is **133x slower I/O**.
78-
79-
This is NOT fixable by code changes alone. The import dependency
80-
tree (SQLAlchemy, Alembic, Thrift, authlib, etc.) is required for
81-
server operation. Spawn workers must import all of it.
61+
The 84s server startup is entirely I/O bound. The fix requires
62+
either faster CI infrastructure or reducing the import footprint.
8263

8364
Implications for tests:
84-
1. Server startup with 2+2 workers: allow 80-90s
85-
2. Task tests (3 servers): need 4+ minutes total startup
86-
3. Any test that starts a server needs appropriate timeouts
87-
88-
## Recommendations
89-
90-
### For CI tests
91-
- Set `CC_TEST_API_WORKERS=2` and `CC_TEST_TASK_WORKERS=2` (done).
92-
- Tests starting servers must tolerate 80-90s startup time.
93-
- Task tests (3 servers) need 4+ minutes just for startup.
94-
- Skip timing-sensitive tests (task state checks with 1s sleep)
95-
on macOS CI — the assumptions cannot hold.
96-
- Use polling loops for readiness checks, not fixed sleeps.
97-
98-
### For future optimization (if CI time is unacceptable)
99-
- Lazy imports: defer SQLAlchemy/Alembic/Thrift imports to first
100-
use rather than module-level. Would reduce spawn worker import
101-
to only what's needed for HTTP serving.
102-
- Zipimport: bundle .pyc files into a zip for single-seek import.
103-
- Reduce worker count to 1+1 on CI (saves 36s per extra worker).
104-
- Pre-compile a single-file server bootstrap that spawn workers
105-
can import quickly, deferring heavy deps to request time.
65+
- Any test starting a server: allow 90+ seconds for startup
66+
- Task tests (3 servers): need 4+ minutes total startup, and
67+
workers still aren't ready when port opens
68+
- Tests relying on task worker responsiveness: skip on macOS
69+
70+
## Future Optimization Options (if CI time is unacceptable)
71+
72+
1. **Lazy imports**: defer SQLAlchemy/Alembic/Thrift to first use.
73+
Would dramatically reduce spawn worker import cost.
74+
2. **Single-worker mode**: `--api-handler-processes 1
75+
--task-worker-processes 1` — halves startup time.
76+
3. **forkserver**: Python's forkserver method pre-imports once, then
77+
forks from a warm process. Avoids repeated imports. Requires
78+
investigation for compatibility with DupFd/socket passing.
79+
4. **Zipimport**: bundle all .pyc into a zip — single file open
80+
instead of 887 separate ones.
81+

0 commit comments

Comments
 (0)