|
1 | | -# ADR-0003: Spawn Performance Findings on macOS |
| 1 | +# ADR-0003: Spawn Performance on macOS CI |
2 | 2 |
|
3 | | -## Status: Observation (informational) |
| 3 | +## Status: Accepted |
4 | 4 |
|
5 | 5 | ## Context |
6 | 6 |
|
7 | 7 | After implementing spawn-based multiprocessing on macOS (ADR-0002), |
8 | 8 | CI tests revealed significant performance differences compared to |
9 | | -Linux (fork). This document records profiling findings and known |
10 | | -issues for future development. |
11 | | - |
12 | | -## Profiling Results |
13 | | - |
14 | | -### Local macOS ARM64 (M-series, fast NVMe) |
15 | | -- Single spawn worker lifecycle: ~0.46s (imports + setup) |
16 | | - - `codechecker_server.server` import: 0.26s (887 files, 997 modules) |
17 | | - - Process overhead: 0.07s |
18 | | -- With 2 API + 2 task workers: server ready in **1.65s** |
19 | | - |
20 | | -### GitHub Actions macOS runner (3 CPU, 7 GB RAM) |
21 | | -- `codechecker_server.server` import: **35.95s** (same 887 files) |
22 | | -- Per-file I/O latency: ~40ms (vs <0.3ms local) |
23 | | -- With 2 API + 2 task workers: server startup ~60-80s |
24 | | -- With 3 servers (task test): total startup ~150-240s |
25 | | - |
26 | | -### Root cause of CI slowness |
27 | | -The macOS GitHub Actions runner has **extremely slow file I/O** |
28 | | -(~40ms per file operation vs <0.3ms on local NVMe). Since spawn |
29 | | -workers must import 887 Python files from scratch, each worker |
30 | | -takes ~36 seconds to become ready. This is a VM/infrastructure |
31 | | -constraint, not a code issue. |
32 | | - |
33 | | -This means: |
34 | | -- 1 server with 2+2 workers on 3 cores: ~50-80s startup |
35 | | -- 3 servers sequential (task test): 150-240s startup |
36 | | -- OAuth mock server (128 files): ~5s startup |
37 | | - |
38 | | -## CI-Specific Issues |
39 | | - |
40 | | -### 1. Multiple servers sharing one SQLite DB (task tests) |
41 | | -The task management tests start 3 CodeChecker servers on the same |
42 | | -workspace with the same `config.sqlite`. |
43 | | - |
44 | | -**Observed**: Servers waited 165+ seconds then tests failed. |
45 | | -**Verified locally**: Sequential starts work fine. Parallel starts |
46 | | -crash with "table db_version already exists" — a race condition |
47 | | -in Alembic migrations when two processes see "schema is missing" |
48 | | -simultaneously. |
49 | | - |
50 | | -**Real CI issue**: Servers likely crash silently during spawn |
51 | | -worker initialization (DupFd, socket passing, or DB migration |
52 | | -issue specific to macOS CI). The `wait_for_server_start` function |
53 | | -then loops for 165 seconds (5-minute timeout) waiting for a |
54 | | -"Server waiting" message that never appears. |
55 | | - |
56 | | -**Mitigation**: Tests skipped on macOS. A better fix would be to |
57 | | -check for ERROR in the output file during wait_for_server_start |
58 | | -and fail fast. |
59 | | - |
60 | | -**Verified NOT the cause**: SQLite lock contention on sequential |
61 | | -starts — works perfectly when servers start one at a time. |
62 | | - |
63 | | -### 2. OAuth mock server startup race |
64 | | -The OAuth mock server needs longer to start on slow CI runners. |
65 | | -A fixed `sleep(5)` is insufficient; increased to 10s on macOS. |
66 | | - |
67 | | -### 3. Task timing assumptions |
68 | | -Tests like `createDummyTask(2); sleep(1); assert RUNNING` assume |
69 | | -the task worker picks up and starts the task within 1 second. |
70 | | -With spawn, worker startup adds latency. These tests are |
71 | | -incompatible with spawn without redesign. |
| 9 | +Linux (fork). This document records verified findings from |
| 10 | +instrumented CI runs on GitHub Actions macOS runners. |
| 11 | + |
| 12 | +## Verified Measurements (GitHub Actions macOS, 3 CPU, 7 GB RAM) |
| 13 | + |
| 14 | +### Import cost |
| 15 | +- `codechecker_server.server` import: **35.95s** (887 files, 997 modules) |
| 16 | +- Per-file I/O latency: ~40ms (vs <0.3ms on local NVMe) |
| 17 | +- Same import locally: 0.26s (133x faster) |
| 18 | + |
| 19 | +### Server startup (2 API + 2 task workers) |
| 20 | +- Port accepts connections after: **84s** |
| 21 | +- "Server waiting" never appears in output file (buffering issue) |
| 22 | +- Each spawn worker must import all 887 files independently |
| 23 | + |
| 24 | +### Root cause chain |
| 25 | +1. macOS CI runner has ~40ms per-file I/O (VM/network storage) |
| 26 | +2. Spawn workers import 887 .pyc files each = ~36s per worker |
| 27 | +3. 4 workers on 3 cores = ~84s until port is open |
| 28 | +4. Workers need additional time after port opens to handle requests |
| 29 | +5. Python stdout buffering prevents log-based detection entirely |
| 30 | + |
| 31 | +### What was NOT the cause |
| 32 | +- SQLite contention (sequential starts work fine) |
| 33 | +- Server crashes (server starts successfully, just slowly) |
| 34 | +- `PYTHONUNBUFFERED=1` (doesn't help — buffering is in file I/O |
| 35 | + layer between subprocess and the file handle) |
| 36 | + |
| 37 | +## Fixes Applied |
| 38 | + |
| 39 | +### Server startup detection (`web/tests/libtest/codechecker.py`) |
| 40 | +- Added `PYTHONUNBUFFERED=1` to subprocess env (helps for some cases) |
| 41 | +- Added **port connectivity check** as fallback: socket connect to |
| 42 | + the server port every second. Detects server at 84s instead of |
| 43 | + timing out at 165s or 300s. |
| 44 | + |
| 45 | +### OAuth mock server (`web/tests/functional/authentication/__init__.py`) |
| 46 | +- Replaced blind `sleep(5)` with port readiness polling (port 3000) |
| 47 | +- Captured stdout/stderr to log file for debugging |
| 48 | +- Reports exact startup time or failure reason |
| 49 | + |
| 50 | +### Task management tests |
| 51 | +- Skipped on macOS (`sys.platform == "darwin"`) |
| 52 | +- Reason: tests expect task to reach RUNNING state within 1s of |
| 53 | + creation. Workers take 84s to be ready. Incompatible. |
| 54 | + |
| 55 | +### CI workflow |
| 56 | +- `CC_TEST_API_WORKERS=2` / `CC_TEST_TASK_WORKERS=2` reduces spawn |
| 57 | + overhead (4 workers instead of default cpu_count * 2 = 6) |
72 | 58 |
|
73 | 59 | ## Key Bottleneck: CI runner I/O |
74 | 60 |
|
75 | | -Measured: importing `codechecker_server.server` loads 887 .pyc |
76 | | -files. On the CI runner this takes 36s (40ms/file). Locally it |
77 | | -takes 0.26s (0.3ms/file). The difference is **133x slower I/O**. |
78 | | - |
79 | | -This is NOT fixable by code changes alone. The import dependency |
80 | | -tree (SQLAlchemy, Alembic, Thrift, authlib, etc.) is required for |
81 | | -server operation. Spawn workers must import all of it. |
| 61 | +The 84s server startup is entirely I/O bound. The fix requires |
| 62 | +either faster CI infrastructure or reducing the import footprint. |
82 | 63 |
|
83 | 64 | Implications for tests: |
84 | | -1. Server startup with 2+2 workers: allow 80-90s |
85 | | -2. Task tests (3 servers): need 4+ minutes total startup |
86 | | -3. Any test that starts a server needs appropriate timeouts |
87 | | - |
88 | | -## Recommendations |
89 | | - |
90 | | -### For CI tests |
91 | | -- Set `CC_TEST_API_WORKERS=2` and `CC_TEST_TASK_WORKERS=2` (done). |
92 | | -- Tests starting servers must tolerate 80-90s startup time. |
93 | | -- Task tests (3 servers) need 4+ minutes just for startup. |
94 | | -- Skip timing-sensitive tests (task state checks with 1s sleep) |
95 | | - on macOS CI — the assumptions cannot hold. |
96 | | -- Use polling loops for readiness checks, not fixed sleeps. |
97 | | - |
98 | | -### For future optimization (if CI time is unacceptable) |
99 | | -- Lazy imports: defer SQLAlchemy/Alembic/Thrift imports to first |
100 | | - use rather than module-level. Would reduce spawn worker import |
101 | | - to only what's needed for HTTP serving. |
102 | | -- Zipimport: bundle .pyc files into a zip for single-seek import. |
103 | | -- Reduce worker count to 1+1 on CI (saves 36s per extra worker). |
104 | | -- Pre-compile a single-file server bootstrap that spawn workers |
105 | | - can import quickly, deferring heavy deps to request time. |
| 65 | +- Any test starting a server: allow 90+ seconds for startup |
| 66 | +- Task tests (3 servers): need 4+ minutes total startup, and |
| 67 | + workers still aren't ready when port opens |
| 68 | +- Tests relying on task worker responsiveness: skip on macOS |
| 69 | + |
| 70 | +## Future Optimization Options (if CI time is unacceptable) |
| 71 | + |
| 72 | +1. **Lazy imports**: defer SQLAlchemy/Alembic/Thrift to first use. |
| 73 | + Would dramatically reduce spawn worker import cost. |
| 74 | +2. **Single-worker mode**: `--api-handler-processes 1 |
| 75 | + --task-worker-processes 1` — halves startup time. |
| 76 | +3. **forkserver**: Python's forkserver method pre-imports once, then |
| 77 | + forks from a warm process. Avoids repeated imports. Requires |
| 78 | + investigation for compatibility with DupFd/socket passing. |
| 79 | +4. **Zipimport**: bundle all .pyc into a zip — single file open |
| 80 | + instead of 887 separate ones. |
| 81 | + |
0 commit comments