@@ -13,19 +13,20 @@ transceiver under heavy mid-flight cancellation).
1313
1414## Status
1515
16- This is the ** skeleton** stage. The harness class structure is in
17- place; thread bodies are intentionally stubs that exit immediately.
18- The pytest test exercises the lifecycle (`setup → start →
19- wait_until_done → stop`) only.
16+ The harness class structure and lifecycle are in place. Thread bodies
17+ land incrementally:
2018
21- Thread bodies are implemented incrementally in subsequent commits,
22- in roughly this order (read-only first, side-effecting later):
19+ | Thread | Status |
20+ | --------| --------|
21+ | ` log_scanner_thread ` | Implemented — hard-zero log fail-fast |
22+ | ` metrics_thread ` | Stub (Step 2) |
23+ | ` injector_thread ` | Implemented — SIGSTOP/SIGCONT/SIGKILL + respawn |
24+ | ` canary_thread ` | Stub |
25+ | ` load_thread ` | Stub |
2326
24- 1 . ` log_scanner_thread ` (read-only — easiest)
25- 2 . ` metrics_thread ` (read-only — almost as easy)
26- 3 . ` injector_thread ` (subprocess control)
27- 4 . ` canary_thread ` (HTTP client + token-equivalence)
28- 5 . ` load_thread ` (wraps existing ` run_cancel_stress_test ` )
27+ Component-level coverage: ` test_log_scanner.py ` , ` test_injector.py ` .
28+ The parametrized marathon pytest still runs a lifecycle smoke until
29+ ` setup() ` launches a real cluster and the remaining threads are wired.
2930
3031## File layout
3132
@@ -35,6 +36,8 @@ tests/integration/defs/stress_test/disagg_cancel/
3536├── __init__.py
3637├── harness.py (DisaggCancellationStressHarness)
3738├── test_disagg_cancel_stress.py (pytest entry point)
39+ ├── test_log_scanner.py (log_scanner unit tests)
40+ ├── test_injector.py (injector unit tests)
3841└── configs/
3942 ├── README.md (YAML schema + how to add a config)
4043 ├── marathon_cpp_v1_deepseek.yaml
@@ -57,24 +60,70 @@ nightly / weekly via
5760` tests/integration/test_lists/qa/llm_function_stress.txt ` (wiring
5861lands together with the load-thread implementation).
5962
60- ### Local smoke (skeleton stage)
63+ ### Unit tests (no GPU, no cluster)
64+
65+ Component tests for individual harness threads run in isolation. They
66+ do ** not** need ` LLM_MODELS_ROOT ` , GPUs, or a TRT-LLM venv with
67+ ` transformers ` — use ` --confcutdir ` so pytest skips the parent
68+ ` tests/integration/defs/conftest.py ` .
69+
70+ From the repository root:
6171
6272``` bash
6373cd /path/to/TensorRT-LLM
64- LLM_MODELS_ROOT=/path/to/models \
65- pytest -sv tests/integration/defs/stress_test/disagg_cancel/
74+
75+ export PYTHONPATH=tests/integration/defs:tests/integration/defs/disaggregated
76+
77+ # Step 3 — injector thread (SIGSTOP / SIGCONT / SIGKILL + respawn)
78+ python3 -m pytest -c /dev/null -o addopts= \
79+ --confcutdir=tests/integration/defs/stress_test \
80+ tests/integration/defs/stress_test/disagg_cancel/test_injector.py -v
81+
82+ # Step 1 — log scanner (optional sanity alongside injector PR)
83+ python3 -m pytest -c /dev/null -o addopts= \
84+ --confcutdir=tests/integration/defs/stress_test \
85+ tests/integration/defs/stress_test/disagg_cancel/test_log_scanner.py -v
86+
87+ # Marathon YAML parse/validate (includes stress_config.injections schedule)
88+ python3 -m pytest -c /dev/null -o addopts= \
89+ --confcutdir=tests/integration/defs/stress_test \
90+ tests/integration/defs/stress_test/disagg_cancel/test_disagg_cancel_stress.py::test_all_marathon_yamls_parse_and_validate -v
91+ ```
92+
93+ All three together:
94+
95+ ``` bash
96+ python3 -m pytest -c /dev/null -o addopts= \
97+ --confcutdir=tests/integration/defs/stress_test \
98+ tests/integration/defs/stress_test/disagg_cancel/test_injector.py \
99+ tests/integration/defs/stress_test/disagg_cancel/test_log_scanner.py \
100+ tests/integration/defs/stress_test/disagg_cancel/test_disagg_cancel_stress.py::test_all_marathon_yamls_parse_and_validate -q
101+ ```
102+
103+ In a full TRT-LLM dev container/venv (with ` transformers ` installed),
104+ the same tests also run under the normal integration pytest path:
105+
106+ ``` bash
107+ pytest -sv tests/integration/defs/stress_test/disagg_cancel/test_injector.py
108+ ```
109+
110+ ### Lifecycle smoke (injector not exercised on real workers)
111+
112+ ``` bash
113+ pytest -sv tests/integration/defs/stress_test/disagg_cancel/test_disagg_cancel_stress.py::test_disagg_cancellation_marathon
66114```
67115
68- In the skeleton stage this should complete in seconds because all
69- threads are no-ops; it only verifies the harness lifecycle compiles
70- and the YAMLs parse.
116+ ` setup() ` is still a stub, so this only checks harness lifecycle
117+ (` setup ` → ` start ` → ` wait ` → ` stop ` ). The injector thread exits
118+ immediately because no workers are registered via
119+ ` bind_tracked_workers() ` .
71120
72- ### Local marathon (once thread bodies are wired )
121+ ### Local marathon (after ` setup() ` + load/canary land )
73122
74- Once thread bodies are implemented, the same command will run the
75- full 2-hour marathon against the C++ marathon config. To run a shorter smoke
76- during development, set ` duration_min: 10 ` and trim
77- ` injections: ` in the YAML.
123+ Once ` setup() ` launches a real 3P3D cluster and registers workers,
124+ the full 2-hour marathon runs via the same pytest entry point. For
125+ development, set ` duration_min: 10 ` and trim ` injections: ` in the
126+ YAML.
78127
79128## Pass criteria
80129
0 commit comments