Skip to content

Commit 4574851

Browse files
[TRTLLM-12648][test] implement disagg cancellation injector thread (#14920)
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
1 parent f0ca418 commit 4574851

4 files changed

Lines changed: 999 additions & 36 deletions

File tree

tests/integration/defs/stress_test/disagg_cancel/README.md

Lines changed: 71 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -13,19 +13,20 @@ transceiver under heavy mid-flight cancellation).
1313

1414
## Status
1515

16-
This is the **skeleton** stage. The harness class structure is in
17-
place; thread bodies are intentionally stubs that exit immediately.
18-
The pytest test exercises the lifecycle (`setup → start →
19-
wait_until_done → stop`) only.
16+
The harness class structure and lifecycle are in place. Thread bodies
17+
land incrementally:
2018

21-
Thread bodies are implemented incrementally in subsequent commits,
22-
in roughly this order (read-only first, side-effecting later):
19+
| Thread | Status |
20+
|--------|--------|
21+
| `log_scanner_thread` | Implemented — hard-zero log fail-fast |
22+
| `metrics_thread` | Stub (Step 2) |
23+
| `injector_thread` | Implemented — SIGSTOP/SIGCONT/SIGKILL + respawn |
24+
| `canary_thread` | Stub |
25+
| `load_thread` | Stub |
2326

24-
1. `log_scanner_thread` (read-only — easiest)
25-
2. `metrics_thread` (read-only — almost as easy)
26-
3. `injector_thread` (subprocess control)
27-
4. `canary_thread` (HTTP client + token-equivalence)
28-
5. `load_thread` (wraps existing `run_cancel_stress_test`)
27+
Component-level coverage: `test_log_scanner.py`, `test_injector.py`.
28+
The parametrized marathon pytest still runs a lifecycle smoke until
29+
`setup()` launches a real cluster and the remaining threads are wired.
2930

3031
## File layout
3132

@@ -35,6 +36,8 @@ tests/integration/defs/stress_test/disagg_cancel/
3536
├── __init__.py
3637
├── harness.py (DisaggCancellationStressHarness)
3738
├── test_disagg_cancel_stress.py (pytest entry point)
39+
├── test_log_scanner.py (log_scanner unit tests)
40+
├── test_injector.py (injector unit tests)
3841
└── configs/
3942
├── README.md (YAML schema + how to add a config)
4043
├── marathon_cpp_v1_deepseek.yaml
@@ -57,24 +60,70 @@ nightly / weekly via
5760
`tests/integration/test_lists/qa/llm_function_stress.txt` (wiring
5861
lands together with the load-thread implementation).
5962

60-
### Local smoke (skeleton stage)
63+
### Unit tests (no GPU, no cluster)
64+
65+
Component tests for individual harness threads run in isolation. They
66+
do **not** need `LLM_MODELS_ROOT`, GPUs, or a TRT-LLM venv with
67+
`transformers` — use `--confcutdir` so pytest skips the parent
68+
`tests/integration/defs/conftest.py`.
69+
70+
From the repository root:
6171

6272
```bash
6373
cd /path/to/TensorRT-LLM
64-
LLM_MODELS_ROOT=/path/to/models \
65-
pytest -sv tests/integration/defs/stress_test/disagg_cancel/
74+
75+
export PYTHONPATH=tests/integration/defs:tests/integration/defs/disaggregated
76+
77+
# Step 3 — injector thread (SIGSTOP / SIGCONT / SIGKILL + respawn)
78+
python3 -m pytest -c /dev/null -o addopts= \
79+
--confcutdir=tests/integration/defs/stress_test \
80+
tests/integration/defs/stress_test/disagg_cancel/test_injector.py -v
81+
82+
# Step 1 — log scanner (optional sanity alongside injector PR)
83+
python3 -m pytest -c /dev/null -o addopts= \
84+
--confcutdir=tests/integration/defs/stress_test \
85+
tests/integration/defs/stress_test/disagg_cancel/test_log_scanner.py -v
86+
87+
# Marathon YAML parse/validate (includes stress_config.injections schedule)
88+
python3 -m pytest -c /dev/null -o addopts= \
89+
--confcutdir=tests/integration/defs/stress_test \
90+
tests/integration/defs/stress_test/disagg_cancel/test_disagg_cancel_stress.py::test_all_marathon_yamls_parse_and_validate -v
91+
```
92+
93+
All three together:
94+
95+
```bash
96+
python3 -m pytest -c /dev/null -o addopts= \
97+
--confcutdir=tests/integration/defs/stress_test \
98+
tests/integration/defs/stress_test/disagg_cancel/test_injector.py \
99+
tests/integration/defs/stress_test/disagg_cancel/test_log_scanner.py \
100+
tests/integration/defs/stress_test/disagg_cancel/test_disagg_cancel_stress.py::test_all_marathon_yamls_parse_and_validate -q
101+
```
102+
103+
In a full TRT-LLM dev container/venv (with `transformers` installed),
104+
the same tests also run under the normal integration pytest path:
105+
106+
```bash
107+
pytest -sv tests/integration/defs/stress_test/disagg_cancel/test_injector.py
108+
```
109+
110+
### Lifecycle smoke (injector not exercised on real workers)
111+
112+
```bash
113+
pytest -sv tests/integration/defs/stress_test/disagg_cancel/test_disagg_cancel_stress.py::test_disagg_cancellation_marathon
66114
```
67115

68-
In the skeleton stage this should complete in seconds because all
69-
threads are no-ops; it only verifies the harness lifecycle compiles
70-
and the YAMLs parse.
116+
`setup()` is still a stub, so this only checks harness lifecycle
117+
(`setup``start``wait``stop`). The injector thread exits
118+
immediately because no workers are registered via
119+
`bind_tracked_workers()`.
71120

72-
### Local marathon (once thread bodies are wired)
121+
### Local marathon (after `setup()` + load/canary land)
73122

74-
Once thread bodies are implemented, the same command will run the
75-
full 2-hour marathon against the C++ marathon config. To run a shorter smoke
76-
during development, set `duration_min: 10` and trim
77-
`injections:` in the YAML.
123+
Once `setup()` launches a real 3P3D cluster and registers workers,
124+
the full 2-hour marathon runs via the same pytest entry point. For
125+
development, set `duration_min: 10` and trim `injections:` in the
126+
YAML.
78127

79128
## Pass criteria
80129

0 commit comments

Comments
 (0)