Skip to content

[TRTLLM-12838][infra] enhance code coverage for catching subprocess data#14985

Open
crazydemo wants to merge 4 commits into
NVIDIA:mainfrom
crazydemo:cbts-coverage-utils
Open

[TRTLLM-12838][infra] enhance code coverage for catching subprocess data#14985
crazydemo wants to merge 4 commits into
NVIDIA:mainfrom
crazydemo:cbts-coverage-utils

Conversation

@crazydemo

@crazydemo crazydemo commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

This PR is under debugging.

Summary by CodeRabbit

  • New Features

    • Implemented per-test coverage attribution for enhanced test-level visibility
    • Added coverage auditing and validation tools for coverage baseline analysis
    • Introduced coverage stability comparison capability between CI runs
  • Documentation

    • Added comprehensive documentation for coverage utilities and their usage

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@crazydemo crazydemo requested review from a team as code owners June 5, 2026 02:31
@crazydemo crazydemo requested review from mlefeb01 and yuanjingx87 June 5, 2026 02:31
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This pull request introduces CBTS Layer C, a CI-only per-test coverage collection infrastructure that attributes code execution to specific pytest test nodeids across multi-process GPU test execution. The changes span Jenkins pipeline conditional gating, Python coverage bootstrap via sitecustomize, pytest plugin hooks for MPI and context switching, coverage.py template configuration, audit/analysis utilities, and integration validation.

Changes

CBTS Layer C Coverage Infrastructure

Layer / File(s) Summary
Pipeline gating and CBTS integration
jenkins/L0_Test.groovy
Introduces CBTS_DOGFOOD_STAGES allowlist and isCbtsStage() helper to conditionally enable CBTS coverage collection. Extends uploadResults() to fetch .coverage.<stage>* and sentinel files from remote nodes and package as cbts-<stage>.tar.gz. Adds cbtsMode flag to getPytestBaseCommandLine() to inject CBTS environment variables (PYTHONPATH, marker/sentinel paths), activate -p cbts_plugin, and adjust pytest-cov flags. Conditionally generates rich .coveragerc for CBTS stages (sysmon core, parallel settings, per-stage data file) or legacy config for non-CBTS stages.
Per-process coverage bootstrap
jenkins/scripts/cbts/coverage_utils/sitecustomize.py
Activates coverage at import time when CBTS_COVERAGE_CONFIG is set. Generates unique data file suffixes with hostname/pid/random to avoid deduplication. Spawns daemon thread for periodic cov.save() to handle SIGKILL'd processes. Reads CBTS_TEST_ID on startup and polls marker file for test context changes, calling cov.switch_context() on updates with immediate save. Includes daemon that waits for tensorrt_llm.mpi_session module and applies coverage-compatible MPI pool patch.
Pytest plugin for test attribution
jenkins/scripts/cbts/coverage_utils/cbts_plugin.py
Provides pytest hooks to enable per-test coverage attribution. Monkey-patches MpiPoolSession._start_mpi_pool to whitelist COVERAGE_* and PYTHON* environment variables for MPI worker processes, with upstream refactor guards. Hooks pytest_runtest_protocol to write current test nodeid to marker file, set CBTS_TEST_ID in os.environ, create per-nodeid sentinel files (sha1 hash), and switch coverage context to match test execution.
Coverage configuration template and generation
jenkins/scripts/cbts/coverage_utils/coveragerc.template, jenkins/scripts/cbts/coverage_utils/make_coveragerc.sh
Introduces Coverage.py template with sysmon core, parallel/multiprocess concurrency, per-stage data file paths, wheel+src source mappings, and omit patterns for tests/utilities. Shell utility validates required env vars, locates template relative to script, substitutes placeholders, and writes materialized .coveragerc to job workspace.
Coverage audit and analysis tools
jenkins/scripts/cbts/coverage_utils/coverage_audit.py, jenkins/scripts/cbts/coverage_utils/stability_diff.py
coverage_audit.py loads coverage contexts, counts product/engine files, computes engine body lines (excluding imports/defs), optionally matches junit test statuses, and classifies each test/context into statuses (OK, SUBPROCESS_LOST, MAIN_ONLY, EMPTY, TEST_FAILED, MISSING_*) based on coverage counts, sentinel presence, and junit metadata. Emits CSV/JSON reports and console summary. Scans pytest logs for save failures. stability_diff.py compares per-test covered line sets across two coverage databases, emitting per-file divergence details and exit code for regression detection.
End-to-end integration validation
jenkins/scripts/cbts/coverage_utils/e2e_smoke.sh
Bash script validating full CBTS pipeline: generates runtime .coveragerc from template, executes pytest integration test with cbts_plugin enabled, combines parallel .coverage.<STAGE>.* files, embeds Python assertions to verify test nodeid appears in coverage context and product code was hit, and invokes coverage_audit for non-blocking report generation.
Architecture and usage documentation
jenkins/scripts/cbts/coverage_utils/README.md
Documents CBTS_COVERAGE_CONFIG activation gate, directory file roles and responsibilities, manual pytest+coverage smoke test workflow, and design decisions including MPI env propagation via monkey-patch, per-test context switching via marker files, periodic saves for short-lived workers, and daemon-thread module loading polling.

Sequence Diagram(s)

No sequence diagrams are necessary for this change. The infrastructure involves many independent components (pipeline gating, coverage bootstrap, pytest hooks, configuration, audit tools) that do not form a single coherent request-response flow. The interactions are primarily asynchronous (daemon threads for saves and polling) or trigger-based (pytest hooks, marker file changes) rather than sequential message flows.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

This change introduces a substantial, multi-layered coverage infrastructure with heterogeneous components spanning Groovy pipeline logic, Python coverage bootstrap and context switching, pytest plugin hooks with MPI patching, configuration templating, and multiple analysis/audit tools. The logic density is moderate to high across sitecustomize (daemon threads, polling), cbts_plugin (MPI inspection and patching), coverage_audit (context classification and junit parsing), and pytest output handling. The variety of implementation patterns (groovy scripting, python background threads, pytest hook wrapping, bash templating, sqlite queries) and the coupling between components (jenkins → python imports → sitecustomize → cbts_plugin → coverage contexts → audit classification) requires careful attention to understand the full data flow and failure modes.

Suggested reviewers

  • tburt-nv
  • yiqingy0
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.85% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description is incomplete and does not follow the template. It contains only a placeholder status note without substantive content in required sections. Fill in the Description and Test Coverage sections with clear explanations. Update PR title to follow [TICKET][type] format. Complete or remove the PR Checklist section.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: enhancing code coverage infrastructure to capture subprocess data, with specific JIRA reference and proper formatting.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
jenkins/scripts/cbts/coverage_utils/coverage_audit.py (1)

1-556: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Run ruff format to fix formatting violations.

The pipeline reports that ruff-format modified 4 files. Please run the formatter locally and commit the changes:

ruff format jenkins/scripts/cbts/coverage_utils/coverage_audit.py
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/scripts/cbts/coverage_utils/coverage_audit.py` around lines 1 - 556,
Run the project formatter (ruff) on the modified module to fix style issues and
commit the result: execute `ruff format
jenkins/scripts/cbts/coverage_utils/coverage_audit.py` (or run `ruff format` for
the repo) and add the updated file to your branch so the PR contains the
formatted version of coverage_audit.py; no code logic changes are required—only
apply and commit the ruff formatting edits for this module.
🧹 Nitpick comments (3)
jenkins/scripts/cbts/coverage_utils/README.md (2)

19-19: 💤 Low value

Consider clarifying the per-test context-switching mechanism.

The description states the plugin does "per-test calls cov.switch_context()", but based on the marker-file polling architecture described in lines 75-78, it appears the plugin writes the marker file per test, and sitecustomize.py in worker processes reads that file and calls switch_context(). If the plugin also calls switch_context() directly in the pytest main process, that would be worth mentioning explicitly to distinguish the two mechanisms.

Suggested clarification
-| `cbts_plugin.py` | Pytest plugin (`-p cbts_plugin`). Monkey-patches `mpi_session.MpiPoolSession._start_mpi_pool` to widen env propagation; per-test calls `cov.switch_context()`. |
+| `cbts_plugin.py` | Pytest plugin (`-p cbts_plugin`). Monkey-patches `mpi_session.MpiPoolSession._start_mpi_pool` to widen env propagation; writes per-test marker files to trigger `cov.switch_context()` in workers. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/scripts/cbts/coverage_utils/README.md` at line 19, Update the README
to explicitly clarify the per-test context-switching flow: state that
cbts_plugin.py (the pytest plugin) writes a per-test marker file (as described
by the marker-file polling in the README) when running tests and that the
worker-side sitecustomize.py polls/reads that marker file and calls
cov.switch_context() within the worker process; also explicitly indicate whether
cbts_plugin.py or mpi_session.MpiPoolSession._start_mpi_pool in the main pytest
process ever calls cov.switch_context() directly (make clear if it does not), so
readers can distinguish the plugin's marker-file writer role from the worker's
switch_context() reader role.

50-50: 💤 Low value

Clarify the purpose of --cov-report= flag.

The pytest command includes --cov-report= (empty value), which is a pytest-cov flag. Since coverage is being collected via sitecustomize.py and coverage.py directly (not via pytest-cov), this flag's purpose is not immediately clear. It likely suppresses pytest-cov output if the plugin is installed, but a brief inline comment would help readers understand why it's present.

Suggested clarification
 pytest -p cbts_plugin --cov-report= -vs \
+    # --cov-report= suppresses pytest-cov if installed; coverage via sitecustomize.py
     "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8B::test_nvfp4"

Or update the comment before the pytest command to explain this.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/scripts/cbts/coverage_utils/README.md` at line 50, Add a brief inline
comment explaining the purpose of the `--cov-report=` flag in the pytest
invocation: state that it intentionally sets an empty report value to suppress
pytest-cov's default output when coverage is being managed externally via
sitecustomize.py and coverage.py, so readers understand it's to avoid duplicate
or noisy coverage output rather than a typo; update the README line containing
the pytest command (`pytest -p cbts_plugin --cov-report= -vs \`) or the
preceding comment block to include this clarification.
jenkins/scripts/cbts/coverage_utils/cbts_plugin.py (1)

59-60: ⚡ Quick win

Add function type annotations for new plugin hooks/helpers.

install_mpi_pool_patch, pytest_configure, and pytest_runtest_protocol should be annotated (including return types) to match repository typing standards.

💡 Suggested fix
+from collections.abc import Generator
@@
-def install_mpi_pool_patch(*, raise_on_refactor=True):
+def install_mpi_pool_patch(*, raise_on_refactor: bool = True) -> bool:
@@
-def pytest_configure(config):  # noqa: D401 - pytest hook
+def pytest_configure(config: pytest.Config) -> None:  # noqa: D401 - pytest hook
@@
-def pytest_runtest_protocol(item, nextitem):  # noqa: D401 - pytest hook
+def pytest_runtest_protocol(
+    item: pytest.Item, nextitem: pytest.Item | None
+) -> Generator[None, None, None]:  # noqa: D401 - pytest hook

As per coding guidelines: “Always annotate functions; make the return type None if the function does not return anything.”

Also applies to: 116-117, 123-124

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/scripts/cbts/coverage_utils/cbts_plugin.py` around lines 59 - 60, Add
explicit type annotations for the new plugin hooks/helpers: annotate
install_mpi_pool_patch as def install_mpi_pool_patch(*, raise_on_refactor: bool
= True) -> None, annotate pytest_configure with the pytest config type (e.g.,
def pytest_configure(config: "pytest.Config") -> None) and annotate
pytest_runtest_protocol with its expected signature and return type (e.g., def
pytest_runtest_protocol(item: "pytest.Item", nextitem: Optional["pytest.Item"])
-> Optional[bool] or -> None depending on current usage). Ensure
imports/forward-references are added if needed (Optional, typing) and use ->
None where the function does not return a value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 937-943: runIsolatedTests() currently appends the "--cov-append"
pytest flag unconditionally which breaks CBTS runs because cbtsMode removes
pytest-cov flags; change runIsolatedTests() to only add "--cov-append" when
pytest-cov is enabled (i.e., when cbtsMode is false or when the cov flags are
present). Locate the check/array that builds pytestArgs in runIsolatedTests()
and gate the push/append of "--cov-append" behind a condition that checks the
cbtsMode variable (or the presence of "--cov=" flags), so CBTS paths never
receive pytest-cov-specific arguments.

In `@jenkins/scripts/cbts/coverage_utils/cbts_plugin.py`:
- Around line 148-164: Save the previous os.environ.get("CBTS_TEST_ID") before
setting it to nodeid, then wrap the sentinel creation,
coverage.switch_context(nodeid) and yield in a try/finally so that after the
yield you restore the original CBTS_TEST_ID (reassign the saved value if not
None, or delete the key if it was absent) to avoid leaking the test id to later
subprocesses; apply this change around the existing symbols CBTS_TEST_ID,
sentinel creation/Path(sentinel).touch(), cov.switch_context(nodeid) and yield.

In `@jenkins/scripts/cbts/coverage_utils/coverage_audit.py`:
- Around line 14-87: The module-level docstring contains backslashes in the
Usage example which triggers ruff D301; change the triple-quoted string at the
top of coverage_audit.py to a raw string literal by prefixing it with r (i.e.,
replace """ with r""" for the module docstring) so backslashes are treated
literally; ensure the string start that encloses the entire docstring (the one
documenting Inputs/Outputs/Usage) is the only change and keep all text
unchanged.
- Around line 233-243: In the _decl_import_lines function docstring, insert a
blank line between the one-line summary and the rest of the description (to
satisfy ruff D205) and correct the typo by replacing "unparseable" with
"unparsable" so the docstring follows Google-style formatting and codespell
rules.
- Around line 155-157: Update the docstring that begins "Strip the dotted module
prefix from a junit ``classname`` when the ``file`` attribute is present." by
inserting a single blank line between that one-line summary and the following
description text so the summary and description are separated (PEP 257 /
Google-style docstring). Locate the docstring for the function or method that
contains that summary (the one handling classname/file attributes in
coverage_audit.py) and add the blank line immediately after the first summary
line.

In `@jenkins/scripts/cbts/coverage_utils/e2e_smoke.sh`:
- Around line 75-84: The step banner prints inconsistent progress counts: change
the "[1/4]" banner to match the other steps which use "/5". Update the echo that
generates the materializing .coveragerc banner (the line echo "==> [1/4]
Materializing .coveragerc" near the COVERAGERC assignment) so it reads "[1/5]"
to be consistent with later banners; no logic changes to COVERAGERC, SCRIPT_DIR,
or make_coveragerc.sh are needed.
- Around line 190-201: The audit step is labeled "non-blocking" but currently
runs under set -e so any failure aborts the script; make the python3
"${SCRIPT_DIR}/coverage_audit.py" invocation genuinely non-blocking by
preventing its non-zero exit from propagating (for example wrap the call in an
if/then that logs a warning on failure, or run it with "|| true", or temporarily
disable set -e around the call), while still writing to AUDIT_CSV and AUDIT_JSON
and preserving the existing variables (COMBINED, JUNIT_XML, PYTEST_LOG,
CBTS_SENTINEL_DIR) so the rest of the script continues even if coverage_audit.py
fails.
- Around line 70-72: Before calling rm -rf on JOB_WORKSPACE in e2e_smoke.sh, add
a safety guard that validates JOB_WORKSPACE is set, non-empty, not root ("/"),
and matches an expected pattern or minimum length (e.g., contains "workspace" or
a known CI workspace prefix) to prevent accidental deletes; if the check fails,
abort with an error message instead of running rm -rf, then proceed to mkdir -p
when the guard passes. Ensure the guard references the JOB_WORKSPACE variable
name so reviewers can find and verify it.

In `@jenkins/scripts/cbts/coverage_utils/README.md`:
- Around line 36-57: The README's smoke test steps omit CBTS_MARKER_FILE and
CBTS_SENTINEL_DIR which Jenkins sets and which sitecustomize.py partially
documents; update the README (in the coverage_utils README.md) to either export
CBTS_MARKER_FILE and CBTS_SENTINEL_DIR in the example (using the same pattern as
COV_DIR/JOB_WORKSPACE/STAGE_NAME) or add a short note documenting their default
values (mentioning CBTS_MARKER_FILE default of /tmp/cbts/current_test.txt and
the default or required value for CBTS_SENTINEL_DIR) so users and scripts
relying on sitecustomize.py know what paths will be used.

In `@jenkins/scripts/cbts/coverage_utils/sitecustomize.py`:
- Around line 82-99: Concurrent access to the shared Coverage object is
unprotected: create a module-level threading.Lock (e.g., _cov_lock =
threading.Lock()) and serialize all operations that touch the Coverage instance
by acquiring the lock around calls to cov.save(), cov.switch_context(),
cov.stop() (and any other cov.* calls) — e.g., in _periodic_save(), the atexit
handler, the marker/context-switching functions, and the shutdown code wrap
those calls in a with _cov_lock: block so saves and context switches cannot
interleave.
- Around line 190-196: The try/except around cov.save() currently swallows all
errors; change it to catch the smallest reasonable exception set and surface
failures instead of passing silently — e.g., replace the bare except with except
Exception as e: and log the failure (using
logging.getLogger(__name__).exception(...) or similar) including the exception
message and traceback so missing subprocess coverage is visible; if a logger is
not available, create one with logging.getLogger and log at WARNING/ERROR level
rather than silently ignoring, keeping the cov.save() call and surrounding try
block intact.

In `@jenkins/scripts/cbts/coverage_utils/stability_diff.py`:
- Around line 58-59: The single-line if statement uses a semicolon to combine
print and return (if len(argv) != 4: print(__doc__, file=sys.stderr); return 2);
split this into a proper block by replacing the one-liner with a multi-line if
block that calls print(__doc__, file=sys.stderr) on its own line and then
executes return 2 on the next line so argv, __doc__, sys.stderr and the return
are each on separate lines and PEP8 E702 is satisfied.
- Around line 26-27: Add an explicit return type annotation to the
lines_for_context function: change its signature to return dict[str,
frozenset[int]] | None so callers and type checkers know it may return either a
mapping of file paths to frozensets of line numbers or None; update the function
definition for lines_for_context(con, ctx_name) accordingly using Python 3.10+
native generics.
- Line 57: The function definition for main (def main(argv)) lacks a return type
annotation; update the signature to include the correct return type (use -> None
if it does not return a value) so change def main(argv) to def main(argv) ->
None and ensure any call sites or tests expecting a return remain valid.
- Around line 14-20: The module-level docstring in stability_diff.py currently
lacks a blank line between the one-line summary ("Stability oracle: diff
per-test coverage sets between two CBTS runs of the same test list.") and the
rest of the description; update the triple-quoted module docstring to insert a
single blank line after the summary line so the docstring follows PEP
257/Google-style conventions (edit the top-level triple-quoted string in
stability_diff.py).

---

Outside diff comments:
In `@jenkins/scripts/cbts/coverage_utils/coverage_audit.py`:
- Around line 1-556: Run the project formatter (ruff) on the modified module to
fix style issues and commit the result: execute `ruff format
jenkins/scripts/cbts/coverage_utils/coverage_audit.py` (or run `ruff format` for
the repo) and add the updated file to your branch so the PR contains the
formatted version of coverage_audit.py; no code logic changes are required—only
apply and commit the ruff formatting edits for this module.

---

Nitpick comments:
In `@jenkins/scripts/cbts/coverage_utils/cbts_plugin.py`:
- Around line 59-60: Add explicit type annotations for the new plugin
hooks/helpers: annotate install_mpi_pool_patch as def install_mpi_pool_patch(*,
raise_on_refactor: bool = True) -> None, annotate pytest_configure with the
pytest config type (e.g., def pytest_configure(config: "pytest.Config") -> None)
and annotate pytest_runtest_protocol with its expected signature and return type
(e.g., def pytest_runtest_protocol(item: "pytest.Item", nextitem:
Optional["pytest.Item"]) -> Optional[bool] or -> None depending on current
usage). Ensure imports/forward-references are added if needed (Optional, typing)
and use -> None where the function does not return a value.

In `@jenkins/scripts/cbts/coverage_utils/README.md`:
- Line 19: Update the README to explicitly clarify the per-test
context-switching flow: state that cbts_plugin.py (the pytest plugin) writes a
per-test marker file (as described by the marker-file polling in the README)
when running tests and that the worker-side sitecustomize.py polls/reads that
marker file and calls cov.switch_context() within the worker process; also
explicitly indicate whether cbts_plugin.py or
mpi_session.MpiPoolSession._start_mpi_pool in the main pytest process ever calls
cov.switch_context() directly (make clear if it does not), so readers can
distinguish the plugin's marker-file writer role from the worker's
switch_context() reader role.
- Line 50: Add a brief inline comment explaining the purpose of the
`--cov-report=` flag in the pytest invocation: state that it intentionally sets
an empty report value to suppress pytest-cov's default output when coverage is
being managed externally via sitecustomize.py and coverage.py, so readers
understand it's to avoid duplicate or noisy coverage output rather than a typo;
update the README line containing the pytest command (`pytest -p cbts_plugin
--cov-report= -vs \`) or the preceding comment block to include this
clarification.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 90090ba6-6904-4e43-88e4-5d47f507ee1c

📥 Commits

Reviewing files that changed from the base of the PR and between df2d5b9 and e5bf8e9.

📒 Files selected for processing (9)
  • jenkins/L0_Test.groovy
  • jenkins/scripts/cbts/coverage_utils/README.md
  • jenkins/scripts/cbts/coverage_utils/cbts_plugin.py
  • jenkins/scripts/cbts/coverage_utils/coverage_audit.py
  • jenkins/scripts/cbts/coverage_utils/coveragerc.template
  • jenkins/scripts/cbts/coverage_utils/e2e_smoke.sh
  • jenkins/scripts/cbts/coverage_utils/make_coveragerc.sh
  • jenkins/scripts/cbts/coverage_utils/sitecustomize.py
  • jenkins/scripts/cbts/coverage_utils/stability_diff.py

Comment thread jenkins/L0_Test.groovy Outdated
Comment thread jenkins/scripts/cbts/coverage_utils/cbts_plugin.py
Comment thread jenkins/scripts/cbts/coverage_utils/coverage_audit.py Outdated
Comment thread jenkins/scripts/cbts/coverage_utils/coverage_audit.py Outdated
Comment thread jenkins/scripts/cbts/coverage_utils/coverage_audit.py Outdated
Comment thread jenkins/scripts/cbts/coverage_utils/sitecustomize.py Outdated
Comment thread jenkins/scripts/cbts/coverage_utils/stability_diff.py Outdated
Comment thread jenkins/scripts/cbts/coverage_utils/stability_diff.py Outdated
Comment thread jenkins/scripts/cbts/coverage_utils/stability_diff.py Outdated
Comment thread jenkins/scripts/cbts/coverage_utils/stability_diff.py Outdated
@crazydemo crazydemo force-pushed the cbts-coverage-utils branch 2 times, most recently from 58d9454 to 6967fc4 Compare June 5, 2026 03:03
@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_H100-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52225 [ run ] triggered by Bot. Commit: 6967fc4 Link to invocation

@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot kill

@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52248 [ run ] triggered by Bot. Commit: 6967fc4 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52225 [ run ] completed with state ABORTED. Commit: 6967fc4

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52248 [ run ] completed with state FAILURE. Commit: 6967fc4
/LLM/main/L0_MergeRequest_PR pipeline #41562 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A30-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52306 [ run ] triggered by Bot. Commit: 697959a Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52306 [ run ] completed with state SUCCESS. Commit: 697959a
/LLM/main/L0_MergeRequest_PR pipeline #41612 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@crazydemo crazydemo force-pushed the cbts-coverage-utils branch from 70f2f22 to a194444 Compare June 6, 2026 10:55
@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A30-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52502 [ run ] triggered by Bot. Commit: a194444 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52502 [ run ] completed with state SUCCESS. Commit: a194444
/LLM/main/L0_MergeRequest_PR pipeline #41793 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A30-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52575 [ run ] triggered by Bot. Commit: aced00d Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52575 [ run ] completed with state SUCCESS. Commit: aced00d
/LLM/main/L0_MergeRequest_PR pipeline #41857 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-PyTorch-1, A30-AutoDeploy-1, DGX_H100-4_GPUs-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52637 [ run ] triggered by Bot. Commit: aced00d Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #52637 [ run ] completed with state FAILURE. Commit: aced00d
/LLM/main/L0_MergeRequest_PR pipeline #41916 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-GptOss-1, A30-AutoDeploy-1"

Add jenkins/scripts/cbts/coverage_utils/ (CI infrastructure only; every
file is a no-op unless CBTS_COVERAGE_CONFIG is set):

- sitecustomize.py: per-process coverage bootstrap using the sysmon core,
  with periodic + context-switch saves and a lock serializing all
  switch_context/save/stop calls across the daemon threads and atexit.
- cbts_plugin.py: pytest plugin (-p cbts_plugin) that switches the
  per-test context and patches mpi_session._start_mpi_pool so MPI workers
  inherit the coverage env.
- coveragerc.template + make_coveragerc.sh: runtime .coveragerc.
- coverage_summary.py: per-stage liveness line (covered/ran test cases),
  stdlib-only and read-only.
- README.md: files, the post-merge gate, output layout, and how to query
  the merged DB.

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Wire coverage_utils into L0_Test.groovy and slurm_run.sh. isCbtsStage()
gates each stage on CBTS_PIPELINE_ELIGIBLE (post-merge pipelines only --
official PostMerge or /bot run --post-merge), the ENABLE_CBTS_COVERAGE
global kill-switch and CBTS_EXCLUDE_STAGES, and skips perf stages. CBTS
stages render the sysmon .coveragerc and run -p cbts_plugin; every other
stage gets an empty rcfile and runs uninstrumented (no legacy pytest-cov).
Per-process .coverage files are gathered back (SLURM scp into cbts/, K8s
local), and coverage_summary.py logs covered/ran per stage right after
pytest.

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
L0_MergeRequest.groovy's Test Coverage stage combines all stages'
.coverage files and uploads the merged DB to
${UPLOAD_PATH}/cbts-coverage/coverage.sqlite. Batched mv for the gather,
10-min timeout, no in-CI HTML render (consumers render on demand).

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
@crazydemo crazydemo force-pushed the cbts-coverage-utils branch from 5849967 to dd70f53 Compare June 11, 2026 08:57
@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --post-merge --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53537 [ run ] triggered by Bot. Commit: dd70f53 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53537 [ run ] completed with state ABORTED. Commit: dd70f53

Link to invocation

@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --post-merge --stage-list "A30-AutoDeploy-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54058 [ run ] triggered by Bot. Commit: 9d2d09f Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54058 [ run ] completed with state FAILURE. Commit: 9d2d09f
/LLM/main/L0_MergeRequest_PR pipeline #43142 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --post-merge --stage-list "A30-AutoDeploy-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54060 [ run ] triggered by Bot. Commit: 9d2d09f Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54060 [ run ] completed with state SUCCESS. Commit: 9d2d09f
/LLM/main/L0_MergeRequest_PR pipeline #43144 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --post-merge --stage-list "A30-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54085 [ run ] triggered by Bot. Commit: 9d2d09f Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54085 [ run ] completed with state FAILURE. Commit: 9d2d09f
/LLM/main/L0_MergeRequest_PR pipeline #43169 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…entry

Nested inner pytests spawned by test_unittests_v2 run as pytest-main, so
sitecustomize skipped the CBTS_TEST_ID context switch and their in-process
coverage drained to the empty context. Switch context from CBTS_TEST_ID for
nested inner pytests too, attributing their coverage to the test-db entry
without loading cbts_plugin in the inner pytest.

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
@crazydemo crazydemo force-pushed the cbts-coverage-utils branch from 9d2d09f to 83437db Compare June 14, 2026 09:36
@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --post-merge --stage-list "A30-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54108 [ run ] triggered by Bot. Commit: 83437db Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54108 [ run ] completed with state SUCCESS. Commit: 83437db
/LLM/main/L0_MergeRequest_PR pipeline #43191 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --post-merge --stage-list "A10-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54118 [ run ] triggered by Bot. Commit: 83437db Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54118 [ run ] completed with state SUCCESS. Commit: 83437db
/LLM/main/L0_MergeRequest_PR pipeline #43202 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --post-merge --stage-list "A10-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54202 [ run ] triggered by Bot. Commit: 83437db Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54202 [ run ] completed with state FAILURE. Commit: 83437db
/LLM/main/L0_MergeRequest_PR pipeline #43280 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --post-merge --stage-list "B300-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54212 [ run ] triggered by Bot. Commit: 83437db Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54212 [ run ] completed with state SUCCESS. Commit: 83437db
/LLM/main/L0_MergeRequest_PR pipeline #43290 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants