Skip to content

Nvtx tracer#1087

Open
Ohm-Rishabh wants to merge 10 commits intohao-ai-lab:mainfrom
Ohm-Rishabh:nvtx_tracer
Open

Nvtx tracer#1087
Ohm-Rishabh wants to merge 10 commits intohao-ai-lab:mainfrom
Ohm-Rishabh:nvtx_tracer

Conversation

@Ohm-Rishabh
Copy link
Copy Markdown
Contributor

This is the pr for adding nvtx range markers in the codebase.

To get trace: python fastvideo/tests/training/Vanilla/mfu_calculation.py --profile

To add custom range: with nvtx_range("range name"): ....code block....

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Ohm-Rishabh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the profiling capabilities of the codebase by integrating NVTX range markers into key computational graphs and the training loop. This allows developers to gain deeper insights into performance bottlenecks using NVIDIA's profiling tools. Additionally, the changes include updates to a training test script to leverage these new profiling features and minor cleanups within the training pipeline for improved stability and data logging.

Highlights

  • NVTX Profiling Integration: Introduced NVTX (NVIDIA Tools Extension) range markers across critical components of the model architecture and training pipeline, enabling detailed performance profiling with tools like Nsight Systems.
  • New NVTX Utilities: Added dedicated nvtx_range, nvtx_annotate, and nvtx_mark utilities in fastvideo/profiler.py to provide flexible and independent NVTX tracing capabilities.
  • MFU Calculation Test Enhancements: Updated the MFU calculation test script to support Nsight profiling, adjusted training parameters (e.g., batch size, max steps), and refined environment variable settings for more robust testing and profiling.
  • Training Pipeline Refinements: Streamlined the training pipeline by removing audio muxing logic and improving the robustness of metric logging, ensuring accurate capture of model architecture details.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • examples/training/finetune/wan_t2v_1.3B/crush_smol/preprocess_wan_data_t2v.sh
    • Updated GPU assignment and video batch size for preprocessing script.
  • fastvideo/attention/layer.py
    • Imported nvtx_range for profiling.
    • Wrapped QKV stacking, all-to-all scatter, rotary embedding application, QKV preprocessing, attention implementation, and all-to-all gather operations with NVTX ranges in DistributedAttention.forward.
    • Wrapped rotary embedding application and attention implementation with NVTX ranges in LocalAttention.forward.
  • fastvideo/layers/layernorm.py
    • Imported nvtx_range for profiling.
    • Wrapped RMSNorm.forward_native with an NVTX range.
    • Wrapped ScaleResidual.forward with an NVTX range.
    • Wrapped FP32LayerNorm.forward with an NVTX range.
    • Wrapped ScaleResidualLayerNormScaleShift.forward with an NVTX range.
    • Wrapped LayerNormScaleShift.forward with an NVTX range.
  • fastvideo/layers/linear.py
    • Imported nvtx_range for profiling.
    • Wrapped the quant_method.apply call in ReplicatedLinear.forward with an NVTX range.
  • fastvideo/layers/mlp.py
    • Imported nvtx_range for profiling.
    • Wrapped fc_in, act, and fc_out calls in MLP.forward with NVTX ranges.
  • fastvideo/models/dits/wanvideo.py
    • Imported nvtx_range for profiling.
    • Wrapped KV computation, attention, and output projection in WanT2VCrossAttention.forward with NVTX ranges.
    • Wrapped scale shift table processing, QKV computation, self-attention, cross-attention, feed-forward, and MLP residual operations in WanTransformerBlock.forward with NVTX ranges.
    • Wrapped rotary position embedding, patch embedding, sequence model parallel shard, attention mask creation, condition embedder, transformer block iterations, output normalization, all-gather with unpad, and projection out operations in WanTransformer3DModel.forward with NVTX ranges.
  • fastvideo/models/loader/component_loader.py
    • Updated the pipeline configuration's DIT config to ensure downstream code can access the actual model architecture.
  • fastvideo/profiler.py
    • Integrated torch.cuda.nvtx.range_push and range_pop into the ProfilerController.region context manager.
    • Added new standalone NVTX utilities: nvtx_range (context manager), nvtx_annotate (decorator), and nvtx_mark (instantaneous marker).
  • fastvideo/tests/training/Vanilla/mfu_calculation.py
    • Updated DATA_PATH to a combined parquet dataset.
    • Changed MASTER_PORT and added CUDA_VISIBLE_DEVICES and FASTVIDEO_ATTENTION_BACKEND environment variables.
    • Adjusted train_batch_size and max_train_steps for testing.
    • Added logging for training metrics including batch size, sequence length, context length, step time, hidden dimension, number of layers, and FFN dimension.
    • Modified test_distributed_training to accept a profile argument and integrated nsys profile command for Nsight tracing.
  • fastvideo/training/training_pipeline.py
    • Imported nvtx_range for profiling.
    • Removed shutil and tempfile imports.
    • Wrapped transformer_forward, loss computation (sharded and non-sharded), backward pass, reduce_loss, clip_grad_norm, optimizer_step, and the main train_step loop with NVTX ranges.
    • Removed audio muxing functionality (_mux_audio method and its calls).
    • Adjusted the logic for logging visualization and validation steps.
    • Removed the try-except block around metric logging, ensuring all metrics are consistently captured.
Activity
  • The pull request introduces NVTX range markers and profiling utilities to the codebase.
  • The author, Ohm-Rishabh, has integrated these markers into various core components and the training pipeline.
  • The MFU calculation test script has been updated to utilize the new profiling capabilities and adjusted for specific training parameters.
  • Minor refactorings and cleanups were performed in the training pipeline, including the removal of audio muxing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces NVTX range markers throughout the codebase to improve profiling capabilities with NVIDIA Nsight Systems. The changes are well-implemented, adding a new nvtx_range utility and applying it to key performance-critical sections in various models and layers. This will provide valuable, fine-grained insights into GPU execution.

My review includes a few suggestions to improve the portability of example and test scripts by removing hardcoded GPU device IDs. I also noted some leftover debug code and a significant out-of-scope change (removal of audio muxing logic) that would be better handled in a separate pull request to maintain clarity and focus.

I am having trouble creating individual review comments. Click here to see my feedback.

examples/training/finetune/wan_t2v_1.3B/crush_smol/preprocess_wan_data_t2v.sh (4)

medium

Hardcoding CUDA_VISIBLE_DEVICES makes this example script difficult to run on different machines where GPU 2 might not be available or suitable. It would be more robust to either expect the user to set this environment variable before running the script or to parameterize it.

# Set the CUDA_VISIBLE_DEVICES environment variable to select a specific GPU, e.g.:
# export CUDA_VISIBLE_DEVICES=0

fastvideo/tests/training/Vanilla/mfu_calculation.py (36)

medium

Hardcoding CUDA_VISIBLE_DEVICES in a test script reduces its portability and can cause failures on systems where the specified GPUs (5, 6) are not available. It's better to let the user or the test environment configure this. Please remove this line and let the environment (e.g., a CI runner or the user's shell) control which GPUs are visible to the script.

fastvideo/training/training_pipeline.py (835)

medium

This commented-out return statement appears to be a leftover from a debugging session. It should be removed to keep the code clean.

fastvideo/training/training_pipeline.py (951-1042)

medium

The removal of the _mux_audio static method and its related logic is a significant change. While this might be a valid cleanup, it seems unrelated to the main purpose of this pull request, which is to add NVTX tracers. Including unrelated changes makes the PR harder to review and understand. It would be better to submit this change in a separate PR with a descriptive title and explanation.

@jzhang38
Copy link
Copy Markdown
Collaborator

jzhang38 commented Mar 8, 2026

Can you resolve the conflict?

@Ohm-Rishabh
Copy link
Copy Markdown
Contributor Author

Can you resolve the conflict?

I've resolved the merge conflicts. lmk if you need to do any other changes.

@Eigensystem Eigensystem removed the go label Mar 28, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 28, 2026

This PR has merge conflicts with the base branch. Please rebase:

git fetch origin main
git rebase origin/main
# Resolve any conflicts, then:
git push --force-with-lease

@mergify mergify Bot added the needs-rebase PR has merge conflicts label Mar 28, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 28, 2026

Pre-commit checks failed

Hi @Ohm-Rishabh, the pre-commit checks have failed. To fix them locally:

# Install pre-commit if you haven't already
uv pip install pre-commit
pre-commit install

# Run all checks and auto-fix what's possible
pre-commit run --all-files

Common fixes:

  • yapf: yapf -i <file> (formatting)
  • ruff: ruff check --fix <file> (linting)
  • codespell: codespell --write-changes <file> (spelling)

After fixing, commit and push the changes. The checks will re-run automatically.

For future commits, pre-commit will run automatically on changed files before each commit.

1 similar comment
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 28, 2026

Pre-commit checks failed

Hi @Ohm-Rishabh, the pre-commit checks have failed. To fix them locally:

# Install pre-commit if you haven't already
uv pip install pre-commit
pre-commit install

# Run all checks and auto-fix what's possible
pre-commit run --all-files

Common fixes:

  • yapf: yapf -i <file> (formatting)
  • ruff: ruff check --fix <file> (linting)
  • codespell: codespell --write-changes <file> (spelling)

After fixing, commit and push the changes. The checks will re-run automatically.

For future commits, pre-commit will run automatically on changed files before each commit.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 30, 2026

Buildkite CI tests failed

Hi @Ohm-Rishabh, some Buildkite CI tests have failed. Check the build for details:
View Buildkite build →

Common causes:

  • Test failures: Check the failing step's output for assertion errors or tracebacks
  • Import errors: Make sure new dependencies are added to pyproject.toml
  • GPU memory: Some tests require specific GPU types (L40S, H100 NVL)
  • Kernel build: If you changed fastvideo-kernel/, the build may have failed

If the failure is unrelated to your changes, leave a comment explaining why.

@mergify mergify Bot added the scope: training Training pipeline, methods, configs label Mar 30, 2026
@mergify mergify Bot added scope: attention Attention backends (VSA, STA, Flash, etc.) scope: infra CI, tests, Docker, build scope: docs Documentation scope: model Model architecture (DiTs, encoders, VAEs) labels Mar 30, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 30, 2026

⚠️ PR title format required

Your PR title must start with a type tag in brackets. Examples:

  • [feat] Add new model support
  • [bugfix] Fix VAE tiling corruption
  • [refactor] Restructure training pipeline
  • [perf] Optimize attention kernel
  • [ci] Update test infrastructure
  • [docs] Add inference guide
  • [misc] Clean up configs
  • [new-model] Port Flux2 to FastVideo

Valid tags: feat, feature, bugfix, fix, refactor, perf, ci, doc, docs, misc, chore, kernel, new-model

Please update your PR title and the merge protection check will pass automatically.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 30, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for:

  • #approved-reviews-by>=1
  • check-success=fastcheck-passed
  • check-success=full-suite-passed
  • check-success~=pre-commit
  • title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model)\]
This rule is failing.
  • #approved-reviews-by>=1
  • check-success=fastcheck-passed
  • check-success=full-suite-passed
  • check-success~=pre-commit
  • title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model)\]

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 30, 2026

Pre-commit checks failed

Hi @Ohm-Rishabh, the pre-commit checks have failed. To fix them locally:

# Install pre-commit if you haven't already
uv pip install pre-commit
pre-commit install

# Run all checks and auto-fix what's possible
pre-commit run --all-files

Common fixes:

  • yapf: yapf -i <file> (formatting)
  • ruff: ruff check --fix <file> (linting)
  • codespell: codespell --write-changes <file> (spelling)

After fixing, commit and push the changes. The checks will re-run automatically.

For future commits, pre-commit will run automatically on changed files before each commit.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 30, 2026

Buildkite CI tests failed

Hi @Ohm-Rishabh, some Buildkite CI tests have failed. Check the build for details:
View Buildkite build →

Common causes:

  • Test failures: Check the failing step's output for assertion errors or tracebacks
  • Import errors: Make sure new dependencies are added to pyproject.toml
  • GPU memory: Some tests require specific GPU types (L40S, H100 NVL)
  • Kernel build: If you changed fastvideo-kernel/, the build may have failed

If the failure is unrelated to your changes, leave a comment explaining why.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 30, 2026

Pre-commit checks failed

Hi @Ohm-Rishabh, the pre-commit checks have failed. To fix them locally:

# Install pre-commit if you haven't already
uv pip install pre-commit
pre-commit install

# Run all checks and auto-fix what's possible
pre-commit run --all-files

Common fixes:

  • yapf: yapf -i <file> (formatting)
  • ruff: ruff check --fix <file> (linting)
  • codespell: codespell --write-changes <file> (spelling)

After fixing, commit and push the changes. The checks will re-run automatically.

For future commits, pre-commit will run automatically on changed files before each commit.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 30, 2026

Buildkite CI tests failed

Hi @Ohm-Rishabh, some Buildkite CI tests have failed. Check the build for details:
View Buildkite build →

Common causes:

  • Test failures: Check the failing step's output for assertion errors or tracebacks
  • Import errors: Make sure new dependencies are added to pyproject.toml
  • GPU memory: Some tests require specific GPU types (L40S, H100 NVL)
  • Kernel build: If you changed fastvideo-kernel/, the build may have failed

If the failure is unrelated to your changes, leave a comment explaining why.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 30, 2026

Pre-commit checks failed

Hi @Ohm-Rishabh, the pre-commit checks have failed. To fix them locally:

# Install pre-commit if you haven't already
uv pip install pre-commit
pre-commit install

# Run all checks and auto-fix what's possible
pre-commit run --all-files

Common fixes:

  • yapf: yapf -i <file> (formatting)
  • ruff: ruff check --fix <file> (linting)
  • codespell: codespell --write-changes <file> (spelling)

After fixing, commit and push the changes. The checks will re-run automatically.

For future commits, pre-commit will run automatically on changed files before each commit.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 30, 2026

❌ CI tests failed

@Ohm-Rishabh — to see what failed:

  1. Scroll to the Checks section below
  2. Find the check marked with ❌ (e.g. buildkite/ci/microscope-transformer-tests)
  3. Click Details to view the full build log

Or view all builds for this branch on Buildkite →

Common causes:

  • Assertion error / test failure — check the failing test's traceback
  • Import error — new dependency missing from pyproject.toml
  • OOM — some tests need specific GPUs (L40S, H100 NVL)

If the failure looks unrelated to your changes, comment why and a maintainer will review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase PR has merge conflicts scope: attention Attention backends (VSA, STA, Flash, etc.) scope: docs Documentation scope: infra CI, tests, Docker, build scope: model Model architecture (DiTs, encoders, VAEs) scope: training Training pipeline, methods, configs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants