Skip to content

hotfix: inplace pin memory caused cudaErrorHostMemoryAlreadyRegistered#69

Merged
blahgeek merged 6 commits into
MoonshotAI:mainfrom
specture724:fix/inplace-pin-712
Dec 22, 2025
Merged

hotfix: inplace pin memory caused cudaErrorHostMemoryAlreadyRegistered#69
blahgeek merged 6 commits into
MoonshotAI:mainfrom
specture724:fix/inplace-pin-712

Conversation

@specture724

@specture724 specture724 commented Dec 18, 2025

Copy link
Copy Markdown
Contributor

Resolve #67
Manually unregister the memory pinned with cudaHostRegister by cudaHostUnregister
Renamed test_pin_memory.py to test_reuse_pin_memory.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a CUDA error (cudaErrorHostMemoryAlreadyRegistered) that occurred when reusing inplace pin memory by implementing manual memory unpinning with cudaHostUnregister. The changes also make inplace pin memory the default behavior and add NPU device compatibility checks.

Key Changes:

  • Implemented manual memory unpinning using cudaHostUnregister for CUDA-registered memory to prevent reuse errors
  • Changed default value of use_inplace_pin_memory from False to True
  • Added NPU device detection to disable inplace pin memory on unsupported hardware

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.

File Description
checkpoint_engine/ps.py Added manually_pinned flag to MemoryBuffer, implemented _unpin() function for manual memory unpinning, changed default for use_inplace_pin_memory to True, added NPU device check, and updated documentation
tests/test_reuse_pin_memory.py New test file (renamed from test_pin_memory.py) that validates shared memory pool registration and unregistration behavior
tests/test_inplace_unpin.py New test that validates repeated pin/unpin cycles work correctly without CUDA errors
tests/test_update.py Removed unused variable assignment and changed directory cleanup from shutil.rmtree() to os.removedirs()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_update.py Outdated
Comment thread checkpoint_engine/ps.py
Comment thread checkpoint_engine/ps.py Outdated
Comment thread tests/test_inplace_unpin.py
Comment thread checkpoint_engine/ps.py
Comment thread checkpoint_engine/ps.py
Comment thread checkpoint_engine/ps.py
@specture724 specture724 changed the title fix: inplace pin memory caused cudaErrorHostMemoryAlreadyRegistered hotfix: inplace pin memory caused cudaErrorHostMemoryAlreadyRegistered Dec 19, 2025
@blahgeek blahgeek merged commit 4350e70 into MoonshotAI:main Dec 22, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug found when using in-place pinned memory

3 participants