Skip to content

test: pin Planetarium.NetMQ to 4.0.0.262-planetarium-pre1#2759

Open
ipdae wants to merge 1 commit into
mainfrom
yang/test-netmq-prerelease-pre1
Open

test: pin Planetarium.NetMQ to 4.0.0.262-planetarium-pre1#2759
ipdae wants to merge 1 commit into
mainfrom
yang/test-netmq-prerelease-pre1

Conversation

@ipdae
Copy link
Copy Markdown
Contributor

@ipdae ipdae commented May 7, 2026

Summary

Pin Planetarium.NetMQ to a prerelease (4.0.0.262-planetarium-pre1) in NineChronicles.Headless.Executable.csproj so the executable resolves the patched NetMQ build instead of the 4.0.0.261-planetarium that comes in transitively through Libplanet.Net 5.5.x.

Do not merge. This is a soak-test branch for the Heimdall validator + remote-headless Exit 139 crash. Once the upstream chain (planetarium/netmq#6 release -> Libplanet 5.5.4 release -> lib9c bump -> headless) lands, this pin should be reverted.

Background

Heimdall validators have been hitting Exit 139 roughly every 2h with this stack:

Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at NetMQ.Core.Transports.StreamEngine.MechanismReady()
   at NetMQ.Core.Transports.StreamEngine.ProcessHandshakeCommand(Msg& msg)
   at NetMQ.Core.Transports.StreamEngine.ProcessInput()
   at NetMQ.Core.Utils.Proactor.Loop()

Cancellation in the handshake path races m_mechanism teardown. The fork's StreamEngine.cs dereferences m_mechanism at both ProcessHandshakeCommand entry and inside MechanismReady without a null check, so the engine crashes the whole process.

The downstream symptom on remote-headless is the recurring tx staging timeout end users see twice before a staged action vanishes — measured ~19x reduction in timeout rate after manual kubectl rollout restart. See planetarium/libplanet#4050 for the full operational analysis.

What's in the prerelease

Planetarium.NetMQ 4.0.0.262-planetarium-pre1 adds early-return null guards at the two NRE sites:

PushMsgResult ProcessHandshakeCommand(ref Msg msg)
{
    if (m_mechanism == null) return PushMsgResult.Error;   // new
    var result = m_mechanism.ProcessHandshakeCommand(ref msg);
    ...
}

void MechanismReady()
{
    if (m_mechanism == null) return;                        // new
    if (m_options.HeartbeatInterval > 0) ...
}

Source: planetarium/netmq#6 (CI green, code-review pending). Public API and AssemblyVersion (4.0.0.0) are unchanged, so this is a binary-compatible swap with 4.0.0.261-planetarium.

Why pin in the executable csproj

NineChronicles.Headless.Executable is the Exe entry. Adding the PackageReference here means NuGet's nearest-wins resolves 4.0.0.262-planetarium-pre1 for the executable's deps closure, overriding the transitive 4.0.0.261-planetarium from Libplanet.Net. Library csprojs in this repo don't link NetMQ directly — only the executable's runtime resolution matters for production validation.

Test plan

  • Validator/headless pod build picks up the prerelease (verify in deploy logs: Planetarium.NetMQ 4.0.0.262-planetarium-pre1).
  • Heimdall validator-5-0 runs without Exit 139 for >=24h.
  • tx staging timeout rate on remote-headless drops from baseline ~266/5min back toward steady-state.

Rollback

If the prerelease introduces any regression:

- <PackageReference Include="Planetarium.NetMQ" Version="4.0.0.262-planetarium-pre1" />

The NuGet package itself can be unlisted on nuget.org but cannot be deleted — fine for a prerelease.

🤖 Generated with Claude Code

Heimdall validators have been crashing with Exit 139 every ~2h on a
NullReferenceException inside NetMQ.Core.Transports.StreamEngine
(MechanismReady / ProcessHandshakeCommand). The downstream symptom on
remote-headless pods is the recurring "tx staging timeout" error that
end users see twice in a row before staged actions vanish.

Pin Planetarium.NetMQ to a prerelease that adds null guards at the two
NRE sites observed in production (planetarium/netmq#6). The pin sits in
the Headless executable csproj so NuGet's nearest-wins resolves it over
the 4.0.0.261-planetarium that comes in transitively through
Libplanet.Net 5.5.x.

This is for Heimdall validator/headless pod soak testing only — revert
once the upstream bump lands in Libplanet 5.5.x and lib9c picks it up.

Refs: planetarium/libplanet#4050

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant