test: pin Planetarium.NetMQ to 4.0.0.262-planetarium-pre1#2759
Open
ipdae wants to merge 1 commit into
Open
Conversation
Heimdall validators have been crashing with Exit 139 every ~2h on a NullReferenceException inside NetMQ.Core.Transports.StreamEngine (MechanismReady / ProcessHandshakeCommand). The downstream symptom on remote-headless pods is the recurring "tx staging timeout" error that end users see twice in a row before staged actions vanish. Pin Planetarium.NetMQ to a prerelease that adds null guards at the two NRE sites observed in production (planetarium/netmq#6). The pin sits in the Headless executable csproj so NuGet's nearest-wins resolves it over the 4.0.0.261-planetarium that comes in transitively through Libplanet.Net 5.5.x. This is for Heimdall validator/headless pod soak testing only — revert once the upstream bump lands in Libplanet 5.5.x and lib9c picks it up. Refs: planetarium/libplanet#4050 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pin
Planetarium.NetMQto a prerelease (4.0.0.262-planetarium-pre1) inNineChronicles.Headless.Executable.csprojso the executable resolves the patched NetMQ build instead of the4.0.0.261-planetariumthat comes in transitively throughLibplanet.Net5.5.x.Do not merge. This is a soak-test branch for the Heimdall validator + remote-headless
Exit 139crash. Once the upstream chain (planetarium/netmq#6 release -> Libplanet 5.5.4 release -> lib9c bump -> headless) lands, this pin should be reverted.Background
Heimdall validators have been hitting
Exit 139roughly every 2h with this stack:Cancellation in the handshake path races
m_mechanismteardown. The fork'sStreamEngine.csdereferencesm_mechanismat bothProcessHandshakeCommandentry and insideMechanismReadywithout a null check, so the engine crashes the whole process.The downstream symptom on
remote-headlessis the recurringtx staging timeoutend users see twice before a staged action vanishes — measured ~19x reduction in timeout rate after manualkubectl rollout restart. See planetarium/libplanet#4050 for the full operational analysis.What's in the prerelease
Planetarium.NetMQ 4.0.0.262-planetarium-pre1adds early-return null guards at the two NRE sites:Source: planetarium/netmq#6 (CI green, code-review pending). Public API and
AssemblyVersion(4.0.0.0) are unchanged, so this is a binary-compatible swap with4.0.0.261-planetarium.Why pin in the executable csproj
NineChronicles.Headless.Executableis the Exe entry. Adding the PackageReference here means NuGet's nearest-wins resolves4.0.0.262-planetarium-pre1for the executable's deps closure, overriding the transitive4.0.0.261-planetariumfromLibplanet.Net. Library csprojs in this repo don't link NetMQ directly — only the executable's runtime resolution matters for production validation.Test plan
Planetarium.NetMQ 4.0.0.262-planetarium-pre1).Exit 139for >=24h.tx staging timeoutrate onremote-headlessdrops from baseline ~266/5min back toward steady-state.Rollback
If the prerelease introduces any regression:
- <PackageReference Include="Planetarium.NetMQ" Version="4.0.0.262-planetarium-pre1" />The NuGet package itself can be unlisted on nuget.org but cannot be deleted — fine for a prerelease.
🤖 Generated with Claude Code