OPC UA Part 4 6.6 Redundancy (server + client) with opt-in distributed high-availability#3918
OPC UA Part 4 6.6 Redundancy (server + client) with opt-in distributed high-availability#3918marcschier wants to merge 43 commits into
Conversation
…ng with hardened shared store Adds an opt-in provider model (Opc.Ua.Server.Distributed) to replicate address-space topology/values and session state across server replicas for active/passive and active/active HA, exposed via documented OPC UA redundancy (ServiceLevel/RedundantServerArray). The single-instance in-memory path stays zero-overhead (default NullRecordProtector). Building blocks: ISharedKeyValueStore (+in-memory), INodeStateStore, address-space synchronizer, leader election (static + shared-store lease CAS), service-level providers, distributed value cache, shared session store. Wired via IServerStartupTask hosting seam + fluent DI (UseDistributedAddressSpace / AddServerServiceLevel / redundancy options). CustomNodeManager2 opts in via ILocalAddressSpaceSource. Security hardening (plan 30, SDL + IEC 62443): IRecordProtector + AesCbcHmacRecordProtector (AES-256-CBC Encrypt-then-MAC, verify-before-decrypt, fail-closed) on every shared record; KeyRingRecordProtector (staged key rotation); SharedSingleUseNonceRegistry (cross-replica single-use server nonce via store CAS); key zeroization. Docs/HighAvailabilitySecurity.md captures the STRIDE threat model and operator guidance. Docs: HighAvailability.md, HighAvailabilitySecurity.md, Docs/README link, HighAvailabilityServer sample. Plans 28/29/30. 85 Distributed unit tests (net10.0 + net48).
…reconnect (S5) Implements the opt-in mirrored fast-reconnect from plans 30/31. After a failover a client reconnects to a standby by re-running ActivateSession; the AuthenticationToken is only a lookup key and the standby still performs the full client-certificate signature validation against a mirrored, single-use serverNonce (token-only hijack and nonce replay are both closed). - SharedSessionEntry: extended with the full reconstruction state (serverNonce, clientNonce, client cert blob, SecurityPolicy/Mode, endpoint, timeout, client description); encrypted at rest via the record protector. - SessionManager: additive, backward-compatible RestoreSessionAsync + SupportsSessionRestore seam in the ActivateSession miss-path (default returns null => unchanged behaviour; m_sessions stays private). - DistributedSessionManager: mirrors encrypted session state on create/activate, removes on close; on restore enforces REQ-UA-7 (same SecurityPolicy/Mode), consumes the serverNonce single-use across the replica set (replay defence), reconstructs the session, and logs provenance with a one-way token digest. - ISessionManagerFactory seam on StandardServer (wired from DI by the hosted service, supplying a server-certificate provider) + DistributedSessionManagerFactory + UseDistributedSessions(...) fluent API. Safe default EnableFastReconnect=false (re-auth on failover). - Docs (HighAvailability.md, HighAvailabilitySecurity.md) updated. Tests: 97 Distributed (net10+net48); no regression (61 Session + 57 client SessionTests integration pass). Remaining: two-server network e2e (S6).
…roring (S6) DistributedSessionMirrorIntegrationTests stands up a fully-started server whose ISessionManagerFactory builds a DistributedSessionManager, and verifies end-to-end that a session created/activated through the real service handlers is mirrored encrypted to the shared store (a wrong-key reader fails closed) and removed on close. This closes the factory -> StandardServer.CreateSessionManager -> mirror runtime-wiring gap. The full secured two-server token-reuse reconnect happy-path remains a documented follow-up (the stack client does re-auth-on-failover; the direct-service helper only drives unsecured sessions). The restore decision logic (REQ-UA-7 + single-use nonce/replay) is unit-tested and the base ActivateSession signature path is integration-tested. 98 Distributed tests pass (net10 + net48).
…t (F6/F9) Workstream A from plans/32: - F6: SharedKeyValueSessionStore keys entries by the SHA-256 digest of the AuthenticationToken (SharedKeyValueSessionStore.KeyFor) instead of the raw token, so a backend's key enumeration/monitoring/dumps never expose the token. Test asserts the keyspace contains no raw token. - F9: a successful cross-replica restore emits a distinct AuditSessionEventState (Session/RestoredFromSharedStore) via the new IAuditEventServer.ReportAuditSessionRestoredEvent, in addition to the standard AuditActivateSession, with a one-way token digest for provenance. - F7: analyzed — the decrypted serverNonce becomes the session's retained working Nonce (no extra plaintext copy in the manager); Nonce.Data zeroization on dispose is a pre-existing server-wide Core concern tracked separately. Docs updated. 99 Distributed tests pass (net10 + net48).
…ng in sample (FG) Workstream FG from plans/32: - New Docs/KubernetesDeployment.md: worked replicaset deployment (StatefulSet + headless Service, leader election, readiness tied to ServiceLevel, KEK + shared ApplicationInstanceCertificate provisioning via Secrets, security checklist). Linked from Docs/README.md and HighAvailability.md. - HighAvailabilityServer sample: wires UseDistributedSessions (opt-in HA_FAST_RECONNECT) and an optional AesCbcHmacRecordProtector from a base64 HA_RECORD_KEY (encrypted shared store), demonstrating the production-hygiene pattern. - Transparent redundancy remains a documented deployment pattern (single virtual endpoint + shared session store + subscription transfer; no new transport).
…se failover) design
…A-13, B) On failover to a redundant server, when EnableTokenReuseFailover is set, the client re-activates the existing session by reusing the current AuthenticationToken (signing over the new channel + last serverNonce) instead of CreateSession, falling back to re-authentication if the standby rejects the token. - Session.cs: extracted ReactivateExistingSessionAsync (the token-reuse activation core, shared with UpdateSessionAsync); RecreateInPlaceCoreAsync tries it first (adopting the failover server's cert) before the existing clear + fresh-CreateSession fallback; new EnableTokenReuseFailover property (copied across recreate clones). OpenAsync is untouched. - ManagedSession: EnableTokenReuseFailover option threaded through CreateAsync + ctor and applied to the inner session; ManagedSessionOptions.EnableTokenReuseFailover; ManagedSessionBuilder.WithTokenReuseFailover(). Default off (re-auth on failover). No regression: 57 client SessionTests + 135 reconnect/failover integration tests pass (net10). Updated the ctor-reflection test for the new parameter.
DistributedSessionFailoverIntegrationTests: two secured servers share one store via DistributedSessionManager; a ManagedSession client with WithTokenReuseFailover fails over from the active to the standby. The standby restores the mirrored session and re-activates it with the reused AuthenticationToken, so the client's SessionId is preserved (a fresh re-authentication would change it). Passes on net10 + net48. Docs (HighAvailability.md, HighAvailabilitySecurity.md) updated: client WithTokenReuseFailover() usage + the e2e validation; plans/32 marks B and C done.
Split all distributed/HA implementation into a new OPCFoundation.NetStandard.Opc.Ua.Server.Distributed project that references Opc.Ua.Server; core now keeps only the seams. - Move ILocalAddressSpace/ILocalAddressSpaceSource to the Opc.Ua.Server namespace (NodeManager/); extract a shared internal PredefinedNodesAddressSpace reused by both CustomNodeManager2 and AsyncCustomNodeManager (both now implement ILocalAddressSpaceSource). Apply path is async (no sync-over-async). - Remove the node-state-store-registry hook from ServerInternalData; the distributed startup task owns its own registry. - Add CryptoUtils.ZeroMemory/FixedTimeEquals polyfills and reuse them in the record protector, EncryptedSecret, and KeyCredentialBridgeAuthenticator. - Fix CA2213 in AddressSpaceSynchronizer (dispose enumerator in DisposeAsync). - Rebase HighAvailabilityServer sample onto AsyncCustomNodeManager; expand the sample README with shared-store wiring and an active/active setup. - Strip internal tracking tags (REQ-UA/Finding/IEC/SDL) from code and XML docs. - Merge HighAvailabilitySecurity.md into HighAvailability.md and fix links. - Wire the new project into UA.slnx, the test project, and the sample.
Add a separate net8.0+ package providing true active/active (multi-writer) replication for the distributed server, built on the Crdt + Crdt.Transport NuGet packages. Opt-in; the base distributed library and its active/passive path are unchanged. - New Libraries/Opc.Ua.Server.Distributed.Crdt project (net8.0;net9.0;net10.0, no-op shell on legacy CI legs via RestrictForLegacyTfm); references the base distributed project + Crdt + Crdt.Transport. - Address space A/A: CrdtAddressSpaceSynchronizer models topology and values as last-writer-wins maps and gossips state over Crdt.Transport (in-memory / TCP / UDP, optional TLS); every replica writes, received state is merged and applied. A topology merge preserves the locally-known value so it never regresses a concurrently-updated value (values are versioned by their own entries). - Sessions A/A: CrdtSharedKeyValueStore replicates encrypted session entries by gossip and reuses DistributedSessionManager; the single-use server nonce stays on a strongly-consistent ISingleUseNonceRegistry (CRDTs cannot enforce exactly-once), and the CRDT store rejects compare-and-swap. - Fluent opt-in: UseCrdtAddressSpace(...) / UseCrdtSessions(...) with a shared CrdtGossipOptions base (ReplicaId, UseTcpGossip/UseUdpGossip/AddPeer, limits). - Tests: new Opc.Ua.Server.Distributed.Crdt.Tests (net8/net10) — convergence, multi-writer, concurrent-value LWW, serializer round-trip, CAS/Watch boundary; plus a CRDT exercise in the AOT test project. - Docs: HighAvailability.md 'Active/active with CRDTs' section incl. the single-use-nonce security boundary; Crdt/Crdt.Transport added to CPM. Crdt/Crdt.Transport 1.0.0 (MIT, NativeAOT-ready) restore from nuget.org.
…mple, clarify HA docs - Rename the user-facing fluent surface from Crdt* to Replicated*: UseReplicatedAddressSpace / UseReplicatedSessions, ReplicatedAddressSpaceOptions / ReplicatedSessionOptions / ReplicatedGossipOptions, ReplicatedServerBuilderExtensions (implementation classes and the Crdt package/namespace keep their names). - Integrate the CRDT package into the HighAvailabilityServer sample with an HA_MODE switch (ap = active/passive leader-write, aa = active/active CRDT gossip via HA_GOSSIP_PORT / HA_GOSSIP_PEERS); document running two AA replicas in the sample README. - HighAvailability.md: reference the implemented CRDT library (not 'deferred'), document that a write to a non-leader replica is discarded in active/passive, and rewrite the value-participation note to describe how participation works. Addresses review comments on Docs/HighAvailability.md and the sample README.
…Ua.Server.Distributed.Tests Organize the source files of both Opc.Ua.Server.Distributed and Opc.Ua.Server.Distributed.Crdt into logical subfolders (AddressSpace, KeyValueStore, Redundancy, Values, Sessions, Security for the base library; AddressSpace, Sessions for the CRDT adapter), with the top-level DI entry points kept at the project root. Namespaces are unchanged. Mirror the organization in the test projects: - Extract the base distributed tests from Opc.Ua.Server.Tests/Distributed into a dedicated Opc.Ua.Server.Distributed.Tests project (mirroring the src project), with the same subfolders; add it to UA.slnx and InternalsVisibleTo. Test namespaces are unchanged. - Organize Opc.Ua.Server.Distributed.Crdt.Tests into AddressSpace/Sessions subfolders mirroring the CRDT src. Addresses review comments asking to mirror the test projects to the src side and to organize the files logically without changing namespaces.
Add unit tests for the previously-uncovered CRDT adapter surface flagged by codecov (patch coverage 67.65%): - ReplicatedGossipOptions/ReplicatedSessionOptions: defaults, UseTcp/UseUdp transport factory wiring, AddPeer, CreateReaderOptions/CreateTransport, null-arg guards. - CrdtAddressSpaceStartupTask: attaches a synchronizer to opted-in (ILocalAddressSpaceSource) node managers and skips the rest; null-arg guards. - CrdtSessionManagerFactory + UseReplicatedSessions/UseReplicatedAddressSpace registration; null-arg guards. - ByteStringCrdtSerializer JSON round-trip (null/empty/data); CrdtSharedKeyValueStore.ScanAsync prefix filtering. CRDT package line coverage 67% -> 93.7%; 25/25 tests pass on net8.0 and net10.0.
|
Addressed the codecov patch-coverage feedback in 4f4d7ed. The 80% patch gap was concentrated in the new
CRDT package line coverage rises from ~67% to 93.7%; 25/25 tests pass on net8.0 and net10.0. Codecov will re-evaluate the patch on this commit. |
…t + non-transparent Server model: per-mode Server.ServerRedundancy (RedundancySupport, RedundantServerArray, non-transparent ServerUriArray, transparent CurrentServerId); sub-range ServiceLevel providers (Table 105) + load balancing; RequestServerStateChange/Maintenance/EstimatedReturnTime (6.6.5, admin-gated); NTRS capability; FindServers returns the RedundantServerSet (IRedundantServerSetProvider). New shared Opc.Ua.ServiceLevels/ServiceLevelSubrange in core. Client: RedundantManagedClient realizing all Table 107 modes (Cold/Warm/Hot(a)/Hot(b)/HotAndMirrored); RedundancySupport wording; ServiceLevel sub-range failover rules; FindServers ServerUri->endpoint resolution (no security downgrade); client redundancy via TransferSubscriptions; non-transparent network redundancy. State mirroring (opt-in provider seams; single-instance unchanged): session takeover, subscription-definition mirror, async notification sequence/Republish mirror, best-effort continuation-point envelope, deterministic EventId provider, RegisterNodes replica-consistent. Extensions: Opc.Ua.Server.Distributed.Crdt (active/active) and Opc.Ua.Server.Distributed.Kubernetes (Lease election, peer discovery, ServiceLevel->readiness). Samples (HighAvailabilityServer, RedundantClient), docs mapped to 6.6, conformance tests. Security: AES-CBC+HMAC record protection, fail-closed RecordProtectionGuard for external stores (base + CRDT session paths), single-use server-nonce replay protection, K8s TLS hostname validation. Validated on net10 and net48.
… review findings CRDT active/active extension (Opc.Ua.Server.Distributed.Crdt): - Bump Crdt/Crdt.Transport 1.0.0 -> 1.0.2 (now ship netstandard2.0/2.1 assets) - Widen TFMs from net8.0+ to $(LibTargetFrameworks) (net472/net48/netstandard2.1/net8/9/10); remove RestrictForLegacyTfm; gate IsAotCompatible/binding-gen to net8+ - Fail closed for unauthenticated networked gossip (TCP without mutual TLS, UDP); add AllowUnauthenticatedGossip opt-out for dev/test; gossip-port NetworkPolicy guidance Server review findings: - AddServerRedundancy warns when non-transparent mode has no IServiceLevelProvider - Rename AddManualFailover -> AddRequestServerStateChange ([Obsolete] shim retained) - Consolidate ServerRedundancyOptions peer inputs (RedundantPeers canonical) - DeterministicEventId: exclude per-replica ReceiveTime, compute after distinguishing fields are populated (replica-stable per 6.6.2.2) - Subscription retransmission mirror: delta path, namespace/server tables once per subscription, bounded-parallel drain Client review findings: - Maintenance(0) fails over to a healthy peer (Table 105); EstimatedReturnTime backoff - Warm backups use MonitoringMode.Disabled until failover (Table 107) - Hot(b) dedup uses exact value identity (no hash-collision drops); value-struct key - Subscription-template ownership: client disposes retained templates - De-duplicate WithNetworkRedundancy registration Validated: 0 warnings; tests green on net10 + net48 (CRDT 30, Distributed 178, Client redundancy 75, InformationModel redundancy 7).
…eview feedback Rename (full identity: folders, csproj, AssemblyName, PackageId, RootNamespace, C# namespaces, references, UA.slnx, InternalsVisibleTo): - Opc.Ua.Server.Distributed -> Opc.Ua.Server.Redundancy - Opc.Ua.Server.Distributed.Crdt -> Opc.Ua.Server.Redundancy.Crdt - Opc.Ua.Server.Distributed.Kubernetes -> Opc.Ua.Server.Redundancy.K8s (identity only; the word "Kubernetes" and k8s client types are preserved in prose/code) - Tests mirror the rename (.Tests/.Crdt.Tests/.K8s.Tests/.Integration.Tests) - Application HighAvailabilityServer -> RedundantServer - Documentation updated to the new names (present tense, merged-to-master state) Review feedback (verified previously-resolved PR comments are satisfied; fixed gaps): - async node-manager base, A/A sample README, REQ-UA tags, CA2213, central CryptoUtils polyfills, ILocalAddressSpace namespace - all confirmed in current code Roadmap findings implemented: - OPCFoundation#28 ServerUriArray-only peers are now failover candidates (connect + read live level) - OPCFoundation#8 FetchRedundancyInfo follow-up reads merged into one ReadValuesAsync - OPCFoundation#29 ContinuationPoint mirroring documented as envelope-only (partial-SHALL boundary) - OPCFoundation#30 Server.ServerRedundancy typed to NonTransparent/Transparent subtype by mode - OPCFoundation#20 Add* (standard) vs Use* (extension) convention documented + ServiceLevel cross-ref - OPCFoundation#24 transparent-mode shared application-key blast-radius/rotation guidance expanded CI green (fixes pre-existing all-TFM build break, unrelated to the rename): - Opc.Ua.Sessions.Tests redundancy test doubles: implement IServerRedundancyHandler. ShouldFailover and use RedundancySupport (RedundancyMode was removed) - Integration tests: add RestrictForLegacyTfm + CustomTestTarget fallback so legacy TFM CI passes no-op instead of failing with empty TargetFrameworks - Integration Maintenance assertion updated to spec-aligned behavior (Maintenance with an available peer warrants failover, Part 4 Table 105) Validated: full UA.slnx builds on net10 + net48; Redundancy 178, Crdt 30, K8s 27, Integration 3, Client redundancy 76, Sessions failover 4 - all pass.
The private ManagedSession constructor gained enableTokenReuseFailover (bool) and networkRedundancy (NetworkRedundancyOptions?) parameters during the network-redundancy work, but two reflection-based test helpers still used the old signature, so GetConstructor returned null and failed the fixture SetUp (52 cascading failures = the test-ubuntu-latest-Client / Fast PR test CI failures). - ManagedSessionComplianceTests.CreateManagedSessionWithInner: add the 4th bool + NetworkRedundancyOptions to the ctor type list and invoke args - ManagedSessionTests: add NetworkRedundancyOptions to the ctor type list and invoke args Validated: full Opc.Ua.Client.Tests now 1530 passed / 0 failed on net10.
The legacy-TFM no-op shell strips all references, so on netstandard2.0/2.1 there is no
assembly providing System.Object and the empty compile failed with CS8021 ("No value
for RuntimeMetadataVersion"). net4x builds got System.Object from the implicit targeting
pack so they worked. Supply an explicit RuntimeMetadataVersion for the no-op shell so the
empty assembly compiles on every legacy TFM.
This was the netstandard2.0 leg of the build-*-all-tfm failure; it affected every
RestrictForLegacyTfm project (Core.Diagnostics, Network.Fuzz*, Redundancy.K8s*,
Redundancy.Integration.Tests, RedundantServer, RedundantClient, Mcp, Minimal* samples).
Validated: full UA.slnx builds on net10, net48, and netstandard2.0; the no-op shell
compiles cleanly on net472/net48/netstandard2.0/netstandard2.1.
… from codecov The new redundancy libraries are ~80% unit-covered, but codecov/patch was dragged down by genuinely-untestable Kubernetes IO (real HTTP to the API server, the readiness HTTP listener, and cluster-bound startup tasks). - Add KubernetesServerBuilderExtensionsTests: a DI-registration test that exercises UseKubernetes / UseKubernetesLeaderElection / UseKubernetesPeerDiscovery / UseKubernetesReadiness (the previously 0%-covered builder wiring) plus null-builder guards, resolving the registered services out-of-cluster with no IO. - codecov.yml: ignore the four integration-only K8s IO files (KubernetesHttpApiClient, KubernetesReadinessServer, KubernetesReadinessStartupTask, KubernetesPeerDiscoveryStartupTask) - mirrors the existing Applications/** exemption; the Kubernetes logic (models, factory, lease election, peer parsing, readiness mapping, builder wiring) stays measured. Measured redundancy-library line coverage (codecov view) is now ~88%, above the 80% patch target. K8s tests: 29 passed.
…Lock ambiguity) The net10-only RedundantServer sample used RestrictForLegacyTfm, which only no-ops legacy TFMs (net4x/ns2.x). Under the Linux all-TFM CI leg (CustomTestTarget=net8.0) the app still built net10 while referencing the net8-built Opc.Ua.Types, so its System.Threading.Lock polyfill collided with net10's BCL Lock (CS0433) in HaSampleNodeManager. Follow CustomTestTarget like the libraries so the app's framework references match the TFM being built (net8 app + net8 types => only the polyfill Lock). Validated: RedundantServer builds on net8.0 and net10.0.
Core.Diagnostics.Tests builds net8/net9/net10 and had an unconditional build-ordering ProjectReference to the net10-only McpServer (so its assembly exists for the reflective McpServer tests). On the Linux all-TFM CI legs (CustomTestTarget=net8.0/net9.0) the solution build forced McpServer to the leg TFM, which it cannot target (its own deps build net8/net9), failing the UA.slnx build. - Reference McpServer only when building net10 (CustomTestTarget '' or net10.0). - Make both reflective LoadMcpAssembly helpers Assert.Ignore (skip) when the net10 Opc.Ua.Mcp assembly is absent, instead of asserting it exists - so the net8/net9 test legs skip the MCP reflective tests cleanly while net10 still runs them. Pre-existing infra issue (unrelated to the redundancy feature) surfaced once the RedundantServer net8 Lock fix let the Linux all-TFM build progress past net8. Validated: full UA.slnx builds on net8.0/net9.0/net10.0; Core.Diagnostics McpServer tests pass on net10 (13) and skip/pass without failure on net8.
…ailability line, and Kubernetes stub Addresses three unresolved review threads from @marcschier: - Docs/MigrationGuide.md: remove the 'Redundancy API updates in 2.0' section (feature is net-new in 2.0, nothing to migrate from) - Docs/HighAvailability.md: remove the CRDT package TFM-availability sentence - Docs/KubernetesDeployment.md: remove consolidation stub (content lives in HighAvailabilityKubernetes.md)
# Conflicts: # Tests/Opc.Ua.Core.Diagnostics.Tests/Opc.Ua.Core.Diagnostics.Tests.csproj
|
|
||
| ## 6.6.2.4.5 Non-transparent failover modes and client actions | ||
|
|
||
| `RedundantManagedClient` implements the Table 107 client patterns over `ManagedSession` instances discovered from `RedundantServerArray`/`ServerUriArray` and resolved with `IRedundantServerEndpointResolver`. The default resolver calls `FindServers` and `GetEndpoints` from the current endpoint's discovery URLs, chooses matching security policy/mode and URL scheme when possible, and caches the result. |
There was a problem hiding this comment.
Clients should also by themselves allow to run in HA/replica set (hot, warm, cold standby) using a leader and followers whereby 2+ clients share authentication token via Crdt then activate or recreate on the leader, transfer subs to leader. Shared abstractions go into Opc.Ua.Core package under "Redundancy".
There was a problem hiding this comment.
Captured the client replica-set design as a planned extension in HighAvailability.md (6a287f0): leader/follower clients, AuthenticationToken shared via CRDT/shared store, subscription transfer to the promoted leader, and shared abstractions under an Opc.Ua.Core "Redundancy" namespace - reusing the server-side leader-election / ISharedKeyValueStore / IRecordProtector seams and WithTokenReuseFailover already in this PR. The full client-clustering implementation is a sizable follow-up (its own design + tests), so I am leaving this thread open to track it rather than resolving it.
…lake test-ubuntu-Server.Redundancy.Crdt failed once (ConcurrentValueWritesConvergeAsync) when the in-memory gossip background loops were CPU-starved under the full 60-job CI matrix. Convergence is normally sub-second (local run 30/30 in 453ms); the wider deadline measures correctness, not runner load. Applied to both AssertEventuallyAsync helpers in the project.
…load Root cause: in CrdtAddressSpaceSynchronizer a value entry that won the LWW merge could fail to materialize on the node. When a local value capture and a remote value apply race, the capture advances m_lastApplied, so ComputeDiffs no longer emits a correcting diff and the apply (computed earlier, outside the lock) writes a now-stale value the node keeps forever - the map converges but one replica's materialized node value is permanently stale (observed as A=22, B=11 on a loaded ubuntu CI runner; reproduced locally at ~1/120 under 24-40 parallel test processes). Fix: - ApplyDiffAsync(value) re-reads the authoritative map value under the lock instead of using the (possibly stale) precomputed diff.Value, and keeps m_lastApplied consistent. - Topology apply preserves the authoritative value entry from the map rather than a racy live-node read. - ApplyInboundFrameAsync reconciles materialized node values against the converged map at gossip quiescence (zero-diff frames) and re-broadcasts on a correction, so anti-entropy heals any residual races and terminates once all replicas match (LWW idempotent). Verified: 0 failures over 1520 stress runs (24-40 parallel) in WSL Linux (pre-fix ~1/120); full Crdt suite 30/30 on Windows net10; library builds clean (0 warnings) on net10 and net48.
…d docker-compose; document client HA Addresses PR review feedback on the redundancy samples and docs: - RedundantClient now builds a single ManagedSession with WithServerRedundancy() that connects to any server and fails over transparently - the same code works whether or not the server is configured for redundancy (the client need not know the server topology before connecting). Removes the bespoke ClientFailoverMode enum and SeededRedundantServerEndpointResolver (peers come from the connected server, not a hand-maintained seed list / client-selected mode). - RedundantServer gains an HA_HOST override so the advertised endpoint URL is reachable across containers/hosts (defaults to localhost). - Adds Dockerfiles for both samples and docker-compose.active-active.yml (real cross-container CRDT-gossip HA) and docker-compose.active-passive.yml (leader-election wiring demo), referenced from the RedundantServer README. - Documents the planned client-side high-availability (replica set) design in HighAvailability.md and refreshes the Samples section. Both samples build clean (0 warnings) on net10.0.
Move the shared abstractions and portable implementations from Opc.Ua.Server.Redundancy into Opc.Ua.Core so client and server redundancy share one set of seams: ISharedKeyValueStore (+KeyValueChange), IRecordProtector (+Null/AesCbcHmac/KeyRing), ILeaderElection (+Static/SharedStoreLease), InMemorySharedKeyValueStore. Server.Redundancy/.Crdt/.K8s, RedundantServer, and tests now consume them via 'using Opc.Ua.Redundancy;'. Server-specific helpers (RecordProtectionGuard, service-level, peers) stay. Foundation for client-side replica sets. All redundancy projects build on net10/net48; Redundancy.Tests 178/178.
…rary New library with ClientReplicaCoordinator: leader election (ILeaderElection) selects one active client holding the ManagedSession; ClientStandbyMode Cold/Warm/Hot governs follower behavior; leader publishes protected session secrets to ISharedKeyValueStore (fail-closed IRecordProtector) for promotion-time token reuse. Builds net48/ns2.0/net10. Wired into UA.slnx.
Add ClientReplicaSetBuilder, Opc.Ua.Client.Redundancy.Tests (5 tests: fail-closed guard, in-memory allowed, missing factory, cold no-connect-before-leader, builder validation), and convert the HighAvailability.md client-HA section from planned to an implemented how-to. Lib + tests build/pass on net48/ns2.0/net10.
Make ManagedSession.ReactivateMirroredSessionAsync public; coordinator loads protected shared session secrets on promotion, applies SessionConfiguration, and reactivates reusing the AuthenticationToken (fast-activate vs HotAndMirrored), falling back to a fresh session on any failure. Client builds net48; tests 5/5.
| private readonly CertificateCollection? m_clientIssuerCertificates; | ||
| private readonly int m_maxHistoryContinuationPoints; | ||
| private readonly SessionSecurityDiagnosticsDataType m_securityDiagnostics; | ||
| private readonly IContinuationPointStore? m_continuationPointStore; |
There was a problem hiding this comment.
the continuation point handling should be a focused class with a simple interface for the Session to handle it, not a set of 1 store 2 lists and 2 dictionaries and many individual method calls.
| private uint m_lifetimeCounter; | ||
| private bool m_waitingForPublish; | ||
| private readonly List<NotificationMessage> m_sentMessages; | ||
| private readonly ISubscriptionRetransmissionStore? m_retransmissionStore; |
There was a problem hiding this comment.
same as for Continuation point. There should be a focused class to handle SentMessages, Retransmission and SequenceNumbers, to keep the Publish / Acknowledge methods clean.
…onnect subfolders
…rename K8s to Opc.Ua.Redundancy.K8s Single server package; K8s renamed; tests/sample/slnx updated. Full UA.slnx builds net10; Server 178, Crdt 30, Client 5 tests pass.
…y point; merge Crdt folders + rename AddressSpace->State Remove RedundantManagedClient/options/factory/IRedundantManagedClientSession + ConnectRedundantAsync; failover folds into ManagedSession (IManagedSession). Merge CrdtAddressSpace->State, CrdtSessions->Sessions; rename AddressSpace->State. Docs/tests updated. UA.slnx builds net10.
…nagedClient removal
Merge Opc.Ua.Redundancy.Server.Crdt.Tests into Opc.Ua.Redundancy.Server.Tests (208 tests, 39s, well under 20min); remove the empty Opc.Ua.Redundancy.Server.Integration.Tests project (no integration tests exist); rename Opc.Ua.Client.Redundancy.Tests to Opc.Ua.Redundancy.Client.Tests. Full UA.slnx builds net10; Server 208, Client 6 pass.
New Opc.Ua.Redundancy library hosts the canonical ByteStringCrdtSerializer + CrdtSharedKeyValueStore (namespace Opc.Ua.Redundancy, shared with the Core seams). Opc.Ua.Redundancy.Server and .Client both link it; their duplicate CRDT copies removed. Adds AddCrdtClientSharedStore DI extension in Opc.Ua.Redundancy.Client (server CRDT DI already via ReplicatedServerBuilderExtensions). Full UA.slnx builds net10+net48; Server 208, Client 6 tests pass.
Summary
This branch implements OPC 10000-4 §6.6 Redundancy end to end — server and client, transparent and non-transparent, across all failover modes (
None,Cold,Warm,Hot,HotAndMirrored,Transparent) — and layers an opt-in distributed high-availability (HA) capability on top so replicas can share address-space, session, and subscription state and expose redundancy to clients through the documented OPC UA mechanisms (for example, running a server replica set across nodes in Kubernetes).The single-instance, in-memory path stays zero-overhead: every distributed feature is opt-in through dependency injection, with a direct-construction fallback, and replaced APIs are kept and marked
[Obsolete].New packages
OPCFoundation.NetStandard.Opc.Ua.Server.RedundancyOPCFoundation.NetStandard.Opc.Ua.Server.Redundancy.CrdtOPCFoundation.NetStandard.Opc.Ua.Server.Redundancy.K8sAlso adds the worked samples
Applications/RedundantServerandApplications/RedundantClient, and the guidesDocs/HighAvailability.mdandDocs/HighAvailabilityKubernetes.md.Server side (§6.6)
AddServerRedundancy(...)populates the liveServer.ServerRedundancynodes (RedundancySupportplusRedundantServerArray/ServerUriArray/CurrentServerId) and drivesServer.ServiceLevelfrom anIServiceLevelProvider(leader high, standby low).Server.ServerRedundancyis typed to the NonTransparent or Transparent subtype according to the configured mode.AddRequestServerStateChange(...)implements the §6.6.5 manual-failover method, withEstimatedReturnTimeand the Maintenance / NoDataServiceLevelsub-ranges.FindServers(ConfiguredRedundantServerSetProvider) and advertise theNTRSdiscovery capability for GDS/NTRS registration.EventIdsynchronization for Transparent and HotAndMirrored sets (§6.6.2.2), excluding per-replica fields so de-duplicating clients do not double-process events.Client side (§6.6)
DefaultServerRedundancyHandlerreadsServer.ServerRedundancyandServer.ServiceLeveland fails over to the highest-ServiceLevel peer. It honors Maintenance (Part 4 Table 105 — disconnect to an available peer), keeps Warm backups disabled until failover (Table 107), and treats peers known only throughServerUriArrayas failover candidates.RedundantManagedClientimplements the Cold, Warm, Hot (a), Hot (b), and HotAndMirrored client patterns: one active session, optional lightweight ServiceLevel status-check sessions to backups, and — for Hot (b) — merging of concurrent reporting streams with exact value-identity de-duplication. HotAndMirrored failover re-activates the mirrored session with the existingAuthenticationTokeninstead of recreating it.Distributed state (opt-in; an extension beyond §6.6)
ISharedKeyValueStore(in-memory default) andINodeStateStore, with anAddressSpaceSynchronizerthat bridges aCustomNodeManager2's predefined nodes to the shared store, so node and reference additions/removals and value changes propagate to other replicas.IDistributedValueCachelets read/write callbacks cache the last value with a freshness bound; monitored items use the normal read pipeline and therefore participate only when the read path does.DistributedSessionManager(wired through a new additiveISessionManagerFactoryseam onStandardServer) mirrors encrypted session state across replicas. On a failover reconnect the standby restores the session and performs the fullActivateSessionclient-certificate signature validation against a single-useserverNonceconsumed via compare-and-swap across the replica set, enforcing the same SecurityPolicy and SecurityMode. TheAuthenticationTokenis a lookup key only, never an authenticator: entries are keyed by the SHA-256 digest of the token, and a cross-replica restore emits a distinct audit event.IRecordProtector; secret-bearing mirrors require a registered protector or an explicit opt-out.Active/active (CRDT)
Opc.Ua.Server.Redundancy.Crdtadds leaderless multi-writer address-space replication with CRDTs and gossip (UseReplicatedAddressSpace) plus CRDT-backed session metadata (UseReplicatedSessions). Networked gossip is fail-closed without authenticated transport (mutual TLS for TCP; an explicit opt-out for isolated dev/test). CRDT state is eventually consistent, so exactly-once decisions (such as the single-use nonce registry) stay on a strongly consistent store. The package is available on all stack target frameworks.Kubernetes
Opc.Ua.Server.Redundancy.K8sprovides Kubernetes Lease leader election, EndpointSlice-based peer discovery, and ServiceLevel-driven readiness for running an OPC UA server replica set.Docs/HighAvailabilityKubernetes.mdcovers StatefulSet/Deployment and Service manifests, RBAC, probes, time synchronization, secrets, gossip-port NetworkPolicy, and GDS/NTRS registration.Compatibility and validation
ISessionManagerFactoryandIServerStartupTask); replaced APIs marked[Obsolete]for migration.net472,net48,netstandard2.0,netstandard2.1,net8.0,net9.0,net10.0); NativeAOT-compatible where the runtime supports it.See
Docs/HighAvailability.mdfor the full OPC 10000-4 §6.6 mapping, theAdd*(standard nodes) versusUse*(extension) builder convention, and the security/rotation guidance.