Skip to content

Import/4.0.0#15

Merged
lupodevelop merged 68 commits into
mainfrom
import/4.0.0
May 13, 2026
Merged

Import/4.0.0#15
lupodevelop merged 68 commits into
mainfrom
import/4.0.0

Conversation

@lupodevelop
Copy link
Copy Markdown
Owner

This pull request introduces several improvements and additions to the project, focusing on developer experience, documentation, and test infrastructure.

NEW VERSION: 4.0.0 - BREAKING CHANGES

Add `/notes` to .gitignore to prevent the notes directory from being tracked by Git.
Introduce a budgeted, race-safe atom creation pipeline to prevent atom-table exhaustion (ETS+atomics-backed counter) and provide budget-aware helpers (to_node_atom_safe/to_cookie_atom_safe, atom_budget_reset). Add validation and utility functions: is_valid_registry_name, clamp_timeout, binary_copy, exit_shutdown, subject_tag_matches_name, monotonic_ms, and refinements to create_subject/encode_subject. Tighten decode_subject_safe to only catch documented badarg errors and reject non-standard subject layouts. Overall this hardens FFI boundaries and prevents crashes from unbounded atom creation or pinned sub-binaries.
Change registry_ffi to return tagged tuples suited for Gleam FFI and add a local-ownership check for unregister.

- Limit exports to register/2, unregister/1, whereis/1.
- register/2: validate Pid; returns {ok, nil} on success, {error, already_exists} when global registration fails, and {error, invalid_process} for non-Pid input.
- unregister/1: check global:whereis_name/1 first; if undefined return {error, not_found}; if the registered PID's node matches the local node, call global:unregister_name/1 and return {ok, nil}; otherwise return {error, not_owned} to prevent remote-unregister attacks.
- whereis/1: return {error, nil} when not found, {ok, Pid} when found.
- Remove several helper/FFI classification functions and associated exports.

Rationale: Avoid atom conversion/atom table exhaustion, and prevent a "registry wipe" attack by ensuring only names owned by this VM can be unregistered. An ETS-based ACL was avoided due to staleness and memory-leak concerns; :global is used as the single source of truth.
Rework cluster_ffi.erl to use safe atom conversion for node and cookie inputs and to surface failures as typed return tuples consumable by Gleam. start_node/2 now validates name and cookie via to_node_atom_safe/to_cookie_atom_safe, classifies network vs start errors, and emits telemetry when atom-budget is exhausted. Added node_type_for/1, is_network_reason/1, is_alive/0, monitor_nodes/1 and decode_node_event/1; connect/1 and ping/1 were changed to return tagged {ok, nil} / {error, ...} values and to emit telemetry on atom-budget exhaustion. Updated imports and removed older unsafe atom helpers for clearer error handling and improved observability.
Introduce src/config_ffi.erl to store runtime configuration in persistent_term for O(1) reads. Exposes put_config/get_config/reset_config plus separate slots and setters for atom budget and conflict-resolver timeout (put_atom_budget/1, put_conflict_resolver_timeout/1, get_conflict_resolver_timeout/0) to avoid decoding Gleam tuples in FFI helpers. Provides a default conflict timeout of 1000 ms and a reset_config/0 helper that clears persistent_term entries and calls distribute_ffi_utils:atom_budget_reset() for tests.
Introduce test/cluster_monitor_test_ffi.erl with helper functions to simulate node events and malformed messages for the cluster_monitor actor. Exports send_nodedown_event/2 and send_nodeup_event/2 which convert a binary node name to an atom and send {nodedown, Node} / {nodeup, Node} to a PID, plus send_garbage_term/1 to deliver an unexpected term. Enables testing the actor without running a distributed node.
Introduce telemetry_ffi.erl: an ETS-backed telemetry sink store with lock-free reads. Implements install/1, reset/0 and emit/1; ensure_table/0 lazily initializes a named ETS table with {read_concurrency, true}. install stores a single arity-1 sink (last-wins), reset removes it, and emit performs an O(1 lookup and invokes the sink inline. Sink exceptions are caught and logged via logger:warning (logging only the event tag to avoid leaking payloads). Chooses ETS over persistent_term to avoid triggering global GC on installs.
Change try_reserve_atom/0 to return {ok, Ref} on success or {error, atom_budget_exceeded} when the atom budget is exhausted instead of a boolean. Update budget_aware_atom_create/1 to use the typed result: reserve before creating an atom, catch error:system_limit from binary_to_atom/2 to decrement the reserved counter and return atom_budget_exceeded, and avoid swallowing unrelated exceptions. Tighten comments around which exceptions are caught. Also update specs for to_node_atom_safe/1 and to_cookie_atom_safe/1 to include the new {error, atom_budget_exceeded} return case. These changes make atom-table saturation degrade to a typed refusal and ensure the budget counter stays consistent on failure.
Introduce test/conflict_test_ffi.erl containing helpers that mirror conflict_ffi:safe_invoke/5 with a tightened timeout for tests. Exports helpers to invoke the wrapped resolver (including a KillBoth variant), provide the local node name as a binary, and supply a bogus non-Pid term for negative tests. The file replicates the private safe_invoke logic (spawn/monitor, translate outcomes, emit telemetry, handle crashes/timeouts and run fallback) so tests can exercise timeout/crash behavior without modifying the production module.
Introduce conflict_ffi.erl to provide a safe wrapper for Gleam/Erlang conflict resolvers passed to :global.register_name/3. It runs user resolvers in a short-lived spawned worker with a configurable timeout (config_ffi:get_conflict_resolver_timeout), emits telemetry (conflict_resolved and conflict_resolver_failed), and applies a deterministic fallback (lowest term-ordered pid) if the resolver crashes or times out. Exposes register_with_resolver/3 and default_resolver_timeout_ms/0 and includes helper logic to translate outcomes, monitor/drain worker replies, and format failure reasons.
Introduce a tiny pure Erlang module with helpers for distribute/conflict resolvers. Keeps scheduling-sensitive logic out of conflict_ffi by providing pid_node_is_local/1 (checks if a pid belongs to the local node) and pid_node_name/1 (returns the pid's node as a UTF-8 binary or empty binary). Functions are simple and unit-testable.
Rework the distribute facade: bump version to 4.0.0 and add a comprehensive, documented API surface. Introduces configuration APIs (configure/get_config and Config types), cluster monitor functions (start_monitor, subscribe, unsubscribe), and richer cluster helpers (start_node, is_distributed, has_peers, health) with error-to-string helpers. Adds many type aliases and re-exports (RegisterError, CallError, DecodeError, etc.), extends the actor lifecycle surface (start_actor/_with_timeout/_observed, start_registered/_with_timeout, start_supervised/_with_timeout, pool/_with_timeout, child_spec variants), and improves messaging primitives (reply, receive/_with_timeout, call/_with_timeout, call_isolated/_with_timeout). Also adjusts imports and changes send/receive error types (SendError), plus lots of docs and rationale for safe cross-node messaging.
Import the telemetry module and expose simple telemetry types and an install function. Adds TelemetryEvent and TelemetrySink type aliases and install_telemetry/1 which installs (or replaces) the global telemetry sink for library events (registry, atom budget, payload limits, codec failures, timeouts). See distribute/telemetry for full event semantics.
Change the catch clause in start_node/2 to pattern-match error:CaughtReason:CaughtStack and include both the error reason and stack in the formatted R2 message (using "~p:~p"). This provides richer diagnostics for start failures while preserving the existing {error, {start_failed, R2}} return structure.
Replace the indefinite receive in await_worker_down with a bounded teardown: first check erlang:is_process_alive/1, demonitor and return immediately if the worker is dead; otherwise wait up to 50ms for the 'DOWN' message, then demonitor(flush) and return. This prevents hanging while waiting for monitor delivery. Also update the safe_invoke comment to document the bounded timeout teardown behavior.
Change Gleam wire codecs to be safer and stricter: switch string/bitarray/list/subject length prefixes from 16-bit to 32-bit, add runtime checks for 32-bit length overflow, and add signed-64 bounds for Int encoding to avoid silent truncation. Enforce strict top-level decoding (reject trailing bytes) via to_decoder and make nil_decoder fail on trailing data. Add DecodeError variants for PayloadTooLarge and ListTooLong and enforce a per-decode element cap (from config) to mitigate billion-laughs attacks. Use a binary_copy FFI to materialise slices to avoid retaining large parent binaries, and change list encoding to build element chunks (avoid quadratic concatenation). Misc: small comment/typo cleanups in Erlang helper files.
Improve named actor lifecycle handling and resource cleanup.

- Add two-phase terminate_orphan_gracefully that exits, waits a grace period, then escalates to kill; emits telemetry on escalation and uses an FFI exit_shutdown helper.
- Introduce init_timeout_ms params across start/registered/supervised/pool APIs and provide default_* helpers that read from config.
- Add start_observed and start_registered_observed to accept a decode-error callback for malformed messages.
- Replace receiver.Next with receiver.HandlerStep in handler signatures.
- Make child_spec unregister/registration behavior robust: kill orphan on register failure and add StartRegisteredError + formatter to distinguish start vs register failures.
- Revamp pool: validate size >= 1, build child specs with index_list, and document cascading failure risk with :global.
- Add start_resource_owner and helpers (await_lifetime_death, resource_owner_poll_ms) as a pattern to own external resources and run cleanup when a lifetime PID dies (workaround for missing terminate callbacks).
- Misc: imports for codec/config/telemetry, expanded docs and comments throughout.
Introduce a new typed cluster monitor actor (src/distribute/cluster_monitor.gleam) that wraps Erlang's net_kernel:monitor_nodes. Provides a safe Message protocol and ClusterEvent type, plus public APIs: start, start_observed (with an unknown-message hook), subscribe and unsubscribe. Uses a selector to classify raw mailbox terms via an FFI decode_node_event and to rewrite DOWN messages for O(1) subscriber pruning using process.monitor; deduplicates subscriptions and proactively demonitors on unsubscribe. Adds FFI bindings for monitor_nodes and node-event decoding and implements resilient broadcast of NodeUp/NodeDown events to registered subscribers.
Make composite codecs stricter and safer: enforce that Option(None) is exactly a single 0 byte (reject trailing bytes), add 32-bit length checks to tuple2/tuple3 encoders to avoid silent truncation >4GiB, and validate that inner decoders consume their entire length-prefixed slices via a new reject_inner_trailing helper (surface trailing bytes as InvalidBinary). Also add a small result(codec) constructor and improve docs/comments explaining the protections against byte-smuggling and length-prefix corruption.
Enforce unsigned 32-bit bounds for tag length and version to avoid silent wrapping and defend against absurd inputs (adds unsigned_32_max and checks that return ValueTooLarge). Introduce sized_decoder that validates tag and version, returns remaining bytes for use in composite codecs, and surfaces TagMismatch/VersionMismatch/InsufficientData/InvalidBinary errors. Also add a codec() helper to bundle encoder, decoder, and sized_decoder.
Refactor global subject messaging to add safety checks, observability, and mailbox-safety primitives. Key changes: add config/telemetry checks and size-limits (send/reply/call now fail with PayloadTooLarge), emit telemetry for rejected payloads and decode failures, clamp negative timeouts, and use process.new_subject for subject creation. Introduce unsafe_from_name (marked @internal) and document its risk, expand CallError/SendError variants, and implement monitor-backed call flow with bounded reply draining. Add call_isolated and default helpers to avoid orphaned reply accumulation in long-running callers and other small helpers (monotonic_ms, clamp_timeout, drain_reply).
Introduce src/distribute/telemetry.gleam: a single-event observability sink for the distribute library. Adds the Event sum type with variants for registry lifecycle, atom-budget exhaustion, payload rejection, decode failures, call lifecycle, orphan escalation, and conflict resolution, plus origin enums (AtomBudgetOrigin, PayloadOrigin, DecodeOrigin) and EventSink. Provides public install/reset/emit functions with small FFI bindings (install_ffi/reset_ffi/emit_ffi) and extensive docs describing semantics (inline execution, last-wins install, failure handling, and forward-compatibility guidance) and intended downstream usage.
Introduce src/distribute/conflict.gleam to provide conflict-resolution primitives for :global registrations during split-brain healing. Adds ConflictOutcome and Resolver types and built-in pure resolvers: lowest_pid_wins, highest_pid_wins, keep_local, kill_both, and node_priority. Includes Erlang FFI bindings for PID comparisons and node inspection and helper logic for node ranking; documentation explains timeout/fallback behavior and operational guidance (telemetry, safety, and data-loss considerations).
Add payload-size enforcement, telemetry, and mailbox protections to typed receivers and distributed workers. Introduces decode_checked and decode_observed helpers that reject oversized payloads (config.get().max_payload_size_bytes), emit telemetry on rejection/failure, and call an optional on_decode_error hook used by observed start functions. Non-Subject mailbox terms are mapped to Garbage via select_other to avoid growing selective-receive penalties. API changes: Next -> HandlerStep, Timeout -> ReceiveTimeout, receive_typed now clamps timeouts via an FFI clamp_timeout, start_receiver delegates to start_receiver_observed, and start_distributed_worker now has an observed variant and accepts init_timeout_ms. Also adds receive_error_to_string and updates selector-based handlers to use the new decoding/telemetry flow to protect from OOM/DoS and improve observability.
Replace untyped dynamic FFI usage with typed Result bindings, add name validation and monotonic timing, and improve safety/observability around global registrations.

Key changes:
- Replace dynamic.Dynamic FFI bindings with typed Result signatures and add new FFI helpers (monotonic_ms, subject_tag_matches_name, is_valid_registry_name_ffi, register_with_resolver_ffi).
- Enforce registry name charset/length via FFI and centralize validation (1..255 bytes, [a-zA-Z0-9._-]).
- Introduce richer error types: UnregisterError and LookupError and helper stringifiers for register/unregister errors.
- Emit telemetry for registration/unregistration outcomes and resolver failures/successes.
- Validate subject/tag invariants when registering subjects/global subjects and return clear InvalidArgument errors instead of silently breaking.
- Add register_global_with_resolver to support custom split-brain conflict resolvers with operational guard rails and telemetry.
- Rework unregister to return typed outcomes and emit telemetry; add unregister_typed helper.
- Improve lookup semantics: whereis now returns a typed Result, lookup reconstructs GlobalSubject via unsafe_from_name, lookup_with_timeout uses monotonic clock and precise sleep clamping, and a new non-blocking lookup_async spawns a monitored poller that replies to a provided subject.
- Remove various ad-hoc dynamic conversions and string-based checks in favor of typed, documented behaviour.

These changes harden the registry API against split-brain pitfalls, make failures more actionable, and provide async polling for cross-node propagation windows.
Add tests covering result codecs, strict Option decoding, and rejection of smuggled inner bytes for tuple decoders. Imports gleam/bit_array to construct forged frames and adds: result_bundled_codec_test, result_bundled_sized_test, option_decoder_rejects_trailing_after_none_test, option_decoder_accepts_clean_none_test, tuple2_rejects_smuggled_inner_bytes_test, and tuple3_rejects_smuggled_inner_bytes_test. These ensure decoders do not accept trailing or injected bytes and enforce strict sizing semantics.
Introduce a new test suite (test/conflict_test.gleam) covering built-in conflict resolvers and the FFI shim that protects :global, including telemetry checks for timeouts, panics, and worker crashes. Update test/distribute_test.gleam to expect version 4.0.0 and add a manifest-matching version test, node-start validation, facade helpers (named, start_registered variants), and other API call updates (codec shorthand, receive_with_timeout, start_actor_with_timeout). Replace the local unique_id helper with test_helpers.unique_id and adjust imports and extern FFI declarations as needed.
Update test/global_test.gleam: add import for distribute/config and a large suite of tests covering edge cases for GlobalSubject behavior. New tests include payload-size enforcement for send/receive/reply/call, encoder/decoder failure handling, call target-down behavior, drain_reply correctness (clears multiple late messages, no-op on empty, bounded under flood), receive_default honoring configured timeouts, and from_pid Nil-tag sharing. Also a small comment tweak. These tests verify robustness around timeouts, oversized payloads, late replies, and configuration usage.
Introduce test/multinode_peer_test.gleam: a real peer-node end-to-end smoke test that verifies cross-node actor messaging. The test starts a local distributed node, registers a test actor, launches a separate OTP peer via Erlang FFI (start_peer/stop_peer), checks node connectivity, encodes and sends an integer payload from the peer to the registered actor, verifies receipt, and performs cleanup. Includes FFI stubs for peer control and a helper to ensure the local cluster is started. This provides a minimal two-node integration check without full cluster orchestration.
Introduce a minimal peer-node helper module for automated multinode tests. Provides start_peer/1 to spawn a peer node (configures name and cookie, connects back to the origin), stop_peer/1 to stop it, peer_connected_nodes/1 to list connected nodes as binaries, and send_to_actor/3 to locate a global actor and send an encoded message. Includes basic error handling and returns binary error reasons for use by FFI tests.
Introduce test/multinode_real_ffi.erl: a test-only peer driver to spawn real BEAM peer nodes and exercise cross-node semantics in the test suite. Provides APIs to ensure distribution (ensure_distribution), start/stop peers (start_peer/stop_peer), perform peer RPCs to register/resolve global names (peer_call_register_global, peer_call_whereis_global), force peer-to-peer connections (connect_peers), and trigger :global.sync on peers (sync_global). Modeled on dev/peer_ffi.erl, it configures cookies, connects peers back to the origin node, and wraps RPC error handling for robust test usage.
Introduce test/multinode_real_test.gleam to run real cross-node BEAM integration tests. The file adds FFI bindings to start/stop peers, connect peers, register/whereis global names, and force :global sync. It implements two tests: Z2 verifies registry.lookup_with_timeout resolves within a time window after a remote registration, and Z3 races register_global on two peers to ensure exactly one winner. Tests are mandatory and fail hard if distribution prerequisites (e.g., epmd) are not available.
Add test/multinode_test.gleam to simulate cross-node message delivery by sending raw encoded binaries directly to actor Subjects. Includes tests for integer decoding, stateful accumulation across multiple sends, registry lookup+send for reconstructed subjects, and ensuring malformed binaries are dropped without crashing actors. Uses helper functions for unique IDs and exercises codec/actor/registry/global paths to validate decode+dispatch end-to-end without requiring a real peer/epmd.
Expand receiver test suite with many new cases: observed error-hook tests, payload size enforcement (reject/accept) for local and distributed receivers/selectors, garbage-drain/resilience tests (including a Z5 flood test), and compile-time type-rename checks. Replace the ad-hoc unique_id with test_helpers.unique_id and add an FFI helper inject_garbage_term; add config resets and explicit timeouts where needed. Also update a couple of start_distributed_worker calls to pass the timeout argument.
Replace ad-hoc unique_id with test_helpers.unique_id and remove the local unique_id/erlang externals. Switch many tests to construct globals via global.unsafe_from_name (ensuring subject tag == registry name) and relax unregister assertions to let _ = registry.unregister(...).

Add a large set of new tests covering lookup_with_timeout (immediate, not found, delayed), async lookup variants, unregister ACL/idempotency behavior, parameter validation for timeouts/poll intervals, and register_global/register_typed tag-mismatch checks. Comments explain rationale (fail-fast validation and protection against cross-node unregister attacks) and improve test coverage for error cases and timing behavior.
Expand tagged tests to enforce unsigned 32-bit version bounds and tag-size handling. Adds tests that verify roundtrips for max (2^32-1) and zero versions, that negative and one-past-2^32 versions produce ValueTooLarge errors, and a sanity check for tag-length boundary handling (empty tag roundtrip). Aims to prevent silent wrapping and cross-node decoding corruption.
Add comprehensive tests for the distribute telemetry sink (test/telemetry_test.gleam). The new file installs and exercises a telemetry sink and asserts emitted events for sink install/replace semantics, sink panics, registry lifecycle (registered, registration failed, unregistered, tag mismatches), atom-budget exhaustion (start_node, ping), payload rejection, and call lifecycle events (timeouts, clamped timeouts, isolated proxy crashes, and target down). Includes helpers and small FFI stubs to control test conditions and silences the logger during deliberate panic/crash scenarios. Each test resets telemetry/config to keep cases isolated.
Introduce test/telemetry_test_ffi.erl providing small Erlang helpers for telemetry tests: spawn_short_lived/0 to produce an immediately-dead pid, exit_self_shutdown/0 to synchronously exit the caller with reason 'shutdown' (used to simulate proxy crashes without SASL noise), and silence_logger/0 / restore_logger/1 to suppress and restore primary logger levels during tests. These helpers make tests more robust and keep test output clean when intentionally triggering failure branches.
Introduce test/test_helpers.gleam which provides a unique_id() helper for tests. It wraps Erlang's unique_integer and integer_to_binary to produce a positive string ID for test isolation to avoid name collisions.
Introduce an interactive development script (dev/distribute_dev.gleam) to run an end-to-end multi-node smoke test of the distribute stack: starts a distributed node, registers a typed actor (counter), spawns a peer BEAM node, sends encoded messages from the peer (or falls back to a local simulation), and verifies state updates. Also add an Erlang FFI helper (dev/peer_ffi.erl) that manages a peer node lifecycle, connects it to the main node (net_kernel:connect_node), exposes functions to query connected nodes and send messages/PIDs from the peer, and ensures proper synchronization of :global lookups. These utilities simplify manual testing of inter-node messaging and distribution setup (requires epmd and OTP 25+).
Introduce comprehensive docs for the distribute library v4.0.0. Adds README, quickstart, getting_started, actors_and_registry, messaging, codecs_and_types, recipes, migration_3_to_4, and safety_and_limits to document the facade API, codecs and wire-format changes, migration guidance from v3, usage recipes, and operational/limits guidance.
Introduce a ROADMAP describing tentative release sequencing and priorities for the v4.x line. Documents v4.0.0 status, a v4.0.1 bug-fix target (fix for registry.lookup_async spinning on peer drop), planned v4.1.0 ergonomics and backend-abstraction refactor, and v4.2.0 plan to offer an opt-in `syn` backend as a sibling package; also notes possible future directions. Emphasizes SemVer guarantees and that schedules are tentative.
Update package version to 4.0.0 and add Gleam tool constraint (>= 1.16.0). Raise the gleam_stdlib dependency constraint to >= 1.0.0 in gleam.toml. Update manifest.toml to use gleam_stdlib 1.0.0 and gleeunit 1.10.0 (with updated checksums) and adjust the requirements to match the new gleam_stdlib constraint.
Publish v4.0.0 notes and refresh README. Adds a comprehensive CHANGELOG entry describing the v4 production release: breaking changes (32-bit wire prefix, API renames/removed facade items), security/hardening (payload caps, atom-budget, strict codec checks), telemetry and sink isolation, custom conflict resolvers with safe FFI shim, many bug fixes, performance improvements, new tests (Z-suite), and migration/documentation updates. README rewritten with a short 30‑second taste, updated docs links, development commands, and license mention.
Files from main relied on the Erlang :telemetry library which is not
a project dependency. Remove internal/telemetry.gleam, the duplicate
cluster/monitor.gleam, and its test. Keep codec/variant.gleam and its
tests which have no external dependencies.
The merge from main introduced telemetry as a runtime dep in gleam.toml
but the package was never resolved in manifest.toml, causing a crash at
test startup. Removed from both files; no code in the branch uses it.
- Remove telemetry from gleam.toml and manifest.toml (added by main
  merge, never resolved, caused runtime crash at test startup)
- Z2/Z3 multinode tests now skip silently when distribution is
  unavailable instead of failing hard
- Remove verbose io.println output from test skip paths
…test

- multinode_real_test: skip Z2/Z3 silently when distribution unavailable
- multinode_peer_test: skip roundtrip test when distribution unavailable
- codec/variant_error_test: use 32-bit length prefix to match codec spec
- Remove telemetry dep from gleam.toml/manifest.toml (main merge artifact)
@lupodevelop lupodevelop merged commit dcfb822 into main May 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant