Skip to content

Commit aa2afbc

Browse files
committed
chore: save session state [skip ci]
1 parent 7d70a97 commit aa2afbc

2 files changed

Lines changed: 86 additions & 129 deletions

File tree

STATE.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,34 @@ CARGO_BUILD_JOBS=2 cargo clippy
108108

109109
---
110110

111+
## Core Pillars (Non-Negotiable Design Decision)
112+
113+
Every module in hyperi-rustlib MUST auto-integrate with the core infrastructure
114+
pillars using the **global singleton pattern**. Services get observability for
115+
free — no handles passed, no opt-in, no extra code in downstream apps.
116+
117+
| Pillar | Singleton | Module | Pattern |
118+
|--------|-----------|--------|---------|
119+
| Config | `OnceLock<Config>` | `config` | `T::from_cascade()` reads from global figment |
120+
| Logging | Global `tracing` subscriber | `logger` | `tracing::info!()` macros — always available |
121+
| Metrics | Global `metrics` recorder | `metrics` | `metrics::counter!()` macros — no-op if no recorder |
122+
| Tracing | Global OTel subscriber | `otel` | Trace context auto-propagated (planned: always-on) |
123+
| Health | Global `HealthState` | `http-server` | Unified readiness flag (planned: auto-wired) |
124+
| Shutdown | `CancellationToken` | (planned) | Unified graceful shutdown (planned: auto-wired) |
125+
126+
**Rule:** When adding ANY new module or feature to rustlib:
127+
1. If it has configurable behaviour → load from cascade via `from_cascade()`
128+
2. If it does I/O or processing → add `#[cfg(feature = "metrics")]` counters/gauges/histograms
129+
3. If it can fail or has interesting state → add `tracing::` log calls
130+
4. If it affects service health → report into unified `HealthState`
131+
132+
**The goal:** A DFE app that does `MetricsManager::new("dfe_loader")` +
133+
`logger::setup_default()` + `config::setup()` at startup gets full
134+
observability across every rustlib feature it uses — transport, tiered-sink,
135+
spool, cache, secrets, HTTP client, DLQ — with zero additional wiring.
136+
137+
---
138+
111139
## Decisions
112140

113141
- **Dynamic linking for C deps** — rdkafka, libgit2, zstd, zlib, openssl all link against system libs via pkg-config. Eliminates ~30min C++ build for rdkafka. aws-lc-sys is the one exception (AWS SDK hardcodes it, no opt-out).

TODO.md

Lines changed: 58 additions & 129 deletions
Original file line numberDiff line numberDiff line change
@@ -8,157 +8,86 @@
88

99
## Current Tasks
1010

11-
### Kafka Transport Metrics Parity with gRPC `[NEXT]`
11+
### Core Pillars Implementation `[NEXT]`
12+
13+
Full plan: `docs/superpowers/plans/2026-03-26-core-pillars.md`
14+
15+
**Phase 1: OTel Tracing Auto-Propagation**
16+
- [ ] Auto-initialise OTel layer in logger when `otel` feature + `OTEL_EXPORTER_OTLP_ENDPOINT` set
17+
- [ ] gRPC trace context propagation (tonic interceptors, `traceparent` header)
18+
- [ ] Kafka trace context propagation (message headers)
19+
- [ ] HTTP client trace context injection
20+
- [ ] HTTP server trace context extraction
21+
22+
**Phase 2: Unified HealthState**
23+
- [ ] `src/health/` module with global `HealthRegistry` singleton
24+
- [ ] `HealthComponent` trait — modules register at construction
25+
- [ ] Wire transport, circuit breaker, config reloader into registry
26+
- [ ] `/readyz` aggregates from `HealthRegistry::is_healthy()`
27+
- [ ] `/health/detailed` JSON endpoint with per-component status
28+
29+
**Phase 3: Unified Graceful Shutdown**
30+
- [ ] `src/shutdown/` module with global `CancellationToken`
31+
- [ ] SIGTERM/SIGINT → `token.cancel()` → all modules drain
32+
- [ ] Wire http-server, tiered-sink, config-reloader, gRPC transport
33+
34+
**Phase 4: New Transports**
35+
- [ ] File transport (NDJSON, wraps existing `NdjsonWriter`)
36+
- [ ] Pipe transport (stdin/stdout, newline-delimited)
37+
- [ ] HTTP transport (POST to endpoint, uses `HttpClient`)
38+
- [ ] Redis/Valkey Streams transport (`XADD`/`XREADGROUP`/`XACK`)
39+
40+
**Phase 5: DLQ Transport Integration**
41+
- [ ] DLQ Kafka backend uses `Box<dyn Transport>` instead of raw producer
42+
- [ ] DLQ can write to any transport (file, HTTP, Redis, Kafka)
43+
44+
**Phase 6: Always-On Defaults**
45+
- [ ] Make config, logger, metrics, health, shutdown default features
46+
- [ ] Downstream dfe-* app remediation (remove boilerplate)
47+
- [ ] Audit hyperi-pylib and write alignment plan
1248

13-
gRPC transport (v1.19.7) auto-emits `dfe_transport_*` metrics. Kafka transport does not — two gaps:
14-
15-
1. **Add metrics instrumentation to `KafkaTransport::send()`** — emit `dfe_transport_sent_total{transport="kafka"}`, `dfe_transport_send_errors_total{transport="kafka"}`, `dfe_transport_backpressured_total{transport="kafka"}`, and send duration histogram, matching gRPC parity.
16-
17-
2. **Wire `StatsContext` into `KafkaTransport`** — currently `new_with_context()` is a stub that ignores the context. The struct uses `DefaultConsumerContext` / plain `FutureProducer`, so `statistics.interval.ms` callbacks (set to 1000ms by all profiles) go to a no-op. Need to either make the struct generic over context (complicates `Transport` trait impl) or use `StatsContext` by default when the `metrics` feature is enabled. This enables `rdkafka_broker_rtt_avg_seconds`, `rdkafka_global_msg_cnt`, `rdkafka_topic_partition_consumer_lag`, etc.
18-
19-
Downstream impact: dfe-fetcher, dfe-receiver, dfe-loader all use `KafkaTransport` and would get these for free once wired.
20-
21-
### Gap Analysis P2 — HTTP Client, Database URLs, Cache
22-
23-
- [ ] HTTP client module with retry middleware (reqwest + reqwest-middleware + reqwest-retry)
24-
- Wrap reqwest with exponential backoff, configurable timeouts
25-
- Auto-register config via `unmarshal_key_registered`
26-
- Metrics integration (request count, duration, errors)
27-
- [ ] Database URL builders (PostgreSQL, ClickHouse, Redis)
28-
- Build connection strings from env vars with standard prefixes
29-
- `SensitiveString` for password fields
30-
- [ ] Cache module with disk/memory backends
31-
- Consolidate secrets cache pattern into reusable module
32-
- TTL, stale-while-revalidate, size bounds
49+
---
3350

3451
### Completed Recent
3552

36-
- [x] **Config registry** (v1.19.3-v1.19.5) — auto-registering reflectable config, `/config` admin endpoint, `SensitiveString`, heuristic redaction, change notification, `ConfigReloader` hook
37-
- [x] **CEL expression profile** (v1.19.2) — `matches()` blocked by default, `ProfileConfig` with per-category overrides, string literal false-positive prevention
38-
- [x] **Config cascade wiring** (v1.19.2) — expression, memory, version_check, scaling, grpc, secrets auto-read from figment cascade
53+
- [x] **Universal metrics instrumentation** (v1.19.8) — tiered-sink, spool, dlq, cache, http-client, secrets all auto-emit Prometheus metrics via global singleton. Core pillar design decision documented in CLAUDE.md.
54+
- [x] **Kafka transport metrics + StatsContext** (v1.19.8) — `KafkaTransport` always uses `StatsContext` for consumer and producer. `dfe_transport_*` metrics on `send()`. `rdkafka_*` metrics auto-emitted. Zero downstream code changes.
55+
- [x] **gRPC transport metrics** (v1.19.7) — `dfe_transport_*` metrics on send/recv. Server push handler uses `try_send` with backpressure status codes.
56+
- [x] **HTTP client module** (v1.19.6) — reqwest + reqwest-middleware + reqwest-retry, exponential backoff, config cascade
57+
- [x] **Database URL builders** (v1.19.6) — PostgreSQL, ClickHouse, Redis/Valkey, MongoDB. Display trait redacts passwords.
58+
- [x] **Cache module** (v1.19.6) — moka-backed concurrent in-memory cache, per-source TTL, source isolation
59+
- [x] **Dependency update** (v1.19.6) — all deps to latest, cargo-audit ignores for transitive advisories
60+
- [x] **Config registry** (v1.19.3-v1.19.5) — auto-registering reflectable config, `/config` admin endpoint, `SensitiveString`, heuristic redaction, change notification
61+
- [x] **CEL expression profile** (v1.19.2) — `matches()` blocked by default, `ProfileConfig` with per-category overrides
62+
- [x] **Config cascade wiring** (v1.19.2) — expression, memory, version_check, scaling, grpc, secrets auto-read from cascade
3963
- [x] **MemoryGuard underflow fix** (v1.19.1) — `fetch_sub` replaced with `fetch_update` + `saturating_sub`
40-
- [x] **Test restructure** (v1.19.1) — `tests/integration/`, `tests/e2e/`, `tests/common/` per testing standard
64+
- [x] **Test restructure** (v1.19.1) — `tests/integration/`, `tests/e2e/`, `tests/common/`
4165
- [x] **hyperi-ci release-merge** — CLI command replaces per-project workflow files
42-
- [x] **Rust edition 2024** — migrated from 2021; `temp-env` replaces unsafe `set_var`/`remove_var` in tests across 6 files
43-
- [x] **async-trait removal** — public traits (`Sink`, `Transport`, `SecretProvider`) now use `fn ... -> impl Future + Send` (Rust 1.75+ native)
44-
- [x] **kafka_config module**`config_from_file`, 7 named profiles, `merge_with_overrides`; librdkafka settings loaded from config git dir (only cascade exception)
45-
- [x] **File output sink**`src/io/`, `src/output/`, `output-file` feature
46-
- [x] **CLI module** — CommonArgs, StandardCommand, DfeApp trait (`cli` feature)
47-
- [x] **Top module** — ratatui TUI dashboard, Prometheus parser, oneshot mode (`top` feature)
48-
- [x] **CI gating fix** — Semantic Release now gated on CI success via workflow_run
49-
50-
---
51-
52-
## Completed
53-
54-
- [x] Vector compat integration tests — 6 tests using real Vector binary + VectorCompatClient (fetch-vector.sh + YAML)
55-
- [x] vault_env integration tests fixed — clear_vault_env() prevents VAULT_TOKEN leakage
56-
- [x] Dependency update sweep — all crates to latest, tonic/prost 0.14 migration (v1.8.4)
57-
- [x] Stale hs-rustlib removed from JFrog hypersec-cargo-local and hyperi-cargo-local
58-
- [x] MaskingLayer fixed — writer-based redaction for both JSON and text formats (v1.8.4)
59-
- [x] Logger output capturing tests — 10 tests (JSON, text, filtering, masking)
60-
- [x] Coloured log output — custom FormatEvent with owo-colors colour scheme
61-
- [x] Metrics graceful shutdown tests — 4 tests (shutdown, rapid cycle, render after stop, concurrent)
62-
- [x] gRPC transport integration tests — 8 tests (send/recv, ordering, large payload, compression)
63-
- [x] gRPC transport with Vector wire protocol compatibility (v1.8.0)
64-
- tonic-based gRPC replacing Zenoh transport
65-
- DFE native proto (`dfe.transport.v1`) + vendored Vector proto
66-
- Vector compat source/sink for migration from Vector pipelines
67-
- build.rs for conditional proto code generation
68-
- [x] Zenoh transport removed — replaced by gRPC (v1.8.0)
69-
- [x] Version check module — startup check against releases.hyperi.io (v1.7.0)
70-
- [x] Deployment validation module — Helm chart and Dockerfile contract checks (v1.7.0)
71-
- [x] CI: ARC self-hosted runners enabled (v1.7.1–v1.8.3)
72-
- [x] Clippy/formatting fixes — approx_constant lint, dprint float formatting (v1.8.1–v1.8.3)
73-
- [x] Package rename: hs-rustlib -> hyperi-rustlib, published v1.4.3 to JFrog
74-
- [x] Rebrand: HyperSec -> HyperI across source, docs, configs, workflows
75-
- [x] Registry migration: hypersec registry -> hyperi registry
76-
- [x] Submodule URLs: hypersec-io -> hyperi-io
77-
- [x] CI config: .hypersec-ci.yaml -> .hyperi-ci.yaml
78-
- [x] Directory-config store with git2 integration (v1.4.0)
79-
- [x] OpenTelemetry metrics support (v1.4.0)
80-
- [x] Secrets management module (OpenBao/Vault, AWS) (v1.3.x)
81-
- [x] HTTP server module (axum-based) (v1.2.0)
82-
- [x] Transport module (Kafka/Memory abstraction)
83-
- [x] TieredSink module (disk spillover with circuit breaker)
84-
- [x] Spool module (disk-backed queue)
85-
- [x] Configuration module (7-layer cascade with figment)
86-
- [x] Logger module (structured JSON, RFC3339, masking)
87-
- [x] Metrics module (Prometheus + process/container)
88-
- [x] Environment detection module
89-
- [x] Runtime paths module (XDG + container awareness)
90-
- [x] Dependency audit (serde_yml -> serde-yaml-ng, queue-file -> yaque, once_cell -> LazyLock)
91-
- [x] Config cascade alignment with hyperi-pylib unified spec (v1.6.0)
92-
- load_home_dotenv default false, app_name support, container/user config paths
93-
- Created docs/CONFIG-CASCADE.md
94-
- PG layer documented as built-for-not-with (YAML gitops covers current needs)
9566

9667
---
9768

98-
## Backlog (P1 - Config Registry)
99-
100-
### Reflectable Config Registry
101-
102-
Central registry where every module registers its config section at startup.
103-
Currently modules independently call `unmarshal_key()` — no visibility into
104-
what config keys exist, their types, defaults, or descriptions.
105-
106-
**Goal:** Any DFE app can list/dump/expose all available config sections.
107-
108-
- [x] Auto-registration via `unmarshal_key_registered` — records `(key, type_name, defaults, effective)` in global registry. Zero code changes in downstream apps.
109-
- [x] `registry::sections()` — list all registered sections
110-
- [x] `registry::dump_effective()` — JSON map of effective values
111-
- [x] `registry::dump_defaults()` — JSON map of defaults (via `T::default()`)
112-
- [x] Heuristic auto-redaction (password, secret, token, key, credential, auth, private, cert, encryption)
113-
- [x] `#[serde(skip_serializing)]` as additional layer for fields that should never appear
114-
- [x] expression, memory, version_check, scaling, grpc, secrets wired with `from_cascade()` auto-register
115-
- [x] Modules without defaults (tiered_sink, http_server, kafka, spool, dlq) use `unmarshal_key_registered` from downstream apps
116-
- [x] `/config` admin endpoint (opt-in via `enable_config_endpoint`) — returns redacted effective + defaults JSON
117-
- [x] Change notification (opt-in) — `registry::on_change(key, callback)` + `registry::update()`
118-
- Modules that need hot-reload subscribe; others keep `OnceLock` (init-once)
119-
- [x] `ConfigReloader.with_registry_update(key)` connects hot-reload to registry
120-
- [x] `SensitiveString` type — compile-time safe, `Serialize` always redacts
121-
- [x] 19 registry + 12 sensitive string tests covering all redaction guarantees
122-
- [ ] Migrate all dfe-* and hyperi-* apps to `unmarshal_key_registered` pattern
123-
- [ ] Align hyperi-pylib with same registry pattern
124-
125-
---
126-
127-
## Backlog (P2 - from Gap Analysis)
128-
129-
### Phase 1 - Core Enterprise
130-
131-
- [ ] Database URL builders module (PostgreSQL, Redis)
132-
- [ ] HTTP client module with retry middleware (reqwest-retry)
69+
## Backlog
13370

13471
### Secrets Providers
13572

136-
- [ ] GCP Secret Manager provider (`secrets-gcp` feature, `google-cloud-secretmanager` crate)
137-
- [ ] Azure Key Vault provider (`secrets-azure` feature, `azure_security_keyvault` crate)
73+
- [ ] GCP Secret Manager provider (`secrets-gcp` feature)
74+
- [ ] Azure Key Vault provider (`secrets-azure` feature)
13875

139-
### Phase 2 - Enhanced Features
76+
### Kafka — Opinionated SASL-SCRAM Named Constructors
14077

141-
- [ ] Cache module with disk/Redis backing
142-
- [ ] CLI framework helpers (wrap Clap)
78+
- [ ] `KafkaConfig::external_sasl_scram(brokers, username, password)` — SASL_SSL + SCRAM-SHA-512
79+
- [ ] `KafkaConfig::internal_sasl_scram(brokers, username, password)` — SASL_PLAINTEXT + SCRAM-SHA-512
14380

144-
### Phase 3 - Advanced
81+
### Other
14582

146-
- [ ] Standalone Kafka client (if transport layer insufficient)
14783
- [ ] PII anonymiser (evaluate Rust libraries)
14884
- [ ] Python bindings for ClickHouse client (PyO3)
14985

150-
### Kafka — Opinionated SASL-SCRAM Named Constructors
151-
152-
- [ ] Add `KafkaConfig::external_sasl_scram(brokers, username, password)` — SASL_SSL + SCRAM-SHA-512
153-
- [ ] Add `KafkaConfig::internal_sasl_scram(brokers, username, password)` — SASL_PLAINTEXT + SCRAM-SHA-512
154-
- [ ] Encodes the decision once: SCRAM works unchanged on Apache Kafka, AutoMQ, MSK, Confluent Cloud
155-
- [ ] Remove per-project manual assembly of protocol + sasl + tls fields in dfe-loader, dfe-receiver
156-
15786
---
15887

15988
## Notes
16089

16190
- Use `CARGO_BUILD_JOBS=2` for all cargo commands
162-
- Transport backends: Kafka, gRPC (native + Vector compat), Memory (Zenoh removed in v1.8.0)
91+
- Transport backends: Kafka, gRPC (native + Vector compat), Memory
92+
- Core pillars plan: `docs/superpowers/plans/2026-03-26-core-pillars.md`
16393
- See docs/GAP_ANALYSIS.md for detailed comparison with hyperi-pylib
164-
- See docs/CLICKHOUSE_PYTHON_BINDINGS.md for Python binding proposal

0 commit comments

Comments
 (0)