|
| 1 | +# BP-69: Adopt slog for structured logging |
| 2 | + |
| 3 | +- Issue: [#4750](https://github.com/apache/bookkeeper/issues/4750) |
| 4 | +- State: Under Discussion |
| 5 | +- Release: N/A |
| 6 | + |
| 7 | +### Motivation |
| 8 | + |
| 9 | +BookKeeper today uses SLF4J across the codebase. Logs are unstructured lines with |
| 10 | +positional `{}` placeholders, and the same identity attributes (ledger id, bookie |
| 11 | +id, entry id, journal id, etc.) are repeated inline in every message string. This |
| 12 | +has three concrete consequences: |
| 13 | + |
| 14 | +1. **Poor queryability in log backends.** Modern log pipelines (Loki, Elastic, |
| 15 | + Splunk, OpenSearch, GCP Logging, CloudWatch) are indexed on structured |
| 16 | + key/value attributes. Grepping `"ledgerId=12345"` out of free-form messages is |
| 17 | + slow, lossy, and breaks whenever a log message is reworded. |
| 18 | +2. **Boilerplate and inconsistency.** Every log statement restates the same |
| 19 | + identity context ("ledger 12345, entry 67, bookie 10.0.0.1:3181 ..."), with |
| 20 | + no consistent attribute naming. Two log sites that mean the same thing get |
| 21 | + formatted differently. |
| 22 | +3. **`isDebugEnabled()` boilerplate.** To avoid `String.format`/toString cost on |
| 23 | + hot paths with debug disabled, the code is sprinkled with explicit level |
| 24 | + guards (hundreds of them), which obscure the actual logic. |
| 25 | + |
| 26 | +A similar analysis was done for Apache Pulsar in |
| 27 | +[PIP-467: Adopt slog for structured logging across Pulsar][pip-467]. PIP-467 |
| 28 | +reached the same conclusion, selected the `slog` library, was accepted, and the |
| 29 | +migration of the entire Pulsar codebase is complete. Because Pulsar is |
| 30 | +BookKeeper's largest user, a shared structured logging substrate lets |
| 31 | +cross-project debugging chains stay structured and correlated end-to-end (more |
| 32 | +on this in the "Cross-boundary logger context" section below, which is the main |
| 33 | +API change this BP introduces for BookKeeper). |
| 34 | + |
| 35 | +BookKeeper itself has a prior attempt at structured logging, the |
| 36 | +`bookkeeper-slogger` module (`Slogger` API with `kv(key, value)` and `ctx()`). |
| 37 | +It was introduced alongside `DbLedgerStorage`/`DirectEntryLogger` but was never |
| 38 | +adopted broadly: only **5 main-source files** in `bookkeeper-server` use it today |
| 39 | +(`DbLedgerStorage`, `DirectEntryLogger`, `DirectCompactionEntryLog`, |
| 40 | +`DirectWriter`, `EntryLogIdsImpl`). The API is clunky (no debug/trace levels, |
| 41 | +verbose `.kv().kv().info(...)` chains, no zero-overhead disabled levels), and |
| 42 | +the SLF4J bridge is the only handler implementation. It is effectively dead |
| 43 | +code. We propose to retire it. |
| 44 | + |
| 45 | +### Public Interfaces |
| 46 | + |
| 47 | +**New public interfaces** (`@Public`/`@Unstable`): |
| 48 | + |
| 49 | +1. A `Map<String, Object>` parameter on the client-side builder API so the |
| 50 | + caller (Pulsar, a user application, etc.) can attach its own logging |
| 51 | + context attributes to a ledger `Handle`: |
| 52 | + - `CreateBuilder withLoggerContext(Map<String, Object> attrs)` |
| 53 | + - `OpenBuilder withLoggerContext(Map<String, Object> attrs)` |
| 54 | + - `DeleteBuilder withLoggerContext(Map<String, Object> attrs)` |
| 55 | + |
| 56 | + When set, the resulting `WriteHandle`/`ReadHandle` binds those attributes |
| 57 | + onto its internal logger. All BookKeeper client log lines emitted on behalf |
| 58 | + of that handle automatically carry the caller's attributes — e.g. Pulsar's |
| 59 | + `managedLedger=<topic>` and `cursor=<subscription>` — without BookKeeper |
| 60 | + knowing anything about Pulsar concepts. Note that the public API does |
| 61 | + **not** expose slog types: callers pass a plain `Map`, so they are not |
| 62 | + required to take a slog dependency to use the feature. |
| 63 | + |
| 64 | +2. A new compile-scope dependency: `io.github.merlimat.slog:slog` (currently |
| 65 | + `0.9.7`). |
| 66 | + |
| 67 | +**Changed public interfaces**: none. SLF4J remains on the classpath during and |
| 68 | +after the migration as a transitive dependency, because slog ships an SLF4J |
| 69 | +handler (`Slf4jHandler`) that sends structured records to any existing SLF4J |
| 70 | +backend (Logback, Log4j2). Applications that currently configure BookKeeper |
| 71 | +logging via `logback.xml` / `log4j2.xml` continue to work unchanged. |
| 72 | + |
| 73 | +**Removed public interfaces**: |
| 74 | + |
| 75 | +- The `bookkeeper-slogger` module (`org.apache.bookkeeper.slogger.Slogger`, |
| 76 | + `AbstractSlogger`, `Sloggable`, `ConsoleSlogger`, `NullSlogger`, and the |
| 77 | + `bookkeeper-slogger-slf4j` bridge). It is `@Unstable` and has no adoption |
| 78 | + outside the 5 internal files listed above. |
| 79 | + |
| 80 | +**Not changed**: |
| 81 | + |
| 82 | +- The BookKeeper wire protocol, binary formats (journal, entry log, ledger |
| 83 | + index), metadata formats, metrics, and CLI are untouched. |
| 84 | +- MDC behavior (`MdcUtils`, MDC propagation in `OrderedExecutor`) is preserved; |
| 85 | + slog's SLF4J handler forwards structured attributes as MDC to the underlying |
| 86 | + SLF4J backend so that existing log appenders keep rendering them. |
| 87 | + |
| 88 | +### Proposed Changes |
| 89 | + |
| 90 | +#### 1. Add `slog` as a core dependency |
| 91 | + |
| 92 | +- Add `io.github.merlimat.slog:slog:0.9.7` to root `pom.xml` |
| 93 | + `<dependencyManagement>` and to the global compile classpath, alongside the |
| 94 | + existing SLF4J API. |
| 95 | +- Add `lombok.config` at the repo root so `@CustomLog` generates a slog |
| 96 | + `Logger`: |
| 97 | + ``` |
| 98 | + lombok.log.custom.declaration = io.github.merlimat.slog.Logger \ |
| 99 | + io.github.merlimat.slog.Logger.get(TYPE) |
| 100 | + ``` |
| 101 | +- Continue to ship the SLF4J handler at runtime so existing log configs (Logback |
| 102 | + / Log4j2 XML) keep working. |
| 103 | + |
| 104 | +#### 2. Migrate SLF4J call sites to slog, one module at a time |
| 105 | + |
| 106 | +Phased conversion. Each phase is a single commit; the codebase must compile at |
| 107 | +the end of each phase. |
| 108 | + |
| 109 | +| Phase | Scope | |
| 110 | +|-------|-------------------------------------------------------------| |
| 111 | +| 1 | `bookkeeper-common` | |
| 112 | +| 2 | `stats`, `bookkeeper-common-allocator` | |
| 113 | +| 3 | `bookkeeper-server` (core, largest phase) | |
| 114 | +| 4 | `bookkeeper-http`, `tools` | |
| 115 | +| 5 | `stream/distributedlog` | |
| 116 | +| 6 | `metadata-drivers`, `benchmark`, peripheral tests | |
| 117 | +| 7 | Cleanup: remove `bookkeeper-slogger`, remove `@Slf4j` usage | |
| 118 | + |
| 119 | +Conversion rules: |
| 120 | + |
| 121 | +- `@Slf4j` → `@CustomLog`; `LoggerFactory.getLogger(X.class)` → |
| 122 | + `Logger.get(X.class)`; naming: static = `LOG`, instance-bound = `log`. |
| 123 | +- `log.info("text {} {}", a, b)` is **not** mechanically rewritten to `logf(...)` |
| 124 | + — instead convert parameters to structured attributes: |
| 125 | + `log.info().attr("a", a).attr("b", b).log("text")`. |
| 126 | +- Classes with identity (ledger id, bookie id, stream name, ...) build an |
| 127 | + instance-bound logger in the constructor and drop those attributes from every |
| 128 | + call site: |
| 129 | + ```java |
| 130 | + this.log = Logger.get(PerChannelBookieClient.class) |
| 131 | + .with().attr("bookieId", bookieId).build(); |
| 132 | + ``` |
| 133 | +- `if (log.isDebugEnabled())` guards are removed; slog returns a no-op `Event` |
| 134 | + for disabled levels so the `.attr()` chain is zero-overhead. |
| 135 | +- When a `.attr()` value is itself expensive to compute (e.g. |
| 136 | + `Iterables.size(iter)`, `Joiner.on(...).join(...)`, |
| 137 | + `Arrays.toString(...)`), use slog's lambda form so the lambda is only invoked |
| 138 | + when the level is enabled: |
| 139 | + ```java |
| 140 | + log.debug(e -> e.attr("count", Iterables.size(newLocations)) |
| 141 | + .log("Update locations")); |
| 142 | + ``` |
| 143 | + |
| 144 | +#### 3. Cross-boundary logger context through the client API |
| 145 | + |
| 146 | +This is the main API change. Today a caller (Pulsar, a user app, a CLI tool) |
| 147 | +that wants to correlate its own log lines with BookKeeper's log lines has no |
| 148 | +way to do so — BookKeeper's client code logs with its own loggers, and the |
| 149 | +caller's context (topic name, cursor name, request id, trace id) never makes |
| 150 | +it into those lines. |
| 151 | + |
| 152 | +We add an optional `withLoggerContext(Map<String, Object>)` method on the |
| 153 | +ledger builder API. The caller provides a plain map of attribute names to |
| 154 | +values; slog does not appear in the public API surface: |
| 155 | + |
| 156 | +```java |
| 157 | +// Pulsar ManagedLedgerImpl |
| 158 | +Map<String, Object> logCtx = Map.of( |
| 159 | + "managedLedger", name, |
| 160 | + "namespace", namespace); |
| 161 | + |
| 162 | +CompletableFuture<WriteHandle> f = bookKeeper.newCreateLedgerOp() |
| 163 | + .withEnsembleSize(3) |
| 164 | + .withWriteQuorumSize(2) |
| 165 | + .withAckQuorumSize(2) |
| 166 | + .withLoggerContext(logCtx) // <-- new |
| 167 | + .execute(); |
| 168 | +``` |
| 169 | + |
| 170 | +Inside BookKeeper, the `LedgerHandle` constructor binds those attributes onto |
| 171 | +its internal logger: |
| 172 | + |
| 173 | +```java |
| 174 | +var builder = Logger.get(LedgerHandle.class).with(); |
| 175 | +if (callerAttrs != null) { |
| 176 | + callerAttrs.forEach(builder::attr); |
| 177 | +} |
| 178 | +this.log = builder.attr("ledgerId", ledgerId).build(); |
| 179 | +``` |
| 180 | + |
| 181 | +The net effect: a failed write on a bookie emits a log line like |
| 182 | + |
| 183 | +``` |
| 184 | +level=ERROR ledgerId=12345 entryId=67 bookieId=bk-3:3181 \ |
| 185 | + managedLedger=persistent://public/default/my-topic \ |
| 186 | + "Write to bookie failed" |
| 187 | +``` |
| 188 | + |
| 189 | +without any BookKeeper code referencing the `managedLedger` concept. The |
| 190 | +propagation is one-way (caller context flows in; BookKeeper-only attributes do |
| 191 | +not leak out), and `withLoggerContext` is optional — if unset, BookKeeper uses |
| 192 | +its own `Logger.get(...)` as before and behavior is unchanged. |
| 193 | + |
| 194 | +Scope for this BP: `CreateBuilder`, `OpenBuilder`, `DeleteBuilder`. The same |
| 195 | +mechanism can be extended to `newListLedgersOp` / admin APIs in follow-up |
| 196 | +changes. |
| 197 | + |
| 198 | +#### 4. Retire `bookkeeper-slogger` |
| 199 | + |
| 200 | +In the cleanup phase: |
| 201 | + |
| 202 | +- Convert the 5 remaining `Slogger` call sites in |
| 203 | + `bookkeeper-server` to the new slog API (they are already structured-event |
| 204 | + loggers; the translation is mechanical). |
| 205 | +- Delete the `bookkeeper-slogger/` module (API + SLF4J bridge + tests). |
| 206 | +- Remove it from the root `pom.xml` reactor. |
| 207 | + |
| 208 | +### Compatibility, Deprecation, and Migration Plan |
| 209 | + |
| 210 | +- **End-user impact for the log output format**: with the default SLF4J |
| 211 | + handler, the rendered log lines continue to come from the user's existing |
| 212 | + Logback/Log4j2 pattern. Structured attributes are forwarded as MDC, so a |
| 213 | + layout like `%X{ledgerId} %msg` picks them up automatically. No log config |
| 214 | + change is required to upgrade. |
| 215 | +- **End-user impact for the client API**: additive only. `withLoggerContext(...)` |
| 216 | + is a new method on `CreateBuilder` / `OpenBuilder` / `DeleteBuilder`; existing |
| 217 | + callers that don't use it see no behavioral change. Its parameter is a |
| 218 | + standard JDK `Map<String, Object>`, so callers are not required to take a |
| 219 | + slog dependency to use the feature. |
| 220 | +- **Dependency impact**: applications get `io.github.merlimat.slog:slog` as a |
| 221 | + new transitive dependency. slog has no transitive dependencies of its own |
| 222 | + beyond an optional SLF4J handler (runtime scope). |
| 223 | +- **`bookkeeper-slogger` removal**: `@Unstable`, 5 internal call sites, zero |
| 224 | + external adoption observed. We will deprecate in the release that begins the |
| 225 | + migration and remove after Phase 7 lands. Users who happen to depend on it |
| 226 | + can migrate to the published slog API, which is a superset. |
| 227 | +- **SLF4J deprecation**: **not** part of this BP. SLF4J stays for the |
| 228 | + foreseeable future as the rendering backend behind slog. |
| 229 | +- **Rollout**: each phase is independently mergeable and independently |
| 230 | + revertable. There is no flag day. |
| 231 | + |
| 232 | +### Test Plan |
| 233 | + |
| 234 | +- Each phase preserves the existing unit/integration test suite unchanged; |
| 235 | + "the tests still pass" is the baseline success criterion per phase. |
| 236 | +- For the new `withLoggerContext(Map<String, Object>)` API, add unit tests |
| 237 | + that: |
| 238 | + 1. Verify the log events emitted by a `LedgerHandle` carry the attributes |
| 239 | + supplied via the map (use a capturing slog handler in tests). |
| 240 | + 2. Verify the default (no `withLoggerContext` call) behavior is identical |
| 241 | + to pre-BP, including logger name. |
| 242 | +- For log-format compatibility, a small integration check in |
| 243 | + `bookkeeper-server` that wires the slog SLF4J handler to Logback and asserts |
| 244 | + that MDC-styled patterns render structured attributes. |
| 245 | +- No wire/format/metric tests change — none of those are touched. |
| 246 | + |
| 247 | +### Rejected Alternatives |
| 248 | + |
| 249 | +1. **Stay on SLF4J, adopt a convention.** Rejected: SLF4J's `{}` format has no |
| 250 | + attribute names, no composable context (beyond MDC, which is thread-local |
| 251 | + and brittle across async/executor boundaries), and no zero-overhead disabled |
| 252 | + levels. Cross-project correlation with Pulsar would not be possible in a |
| 253 | + structured way. |
| 254 | +2. **Expand `bookkeeper-slogger` to cover the whole codebase.** Rejected: the |
| 255 | + API surface is small and missing critical features (`debug`/`trace`, |
| 256 | + lambda-deferred attribute evaluation, per-instance context via |
| 257 | + `with().build()`). It has no external users. Hardening it would mean |
| 258 | + reinventing slog while duplicating work with Pulsar's choice. Cheaper to |
| 259 | + retire it. |
| 260 | +3. **Use Log4j2's `ThreadContext` / structured API directly.** Rejected: binds |
| 261 | + BookKeeper to a specific backend (we currently support Logback, Log4j2, and |
| 262 | + others via SLF4J). slog keeps the rendering backend pluggable. |
| 263 | +4. **OpenTelemetry Logs Bridge API.** Rejected for this BP: OTel logs are still |
| 264 | + stabilizing, and binding the in-process logging API to an OTel SDK is a much |
| 265 | + larger commitment (and dependency footprint) than what is needed to solve |
| 266 | + the stated problem. Nothing in this BP precludes later adding an OTel |
| 267 | + handler to slog. |
| 268 | +5. **Expose a slog `Logger` parameter on the builder (e.g. |
| 269 | + `withLogger(io.github.merlimat.slog.Logger)`) instead of a |
| 270 | + `Map<String, Object>`.** Rejected: it would force every BookKeeper client |
| 271 | + that wants to propagate logging context to take a direct compile dependency |
| 272 | + on slog and to construct a slog `Logger` before calling the client API. The |
| 273 | + map-of-attributes form keeps slog out of the public API surface while giving |
| 274 | + the caller everything they actually need (a bag of key/value pairs to bind |
| 275 | + to the handle's logger). Nothing prevents a caller that already uses slog |
| 276 | + from deriving the map from an existing logger's attributes. |
| 277 | + |
| 278 | +[pip-467]: https://github.com/apache/pulsar/blob/master/pip/pip-467.md |
0 commit comments