Add FluentBitUpgradeSummary.md with 4.x to 5.x validation results and Windows re-tail root cause

zanejohnson-azure · Copilot · zanejohnson-azure · commit e3fa87fc37b7 · 2026-05-14T23:51:58.000-07:00
Documents the proven root cause of the one-time Windows ContainerLogV2 spike on 4.x to 5.x upgrade: fluent-bit 5.x added an offset_marker content fingerprint to in_tail's SQLite recovery DB. 4.x never wrote it, so on first 5.x boot the marker validation fails for all recovered files, triggering a re-tail from byte 0 (Read_from_Head true). One-time per Windows node, self-healing on first boot.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;
diff --git a/FluentBitUpgradeSummary.md b/FluentBitUpgradeSummary.md
@@ -0,0 +1,207 @@
+# fluent-bit 4.0.14 → 5.0.x Upgrade — Validation Summary
+
+**PR**: [#1671](https://github.com/microsoft/Docker-Provider/pull/1671) (`zanejohnson-azure/upgrade-fluent-bit`)
+**Changes**:
+- Linux: `kubernetes/linux/setup.sh` → `azcu-fluent-bit-5.0.4`
+- Windows: `kubernetes/windows/setup.ps1` → `fluent-bit-5.0.3-win64.zip` (5.0.4 not yet published upstream — will bump when available)
+
+**Verdict**: ✅ **No regressions.** Safe to merge once Windows 5.0.4 lands.
+
+---
+
+## How it was validated
+
+A/B backdoor deployment on AKS cluster `zane-ama-logs-helm-test` (LA workspace `222e72f7-1ad8-4e28-b2a9-07d046eedef4`):
+- **Baseline**: `ciprod:3.3.0` (current prod, fluent-bit 4.0.14)
+- **Test**: `cidev:3.3.0-6-g1d77401ab-20260506045747` (this PR, fluent-bit 5.0.4 Linux)
+---
+
+## Data-volume parity (1-min bins, 5-min window)
+
+| Table | Prod | Test | Delta | Verdict |
+|---|---:|---:|---:|---|
+| ContainerInventory | 822 | 825 | +3 | ✅ PASS — 3 extra rows = unrelated `azsecpack-azl3-image` (Azure Security Linux DaemonSet) snapshots |
+| KubeNodeInventory | 25 | 25 | 0 | ✅ PASS |
+| KubePodInventory | 825 | 826 | +1 | ✅ PASS — within noise |
+| InsightsMetrics | 825 | 825 | 0 | ✅ PASS |
+| Perf | 5827 | 5835 | +8 | ✅ PASS — matches azsecpack containers above |
+| ContainerLogV2 | varies | varies | — | ✅ PASS — no sustained drop/spike |
+
+Per-container investigation confirmed the small deltas are unrelated cluster churn (azsecpack), not the ama-logs code change.
+
+---
+
+## Functional smoke tests
+
+| # | Area | Status | Notes |
+|---|---|---|---|
+| 1 | Pod startup / no crashloop | ✅ PASS | 0 restarts across all ama-logs pods over 41+ h soak |
+| 2 | ContainerLogV2 ingestion | ✅ PASS | Stream uninterrupted across rollout |
+| 3 | KubePodInventory / Inventory ingestion | ✅ PASS | Counts match prod within noise |
+| 4 | InsightsMetrics / Perf ingestion | ✅ PASS | Counts match prod within noise |
+| 5 | KubeEvents | ✅ PASS | No change in event flow |
+| 6 | DaemonSet rollout | ✅ PASS | `kubectl rollout restart ds/ama-logs` clean, 4/4 pods Ready |
+| 7 | Prometheus scraping (ama-logs-prometheus container) | ✅ PASS | No regression in metrics scrape |
+| 8 | **Multiline (Java stack trace)** | ✅ **PASS** | See section below — explicit before/after test |
+
+---
+
+## Multiline stack-trace grouping (the highest-risk 5.x area)
+
+fluent-bit 5.x rewrote `in_emitter` and the multiline buffering path. ama-logs gates multiline behind a customer opt-in (`#${MultilineEnabled}` token in `fluent-bit.conf` stripped by `fluent-bit-conf-customizer.rb` when `enable_multiline_logs.enabled = "true"`). Tested explicitly:
+
+**Method**: Deployed a busybox pod (`multiline-emitter`) that prints a 7-line Java NPE + `Caused by:` chain every 30 s, tagged `iter-N`. Captured `ContainerLogV2` rows before (multiline OFF) and after (configmap patch enabling multiline for `java`).
+
+| Mode | Iters in window | ContainerLogV2 rows | Rows / trace |
+|---|---:|---:|---:|
+| BEFORE (default — multiline OFF) | 11 | 77 | **7** (one row per `\n`-terminated line) |
+| AFTER (configmap opt-in, java) | 9 | 9 | **1** (full trace collapsed) |
+
+Sample AFTER row (`iter-19` `LogMessage`):
+```
+java.lang.NullPointerException: multiline-marker iter-19
+	at com.example.Foo.bar(Foo.java:42)
+	at com.example.Foo.baz(Foo.java:13)
+	at com.example.Main.main(Main.java:7)
+Caused by: java.io.IOException: disk read failed iter-19
+	at com.example.Disk.read(Disk.java:99)
+	at com.example.Foo.bar(Foo.java:40)
+```
+
+**Result**: Multiline grouping works as documented on 5.0.4. No partial traces, no dropped continuation lines, no duplicate emissions.
+
+---
+
+## Windows log ingestion (separate cluster)
+
+Verified on a second cluster (`ci-logs-dev-aks-all-nodes`, LA workspace `23320075-7b9b-42e1-acc5-97baf986542e`) that is running this PR's `cidev:win-3.3.0-9-g2dfc6056c-20260511064557` Windows image plus a continuous `log-gen-windows-*` workload on both Windows nodes (14 h soak, 0 restarts).
+
+| Node | Computer | CLV2 rows / 30 min | Verdict |
+|---|---|---:|---|
+| Windows fip | `akswinfip000000` | 1,645 | ✅ PASS |
+| Windows p22 | `akswinp22000000` | 1,673 | ✅ PASS |
+| (Linux baseline on same cluster) | aks-usr*-vmss000000 (×6) | 1,713–1,722 | — |
+
+- Row counts on Windows are within ~4% of Linux nodes on the same cluster running the same workload — parity confirmed.
+- Spot-check of `log-gen-windows` `LogMessage` shows **sequential `line N` counter with no gaps** (e.g. 16647 → 16654 continuous) → no record loss in fluent-bit 5.0.3 Windows tail/forward path.
+- Kubernetes filter populates `PodName` / `ContainerName` / `Computer` correctly for Windows pods.
+
+**Result**: Windows ingestion path on fluent-bit 5.0.3 is healthy. Will re-verify after bumping to 5.0.4 once upstream win64 zip is published.
+
+---
+
+## Resource usage (fluent-bit memory / CPU)
+
+### Linux — A/B on `zane-ama-logs-helm-test` (4.0.14 → 5.0.4)
+
+Per-process `fluent-bit` RSS (sampled inside ama-logs container during steady state):
+
+| Image | fluent-bit RSS | Δ vs prod |
+|---|---:|---:|
+| `ciprod:3.3.0` (4.0.14) — prod baseline | ~110–115 MB | — |
+| `cidev:3.3.0-6-g1d77401ab` (5.0.4) — test | ~120–126 MB (124,016 / 125,820 KB sampled) | **+10%** |
+
+Container-level `ama-logs` working set (hosts fluent-bit + mdsd + telegraf, from `Perf` table, 4 DS pods avg):
+
+| Image | Avg working set | Δ |
+|---|---:|---:|
+| Prod (4.0.14) | 252.7 MB | — |
+| Test (5.0.4) | 288.6 MB | +35.9 MB / +14.2% |
+
+Of the +35.9 MB container delta, ~half is fluent-bit (+10 MB) and ~half is `mdsd` (high natural variance, unrelated to this PR). `flbstore/` chunk-buffer directory remained at **4 KB / empty** → no chunk leak on 5.0.4.
+
+**Verdict**: ✅ **PASS** — +10% per-process RSS is the expected baseline shift from fluent-bit 5.x bundling cmetrics 2.x, cprofiles, ctraces, and the new worker-pool runtime. No leak, no sustained growth across 41 h soak.
+
+### Windows — snapshot on `zane-ama-logs-helm-test` + `ci-logs-dev-aks-all-nodes` (5.0.3)
+
+No prod-image Windows baseline was deployed (would require a second backdoor cycle), so this is an absolute-value health check rather than an A/B. Three pods sampled across two clusters, all on `cidev:win-...` (fluent-bit 5.0.3):
+
+| Cluster | Pod | Pod age | `fluent-bit.exe` RSS | Container working set |
+|---|---|---:|---:|---:|
+| `zane-ama-logs-helm-test` | `ama-logs-windows-xv59q` | 46 h | **39.7 MB** | 252 MiB |
+| `ci-logs-dev-aks-all-nodes` | `ama-logs-windows-g52sn` | ~14 h (under log-gen load) | **39.5 MB** | 196 MiB |
+| `ci-logs-dev-aks-all-nodes` | `ama-logs-windows-xflnf` | ~14 h (under log-gen load) | **39.7 MB** | 198 MiB |
+
+- fluent-bit RSS is **flat across 3 pods, 2 clusters, 14 h–46 h uptime** (39.5–39.7 MB band) → no leak.
+- Container working set is well under the Windows DS limit; no OOM events.
+- 0 restarts across all 3 pods (combined ~60 pod-hours, plus continuous log-gen on two of them).
+
+**Verdict**: ✅ **PASS** — Windows fluent-bit 5.0.3 RSS is stable and low. A formal Windows A/B vs 4.0.14 is not blocking; if needed it can be run on the same backdoor flow used for Linux.
+
+---
+
+## Windows ContainerLogV2 one-time spike on 4.x → 5.x upgrade (root-caused)
+
+### Observation
+
+On every Helm rollout from a 4.x image to a 5.x image, the **first minute** after the new Windows pod becomes Ready shows a single ~35–38K `ContainerLogV2` burst on the Windows node, then ingestion immediately returns to the pre-rollout cadence. Reproduced twice (35,464 and 38,483 records).
+
+### Reproduction matrix (complete)
+
+| Test | DB last written by | Boot version | Spike? |
+|---|---|---|---|
+| Helm rollout 4.x → 5.x (#1) | 4.x | 5.x | ✅ YES (~35K) |
+| Helm rollout 4.x → 5.x (#2) | 4.x | 5.x | ✅ YES (~38K) |
+| Helm rollout 5.x → 4.x | 5.x | 4.x | ❌ NO |
+| Helm rollout 5.x → 5.x (REVISION 66, 06:39:19 UTC 5/15) | 5.x | 5.x | ❌ NO |
+| `kubectl delete pod` (5.x → 5.x, same image) | 5.x | 5.x | ❌ NO |
+
+The trigger is precisely "DB last written by 4.x AND new boot is 5.x" — not Helm rollout in general, not pod restart in general.
+
+### Proof: the spike IS re-tail of historic file content (not new data)
+
+Test pod `second-log-app` (from `test\scenario\log-app-win-ltsc2022.yml`) emits one sequential integer per second to stdout. This made it possible to track exactly which file offsets were ingested.
+
+Spike at `2026-05-15 05:52:00 UTC` (Windows node `aksurwin*`, `ContainerName == "second-log-app"`):
+
+| Window | Records | Min seq | Max seq | Range | Cadence |
+|---|---:|---:|---:|---:|---|
+| Pre-spike (05:48–05:50) | 60–61 / min | 445,733 | 445,911 | 178 | 1 / sec — real time |
+| **Spike (05:52)** | **38,543** | **236,072** | **446,030** | **209,958** | re-tail of ~58 h of historic seqs |
+| Post-spike (05:53–05:55) | 60 / min | 446,031 | 446,210 | 179 | back to 1 / sec |
+
+- 209,958-seq range ≈ 58 h of file content — exactly the age of the file at that point.
+- Same historic seqs (`300000`, `350000`, `400000`) were already present in `ContainerLogV2` from `2026-05-13` and `2026-05-14` with their **original** `TimeGenerated` → spike rows are duplicates of content ingested days earlier, not new data.
+- Real-time stream (post-spike seqs `446,031+`) resumes immediately and is unbroken → no data loss.
+
+### Root cause (proven via fluent-bit source-code diff: v4.0.14 → v5.0.3 `plugins/in_tail/tail_db.c`)
+
+fluent-bit 5.x added a content-fingerprint column (`offset_marker`) to the `in_tail` SQLite tail DB and validates it on recovery. 4.x does not know that column exists.
+
+**v4.0.14** (`tail_db.c`):
+- `flb_tail_db_open()` only creates the schema; runs **no migrations**.
+- `db_file_exists()` reads 4 columns: `id, name, offset, inode`.
+- `db_file_insert()` binds 4 fields: `name, offset, inode, created`.
+- The `offset_marker` concept does not exist in this version.
+
+**v5.0.3** (`tail_db.c`):
+- `flb_tail_db_open()` calls `db_apply_migration_if_needed(SQL_ALTER_FILES_ADD_OFFSET_MARKER)` and `SQL_ALTER_FILES_ADD_OFFSET_MARKER_SIZE` — i.e. it issues `ALTER TABLE in_tail_files ADD COLUMN offset_marker INTEGER DEFAULT 0` (and the size column) on whatever DB it inherits from a prior version.
+- `db_file_exists()` reads 6 columns including the new `offset_marker, offset_marker_size`.
+- `db_file_insert()` calls `flb_tail_file_update_offset_marker()` to compute a content fingerprint before persisting.
+
+**The mechanism on 4.x → 5.x boot:**
+
+1. The on-disk DB has rows written by 4.x with no marker (4.x's INSERT only binds 4 fields).
+2. 5.x boots, runs the schema migration. All pre-existing 4.x rows now have `offset_marker = 0` (the column DEFAULT).
+3. For each container log file already being tailed, 5.x recomputes the file's actual `offset_marker` from the bytes near the recovered offset and compares to the stored value (0).
+4. Mismatch → 5.x discards the recovered offset and treats the file as untracked.
+5. With `Read_from_Head true` (set in `build\windows\installer\conf\fluent-bit.conf`), 5.x re-tails from byte 0, producing the burst of historic duplicates.
+6. 5.x then INSERTs its own rows with proper markers. Every subsequent restart (pod-delete, Helm rollout, host reboot) validates cleanly → no further spikes.
+
+**Note on the captured DB snapshots.** Earlier in this investigation I observed that snapshots from both "4.x" and "5.x" timeframes had identical schema and non-zero markers, and concluded the schema was unchanged. This was misleading: the snapshots were all taken *after* 5.x had run the migration and written its own rows. The 4.x source code makes it clear that 4.x never writes the marker column, so the values in any 4.x-only DB are the column DEFAULT of 0. That zero is what 5.x rejects on first boot.
+
+### Impact and verdict
+
+- **Scope**: Per-Windows-node, **once**, on the first 4.x → 5.x rollout per node. Subsequent 5.x rollouts and pod-deletes do not trigger it (proven by reproduction matrix).
+- **Magnitude**: ~35–38K duplicate `ContainerLogV2` rows per Windows node, bounded by the size of existing container log files at upgrade time.
+- **Real-time data**: Not lost. Post-spike ingestion resumes immediately on the next sequence number.
+- **Downstream**: Customers who alert on `ContainerLogV2` rate may see one transient burst per Windows node during the rollout window. Rows are valid log lines (not corrupt), just duplicated.
+- **Linux**: Same fluent-bit 5.x code path applies. The Linux A/B soak in this PR did not show a comparable burst — likely because Linux container log files on the test cluster were smaller and rotated more aggressively, so the re-tail volume was absorbed in normal noise. This is not a Windows-specific bug; it is a fluent-bit-5.x DB-migration behavior.
+
+**Verdict**: ✅ **Accepted as one-time upgrade cost.** Documented here and in release notes. No mitigation needed because:
+- It self-heals on first boot (subsequent restarts are clean).
+- Pre-deleting the DB to dodge the migration would itself force a full re-tail.
+- Setting `Read_from_Head false` would silently *drop* logs on every legitimate pod restart, which is strictly worse than a one-time duplicate burst.
+
+---
+