|
| 1 | +# fluent-bit 4.0.14 → 5.0.x Upgrade — Validation Summary |
| 2 | + |
| 3 | +**PR**: [#1671](https://github.com/microsoft/Docker-Provider/pull/1671) (`zanejohnson-azure/upgrade-fluent-bit`) |
| 4 | +**Changes**: |
| 5 | +- Linux: `kubernetes/linux/setup.sh` → `azcu-fluent-bit-5.0.4` |
| 6 | +- Windows: `kubernetes/windows/setup.ps1` → `fluent-bit-5.0.3-win64.zip` (5.0.4 not yet published upstream — will bump when available) |
| 7 | + |
| 8 | +**Verdict**: ✅ **No regressions.** Safe to merge once Windows 5.0.4 lands. |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +## How it was validated |
| 13 | + |
| 14 | +A/B backdoor deployment on AKS cluster `zane-ama-logs-helm-test` (LA workspace `222e72f7-1ad8-4e28-b2a9-07d046eedef4`): |
| 15 | +- **Baseline**: `ciprod:3.3.0` (current prod, fluent-bit 4.0.14) |
| 16 | +- **Test**: `cidev:3.3.0-6-g1d77401ab-20260506045747` (this PR, fluent-bit 5.0.4 Linux) |
| 17 | +--- |
| 18 | + |
| 19 | +## Data-volume parity (1-min bins, 5-min window) |
| 20 | + |
| 21 | +| Table | Prod | Test | Delta | Verdict | |
| 22 | +|---|---:|---:|---:|---| |
| 23 | +| ContainerInventory | 822 | 825 | +3 | ✅ PASS — 3 extra rows = unrelated `azsecpack-azl3-image` (Azure Security Linux DaemonSet) snapshots | |
| 24 | +| KubeNodeInventory | 25 | 25 | 0 | ✅ PASS | |
| 25 | +| KubePodInventory | 825 | 826 | +1 | ✅ PASS — within noise | |
| 26 | +| InsightsMetrics | 825 | 825 | 0 | ✅ PASS | |
| 27 | +| Perf | 5827 | 5835 | +8 | ✅ PASS — matches azsecpack containers above | |
| 28 | +| ContainerLogV2 | varies | varies | — | ✅ PASS — no sustained drop/spike | |
| 29 | + |
| 30 | +Per-container investigation confirmed the small deltas are unrelated cluster churn (azsecpack), not the ama-logs code change. |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## Functional smoke tests |
| 35 | + |
| 36 | +| # | Area | Status | Notes | |
| 37 | +|---|---|---|---| |
| 38 | +| 1 | Pod startup / no crashloop | ✅ PASS | 0 restarts across all ama-logs pods over 41+ h soak | |
| 39 | +| 2 | ContainerLogV2 ingestion | ✅ PASS | Stream uninterrupted across rollout | |
| 40 | +| 3 | KubePodInventory / Inventory ingestion | ✅ PASS | Counts match prod within noise | |
| 41 | +| 4 | InsightsMetrics / Perf ingestion | ✅ PASS | Counts match prod within noise | |
| 42 | +| 5 | KubeEvents | ✅ PASS | No change in event flow | |
| 43 | +| 6 | DaemonSet rollout | ✅ PASS | `kubectl rollout restart ds/ama-logs` clean, 4/4 pods Ready | |
| 44 | +| 7 | Prometheus scraping (ama-logs-prometheus container) | ✅ PASS | No regression in metrics scrape | |
| 45 | +| 8 | **Multiline (Java stack trace)** | ✅ **PASS** | See section below — explicit before/after test | |
| 46 | + |
| 47 | +--- |
| 48 | + |
| 49 | +## Multiline stack-trace grouping (the highest-risk 5.x area) |
| 50 | + |
| 51 | +fluent-bit 5.x rewrote `in_emitter` and the multiline buffering path. ama-logs gates multiline behind a customer opt-in (`#${MultilineEnabled}` token in `fluent-bit.conf` stripped by `fluent-bit-conf-customizer.rb` when `enable_multiline_logs.enabled = "true"`). Tested explicitly: |
| 52 | + |
| 53 | +**Method**: Deployed a busybox pod (`multiline-emitter`) that prints a 7-line Java NPE + `Caused by:` chain every 30 s, tagged `iter-N`. Captured `ContainerLogV2` rows before (multiline OFF) and after (configmap patch enabling multiline for `java`). |
| 54 | + |
| 55 | +| Mode | Iters in window | ContainerLogV2 rows | Rows / trace | |
| 56 | +|---|---:|---:|---:| |
| 57 | +| BEFORE (default — multiline OFF) | 11 | 77 | **7** (one row per `\n`-terminated line) | |
| 58 | +| AFTER (configmap opt-in, java) | 9 | 9 | **1** (full trace collapsed) | |
| 59 | + |
| 60 | +Sample AFTER row (`iter-19` `LogMessage`): |
| 61 | +``` |
| 62 | +java.lang.NullPointerException: multiline-marker iter-19 |
| 63 | + at com.example.Foo.bar(Foo.java:42) |
| 64 | + at com.example.Foo.baz(Foo.java:13) |
| 65 | + at com.example.Main.main(Main.java:7) |
| 66 | +Caused by: java.io.IOException: disk read failed iter-19 |
| 67 | + at com.example.Disk.read(Disk.java:99) |
| 68 | + at com.example.Foo.bar(Foo.java:40) |
| 69 | +``` |
| 70 | + |
| 71 | +**Result**: Multiline grouping works as documented on 5.0.4. No partial traces, no dropped continuation lines, no duplicate emissions. |
| 72 | + |
| 73 | +--- |
| 74 | + |
| 75 | +## Windows log ingestion (separate cluster) |
| 76 | + |
| 77 | +Verified on a second cluster (`ci-logs-dev-aks-all-nodes`, LA workspace `23320075-7b9b-42e1-acc5-97baf986542e`) that is running this PR's `cidev:win-3.3.0-9-g2dfc6056c-20260511064557` Windows image plus a continuous `log-gen-windows-*` workload on both Windows nodes (14 h soak, 0 restarts). |
| 78 | + |
| 79 | +| Node | Computer | CLV2 rows / 30 min | Verdict | |
| 80 | +|---|---|---:|---| |
| 81 | +| Windows fip | `akswinfip000000` | 1,645 | ✅ PASS | |
| 82 | +| Windows p22 | `akswinp22000000` | 1,673 | ✅ PASS | |
| 83 | +| (Linux baseline on same cluster) | aks-usr*-vmss000000 (×6) | 1,713–1,722 | — | |
| 84 | + |
| 85 | +- Row counts on Windows are within ~4% of Linux nodes on the same cluster running the same workload — parity confirmed. |
| 86 | +- Spot-check of `log-gen-windows` `LogMessage` shows **sequential `line N` counter with no gaps** (e.g. 16647 → 16654 continuous) → no record loss in fluent-bit 5.0.3 Windows tail/forward path. |
| 87 | +- Kubernetes filter populates `PodName` / `ContainerName` / `Computer` correctly for Windows pods. |
| 88 | + |
| 89 | +**Result**: Windows ingestion path on fluent-bit 5.0.3 is healthy. Will re-verify after bumping to 5.0.4 once upstream win64 zip is published. |
| 90 | + |
| 91 | +--- |
| 92 | + |
| 93 | +## Resource usage (fluent-bit memory / CPU) |
| 94 | + |
| 95 | +### Linux — A/B on `zane-ama-logs-helm-test` (4.0.14 → 5.0.4) |
| 96 | + |
| 97 | +Per-process `fluent-bit` RSS (sampled inside ama-logs container during steady state): |
| 98 | + |
| 99 | +| Image | fluent-bit RSS | Δ vs prod | |
| 100 | +|---|---:|---:| |
| 101 | +| `ciprod:3.3.0` (4.0.14) — prod baseline | ~110–115 MB | — | |
| 102 | +| `cidev:3.3.0-6-g1d77401ab` (5.0.4) — test | ~120–126 MB (124,016 / 125,820 KB sampled) | **+10%** | |
| 103 | + |
| 104 | +Container-level `ama-logs` working set (hosts fluent-bit + mdsd + telegraf, from `Perf` table, 4 DS pods avg): |
| 105 | + |
| 106 | +| Image | Avg working set | Δ | |
| 107 | +|---|---:|---:| |
| 108 | +| Prod (4.0.14) | 252.7 MB | — | |
| 109 | +| Test (5.0.4) | 288.6 MB | +35.9 MB / +14.2% | |
| 110 | + |
| 111 | +Of the +35.9 MB container delta, ~half is fluent-bit (+10 MB) and ~half is `mdsd` (high natural variance, unrelated to this PR). `flbstore/` chunk-buffer directory remained at **4 KB / empty** → no chunk leak on 5.0.4. |
| 112 | + |
| 113 | +**Verdict**: ✅ **PASS** — +10% per-process RSS is the expected baseline shift from fluent-bit 5.x bundling cmetrics 2.x, cprofiles, ctraces, and the new worker-pool runtime. No leak, no sustained growth across 41 h soak. |
| 114 | + |
| 115 | +### Windows — snapshot on `zane-ama-logs-helm-test` + `ci-logs-dev-aks-all-nodes` (5.0.3) |
| 116 | + |
| 117 | +No prod-image Windows baseline was deployed (would require a second backdoor cycle), so this is an absolute-value health check rather than an A/B. Three pods sampled across two clusters, all on `cidev:win-...` (fluent-bit 5.0.3): |
| 118 | + |
| 119 | +| Cluster | Pod | Pod age | `fluent-bit.exe` RSS | Container working set | |
| 120 | +|---|---|---:|---:|---:| |
| 121 | +| `zane-ama-logs-helm-test` | `ama-logs-windows-xv59q` | 46 h | **39.7 MB** | 252 MiB | |
| 122 | +| `ci-logs-dev-aks-all-nodes` | `ama-logs-windows-g52sn` | ~14 h (under log-gen load) | **39.5 MB** | 196 MiB | |
| 123 | +| `ci-logs-dev-aks-all-nodes` | `ama-logs-windows-xflnf` | ~14 h (under log-gen load) | **39.7 MB** | 198 MiB | |
| 124 | + |
| 125 | +- fluent-bit RSS is **flat across 3 pods, 2 clusters, 14 h–46 h uptime** (39.5–39.7 MB band) → no leak. |
| 126 | +- Container working set is well under the Windows DS limit; no OOM events. |
| 127 | +- 0 restarts across all 3 pods (combined ~60 pod-hours, plus continuous log-gen on two of them). |
| 128 | + |
| 129 | +**Verdict**: ✅ **PASS** — Windows fluent-bit 5.0.3 RSS is stable and low. A formal Windows A/B vs 4.0.14 is not blocking; if needed it can be run on the same backdoor flow used for Linux. |
| 130 | + |
| 131 | +--- |
| 132 | + |
| 133 | +## Windows ContainerLogV2 one-time spike on 4.x → 5.x upgrade (root-caused) |
| 134 | + |
| 135 | +### Observation |
| 136 | + |
| 137 | +On every Helm rollout from a 4.x image to a 5.x image, the **first minute** after the new Windows pod becomes Ready shows a single ~35–38K `ContainerLogV2` burst on the Windows node, then ingestion immediately returns to the pre-rollout cadence. Reproduced twice (35,464 and 38,483 records). |
| 138 | + |
| 139 | +### Reproduction matrix (complete) |
| 140 | + |
| 141 | +| Test | DB last written by | Boot version | Spike? | |
| 142 | +|---|---|---|---| |
| 143 | +| Helm rollout 4.x → 5.x (#1) | 4.x | 5.x | ✅ YES (~35K) | |
| 144 | +| Helm rollout 4.x → 5.x (#2) | 4.x | 5.x | ✅ YES (~38K) | |
| 145 | +| Helm rollout 5.x → 4.x | 5.x | 4.x | ❌ NO | |
| 146 | +| Helm rollout 5.x → 5.x (REVISION 66, 06:39:19 UTC 5/15) | 5.x | 5.x | ❌ NO | |
| 147 | +| `kubectl delete pod` (5.x → 5.x, same image) | 5.x | 5.x | ❌ NO | |
| 148 | + |
| 149 | +The trigger is precisely "DB last written by 4.x AND new boot is 5.x" — not Helm rollout in general, not pod restart in general. |
| 150 | + |
| 151 | +### Proof: the spike IS re-tail of historic file content (not new data) |
| 152 | + |
| 153 | +Test pod `second-log-app` (from `test\scenario\log-app-win-ltsc2022.yml`) emits one sequential integer per second to stdout. This made it possible to track exactly which file offsets were ingested. |
| 154 | + |
| 155 | +Spike at `2026-05-15 05:52:00 UTC` (Windows node `aksurwin*`, `ContainerName == "second-log-app"`): |
| 156 | + |
| 157 | +| Window | Records | Min seq | Max seq | Range | Cadence | |
| 158 | +|---|---:|---:|---:|---:|---| |
| 159 | +| Pre-spike (05:48–05:50) | 60–61 / min | 445,733 | 445,911 | 178 | 1 / sec — real time | |
| 160 | +| **Spike (05:52)** | **38,543** | **236,072** | **446,030** | **209,958** | re-tail of ~58 h of historic seqs | |
| 161 | +| Post-spike (05:53–05:55) | 60 / min | 446,031 | 446,210 | 179 | back to 1 / sec | |
| 162 | + |
| 163 | +- 209,958-seq range ≈ 58 h of file content — exactly the age of the file at that point. |
| 164 | +- Same historic seqs (`300000`, `350000`, `400000`) were already present in `ContainerLogV2` from `2026-05-13` and `2026-05-14` with their **original** `TimeGenerated` → spike rows are duplicates of content ingested days earlier, not new data. |
| 165 | +- Real-time stream (post-spike seqs `446,031+`) resumes immediately and is unbroken → no data loss. |
| 166 | + |
| 167 | +### Root cause (proven via fluent-bit source-code diff: v4.0.14 → v5.0.3 `plugins/in_tail/tail_db.c`) |
| 168 | + |
| 169 | +fluent-bit 5.x added a content-fingerprint column (`offset_marker`) to the `in_tail` SQLite tail DB and validates it on recovery. 4.x does not know that column exists. |
| 170 | + |
| 171 | +**v4.0.14** (`tail_db.c`): |
| 172 | +- `flb_tail_db_open()` only creates the schema; runs **no migrations**. |
| 173 | +- `db_file_exists()` reads 4 columns: `id, name, offset, inode`. |
| 174 | +- `db_file_insert()` binds 4 fields: `name, offset, inode, created`. |
| 175 | +- The `offset_marker` concept does not exist in this version. |
| 176 | + |
| 177 | +**v5.0.3** (`tail_db.c`): |
| 178 | +- `flb_tail_db_open()` calls `db_apply_migration_if_needed(SQL_ALTER_FILES_ADD_OFFSET_MARKER)` and `SQL_ALTER_FILES_ADD_OFFSET_MARKER_SIZE` — i.e. it issues `ALTER TABLE in_tail_files ADD COLUMN offset_marker INTEGER DEFAULT 0` (and the size column) on whatever DB it inherits from a prior version. |
| 179 | +- `db_file_exists()` reads 6 columns including the new `offset_marker, offset_marker_size`. |
| 180 | +- `db_file_insert()` calls `flb_tail_file_update_offset_marker()` to compute a content fingerprint before persisting. |
| 181 | + |
| 182 | +**The mechanism on 4.x → 5.x boot:** |
| 183 | + |
| 184 | +1. The on-disk DB has rows written by 4.x with no marker (4.x's INSERT only binds 4 fields). |
| 185 | +2. 5.x boots, runs the schema migration. All pre-existing 4.x rows now have `offset_marker = 0` (the column DEFAULT). |
| 186 | +3. For each container log file already being tailed, 5.x recomputes the file's actual `offset_marker` from the bytes near the recovered offset and compares to the stored value (0). |
| 187 | +4. Mismatch → 5.x discards the recovered offset and treats the file as untracked. |
| 188 | +5. With `Read_from_Head true` (set in `build\windows\installer\conf\fluent-bit.conf`), 5.x re-tails from byte 0, producing the burst of historic duplicates. |
| 189 | +6. 5.x then INSERTs its own rows with proper markers. Every subsequent restart (pod-delete, Helm rollout, host reboot) validates cleanly → no further spikes. |
| 190 | + |
| 191 | +**Note on the captured DB snapshots.** Earlier in this investigation I observed that snapshots from both "4.x" and "5.x" timeframes had identical schema and non-zero markers, and concluded the schema was unchanged. This was misleading: the snapshots were all taken *after* 5.x had run the migration and written its own rows. The 4.x source code makes it clear that 4.x never writes the marker column, so the values in any 4.x-only DB are the column DEFAULT of 0. That zero is what 5.x rejects on first boot. |
| 192 | + |
| 193 | +### Impact and verdict |
| 194 | + |
| 195 | +- **Scope**: Per-Windows-node, **once**, on the first 4.x → 5.x rollout per node. Subsequent 5.x rollouts and pod-deletes do not trigger it (proven by reproduction matrix). |
| 196 | +- **Magnitude**: ~35–38K duplicate `ContainerLogV2` rows per Windows node, bounded by the size of existing container log files at upgrade time. |
| 197 | +- **Real-time data**: Not lost. Post-spike ingestion resumes immediately on the next sequence number. |
| 198 | +- **Downstream**: Customers who alert on `ContainerLogV2` rate may see one transient burst per Windows node during the rollout window. Rows are valid log lines (not corrupt), just duplicated. |
| 199 | +- **Linux**: Same fluent-bit 5.x code path applies. The Linux A/B soak in this PR did not show a comparable burst — likely because Linux container log files on the test cluster were smaller and rotated more aggressively, so the re-tail volume was absorbed in normal noise. This is not a Windows-specific bug; it is a fluent-bit-5.x DB-migration behavior. |
| 200 | + |
| 201 | +**Verdict**: ✅ **Accepted as one-time upgrade cost.** Documented here and in release notes. No mitigation needed because: |
| 202 | +- It self-heals on first boot (subsequent restarts are clean). |
| 203 | +- Pre-deleting the DB to dodge the migration would itself force a full re-tail. |
| 204 | +- Setting `Read_from_Head false` would silently *drop* logs on every legitimate pod restart, which is strictly worse than a one-time duplicate burst. |
| 205 | + |
| 206 | +--- |
| 207 | + |
0 commit comments