Skip to content

Commit e3fa87f

Browse files
Add FluentBitUpgradeSummary.md with 4.x to 5.x validation results and Windows re-tail root cause
Documents the proven root cause of the one-time Windows ContainerLogV2 spike on 4.x to 5.x upgrade: fluent-bit 5.x added an offset_marker content fingerprint to in_tail's SQLite recovery DB. 4.x never wrote it, so on first 5.x boot the marker validation fails for all recovered files, triggering a re-tail from byte 0 (Read_from_Head true). One-time per Windows node, self-healing on first boot. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 1d77401 commit e3fa87f

1 file changed

Lines changed: 207 additions & 0 deletions

File tree

FluentBitUpgradeSummary.md

Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
# fluent-bit 4.0.14 → 5.0.x Upgrade — Validation Summary
2+
3+
**PR**: [#1671](https://github.com/microsoft/Docker-Provider/pull/1671) (`zanejohnson-azure/upgrade-fluent-bit`)
4+
**Changes**:
5+
- Linux: `kubernetes/linux/setup.sh``azcu-fluent-bit-5.0.4`
6+
- Windows: `kubernetes/windows/setup.ps1``fluent-bit-5.0.3-win64.zip` (5.0.4 not yet published upstream — will bump when available)
7+
8+
**Verdict**: ✅ **No regressions.** Safe to merge once Windows 5.0.4 lands.
9+
10+
---
11+
12+
## How it was validated
13+
14+
A/B backdoor deployment on AKS cluster `zane-ama-logs-helm-test` (LA workspace `222e72f7-1ad8-4e28-b2a9-07d046eedef4`):
15+
- **Baseline**: `ciprod:3.3.0` (current prod, fluent-bit 4.0.14)
16+
- **Test**: `cidev:3.3.0-6-g1d77401ab-20260506045747` (this PR, fluent-bit 5.0.4 Linux)
17+
---
18+
19+
## Data-volume parity (1-min bins, 5-min window)
20+
21+
| Table | Prod | Test | Delta | Verdict |
22+
|---|---:|---:|---:|---|
23+
| ContainerInventory | 822 | 825 | +3 | ✅ PASS — 3 extra rows = unrelated `azsecpack-azl3-image` (Azure Security Linux DaemonSet) snapshots |
24+
| KubeNodeInventory | 25 | 25 | 0 | ✅ PASS |
25+
| KubePodInventory | 825 | 826 | +1 | ✅ PASS — within noise |
26+
| InsightsMetrics | 825 | 825 | 0 | ✅ PASS |
27+
| Perf | 5827 | 5835 | +8 | ✅ PASS — matches azsecpack containers above |
28+
| ContainerLogV2 | varies | varies || ✅ PASS — no sustained drop/spike |
29+
30+
Per-container investigation confirmed the small deltas are unrelated cluster churn (azsecpack), not the ama-logs code change.
31+
32+
---
33+
34+
## Functional smoke tests
35+
36+
| # | Area | Status | Notes |
37+
|---|---|---|---|
38+
| 1 | Pod startup / no crashloop | ✅ PASS | 0 restarts across all ama-logs pods over 41+ h soak |
39+
| 2 | ContainerLogV2 ingestion | ✅ PASS | Stream uninterrupted across rollout |
40+
| 3 | KubePodInventory / Inventory ingestion | ✅ PASS | Counts match prod within noise |
41+
| 4 | InsightsMetrics / Perf ingestion | ✅ PASS | Counts match prod within noise |
42+
| 5 | KubeEvents | ✅ PASS | No change in event flow |
43+
| 6 | DaemonSet rollout | ✅ PASS | `kubectl rollout restart ds/ama-logs` clean, 4/4 pods Ready |
44+
| 7 | Prometheus scraping (ama-logs-prometheus container) | ✅ PASS | No regression in metrics scrape |
45+
| 8 | **Multiline (Java stack trace)** |**PASS** | See section below — explicit before/after test |
46+
47+
---
48+
49+
## Multiline stack-trace grouping (the highest-risk 5.x area)
50+
51+
fluent-bit 5.x rewrote `in_emitter` and the multiline buffering path. ama-logs gates multiline behind a customer opt-in (`#${MultilineEnabled}` token in `fluent-bit.conf` stripped by `fluent-bit-conf-customizer.rb` when `enable_multiline_logs.enabled = "true"`). Tested explicitly:
52+
53+
**Method**: Deployed a busybox pod (`multiline-emitter`) that prints a 7-line Java NPE + `Caused by:` chain every 30 s, tagged `iter-N`. Captured `ContainerLogV2` rows before (multiline OFF) and after (configmap patch enabling multiline for `java`).
54+
55+
| Mode | Iters in window | ContainerLogV2 rows | Rows / trace |
56+
|---|---:|---:|---:|
57+
| BEFORE (default — multiline OFF) | 11 | 77 | **7** (one row per `\n`-terminated line) |
58+
| AFTER (configmap opt-in, java) | 9 | 9 | **1** (full trace collapsed) |
59+
60+
Sample AFTER row (`iter-19` `LogMessage`):
61+
```
62+
java.lang.NullPointerException: multiline-marker iter-19
63+
at com.example.Foo.bar(Foo.java:42)
64+
at com.example.Foo.baz(Foo.java:13)
65+
at com.example.Main.main(Main.java:7)
66+
Caused by: java.io.IOException: disk read failed iter-19
67+
at com.example.Disk.read(Disk.java:99)
68+
at com.example.Foo.bar(Foo.java:40)
69+
```
70+
71+
**Result**: Multiline grouping works as documented on 5.0.4. No partial traces, no dropped continuation lines, no duplicate emissions.
72+
73+
---
74+
75+
## Windows log ingestion (separate cluster)
76+
77+
Verified on a second cluster (`ci-logs-dev-aks-all-nodes`, LA workspace `23320075-7b9b-42e1-acc5-97baf986542e`) that is running this PR's `cidev:win-3.3.0-9-g2dfc6056c-20260511064557` Windows image plus a continuous `log-gen-windows-*` workload on both Windows nodes (14 h soak, 0 restarts).
78+
79+
| Node | Computer | CLV2 rows / 30 min | Verdict |
80+
|---|---|---:|---|
81+
| Windows fip | `akswinfip000000` | 1,645 | ✅ PASS |
82+
| Windows p22 | `akswinp22000000` | 1,673 | ✅ PASS |
83+
| (Linux baseline on same cluster) | aks-usr*-vmss000000 (×6) | 1,713–1,722 ||
84+
85+
- Row counts on Windows are within ~4% of Linux nodes on the same cluster running the same workload — parity confirmed.
86+
- Spot-check of `log-gen-windows` `LogMessage` shows **sequential `line N` counter with no gaps** (e.g. 16647 → 16654 continuous) → no record loss in fluent-bit 5.0.3 Windows tail/forward path.
87+
- Kubernetes filter populates `PodName` / `ContainerName` / `Computer` correctly for Windows pods.
88+
89+
**Result**: Windows ingestion path on fluent-bit 5.0.3 is healthy. Will re-verify after bumping to 5.0.4 once upstream win64 zip is published.
90+
91+
---
92+
93+
## Resource usage (fluent-bit memory / CPU)
94+
95+
### Linux — A/B on `zane-ama-logs-helm-test` (4.0.14 → 5.0.4)
96+
97+
Per-process `fluent-bit` RSS (sampled inside ama-logs container during steady state):
98+
99+
| Image | fluent-bit RSS | Δ vs prod |
100+
|---|---:|---:|
101+
| `ciprod:3.3.0` (4.0.14) — prod baseline | ~110–115 MB ||
102+
| `cidev:3.3.0-6-g1d77401ab` (5.0.4) — test | ~120–126 MB (124,016 / 125,820 KB sampled) | **+10%** |
103+
104+
Container-level `ama-logs` working set (hosts fluent-bit + mdsd + telegraf, from `Perf` table, 4 DS pods avg):
105+
106+
| Image | Avg working set | Δ |
107+
|---|---:|---:|
108+
| Prod (4.0.14) | 252.7 MB ||
109+
| Test (5.0.4) | 288.6 MB | +35.9 MB / +14.2% |
110+
111+
Of the +35.9 MB container delta, ~half is fluent-bit (+10 MB) and ~half is `mdsd` (high natural variance, unrelated to this PR). `flbstore/` chunk-buffer directory remained at **4 KB / empty** → no chunk leak on 5.0.4.
112+
113+
**Verdict**: ✅ **PASS** — +10% per-process RSS is the expected baseline shift from fluent-bit 5.x bundling cmetrics 2.x, cprofiles, ctraces, and the new worker-pool runtime. No leak, no sustained growth across 41 h soak.
114+
115+
### Windows — snapshot on `zane-ama-logs-helm-test` + `ci-logs-dev-aks-all-nodes` (5.0.3)
116+
117+
No prod-image Windows baseline was deployed (would require a second backdoor cycle), so this is an absolute-value health check rather than an A/B. Three pods sampled across two clusters, all on `cidev:win-...` (fluent-bit 5.0.3):
118+
119+
| Cluster | Pod | Pod age | `fluent-bit.exe` RSS | Container working set |
120+
|---|---|---:|---:|---:|
121+
| `zane-ama-logs-helm-test` | `ama-logs-windows-xv59q` | 46 h | **39.7 MB** | 252 MiB |
122+
| `ci-logs-dev-aks-all-nodes` | `ama-logs-windows-g52sn` | ~14 h (under log-gen load) | **39.5 MB** | 196 MiB |
123+
| `ci-logs-dev-aks-all-nodes` | `ama-logs-windows-xflnf` | ~14 h (under log-gen load) | **39.7 MB** | 198 MiB |
124+
125+
- fluent-bit RSS is **flat across 3 pods, 2 clusters, 14 h–46 h uptime** (39.5–39.7 MB band) → no leak.
126+
- Container working set is well under the Windows DS limit; no OOM events.
127+
- 0 restarts across all 3 pods (combined ~60 pod-hours, plus continuous log-gen on two of them).
128+
129+
**Verdict**: ✅ **PASS** — Windows fluent-bit 5.0.3 RSS is stable and low. A formal Windows A/B vs 4.0.14 is not blocking; if needed it can be run on the same backdoor flow used for Linux.
130+
131+
---
132+
133+
## Windows ContainerLogV2 one-time spike on 4.x → 5.x upgrade (root-caused)
134+
135+
### Observation
136+
137+
On every Helm rollout from a 4.x image to a 5.x image, the **first minute** after the new Windows pod becomes Ready shows a single ~35–38K `ContainerLogV2` burst on the Windows node, then ingestion immediately returns to the pre-rollout cadence. Reproduced twice (35,464 and 38,483 records).
138+
139+
### Reproduction matrix (complete)
140+
141+
| Test | DB last written by | Boot version | Spike? |
142+
|---|---|---|---|
143+
| Helm rollout 4.x → 5.x (#1) | 4.x | 5.x | ✅ YES (~35K) |
144+
| Helm rollout 4.x → 5.x (#2) | 4.x | 5.x | ✅ YES (~38K) |
145+
| Helm rollout 5.x → 4.x | 5.x | 4.x | ❌ NO |
146+
| Helm rollout 5.x → 5.x (REVISION 66, 06:39:19 UTC 5/15) | 5.x | 5.x | ❌ NO |
147+
| `kubectl delete pod` (5.x → 5.x, same image) | 5.x | 5.x | ❌ NO |
148+
149+
The trigger is precisely "DB last written by 4.x AND new boot is 5.x" — not Helm rollout in general, not pod restart in general.
150+
151+
### Proof: the spike IS re-tail of historic file content (not new data)
152+
153+
Test pod `second-log-app` (from `test\scenario\log-app-win-ltsc2022.yml`) emits one sequential integer per second to stdout. This made it possible to track exactly which file offsets were ingested.
154+
155+
Spike at `2026-05-15 05:52:00 UTC` (Windows node `aksurwin*`, `ContainerName == "second-log-app"`):
156+
157+
| Window | Records | Min seq | Max seq | Range | Cadence |
158+
|---|---:|---:|---:|---:|---|
159+
| Pre-spike (05:48–05:50) | 60–61 / min | 445,733 | 445,911 | 178 | 1 / sec — real time |
160+
| **Spike (05:52)** | **38,543** | **236,072** | **446,030** | **209,958** | re-tail of ~58 h of historic seqs |
161+
| Post-spike (05:53–05:55) | 60 / min | 446,031 | 446,210 | 179 | back to 1 / sec |
162+
163+
- 209,958-seq range ≈ 58 h of file content — exactly the age of the file at that point.
164+
- Same historic seqs (`300000`, `350000`, `400000`) were already present in `ContainerLogV2` from `2026-05-13` and `2026-05-14` with their **original** `TimeGenerated` → spike rows are duplicates of content ingested days earlier, not new data.
165+
- Real-time stream (post-spike seqs `446,031+`) resumes immediately and is unbroken → no data loss.
166+
167+
### Root cause (proven via fluent-bit source-code diff: v4.0.14 → v5.0.3 `plugins/in_tail/tail_db.c`)
168+
169+
fluent-bit 5.x added a content-fingerprint column (`offset_marker`) to the `in_tail` SQLite tail DB and validates it on recovery. 4.x does not know that column exists.
170+
171+
**v4.0.14** (`tail_db.c`):
172+
- `flb_tail_db_open()` only creates the schema; runs **no migrations**.
173+
- `db_file_exists()` reads 4 columns: `id, name, offset, inode`.
174+
- `db_file_insert()` binds 4 fields: `name, offset, inode, created`.
175+
- The `offset_marker` concept does not exist in this version.
176+
177+
**v5.0.3** (`tail_db.c`):
178+
- `flb_tail_db_open()` calls `db_apply_migration_if_needed(SQL_ALTER_FILES_ADD_OFFSET_MARKER)` and `SQL_ALTER_FILES_ADD_OFFSET_MARKER_SIZE` — i.e. it issues `ALTER TABLE in_tail_files ADD COLUMN offset_marker INTEGER DEFAULT 0` (and the size column) on whatever DB it inherits from a prior version.
179+
- `db_file_exists()` reads 6 columns including the new `offset_marker, offset_marker_size`.
180+
- `db_file_insert()` calls `flb_tail_file_update_offset_marker()` to compute a content fingerprint before persisting.
181+
182+
**The mechanism on 4.x → 5.x boot:**
183+
184+
1. The on-disk DB has rows written by 4.x with no marker (4.x's INSERT only binds 4 fields).
185+
2. 5.x boots, runs the schema migration. All pre-existing 4.x rows now have `offset_marker = 0` (the column DEFAULT).
186+
3. For each container log file already being tailed, 5.x recomputes the file's actual `offset_marker` from the bytes near the recovered offset and compares to the stored value (0).
187+
4. Mismatch → 5.x discards the recovered offset and treats the file as untracked.
188+
5. With `Read_from_Head true` (set in `build\windows\installer\conf\fluent-bit.conf`), 5.x re-tails from byte 0, producing the burst of historic duplicates.
189+
6. 5.x then INSERTs its own rows with proper markers. Every subsequent restart (pod-delete, Helm rollout, host reboot) validates cleanly → no further spikes.
190+
191+
**Note on the captured DB snapshots.** Earlier in this investigation I observed that snapshots from both "4.x" and "5.x" timeframes had identical schema and non-zero markers, and concluded the schema was unchanged. This was misleading: the snapshots were all taken *after* 5.x had run the migration and written its own rows. The 4.x source code makes it clear that 4.x never writes the marker column, so the values in any 4.x-only DB are the column DEFAULT of 0. That zero is what 5.x rejects on first boot.
192+
193+
### Impact and verdict
194+
195+
- **Scope**: Per-Windows-node, **once**, on the first 4.x → 5.x rollout per node. Subsequent 5.x rollouts and pod-deletes do not trigger it (proven by reproduction matrix).
196+
- **Magnitude**: ~35–38K duplicate `ContainerLogV2` rows per Windows node, bounded by the size of existing container log files at upgrade time.
197+
- **Real-time data**: Not lost. Post-spike ingestion resumes immediately on the next sequence number.
198+
- **Downstream**: Customers who alert on `ContainerLogV2` rate may see one transient burst per Windows node during the rollout window. Rows are valid log lines (not corrupt), just duplicated.
199+
- **Linux**: Same fluent-bit 5.x code path applies. The Linux A/B soak in this PR did not show a comparable burst — likely because Linux container log files on the test cluster were smaller and rotated more aggressively, so the re-tail volume was absorbed in normal noise. This is not a Windows-specific bug; it is a fluent-bit-5.x DB-migration behavior.
200+
201+
**Verdict**: ✅ **Accepted as one-time upgrade cost.** Documented here and in release notes. No mitigation needed because:
202+
- It self-heals on first boot (subsequent restarts are clean).
203+
- Pre-deleting the DB to dodge the migration would itself force a full re-tail.
204+
- Setting `Read_from_Head false` would silently *drop* logs on every legitimate pod restart, which is strictly worse than a one-time duplicate burst.
205+
206+
---
207+

0 commit comments

Comments
 (0)