Skip to content

in_tail: reconcile files after missed inotify events#11750

Open
sandy2008 wants to merge 2 commits intofluent:masterfrom
sandy2008:codex/in-tail-inotify-reconcile
Open

in_tail: reconcile files after missed inotify events#11750
sandy2008 wants to merge 2 commits intofluent:masterfrom
sandy2008:codex/in-tail-inotify-reconcile

Conversation

@sandy2008
Copy link
Copy Markdown

@sandy2008 sandy2008 commented Apr 27, 2026

This makes the inotify tail backend recover from missed events instead of waiting for the next modify event to happen to the same file. That matters when the inotify queue overflows, or when a truncate/copytruncate/delete/rotation event is missed while a file still has readable data.

What changed:

  • handle IN_Q_OVERFLOW by reconciling tracked files and rescanning paths
  • share truncate, delete, rotation, and pending-byte checks through one reconcile helper
  • catch missed truncate/copytruncate during progress checks, not only on IN_MODIFY
  • keep unlinked files around while they still have pending or buffered bytes
  • use the no-DB offset marker path so copytruncate detection still works without a SQL DB
  • return watch re-registration failures instead of ignoring them

Testing:

  • Linux Docker build: cmake -S . -B /build -DFLB_TESTS_RUNTIME=On -DFLB_TESTS_INTERNAL=On -DFLB_EXAMPLES=Off && cmake --build /build --target fluent-bit-bin -j"$(nproc)"
  • Linux Docker normal integration: FLUENT_BIT_BINARY=/build/bin/fluent-bit /tmp/flb-it-venv/bin/python -m pytest tests/integration/scenarios/in_tail/tests/test_in_tail_001.py -q -k "copytruncate_with_stale_writer or follows_rename_rotation or handles_multiple_rename_rotations or multi_file_rapid_rotation or rotate_wait_keeps_old_inode_then_purges_it"
  • Linux Docker valgrind integration: FLUENT_BIT_BINARY=/build/bin/fluent-bit VALGRIND=1 VALGRIND_STRICT=1 /tmp/flb-it-venv/bin/python -m pytest tests/integration/scenarios/in_tail/tests/test_in_tail_001.py -q -k "copytruncate_with_stale_writer or follows_rename_rotation or handles_multiple_rename_rotations or multi_file_rapid_rotation or rotate_wait_keeps_old_inode_then_purges_it"

Summary by CodeRabbit

  • Refactor

    • Centralized resumable-offset persistence for consistent progress tracking.
    • Consolidated truncate-recovery into a single reusable routine for clearer behavior.
    • Reworked event handling to use a shared reconciliation path for file state and rotation management.
  • Bug Fixes

    • Inotify overflow now triggers full reconciliation to avoid missed events.
    • More reliable detection and cleanup of deleted/rotated files and pending-data accounting.

Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

Centralizes resumable-offset persistence and truncate-reset into new helpers and refactors inotify handling to use a shared file-state reconciler, unifying truncation, rotation, deletion, pending-byte detection, and DB updates across event and periodic checks.

Changes

Cohort / File(s) Summary
Tail file logic
plugins/in_tail/tail_file.c, plugins/in_tail/tail_file.h
Adds update_resumable_offset_state() and exported flb_tail_file_reset_on_truncate(); consolidates seek/reset/truncate handling and centralizes SQLDB vs marker persistence; refactors callers (e.g., adjust_counters, flb_tail_file_chunk) to use the new helpers.
Inotify & reconciliation
plugins/in_tail/tail_fs_inotify.c
Introduces reconcile_file_state() and moves per-event/tracking logic into it; tail_fs_event() treats IN_Q_OVERFLOW by triggering a full scan and otherwise delegates rotation/truncate/delete/pending-byte decisions to the reconciler; periodic progress checks now call the reconciler.
Public API surface
plugins/in_tail/tail_file.h
Adds declaration: int flb_tail_file_reset_on_truncate(struct flb_tail_file *file, int64_t size_delta, const char *caller);.

Sequence Diagram(s)

sequenceDiagram
    participant Inotify
    participant TailFS as "tail_fs_inotify"
    participant Reconciler as "reconcile_file_state"
    participant TailFile as "tail_file"
    participant SQLDB

    Inotify->>TailFS: file event (modify/move/unlink/IN_Q_OVERFLOW)
    alt IN_Q_OVERFLOW
        TailFS->>TailFS: trigger full flb_tail_scan()
    else normal event
        TailFS->>Reconciler: reconcile_file_state(file)
        Reconciler->>TailFile: fstat(), compare offsets/sizes
        alt truncated or offset mismatch
            Reconciler->>TailFile: flb_tail_file_reset_on_truncate(size_delta, caller)
            TailFile->>SQLDB: update_resumable_offset_state() (if enabled)
        else rotation/delete/pending
            Reconciler->>TailFS: mark removed/rotated or pending_bytes > 0
        end
        TailFS->>TailFS: schedule read if pending
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • edsiper
  • koleini
  • fujimotos

Poem

🐇 I nibble bytes and hop back fast,

Reset the offset, forget the past.
One helper now keeps state in line,
Reconcile, persist — all tidy, fine.
Hooray, I thumped the dupes — a carrot dine! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'in_tail: reconcile files after missed inotify events' accurately describes the main objective of the PR, which is to add recovery from missed inotify events in the in_tail plugin through file reconciliation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sandy2008 sandy2008 marked this pull request as ready for review April 27, 2026 10:32
@sandy2008 sandy2008 closed this Apr 27, 2026
@sandy2008 sandy2008 reopened this Apr 27, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ebe842a79

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 435 to 436
if (ret == -1) {
flb_errno();
flb_plg_error(ins, "fstat error");

continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle reconcile errors instead of silently continuing

in_tail_progress_check_callback treats every -1 from reconcile_file_state() as a benign skip, but that helper now returns -1 for operational failures too (not just intentional file removal), including failed watch re-registration when inotify_add_watch returns errors like ENOSPC. In that path flb_tail_file_rotated() has already marked the file as rotated, so later passes stop attempting re-registration, leaving the file unwatched and degrading ingestion correctness/timeliness under inotify-pressure scenarios.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. You're right — once flb_tail_file_rotated() runs and the inotify add fails, flb_tail_file_is_rotated() short-circuits forever and the file stays unwatched.

Pushed a fix in 68a562d that splits the rotation handling in reconcile_file_state into two parts: detect-and-mark-rotated (gated on file->rotated == 0), and a separate "re-add the watch if we've lost it" block (file->rotated != 0 && file->watch_fd == -1). That way the next reconcile pass — whether triggered by another inotify event or by in_tail_progress_check_callback — will retry inotify_add_watch and recover once kernel resources free up.

Kept the continue-on--1 in the progress callback as is for now, since the operational -1 paths it can hit are mostly transient (stat blip, brief name resolution failure) and the next periodic tick will retry anyway. Happy to tighten that further if you'd prefer.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
plugins/in_tail/tail_fs_inotify.c (3)

364-383: IN_MOVE_SELF rotation handling now duplicates reconcile_file_state.

reconcile_file_state already calls flb_tail_file_is_rotatedflb_tail_file_rotatedflb_tail_fs_removeflb_tail_fs_add_rotated (lines 260-279), and flb_tail_file_is_rotated correctly detects the inode/name divergence after a MOVE_SELF rename. After this explicit block runs, file->rotated != 0 makes flb_tail_file_is_rotated short-circuit to FLB_FALSE, so the second pass is a no-op — i.e., the only thing this block buys is doing the rotation work slightly earlier in the same callback.

Consider dropping the explicit block and relying on reconcile_file_state for MOVE_SELF too; this also keeps all rotation re-registration failure handling in one place. Optional, since the current code is correct.

♻️ Suggested simplification
-    /* Check file rotation (only if it has not been rotated before) */
-    if (ev.mask & IN_MOVE_SELF && file->rotated == 0) {
-        flb_plg_debug(ins, "inode=%"PRIu64" rotated IN_MOVE SELF '%s'",
-                      file->inode, file->name);
-
-        /* A rotated file must be re-registered */
-        ret = flb_tail_file_rotated(file);
-        if (ret == -1) {
-            return -1;
-        }
-
-        ret = flb_tail_fs_remove(ctx, file);
-        if (ret == -1) {
-            return -1;
-        }
-
-        ret = flb_tail_fs_add_rotated(file);
-        if (ret == -1) {
-            return -1;
-        }
-    }
-
     ret = reconcile_file_state(ctx, file, "tail_fs_event", &pending_data_detected);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/in_tail/tail_fs_inotify.c` around lines 364 - 383, The IN_MOVE_SELF
handling block duplicates rotation logic already performed by
reconcile_file_state; remove the explicit ev.mask & IN_MOVE_SELF branch (the
block that calls flb_tail_file_rotated, flb_tail_fs_remove,
flb_tail_fs_add_rotated) and rely on reconcile_file_state which uses
flb_tail_file_is_rotated → flb_tail_file_rotated → flb_tail_fs_remove →
flb_tail_fs_add_rotated to detect and re-register rotated files, keeping all
rotation and failure handling centralized in reconcile_file_state.

167-197: Truncate-reset semantics now diverge between this helper and adjust_counters in tail_file.c.

reset_file_on_truncate resets stream_offset, last_processed_bytes, and pending_bytes after a truncate, but the truncate branch in adjust_counters (plugins/in_tail/tail_file.c lines 1707-1727) only resets offset/buf_len and calls update_resumable_offset_state. Records emitted after a truncate detected via adjust_counters will keep the pre-truncate stream_offset, producing wrong offset_key values, while ones detected via inotify reconcile will be correct. Worth unifying both paths through a single shared helper.

Two smaller things to consider in this helper:

  • file->pending_bytes = file->size; at Line 188 is immediately overwritten by reconcile_file_state at lines 232-237 — it can be dropped.
  • For the non-SQLDB case, the in-memory marker (file->db_offset_marker / db_offset_marker_size) isn't refreshed here; it works out today because flb_tail_file_offset_marker_matches early-returns when db_offset <= 0, but a defensive flb_tail_file_update_offset_marker(file) (or reuse of update_resumable_offset_state) would make the post-truncate state explicit.
♻️ Sketch for unifying truncate state reset
 static int reset_file_on_truncate(struct flb_tail_config *ctx,
                                   struct flb_tail_file *file,
                                   int64_t size_delta,
                                   const char *caller)
 {
     int64_t offset;

     offset = lseek(file->fd, 0, SEEK_SET);
     if (offset == -1) {
         flb_errno();
         return -1;
     }

     flb_plg_debug(ctx->ins,
                   "%s: inode=%"PRIu64" file truncated %s (diff: %"PRId64" bytes)",
                   caller, file->inode, file->name, size_delta);

     file->offset = offset;
     file->stream_offset = offset;
     file->last_processed_bytes = 0;
     file->buf_len = 0;
-    file->pending_bytes = file->size;
-
-#ifdef FLB_HAVE_SQLDB
-    if (ctx->db) {
-        flb_tail_db_file_offset(file, ctx);
-    }
-#endif
+
+    /* Persist offset and refresh the in-memory marker for both SQLDB
+     * and non-SQLDB code paths (mirrors update_resumable_offset_state
+     * in tail_file.c). */
+#ifdef FLB_HAVE_SQLDB
+    if (ctx->db) {
+        flb_tail_db_file_offset(file, ctx);
+    }
+    else
+#endif
+    {
+        flb_tail_file_update_offset_marker(file);
+    }

     return 0;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/in_tail/tail_fs_inotify.c` around lines 167 - 197, The truncate reset
in reset_file_on_truncate currently diverges from adjust_counters: make both
paths share the same state-reset semantics by updating reset_file_on_truncate to
reset the same fields used by adjust_counters (clear file->offset and
file->buf_len) and also refresh resumable-offset state (call the existing helper
flb_tail_file_update_offset_marker(file) or reuse update_resumable_offset_state)
so stream_offset/last_processed_bytes and the in-memory db offset marker are
consistent after a truncate; remove the redundant file->pending_bytes =
file->size assignment (it is overwritten by reconcile_file_state) and keep the
FLB_HAVE_SQLDB branch to call flb_tail_db_file_offset(file, ctx) so persistent
state is updated as before.

385-388: Minor: pending_data_detected in tail_fs_event is written but never read.

reconcile_file_state requires the out param, but the caller at Line 385 ignores it; the unconditional tail_signal_pending(ctx) in the success path further down already handles the “there might be more data” case. Consider either using the value to skip in_tail_collect_event when nothing is pending, or making the out param optional (NULL-able) so callers that don't care don't need a placeholder local.

Returning 0 on reconcile_file_state == -1 is fine for the “file was removed” case but does swallow watch re-registration failures from the rotation block inside reconcile_file_state; given the periodic in_tail_progress_check_callback will reconcile again, recovery is still possible.

The in_tail_progress_check_callback rewrite (Line 416-446) is clean — delegating per-file work to reconcile_file_state removes the previous duplicated size/pending math, and the (void) ins; is the right way to silence the unused-parameter warning.

Also applies to: 416-449

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/in_tail/tail_fs_inotify.c` around lines 385 - 388, The call in
tail_fs_event to reconcile_file_state currently passes a local
pending_data_detected that is never read; either use that out-param to avoid
unnecessary work (i.e., after calling reconcile_file_state check
pending_data_detected and skip calling in_tail_collect_event when false, while
still calling tail_signal_pending only when needed), or make
reconcile_file_state's out parameter optional (allow NULL) and update callers
(tail_fs_event and in_tail_progress_check_callback) to pass NULL when they don't
care; adjust tail_fs_event to stop creating the unused local if you go the NULL
route. Keep the existing return behavior for reconcile_file_state == -1 (return
0 from tail_fs_event) so rotation/watch re-registration failures continue to be
handled by periodic in_tail_progress_check_callback. Ensure references to
reconcile_file_state, tail_fs_event, in_tail_collect_event, tail_signal_pending,
and in_tail_progress_check_callback are updated consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@plugins/in_tail/tail_fs_inotify.c`:
- Around line 364-383: The IN_MOVE_SELF handling block duplicates rotation logic
already performed by reconcile_file_state; remove the explicit ev.mask &
IN_MOVE_SELF branch (the block that calls flb_tail_file_rotated,
flb_tail_fs_remove, flb_tail_fs_add_rotated) and rely on reconcile_file_state
which uses flb_tail_file_is_rotated → flb_tail_file_rotated → flb_tail_fs_remove
→ flb_tail_fs_add_rotated to detect and re-register rotated files, keeping all
rotation and failure handling centralized in reconcile_file_state.
- Around line 167-197: The truncate reset in reset_file_on_truncate currently
diverges from adjust_counters: make both paths share the same state-reset
semantics by updating reset_file_on_truncate to reset the same fields used by
adjust_counters (clear file->offset and file->buf_len) and also refresh
resumable-offset state (call the existing helper
flb_tail_file_update_offset_marker(file) or reuse update_resumable_offset_state)
so stream_offset/last_processed_bytes and the in-memory db offset marker are
consistent after a truncate; remove the redundant file->pending_bytes =
file->size assignment (it is overwritten by reconcile_file_state) and keep the
FLB_HAVE_SQLDB branch to call flb_tail_db_file_offset(file, ctx) so persistent
state is updated as before.
- Around line 385-388: The call in tail_fs_event to reconcile_file_state
currently passes a local pending_data_detected that is never read; either use
that out-param to avoid unnecessary work (i.e., after calling
reconcile_file_state check pending_data_detected and skip calling
in_tail_collect_event when false, while still calling tail_signal_pending only
when needed), or make reconcile_file_state's out parameter optional (allow NULL)
and update callers (tail_fs_event and in_tail_progress_check_callback) to pass
NULL when they don't care; adjust tail_fs_event to stop creating the unused
local if you go the NULL route. Keep the existing return behavior for
reconcile_file_state == -1 (return 0 from tail_fs_event) so rotation/watch
re-registration failures continue to be handled by periodic
in_tail_progress_check_callback. Ensure references to reconcile_file_state,
tail_fs_event, in_tail_collect_event, tail_signal_pending, and
in_tail_progress_check_callback are updated consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a8aac234-f0a4-47a4-ac71-1734ecd362ac

📥 Commits

Reviewing files that changed from the base of the PR and between 29deec9 and 4ebe842.

📒 Files selected for processing (2)
  • plugins/in_tail/tail_file.c
  • plugins/in_tail/tail_fs_inotify.c

@sandy2008
Copy link
Copy Markdown
Author

Thanks for the thorough review @coderabbitai — all three nitpicks are addressed in 68a562d.

364-383 (IN_MOVE_SELF redundancy): Yep, dropped the explicit block. reconcile_file_state already calls flb_tail_file_is_rotated which catches the inode/name divergence after a rename, and the new "re-add watch if lost" branch I added for the Codex P2 makes the retry path unified. Left a short comment in tail_fs_event pointing at where it now happens.

167-197 (truncate-reset divergence): Good call on this one — that was a real bug. The adjust_counters truncate path was leaving stream_offset and last_processed_bytes stale, so any records emitted between the truncate and the next reconcile would carry the wrong offset_key. Pulled out a shared helper flb_tail_file_reset_on_truncate in tail_file.c that does the lseek + reset + update_resumable_offset_state (so non-SQLDB also gets the in-memory marker refreshed). Both adjust_counters and reconcile_file_state go through it now. Dropped the redundant pending_bytes = file->size while I was at it.

385-388 (unused pending_data_detected): Went with the NULL-able out-param option — felt safer than gating in_tail_collect_event on it, since that function also drains buffered data and advances multiline state, not just pending bytes. tail_fs_event now passes NULL, the local was deleted, and the internal if (*pending_data_detected == FLB_TRUE) check uses file->pending_bytes > 0 instead so it doesn't dereference the optional pointer.

All five targeted in_tail integration tests still pass under both regular and VALGRIND=1 VALGRIND_STRICT=1, and the full test_in_tail_001.py suite is green (26 passed, 1 skipped).

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

Tip

For best results, initiate chat on the files or code changes.

@sandy2008 Thanks for the detailed walkthrough of each fix — these are solid resolutions:

  • IN_MOVE_SELF: Delegating to flb_tail_file_is_rotated inside reconcile_file_state is the right unification point; the added comment in tail_fs_event should help future readers understand why no explicit block is needed.
  • Truncate-reset: The shared flb_tail_file_reset_on_truncate helper is exactly the right fix here — keeping stream_offset and last_processed_bytes consistent was the critical missing piece, and routing both callers through a single function prevents this class of divergence from creeping back in.
  • pending_data_detected: The NULL-able out-param with the file->pending_bytes > 0 fallback is a clean solution that avoids the brittle coupling between the out-param and in_tail_collect_event's broader responsibilities.

Green on 26/27 integration tests under Valgrind strict mode is a strong signal. All three concerns are well addressed. 🎉


✏️ Learnings added
Learnt from: sandy2008
URL: https://github.com/fluent/fluent-bit/pull/11750

Timestamp: 2026-04-28T17:39:59.910Z
Learning: In plugins/in_tail/tail_file.c, `flb_tail_file_reset_on_truncate` is a shared helper that performs lseek + offset/counter reset + `update_resumable_offset_state`. Both `adjust_counters` (in tail_fs_inotify.c) and `reconcile_file_state` call it so that `stream_offset` and `last_processed_bytes` are never left stale after a truncate, and non-SQLDB paths also get the in-memory offset marker refreshed.

Learnt from: sandy2008
URL: https://github.com/fluent/fluent-bit/pull/11750

Timestamp: 2026-04-28T17:39:59.910Z
Learning: In plugins/in_tail/tail_fs_inotify.c, `reconcile_file_state` has a NULL-able `pending_data_detected` out-parameter. When the caller passes NULL (e.g. `tail_fs_event`), the internal check uses `file->pending_bytes > 0` directly instead of dereferencing the optional pointer. This avoids tying `in_tail_collect_event` invocation solely to pending-byte detection, since that function also drains buffered data and advances multiline state.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11059
File: plugins/in_tail/tail_file.c:1618-1640
Timestamp: 2025-10-23T07:43:16.216Z
Learning: In plugins/in_tail/tail_file.c, when truncate_long_lines is enabled and the buffer is full, the early truncation path uses `lines > 0` as the validation pattern to confirm whether process_content successfully processed content. This is intentional to track occurrences of line processing rather than byte consumption, and consuming bytes based on `processed_bytes > 0` would be overkill for this validation purpose.

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

- retry inotify_add_watch on next reconcile when a previously rotated
  file ended up unwatched (e.g. ENOSPC on the first add)
- drop the IN_MOVE_SELF block in tail_fs_event; reconcile_file_state
  already detects inode/name divergence and re-registers the watch via
  the same retry path
- share the truncate reset between adjust_counters and the inotify
  reconcile path through a new flb_tail_file_reset_on_truncate helper
  so post-truncate offset_key values stay consistent across both paths
- make reconcile_file_state's pending_data_detected out param NULL-able
  and pass NULL from tail_fs_event where the value was unused

Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sandy2008 sandy2008 force-pushed the codex/in-tail-inotify-reconcile branch from 68a562d to dc50818 Compare April 28, 2026 17:41
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/in_tail/tail_file.c`:
- Around line 172-182: The function update_resumable_offset_state currently
swallows errors from the persistence calls; change it to return an int status
and propagate errors from flb_tail_db_file_offset and
flb_tail_file_update_offset_marker (i.e., return their failure codes instead of
ignoring them), keeping the FLB_HAVE_SQLDB conditional logic intact; update all
callers to check the returned status and handle/report persistence failures so
truncate recovery and post-read updates can act on real persistence errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f5b3f23f-3a7a-4ec9-9191-2a8d19e4fbce

📥 Commits

Reviewing files that changed from the base of the PR and between 4ebe842 and 68a562d.

📒 Files selected for processing (3)
  • plugins/in_tail/tail_file.c
  • plugins/in_tail/tail_file.h
  • plugins/in_tail/tail_fs_inotify.c

Comment on lines +172 to +182
static void update_resumable_offset_state(struct flb_tail_file *file)
{
#ifdef FLB_HAVE_SQLDB
if (file->config->db) {
flb_tail_db_file_offset(file, file->config);
return;
}
#endif

flb_tail_file_update_offset_marker(file);
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Propagate offset-state persistence failures instead of swallowing them.

update_resumable_offset_state() ignores failures from DB/marker persistence. That makes truncate recovery and post-read progress updates report success even when resumable state update failed.

Proposed fix
-static void update_resumable_offset_state(struct flb_tail_file *file)
+static int update_resumable_offset_state(struct flb_tail_file *file)
 {
 `#ifdef` FLB_HAVE_SQLDB
     if (file->config->db) {
-        flb_tail_db_file_offset(file, file->config);
-        return;
+        return flb_tail_db_file_offset(file, file->config);
     }
 `#endif`
 
-    flb_tail_file_update_offset_marker(file);
+    return flb_tail_file_update_offset_marker(file);
 }
@@
-    update_resumable_offset_state(file);
+    if (update_resumable_offset_state(file) == -1) {
+        return -1;
+    }
     return 0;
@@
-                update_resumable_offset_state(file);
+                if (update_resumable_offset_state(file) == -1) {
+                    return FLB_TAIL_ERROR;
+                }
                 return adjust_counters(ctx, file);
@@
-        update_resumable_offset_state(file);
+        if (update_resumable_offset_state(file) == -1) {
+            return FLB_TAIL_ERROR;
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/in_tail/tail_file.c` around lines 172 - 182, The function
update_resumable_offset_state currently swallows errors from the persistence
calls; change it to return an int status and propagate errors from
flb_tail_db_file_offset and flb_tail_file_update_offset_marker (i.e., return
their failure codes instead of ignoring them), keeping the FLB_HAVE_SQLDB
conditional logic intact; update all callers to check the returned status and
handle/report persistence failures so truncate recovery and post-read updates
can act on real persistence errors.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging this — talked it through with my reviewer agents and we landed on pushing back on the propagate-up suggestion. Tracing the call sites:

update_resumable_offset_state runs after consume_bytes() has already shifted data out of file->buf_data and after records have been emitted to the engine. By the time we hit the persist call, the data is gone from the in-process buffer.

If the helper returns -1 and we propagate that up to flb_tail_file_chunkFLB_TAIL_ERROR, tail.c maps that to flb_tail_file_remove(file) (tail.c:117-119). On the next scan the file gets re-discovered and re-tailed from the last persisted offset — so a transient sqlite-lock blip would replay every record we just emitted. That's a worse outcome than the current "log and continue, accept that we may replay this batch on a hard restart" behavior.

The swallow pattern is also pre-existing — master swallows flb_tail_db_file_offset returns at tail_file.c:1204, :1227, tail_fs_stat.c:144, plus the two sites this PR refactored into the helper. This PR didn't introduce the silent-failure shape, just consolidated two of the call sites.

That said, your underlying concern about visibility is fair — flb_tail_db_file_offset already logs lock failures, but the sqlite_step path and flb_tail_file_update_offset_marker don't. Happy to do a targeted follow-up PR that adds debug-level logging inside update_resumable_offset_state (and/or inside the compute_offset_marker failure path) without changing return semantics, if that addresses the operability gap. Wanted to keep this PR scoped to the inotify reconcile work it started as.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sandy2008, that's a well-reasoned pushback and I'm convinced. The key point I missed is that the persist call happens after the buffer has been drained and records emitted — so converting a persist failure into FLB_TAIL_ERRORflb_tail_file_remove creates a guaranteed duplication path (re-tail from last persisted offset) for what is likely a transient SQLite contention blip. "Accept potential replay only on a hard restart" is clearly the lesser evil compared to "guarantee replay on every transient lock failure." The pre-existing pattern on master further confirms this is a deliberate choice, not an oversight.

The follow-up PR adding flb_plg_debug/flb_plg_warn inside update_resumable_offset_state and the compute_offset_marker failure path would address the operability gap without touching return semantics — that sounds like the right scope. Happy to review it when it lands.

Withdrawing the original suggestion.


✏️ Learnings added
Learnt from: sandy2008
Repo: fluent/fluent-bit PR: 11750
File: plugins/in_tail/tail_file.c:172-182
Timestamp: 2026-04-28T17:54:10.813Z
Learning: In plugins/in_tail/tail_file.c, `update_resumable_offset_state()` intentionally swallows errors from `flb_tail_db_file_offset` and `flb_tail_file_update_offset_marker`. By the time it is called, `consume_bytes()` has already drained the in-process buffer and records have been emitted to the engine. Propagating failure up to `FLB_TAIL_ERROR` would trigger `flb_tail_file_remove(file)` in tail.c, causing re-discovery and re-tailing from the last persisted offset — duplicating already-emitted records on every transient SQLite lock blip. The accepted tradeoff is "log and continue; accept potential replay only on a hard restart." This swallow pattern is pre-existing in master (tail_file.c:1204, :1227, tail_fs_stat.c:144) and was intentionally preserved/consolidated in PR `#11750`. Improved visibility should come from adding debug-level logging inside the helper and compute_offset_marker failure path in a follow-up PR, not by changing return semantics.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: sandy2008
Repo: fluent/fluent-bit PR: 0
File: :0-0
Timestamp: 2026-04-28T17:39:59.910Z
Learning: In plugins/in_tail/tail_file.c, `flb_tail_file_reset_on_truncate` is a shared helper that performs lseek + offset/counter reset + `update_resumable_offset_state`. Both `adjust_counters` (in tail_fs_inotify.c) and `reconcile_file_state` call it so that `stream_offset` and `last_processed_bytes` are never left stale after a truncate, and non-SQLDB paths also get the in-memory offset marker refreshed.

Learnt from: sandy2008
Repo: fluent/fluent-bit PR: 0
File: :0-0
Timestamp: 2026-04-28T17:39:59.910Z
Learning: In plugins/in_tail/tail_fs_inotify.c, `reconcile_file_state` has a NULL-able `pending_data_detected` out-parameter. When the caller passes NULL (e.g. `tail_fs_event`), the internal check uses `file->pending_bytes > 0` directly instead of dereferencing the optional pointer. This avoids tying `in_tail_collect_event` invocation solely to pending-byte detection, since that function also drains buffered data and advances multiline state.

Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11059
File: plugins/in_tail/tail_file.c:1618-1640
Timestamp: 2025-10-23T07:43:16.216Z
Learning: In plugins/in_tail/tail_file.c, when truncate_long_lines is enabled and the buffer is full, the early truncation path uses `lines > 0` as the validation pattern to confirm whether process_content successfully processed content. This is intentional to track occurrences of line processing rather than byte consumption, and consuming bytes based on `processed_bytes > 0` would be overkill for this validation purpose.

Learnt from: cosmo0920
Repo: fluent/fluent-bit PR: 11711
File: tests/runtime/http_client_chunked.c:172-175
Timestamp: 2026-04-14T20:30:06.356Z
Learning: In this project (fluent/fluent-bit), use K&R brace style for C/C++: for control statements (`if`, `else`, `while`, `do`, `for`), place the opening brace on the same line as the control statement (e.g., `if (cond) {`). Do not use Allman style (opening brace on the next line).

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
plugins/in_tail/tail_file.c (1)

172-182: ⚠️ Potential issue | 🟠 Major

Propagate resumable-offset persistence failures from the shared helper.

flb_tail_db_file_offset() and flb_tail_file_update_offset_marker() can fail, but update_resumable_offset_state() drops that status. As a result, truncate recovery and the post-read updates at Line 1795 and Line 1979 can still report success with stale resumable state.

Also applies to: 206-207, 1795-1795, 1979-1979

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/in_tail/tail_file.c` around lines 172 - 182,
update_resumable_offset_state currently ignores failures from
flb_tail_db_file_offset and flb_tail_file_update_offset_marker; change
update_resumable_offset_state to return an int status (0 on success, non-zero on
failure), propagate and return the error from flb_tail_db_file_offset and
flb_tail_file_update_offset_marker instead of dropping it, and update all
callers (including the truncate-recovery and post-read update sites that invoke
update_resumable_offset_state) to check the returned status and handle/propagate
failures accordingly so resumable-offset persistence errors are not treated as
success.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@plugins/in_tail/tail_file.c`:
- Around line 172-182: update_resumable_offset_state currently ignores failures
from flb_tail_db_file_offset and flb_tail_file_update_offset_marker; change
update_resumable_offset_state to return an int status (0 on success, non-zero on
failure), propagate and return the error from flb_tail_db_file_offset and
flb_tail_file_update_offset_marker instead of dropping it, and update all
callers (including the truncate-recovery and post-read update sites that invoke
update_resumable_offset_state) to check the returned status and handle/propagate
failures accordingly so resumable-offset persistence errors are not treated as
success.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff41faaf-e067-4b74-8019-d5f63343acf2

📥 Commits

Reviewing files that changed from the base of the PR and between 68a562d and dc50818.

📒 Files selected for processing (3)
  • plugins/in_tail/tail_file.c
  • plugins/in_tail/tail_file.h
  • plugins/in_tail/tail_fs_inotify.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • plugins/in_tail/tail_file.h

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant