Skip to content

File source identity issue: checksum collisions and inode reuse can cause skipped data #25079

@Rentu

Description

@Rentu

A note for the community

file source currently has two fingerprint modes:

checksum (first N lines)
device_and_inode
Both can fail in real production cases:

Checksum mode issue
Different files can share the same first N lines, so they get the same fingerprint.
Vector may switch watcher path by mtime but keep old offset/checkpoint, which can skip data in the new file.

Inode mode issue
Inode can be reused after file deletion/rotation under high churn.
A new file may appear with the same (device,inode) but different content generation.
Reusing old checkpoint offset for this new generation can also skip data.

So checksum-only is unsafe for “same file” identity, and inode-only is also unsafe under inode reuse.

Feature request
Please add a safer composite file identity mode, for example:

primary key: device + inode
plus content generation validation (e.g. header checksum/first bytes in checkpoint)
if validation fails, treat as new file generation and reset resume offset safely
This would reduce data loss/skip risks while keeping backward compatibility (opt-in mode is fine).

Use Cases

No response

Attempted Solutions

No response

Proposal

No response

References

No response

Version

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    source: fileAnything `file` source related

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions