Skip to content

[server] Add disk-usage write protection to TabletServer #3338

@swuferhong

Description

@swuferhong

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

When a TabletServer's local data disk is nearly full, continued writes can cause multiple errors, corrupt in-progress segments, or trigger cascading failures across replicas. Currently Fluss has no built-in mechanism to proactively reject writes before the disk runs out of space.

Solution

Introduce a periodic disk-usage monitoring state machine with hysteresis-based write protection:

  • Sampling: A background scheduler thread calls statvfs every server.data-disk.check-interval (default 30s) to collect the disk usage ratio. The overhead is negligible (~100μs per call, pure in-memory).
  • Multi-disk strategy: When multiple data directories are configured, the collector reports the maximum usage across all distinct FileStores, so that a single nearly-full disk is never masked by other low-usage disks.
  • Hysteresis state machine: Writes are locked when any disk's usage reaches server.data-disk.write-limit-ratio (default 0.85) and resume only after usage drops below (ratio - 0.10). The fixed 10% recovery gap prevents rapid lock/unlock oscillation.
  • Write interception: Only client-driven writes (appendLog / putKv) are rejected with a retriable DiskWriteLockedException. Follower replication is not blocked to preserve replica consistency.
  • Dynamic reconfiguration: server.data-disk.write-limit-ratio can be changed at runtime via alterConfigs(). Changes take effect immediately by triggering a synchronous disk check.

Anything else?

No response

Willingness to contribute

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

Labels

No labels
No labels
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions