Search before asking
Motivation
When a TabletServer's local data disk is nearly full, continued writes can cause multiple errors, corrupt in-progress segments, or trigger cascading failures across replicas. Currently Fluss has no built-in mechanism to proactively reject writes before the disk runs out of space.
Solution
Introduce a periodic disk-usage monitoring state machine with hysteresis-based write protection:
- Sampling: A background scheduler thread calls statvfs every
server.data-disk.check-interval (default 30s) to collect the disk usage ratio. The overhead is negligible (~100μs per call, pure in-memory).
- Multi-disk strategy: When multiple data directories are configured, the collector reports the maximum usage across all distinct FileStores, so that a single nearly-full disk is never masked by other low-usage disks.
- Hysteresis state machine: Writes are locked when any disk's usage reaches server.data-disk.write-limit-ratio (default 0.85) and resume only after usage drops below (ratio - 0.10). The fixed 10% recovery gap prevents rapid lock/unlock oscillation.
- Write interception: Only client-driven writes (appendLog / putKv) are rejected with a retriable DiskWriteLockedException. Follower replication is not blocked to preserve replica consistency.
- Dynamic reconfiguration:
server.data-disk.write-limit-ratio can be changed at runtime via alterConfigs(). Changes take effect immediately by triggering a synchronous disk check.
Anything else?
No response
Willingness to contribute
Search before asking
Motivation
When a TabletServer's local data disk is nearly full, continued writes can cause multiple errors, corrupt in-progress segments, or trigger cascading failures across replicas. Currently Fluss has no built-in mechanism to proactively reject writes before the disk runs out of space.
Solution
Introduce a periodic disk-usage monitoring state machine with hysteresis-based write protection:
server.data-disk.check-interval(default 30s) to collect the disk usage ratio. The overhead is negligible (~100μs per call, pure in-memory).server.data-disk.write-limit-ratiocan be changed at runtime via alterConfigs(). Changes take effect immediately by triggering a synchronous disk check.Anything else?
No response
Willingness to contribute