Fix #64: Add mount health check using Ino==1 detection#71
Conversation
Add a periodic health check that detects when a JuiceFS FUSE mount is no longer active by checking stat.Ino == 1 on the mountpoint, which is consistent with the existing mount verification logic in ceMount/eeMount. When mount death is detected, the driver logs an error so the issue is no longer silent. The health check runs every 30s in a goroutine with proper context-based cancellation on unmount. Also fixes the ceMount retry bug: move exec.Command inside the loop since Cmd.Run() can only be called once per instance. Partially addresses juicedata#64 (silent fallback to local disk after FUSE mount death).
| return logError("failed to umount %s: %s", r.Name, err) | ||
| } | ||
|
|
||
| v.stopHealthCheck() |
There was a problem hiding this comment.
We need to ensure that the connections become 0 before stopping the HealthCheck.
| select { | ||
| case <-ticker.C: | ||
| if !isJuicefsMounted(v.Mountpoint) { | ||
| logrus.Errorf("JuiceFS mount at %s is no longer active", v.Mountpoint) |
There was a problem hiding this comment.
This does not ensure it will continue to write into the host filesystem?
Per review feedback: stopHealthCheck should only be called when v.connections drops to 0, not on every unmount. This ensures the health check keeps running while other containers are still using the volume.
|
Thanks for the review, @zxh326. I've pushed a fix for the first point: 1. Stop health check only when connections drops to 0: Fixed in aaf66f7. The health check now only stops when the last container unmounts, so it keeps running while other containers are still using the volume. 2. Regarding "this does not ensure it will continue to write into the host filesystem": You're right, logging alone doesn't prevent the silent fallback. The current PR only makes the issue visible. For active remediation when mount death is detected, I see four options:
Which approach would you prefer? I'm happy to implement it as a follow-up PR so this one stays focused on detection and visibility. |
Summary
Fixes #64 — FUSE mount death causes silent fallback to local disk, filling up host storage.
Changes
stat.Ino == 1on the mountpoint, consistent with the existing mount verification inceMount/eeMountERRORso the issue is no longer silentcontext.Context-based cancellation on unmountsync.MutextojfsVolumefor thread-safe health check lifecycle managementMount()andUnmount()driver methodsAddressing review feedback
Per @zxh326's review on the previous PR (#66):
✅ This PR now uses
Ino == 1instead of parsing/proc/self/mountinfofor FUSE mount detection.The health check currently logs when the mount dies, making the issue visible rather than silent. A future enhancement could attempt automatic remount, but that introduces complexity (re-acquiring credentials, race conditions with concurrent Docker operations) that should be discussed separately.
Also includes
exec.Commandmoved inside the loop sinceCmd.Run()can only be called once per instance (addresses Bug: ceMount loop reuses exec.Cmd, causing 'exec: already started' on retry #62)