Skip to content

Fix #64: Add mount health check using Ino==1 detection#71

Open
raphaelbahat wants to merge 2 commits into
juicedata:masterfrom
raphaelbahat:fix/health-check-ino
Open

Fix #64: Add mount health check using Ino==1 detection#71
raphaelbahat wants to merge 2 commits into
juicedata:masterfrom
raphaelbahat:fix/health-check-ino

Conversation

@raphaelbahat
Copy link
Copy Markdown
Contributor

Summary

Fixes #64 — FUSE mount death causes silent fallback to local disk, filling up host storage.

Changes

  • Add a periodic health check (every 30s) that detects when a JuiceFS mount is no longer active by checking stat.Ino == 1 on the mountpoint, consistent with the existing mount verification in ceMount/eeMount
  • When mount death is detected, the driver logs an ERROR so the issue is no longer silent
  • The health check runs in a goroutine with context.Context-based cancellation on unmount
  • Add sync.Mutex to jfsVolume for thread-safe health check lifecycle management
  • Add locking to Mount() and Unmount() driver methods

Addressing review feedback

Per @zxh326's review on the previous PR (#66):

I would also avoid using "is this path any FUSE mount?" as the health signal here. The existing mount verification already checks stat.Ino == 1 on the mountpoint, which is a better signal for this driver because it verifies that the path still looks like the JuiceFS mount root.

✅ This PR now uses Ino == 1 instead of parsing /proc/self/mountinfo for FUSE mount detection.

It only logs when /proc/self/mountinfo no longer shows the mountpoint as a FUSE mount, but the driver still returns the same mountpoint from Mount()/Path(). That means Docker can still continue writing to the local directory after the JuiceFS mount is gone.

The health check currently logs when the mount dies, making the issue visible rather than silent. A future enhancement could attempt automatic remount, but that introduces complexity (re-acquiring credentials, race conditions with concurrent Docker operations) that should be discussed separately.

Also includes

Add a periodic health check that detects when a JuiceFS FUSE mount is
no longer active by checking stat.Ino == 1 on the mountpoint, which is
consistent with the existing mount verification logic in ceMount/eeMount.

When mount death is detected, the driver logs an error so the issue is
no longer silent. The health check runs every 30s in a goroutine with
proper context-based cancellation on unmount.

Also fixes the ceMount retry bug: move exec.Command inside the loop
since Cmd.Run() can only be called once per instance.

Partially addresses juicedata#64 (silent fallback to local disk after FUSE mount death).
Comment thread main.go Outdated
return logError("failed to umount %s: %s", r.Name, err)
}

v.stopHealthCheck()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to ensure that the connections become 0 before stopping the HealthCheck.

Comment thread main.go
select {
case <-ticker.C:
if !isJuicefsMounted(v.Mountpoint) {
logrus.Errorf("JuiceFS mount at %s is no longer active", v.Mountpoint)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not ensure it will continue to write into the host filesystem?

Per review feedback: stopHealthCheck should only be called when
v.connections drops to 0, not on every unmount. This ensures the
health check keeps running while other containers are still using
the volume.
@raphaelbahat
Copy link
Copy Markdown
Contributor Author

raphaelbahat commented May 8, 2026

Thanks for the review, @zxh326. I've pushed a fix for the first point:

1. Stop health check only when connections drops to 0: Fixed in aaf66f7. The health check now only stops when the last container unmounts, so it keeps running while other containers are still using the volume.

2. Regarding "this does not ensure it will continue to write into the host filesystem": You're right, logging alone doesn't prevent the silent fallback. The current PR only makes the issue visible.

For active remediation when mount death is detected, I see four options:

  • Fail fast — unmount the dead mountpoint: When mount death is detected, call umount on the mountpoint so that all I/O in the container immediately fails (EIO/ENOTCONN), causing the application to crash and the container to exit per its restart policy. This prevents silent writes to local disk, causes no data loss (local files remain on disk for recovery), and lets the normal Docker lifecycle handle remounting — Mount() is called again when the container restarts. Simpler than auto-remount since no credential re-acquisition is needed.
  • Automatic remount: Re-run the mount command when the health check detects mount death. This is the most seamless for users but introduces complexity (re-acquiring credentials, race conditions with concurrent Docker operations, handling partial mount states).
  • Return error from Path()/Get(): Signal to Docker that the volume is unhealthy. However, Docker may not handle this gracefully and could cause container failures.
  • Remove the mountpoint directory: Force Docker to recreate the volume on next access. Risky; could lead to data loss if the FUSE mount is temporarily unavailable but recoverable.

Which approach would you prefer? I'm happy to implement it as a follow-up PR so this one stays focused on detection and visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FUSE mount death causes silent fallback to local disk, filling up host storage

2 participants