Rustfs Disk-Full Recovery Failure & Metadata Corruption Report #2737

bartfaizoli76 · 2026-04-29T08:47:28Z

bartfaizoli76
Apr 29, 2026

Rustfs Version: docker.io/rustfs/rustfs:1.0.0-alpha.98
Deployment Mode: Single-replica Deployment on OpenShift with LVMS thin-provisioned PVC

1. Executive Summary

When the underlying storage volume reached ~98% capacity (71 GiB used out of 72 GiB), rustfs entered an unrecoverable state where:

All S3 API operations failed with local disk /data health is faulty, returning error
Internal metadata files became corrupted, exhibiting InsufficientReadQuorum errors
Expanding the volume from 72 GiB → 100 GiB did NOT restore service, because metadata damage was already done

The only recovery path was complete PVC deletion and data loss.

2. Environment Details

Component	Details
Platform	OpenShift (HCP v1.3.7)
Storage Class	`lvms-vg2` (LVM Thin Provisioning)
PVC Size	`72Gi` (XFS filesystem)
Rustfs Deployment	Single replica, `resources.limits.memory: 2Gi`, `resources.limits.cpu: 2000m`
Data Path	`/data` mounted from `rustfs-data-pvc`
fsGroup	`10001`
Initial Buckets	`edb-hcp` (63 GiB), `logs` (negligible)

3. Problem Timeline

Phase 1: Silent Degradation (disk fills up)

PostgreSQL WAL backups + metrics + logs accumulated in edb-hcp bucket
Disk usage grew to 71 GiB / 72 GiB (98%)
No proactive alerts from rustfs; operations simply began failing

Phase 2: Total API Failure

All S3 operations (GET, PUT, DELETE, LIST) returned errors. Rustfs logs showed:

WARN  local disk /data health is faulty, returning error

Impact:

mc ls → The specified bucket does not exist
mc rm → Bucket does not exist (even though buckets were present)
CNPG backups failed
HCP cluster creation failed (503 errors due to S3 unavailability)

Phase 3: Diagnosis (volume expansion attempted)

We diagnosed the issue on the host node:

# Filesystem check — CLEAN
xfs_repair -n /dev/mapper/vg2-...
# Result: ALL PHASES PASSED, no corruption

# Kernel logs — CLEAN
dmesg | grep -i "xfs\|error\|I/O"
# Result: only normal mount/unmount messages

# Disk usage
/dev/mapper/vg2-...   72G   71G  1.5G  98% /data

Conclusion: The XFS filesystem was healthy; the issue was rustfs's internal logic.

Phase 4: Volume Expansion (did NOT help)

We expanded the underlying infrastructure:

pvresize /dev/vdb — physical volume expanded to 200 GiB
LV already showed 100G virtual size
xfs_growfs — filesystem grown to 100 GiB
Remounted rustfs pod

Result: Disk now reported 71% used (100G total, 71G used, 30G free)

But rustfs STILL failed with NEW errors:

WARN  local disk /data health is faulty, returning error
ERROR InsufficientReadQuorum(".rustfs.sys", "buckets/.usage.json")
ERROR Storage resources are insufficient for the read operation: .rustfs.sys/buckets/.usage.json
ERROR Failed to read data usage info from backend: .rustfs.sys/buckets/.usage.json.bkp

Phase 5: Root Cause Identified

The .rustfs.sys/buckets/.usage.json metadata file (stored as erasure-coded chunks in rustfs's internal format) had lost read quorum. This means:

During the disk-full period, writes to metadata files were partially successful or rolled back inconsistently
Erasure coding chunks for internal metadata became unreadable
The damage was irreversible without low-level chunk reconstruction

4. What Actually Failed

Layer	Status	Observation
Physical disk (`/dev/vdb`)	✅ Healthy	No SMART errors
LVM thin pool (`vg2/thin-pool-2`)	✅ Healthy	61% allocated after resize
Logical volume (`vg2/be6af05d-...`)	✅ Healthy	100G, no errors
XFS filesystem	✅ Healthy	`xfs_repair -n` clean
Rustfs S3 API	❌ Broken	All operations rejected
Rustfs metadata (`.rustfs.sys`)	❌ Corrupted	`InsufficientReadQuorum`

5. Reproduction Steps

Deploy rustfs 1.0.0-alpha.98 with a 72 GiB XFS volume
Continuously write S3 objects until disk reaches ~98% capacity
Observe: API calls begin failing with disk health is faulty
Attempt to delete objects via S3 API → fails (can't write metadata updates)
Expand volume to 100+ GiB via LVM/XFS
Restart rustfs pod
Observe: metadata read quorum failures persist, service remains broken

6. Feature Requests / Bug Reports for Rustfs Team

6.1 🐛 Bug: Metadata corruption on disk-full

Severity: Critical
Description: When the underlying disk fills up, rustfs's internal metadata writes (.rustfs.sys/*) can leave erasure-coded chunks in an inconsistent state. After disk space is freed or expanded, metadata read quorum cannot be restored.

Expected behavior: Rustfs should handle ENOSPC gracefully without corrupting its own metadata. At minimum, metadata writes should be atomic or self-healing.

Actual behavior: Metadata files (.usage.json, .usage.json.bkp, .bloomcycle.bin) become permanently unreadable with InsufficientReadQuorum.

6.2 ⚠ Feature: Configurable disk-usage threshold

Description: The local disk /data health is faulty message gives no indication whether the issue is:

Disk full?
I/O errors?
Filesystem corruption?

Request: Add a configurable MIN_FREE_SPACE_PERCENT or MIN_FREE_SPACE_GB parameter. When breached:

Reject new writes with a clear DiskFull error
Allow reads and deletes to continue (so users can free space!)
Log the actual reason: Disk usage 98% exceeds threshold 90%

6.3 ⚠ Feature: Read-only / maintenance mode on disk-full

Description: When disk is critically full, rustfs rejects everything, including DELETE operations.

Request: Implement a "read-only with deletes allowed" mode. Users must be able to remove objects to recover space.

6.4 ⚠ Feature: Metadata self-healing

Description: If internal metadata files (.usage.json, etc.) lose read quorum, rustfs has no recovery path.

Request: On startup, if metadata is unreadable:

Attempt to reconstruct from available chunks
If reconstruction fails, rebuild metadata from scratch by scanning the bucket store (like fsck)
Log a clear warning: Rebuilding metadata from bucket scan — this may take time

6.5 ⚠ Feature: Disk health diagnostic endpoint

Description: /health returns only HTTP 200/503. There's no way to query WHY the disk is considered faulty.

Request: Add /health/detailed or expose metrics showing:

Disk usage %
Disk health status (with reason: full / io_error / metadata_corrupt / ok)
Metadata quorum status per file

7. Workarounds (for operators hitting this issue)

❌ Did NOT work:

Volume expansion (LVM/XFS resize) — metadata already corrupted
Direct filesystem deletion — rustfs metadata out of sync
xfs_repair — filesystem was already clean

✅ Only working recovery:

Scale rustfs to 0 replicas
Delete PVC rustfs-data-pvc (this destroys all data)
Recreate PVC with larger size (e.g., 200 GiB)
Scale rustfs back to 1
Recreate buckets and reconfigure S3 clients

8. Appendix: Key Log Excerpts

Disk-full period (before resize):

WARN  local disk /data health is faulty, returning error
WARN  get_bucket_info{bucket="edb-hcp"}: local disk /data health is faulty
ERROR save_config_with_opts: err: ErasureWriteQuorum, file: buckets/.usage.json
ERROR Failed to save data usage bloom name to buckets/.bloomcycle.bin: erasure write quorum

After volume resize (100 GiB, 71% used):

WARN  local disk /data health is faulty, returning error
ERROR read_config_with_metadata: err: InsufficientReadQuorum(".rustfs.sys", "buckets/.usage.json")
ERROR Failed to read data usage info from backend: Storage resources are insufficient

9. Contact / Follow-up

This issue was encountered during EDB Hosted Cloud Platform (HCP) PoC operations.
We are happy to provide additional logs, core dumps, or reproduction environments if helpful.

Suggested priority: P1 — data-loss / unrecoverable state
Suggested labels: bug, data-loss, disk-full, metadata, erasure-coding

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RustFS

Rustfs Disk-Full Recovery Failure & Metadata Corruption Report #2737

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

RustFS

Rustfs Disk-Full Recovery Failure & Metadata Corruption Report #2737

Uh oh!

bartfaizoli76 Apr 29, 2026

1. Executive Summary

2. Environment Details

3. Problem Timeline

Phase 1: Silent Degradation (disk fills up)

Phase 2: Total API Failure

Phase 3: Diagnosis (volume expansion attempted)

Phase 4: Volume Expansion (did NOT help)

Phase 5: Root Cause Identified

4. What Actually Failed

5. Reproduction Steps

6. Feature Requests / Bug Reports for Rustfs Team

6.1 🐛 Bug: Metadata corruption on disk-full

6.2 ⚠ Feature: Configurable disk-usage threshold

6.3 ⚠ Feature: Read-only / maintenance mode on disk-full

6.4 ⚠ Feature: Metadata self-healing

6.5 ⚠ Feature: Disk health diagnostic endpoint

7. Workarounds (for operators hitting this issue)

❌ Did NOT work:

✅ Only working recovery:

8. Appendix: Key Log Excerpts

Disk-full period (before resize):

After volume resize (100 GiB, 71% used):

9. Contact / Follow-up

Replies: 0 comments

bartfaizoli76
Apr 29, 2026