The issue happen in my local build SM 0.5.5 image, lets put here to see if it is a real bug for GC
Environment
| Parameter |
Value |
| Cluster |
908 |
| Namespace |
nuobject2sh-dev |
| Deployment |
sm-long-running2-1013 |
| HomeObjVer |
homeobject/4.1.0@oss/v4 |
| HomeStoreVer |
homestore/7.4.4@oss/master |
Issue Description
The storage manager instance sm2 became stuck because no space left and was unable to synchronize with other cluster members. And the gc task failed due to data verification failure. The data was not migrated successfully by the previous gc task.
Test Scenario
- Initial Test Cycle:
write_blobs → delete_blobs → upgrade_version → gc → write_blobs → seal_shards → truncate_logs → read_verify (PASSED)
- Follow-up Test: Write 10,000 additional blobs
blob_size: 4M
replica_count: 3
write:delete:read ratio: 6:4
- Failure Point: During the write phase,
sm2 stuck .
Symptoms
1. Space Full Error trigger EGC
[03/07/26 01:17:40.715] [storage_mgr] [warning] block allocation failure, repl_key=[...], status=[SPACE_FULL]
[03/07/26 01:17:40.715] [storage_mgr] [error] Allocate blk for rreq failed error=NO_SPACE_LEFT
2. GC Task Failure with Blob Verification Errors
randomly selected one task log; all blobs on pg=0, shard=0x19 failed verification, and the PBA mismatch:
[03/07/26 01:11:51.753] [storage_mgr] [error] [18] [gc_manager.cpp:872:operator()] [gc_task_id=112, pg_id=0, shard_id=0x19] blob verification fails for move_from_chunk=66, blob_id=10488, pba=[{blk#=52278 count=1025 chunk=55},]
[03/07/26 01:11:51.767] [storage_mgr] [error] [18] [hs_blob_manager.cpp:702:do_verify_blob] Invalid header found: [header=magic=0xdbcc00880a6daea8 version=47 shard=0xd7ed5efeee4f468 blob_size=1930654802 user_size=4123788523 ...
...
[03/07/26 01:11:52.694] [storage_mgr] [error] [18] [hs_blob_manager.cpp:702:do_verify_blob] Invalid header found: [header=magic=0x5812e3650743c043 version=223 shard=0xddbf494dcb517fd7 blob_size=3020033140 user_size=2718524605 ...
[03/07/26 01:11:52.694] [storage_mgr] [error] [18] [gc_manager.cpp:872:operator()] [gc_task_id=112, pg_id=0, shard_id=0x19] blob verification fails for move_from_chunk=66, blob_id=10699, pba=[{blk#=95328 count=1025 chunk=55},]
[03/07/26 01:11:52.694] [storage_mgr] [error] [251] [gc_manager.cpp:986:copy_valid_data] [gc_task_id=112, pg_id=0, shard_id=0x19] Failed to copy all blobs from move_from_chunk=66 to move_to_chunk=51
[03/07/26 01:11:52.694] [storage_mgr] [error] [251] [gc_manager.cpp:1168:handle_error_before_persisting_gc_metablk] [gc_task_id=112, pg_id=0, shard_id=0xffffffffffffffff] move_from_chunk=66 to move_to_chunk=51 with priority=1 failed before persisting gc metablk
3. Previous GC Task for the Shard Could Not Complete
[03/07/26 00:59:34.119] [storage_mgr] [debug] [250] [gc_manager.cpp:388:operator()] start gc task : move_from_chunk_id=55, priority=1
[03/07/26 00:59:34.119] [storage_mgr] [debug] [250] [gc_manager.cpp:1191:process_gc_task] [gc_task_id=71, pg_id=0, shard_id=0xffffffffffffffff] start process gc task for move_from_chunk=55 with priority=1
[03/07/26 00:59:34.119] [storage_mgr] [debug] [250] [gc_manager.cpp:1210:process_gc_task] [gc_task_id=71, pg_id=0, shard_id=0xffffffffffffffff] task for move_from_chunk=55 to move_to_chunk=66 with priority=1 start copying data
[03/07/26 00:59:34.119] [storage_mgr] [debug] [250] [gc_manager.cpp:1059:purge_reserved_chunk] [gc_task_id=71, pg_id=0, shard_id=0xffffffffffffffff] reset chunk=66 before using it for gc
[03/07/26 00:59:34.120] [storage_mgr] [warning] [250] [gc_manager.cpp:694:copy_valid_data] [gc_task_id=71, pg_id=0, shard_id=0x19] last shard in move_from_chunk=55 has a state of OPEN!
[03/07/26 00:59:34.227] [storage_mgr] [debug] [250] [gc_manager.cpp:762:copy_valid_data] [gc_task_id=71, pg_id=0, shard_id=0x19] empty shard found in move_from_chunk=55
[03/07/26 00:59:34.227] [storage_mgr] [debug] [250] [gc_manager.cpp:990:copy_valid_data] [gc_task_id=71, pg_id=0, shard_id=0x19] successfully copy blobs from move_from_chunk=55 to move_to_chunk=66
...
[03/07/26 00:59:34.244] [storage_mgr] [info] [250] [gc_manager.cpp:1291:process_gc_task] [gc_task_id=71, pg_id=0, shard_id=0xffffffffffffffff] task for move_from_chunk=55 to move_to_chunk=66 with priority=1 is completed!
4. shard 0x19 creation and GC overlap
[03/07/26 00:59:04.905] [storage_mgr] [debug] [74] [raft_repl_dev.cpp:1694:handle_commit] [traceID=13542536828225921791] [rdev0:c5f48809-45d9-4b88-b843-73ef1661a1ad] Raft channel: Commit rreq=[dsn=18752 term=3 lsn=18757 op=HS_DATA_LINKED local_blkid=[[{blk#=42027 count=1 chunk=55},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | DATA_WRITTEN | LOG_RECEIVED | LOG_FLUSHED]]
[03/07/26 00:59:34.910] [storage_mgr] [debug] [74] [hs_shard_manager.cpp:470:local_create_shard] [trace_id=13542536828225921791,shardID=0x19,pg=0,shard=0x19] local_create_shard 25, vchunk_id=17, p_chunk_id=55, pg_id=0
[03/07/26 00:59:34.910] [storage_mgr] [debug] [74] [hs_shard_manager.cpp:525:on_shard_message_commit] [trace_id=13542536828225921791,shardID=0x19,pg=0,shard=0x19] Commit done for creating shard
Phenomenon Analysis
Primary Issue: Race Condition Between GC and Shard Operations?
The logs reveal a critical race condition between garbage collection and shard lifecycle operations:
-
Shard in OPEN State During GC: At 00:59:34.120, a warning was logged indicating that the last shard in chunk=55 was still in OPEN state when GC started moving data to chunk=66.
-
Concurrent Operations: While GC was copying data from chunk 55 to chunk 66, the shard was still writing data to the original chunk 55.
-
Data Corruption: The PBA references in the GC metadata point to chunk=55, but the actual data content after GC points to invalid blob headers. This indicates that:
- GC metadata was not updated to reflect new locations
- The data has been corrupted
The issue happen in my local build SM 0.5.5 image, lets put here to see if it is a real bug for GC
Environment
Issue Description
The storage manager instance
sm2became stuck because no space left and was unable to synchronize with other cluster members. And the gc task failed due to data verification failure. The data was not migrated successfully by the previous gc task.Test Scenario
write_blobs→delete_blobs→upgrade_version→gc→write_blobs→seal_shards→truncate_logs→read_verify(PASSED)blob_size: 4Mreplica_count: 3write:delete:readratio: 6:4sm2stuck .Symptoms
1. Space Full Error trigger EGC
2. GC Task Failure with Blob Verification Errors
randomly selected one task log; all blobs on pg=0, shard=0x19 failed verification, and the PBA mismatch:
3. Previous GC Task for the Shard Could Not Complete
4. shard 0x19 creation and GC overlap
Phenomenon Analysis
Primary Issue: Race Condition Between GC and Shard Operations?
The logs reveal a critical race condition between garbage collection and shard lifecycle operations:
Shard in OPEN State During GC: At
00:59:34.120, a warning was logged indicating that the last shard inchunk=55was still inOPENstate when GC started moving data tochunk=66.Concurrent Operations: While GC was copying data from chunk 55 to chunk 66, the shard was still writing data to the original chunk 55.
Data Corruption: The PBA references in the GC metadata point to
chunk=55, but the actual data content after GC points to invalid blob headers. This indicates that: