GC conflict with shard creating

The issue happen in my local build SM 0.5.5 image, lets put here to see if it is a real bug for GC


## Environment

| Parameter | Value |
|-----------|-------|
| **Cluster** | 908 |
| **Namespace** | nuobject2sh-dev |
| **Deployment** | sm-long-running2-1013 |
| **HomeObjVer** | homeobject/4.1.0@oss/v4 |
| **HomeStoreVer** | homestore/7.4.4@oss/master |

## Issue Description

The storage manager instance `sm2` became stuck because no space left and was unable to synchronize with other cluster members. And the gc task failed due to data verification failure. The data was not migrated successfully by the previous gc task.

### Test Scenario

1. **Initial Test Cycle**: `write_blobs` → `delete_blobs` → `upgrade_version` → `gc` → `write_blobs` → `seal_shards` → `truncate_logs` → `read_verify` (PASSED)
2. **Follow-up Test**: Write 10,000 additional blobs
   - `blob_size`: 4M
   - `replica_count`: 3
   - `write:delete:read` ratio: 6:4
3. **Failure Point**: During the write phase, `sm2` stuck .

## Symptoms

### 1. Space Full Error trigger EGC
```
[03/07/26 01:17:40.715] [storage_mgr] [warning] block allocation failure, repl_key=[...], status=[SPACE_FULL]
[03/07/26 01:17:40.715] [storage_mgr] [error] Allocate blk for rreq failed error=NO_SPACE_LEFT
```

### 2. GC Task Failure with Blob Verification Errors

randomly selected one task log; all blobs on pg=0, shard=0x19 failed verification, and the PBA mismatch:

```
[03/07/26 01:11:51.753] [storage_mgr] [error] [18] [gc_manager.cpp:872:operator()] [gc_task_id=112, pg_id=0, shard_id=0x19] blob verification fails for move_from_chunk=66, blob_id=10488, pba=[{blk#=52278 count=1025 chunk=55},]	
[03/07/26 01:11:51.767] [storage_mgr] [error] [18] [hs_blob_manager.cpp:702:do_verify_blob] Invalid header found: [header=magic=0xdbcc00880a6daea8 version=47 shard=0xd7ed5efeee4f468 blob_size=1930654802 user_size=4123788523 ...
...
[03/07/26 01:11:52.694] [storage_mgr] [error] [18] [hs_blob_manager.cpp:702:do_verify_blob] Invalid header found: [header=magic=0x5812e3650743c043 version=223 shard=0xddbf494dcb517fd7 blob_size=3020033140 user_size=2718524605 ...
[03/07/26 01:11:52.694] [storage_mgr] [error] [18] [gc_manager.cpp:872:operator()] [gc_task_id=112, pg_id=0, shard_id=0x19] blob verification fails for move_from_chunk=66, blob_id=10699, pba=[{blk#=95328 count=1025 chunk=55},]
[03/07/26 01:11:52.694] [storage_mgr] [error] [251] [gc_manager.cpp:986:copy_valid_data] [gc_task_id=112, pg_id=0, shard_id=0x19] Failed to copy all blobs from move_from_chunk=66 to move_to_chunk=51
[03/07/26 01:11:52.694] [storage_mgr] [error] [251] [gc_manager.cpp:1168:handle_error_before_persisting_gc_metablk] [gc_task_id=112, pg_id=0, shard_id=0xffffffffffffffff] move_from_chunk=66 to move_to_chunk=51 with priority=1 failed before persisting gc metablk

```

### 3. Previous GC Task for the Shard Could Not Complete
```
[03/07/26 00:59:34.119] [storage_mgr] [debug] [250] [gc_manager.cpp:388:operator()] start gc task : move_from_chunk_id=55, priority=1
[03/07/26 00:59:34.119] [storage_mgr] [debug] [250] [gc_manager.cpp:1191:process_gc_task] [gc_task_id=71, pg_id=0, shard_id=0xffffffffffffffff] start process gc task for move_from_chunk=55 with priority=1 
[03/07/26 00:59:34.119] [storage_mgr] [debug] [250] [gc_manager.cpp:1210:process_gc_task] [gc_task_id=71, pg_id=0, shard_id=0xffffffffffffffff] task for move_from_chunk=55 to move_to_chunk=66 with priority=1 start copying data
[03/07/26 00:59:34.119] [storage_mgr] [debug] [250] [gc_manager.cpp:1059:purge_reserved_chunk] [gc_task_id=71, pg_id=0, shard_id=0xffffffffffffffff] reset chunk=66 before using it for gc
[03/07/26 00:59:34.120] [storage_mgr] [warning] [250] [gc_manager.cpp:694:copy_valid_data] [gc_task_id=71, pg_id=0, shard_id=0x19] last shard in move_from_chunk=55 has a state of OPEN!
[03/07/26 00:59:34.227] [storage_mgr] [debug] [250] [gc_manager.cpp:762:copy_valid_data] [gc_task_id=71, pg_id=0, shard_id=0x19] empty shard found in move_from_chunk=55
[03/07/26 00:59:34.227] [storage_mgr] [debug] [250] [gc_manager.cpp:990:copy_valid_data] [gc_task_id=71, pg_id=0, shard_id=0x19] successfully copy blobs from move_from_chunk=55 to move_to_chunk=66
...
[03/07/26 00:59:34.244] [storage_mgr] [info] [250] [gc_manager.cpp:1291:process_gc_task] [gc_task_id=71, pg_id=0, shard_id=0xffffffffffffffff] task for move_from_chunk=55 to move_to_chunk=66 with priority=1 is completed!

```

### 4. shard 0x19 creation and GC overlap
```
[03/07/26 00:59:04.905] [storage_mgr] [debug] [74] [raft_repl_dev.cpp:1694:handle_commit] [traceID=13542536828225921791] [rdev0:c5f48809-45d9-4b88-b843-73ef1661a1ad] Raft channel: Commit rreq=[dsn=18752 term=3 lsn=18757 op=HS_DATA_LINKED local_blkid=[[{blk#=42027 count=1 chunk=55},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | DATA_WRITTEN | LOG_RECEIVED | LOG_FLUSHED]]

[03/07/26 00:59:34.910] [storage_mgr] [debug] [74] [hs_shard_manager.cpp:470:local_create_shard] [trace_id=13542536828225921791,shardID=0x19,pg=0,shard=0x19] local_create_shard 25, vchunk_id=17, p_chunk_id=55, pg_id=0
[03/07/26 00:59:34.910] [storage_mgr] [debug] [74] [hs_shard_manager.cpp:525:on_shard_message_commit] [trace_id=13542536828225921791,shardID=0x19,pg=0,shard=0x19] Commit done for creating shard
```

## Phenomenon Analysis


### Primary Issue: Race Condition Between GC and Shard Operations?

The logs reveal a critical race condition between garbage collection and shard lifecycle operations:

1. **Shard in OPEN State During GC**: At `00:59:34.120`, a warning was logged indicating that the last shard in `chunk=55` was still in `OPEN` state when GC started moving data to `chunk=66`.

2. **Concurrent Operations**: While GC was copying data from chunk 55 to chunk 66, the shard was still writing data to the original chunk 55.

3. **Data Corruption**: The PBA references in the GC metadata point to `chunk=55`, but the actual data content after GC points to invalid blob headers. This indicates that:
   - GC metadata was not updated to reflect new locations
   - The data has been corrupted

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC conflict with shard creating #870

Environment

Issue Description

Test Scenario

Symptoms

1. Space Full Error trigger EGC

2. GC Task Failure with Blob Verification Errors

3. Previous GC Task for the Shard Could Not Complete

4. shard 0x19 creation and GC overlap

Phenomenon Analysis

Primary Issue: Race Condition Between GC and Shard Operations?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Parameter	Value
Cluster	908
Namespace	nuobject2sh-dev
Deployment	sm-long-running2-1013
HomeObjVer	homeobject/4.1.0@oss/v4
HomeStoreVer	homestore/7.4.4@oss/master

GC conflict with shard creating #870

Description

Environment

Issue Description

Test Scenario

Symptoms

1. Space Full Error trigger EGC

2. GC Task Failure with Blob Verification Errors

3. Previous GC Task for the Shard Could Not Complete

4. shard 0x19 creation and GC overlap

Phenomenon Analysis

Primary Issue: Race Condition Between GC and Shard Operations?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions