Commit bfc0ba6
[perf] Performance optimization of 'KVStorageManager'
Co-authored-by: dpj135<diaopengjie@huawei.com>
# message auto-generated for no-merge-commit merge:
!12 merge dpj/optimize_KVSmanager into main
[perf] Performance optimization of 'KVStorageManager'
Created-by: dpj135
Commit-by: dpj135
Merged-by: ascend-robot
Description: ## Description
We tried to optimize every method of KVStorageManager.
**The process of `storage_manager.put_data`:** --> `storage_manager._generate_keys` --> `storage_manager._get_shape_type_list` --> `storage_client.get` --> `storage_manager._merge_tensors_to_tensordict`.
We found that `torch.stack` in `storage_manager._merge_tensors_to_tensordict` was very slow when processing CPU data, so we used multithreading for optimization.
We used the **165M** data configuration from the TQ performance test for testing, and the results are as follows. The 'radio' in the figure below refers to the proportion of the `tq_client.async_get_data` runtime. By the way, we mock the **datasystem**:
```python
mock_storage = MagicMock()
def mock_get_side_effect(keys, shapes = None, dtypes = None):
# network transfer
time.sleep(total_data_size_gb/8)
return [
torch.zeros(s, dtype=d) if d is not None
else "".join(
random.choices("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", k=1024)
)
for s, d in zip(shapes, dtypes)
]
mock_storage.side_effect = mock_get_side_effect
self.data_system_client.storage_manager.storage_client.get = mock_storage
```
Test results as fellow:
```
batch_size 1024
seq_len 8192
field_num 10
Toal size 165M
```
1. 50% cpu-tensor+ 50% non-tensor:
- Before optimization

- After optimization

2. 100% cpu-tensor:
- Before optimization

- After optimization

## Analysis
In a scenario with 100% CPU tensor usage, an overall performance improvement of 40% is achieved when the number of threads is 4.
In a scenario with 50% CPU tensors and 50% non-tensors, with 4 threads, the overall performance improves by 7.5%.
> `torch.stack` releases the python Global Interpreter Lock (GIL) for threads, thus enabling faster thread execution. Too much nontensor(Python Object) will reduce performance because nontensor always occupy GIL.
## TODO:
- [x] Provide a more comprehensive unit test
- [x] The number of threads needs to be adjusted.
- [ ] Optimize performance on nontensors **if necessary**.
See merge request: Ascend/TransferQueue!121 parent 214a4d6 commit bfc0ba6
2 files changed
Lines changed: 447 additions & 334 deletions
0 commit comments