Skip to content

Commit bfc0ba6

Browse files
dpj135ascend-robot
authored andcommitted
[perf] Performance optimization of 'KVStorageManager'
Co-authored-by: dpj135<diaopengjie@huawei.com> # message auto-generated for no-merge-commit merge: !12 merge dpj/optimize_KVSmanager into main [perf] Performance optimization of 'KVStorageManager' Created-by: dpj135 Commit-by: dpj135 Merged-by: ascend-robot Description: ## Description We tried to optimize every method of KVStorageManager. **The process of `storage_manager.put_data`:** --> `storage_manager._generate_keys` --> `storage_manager._get_shape_type_list` --> `storage_client.get` --> `storage_manager._merge_tensors_to_tensordict`. We found that `torch.stack` in `storage_manager._merge_tensors_to_tensordict` was very slow when processing CPU data, so we used multithreading for optimization. We used the **165M** data configuration from the TQ performance test for testing, and the results are as follows. The 'radio' in the figure below refers to the proportion of the `tq_client.async_get_data` runtime. By the way, we mock the **datasystem**: ```python mock_storage = MagicMock() def mock_get_side_effect(keys, shapes = None, dtypes = None): # network transfer time.sleep(total_data_size_gb/8) return [ torch.zeros(s, dtype=d) if d is not None else "".join( random.choices("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", k=1024) ) for s, d in zip(shapes, dtypes) ] mock_storage.side_effect = mock_get_side_effect self.data_system_client.storage_manager.storage_client.get = mock_storage ``` Test results as fellow: ``` batch_size 1024 seq_len 8192 field_num 10 Toal size 165M ``` 1. 50% cpu-tensor+ 50% non-tensor: - Before optimization ![image.png](https://raw.gitcode.com/user-images/assets/8886051/3aad78e3-7725-4d61-a092-cf1fe78f2091/image.png 'image.png') - After optimization ![image.png](https://raw.gitcode.com/user-images/assets/8886051/d718aec8-2b14-4041-be4a-c49c4e10b330/image.png 'image.png') 2. 100% cpu-tensor: - Before optimization ![image.png](https://raw.gitcode.com/user-images/assets/8886051/fc003b92-472b-4766-be58-3f42ae5efabc/image.png 'image.png') - After optimization ![image.png](https://raw.gitcode.com/user-images/assets/8886051/4789b90c-d3cf-4c89-8751-d5323292502c/image.png 'image.png') ## Analysis In a scenario with 100% CPU tensor usage, an overall performance improvement of 40% is achieved when the number of threads is 4. In a scenario with 50% CPU tensors and 50% non-tensors, with 4 threads, the overall performance improves by 7.5%. > `torch.stack` releases the python Global Interpreter Lock (GIL) for threads, thus enabling faster thread execution. Too much nontensor(Python Object) will reduce performance because nontensor always occupy GIL. ## TODO: - [x] Provide a more comprehensive unit test - [x] The number of threads needs to be adjusted. - [ ] Optimize performance on nontensors **if necessary**. See merge request: Ascend/TransferQueue!12
1 parent 214a4d6 commit bfc0ba6

2 files changed

Lines changed: 447 additions & 334 deletions

File tree

0 commit comments

Comments
 (0)