Skip to content

Commit c6e766c

Browse files
authored
Merge pull request #74 from TransferQueue/dev
Merge Dev 1021
2 parents 68c04e7 + 086b7ac commit c6e766c

1 file changed

Lines changed: 17 additions & 8 deletions

File tree

README.md

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,15 @@
77
<br />
88
<br />
99

10+
<a href="https://deepwiki.com/TransferQueue/TransferQueue"><img src="https://devin.ai/assets/deepwiki-badge.png" alt="Ask DeepWiki.com" style="height:20px;"></a>
11+
[![GitHub Repo stars](https://img.shields.io/github/stars/TransferQueue/TransferQueue)](https://github.com/TransferQueue/TransferQueue/stargazers/)
12+
[![GitHub commit activity](https://img.shields.io/github/commit-activity/w/TransferQueue/TransferQueue)](https://github.com/TransferQueue/TransferQueue/graphs/commit-activity)
13+
1014
</div>
1115
<br/>
1216

1317

18+
1419
<h2 id="overview">🎉 Overview</h2>
1520

1621
TransferQueue is a high-performance data storage and transfer module with panoramic data visibility and streaming scheduling capabilities, optimized for efficient dataflow in post-training workflows.
@@ -33,9 +38,9 @@ TransferQueue offers **fine-grained, sample-level** data management and **load-b
3338

3439

3540
<h2 id="updates">🔄 Updates</h2>
36-
41+
- **Oct 21, 2025**: Official integration into verl is ready [verl/pulls/3649](https://github.com/volcengine/verl/pull/3649). Following PRs will optimize the single controller architecture by fully decoupling data & control flows.
3742
- **July 22, 2025**: We present a series of Chinese blogs on <a href="https://zhuanlan.zhihu.com/p/1930244241625449814">Zhihu 1</a>, <a href="https://zhuanlan.zhihu.com/p/1933259599953232589">2</a>.
38-
- **July 21, 2025**: We start an RFC on verl community [RFC#2662](https://github.com/volcengine/verl/discussions/2662).
43+
- **July 21, 2025**: We started an RFC on verl community [verl/discussions/2662](https://github.com/volcengine/verl/discussions/2662).
3944
- **July 2, 2025**: We publish the paper [AsyncFlow](https://arxiv.org/abs/2507.01663).
4045

4146

@@ -48,15 +53,15 @@ TransferQueue offers **fine-grained, sample-level** data management and **load-b
4853

4954
In the control plane, `TransferQueueController` tracks the **production status** and **consumption status** of each training sample as metadata. When all the required data fields are ready (i.e., written to the `TransferQueueStorage`), we know that this data sample can be consumed by downstream tasks.
5055

51-
For consumption status, we record the consumption records for each computational task (e.g., `generate_sequences`, `compute_log_prob`, etc.). Therefore, even different computation tasks require the same data field, they can consume the data independently without interfering with each other.
56+
For consumption status, we record the consumption records for each computational task (e.g., `generate_sequences`, `compute_log_prob`, etc.). Therefore, even when different computation tasks require the same data field, they can consume the data independently without interfering with each other.
5257

5358

5459
<p align="center">
5560
<img src="https://cdn.nlark.com/yuque/0/2025/png/23208217/1758696820173-456c1784-42ba-40c8-a292-2ff1401f49c5.png" width="70%">
5661
</p>
5762

5863

59-
> In the future, we plan to support **load-balancing** and **dynamic batching** capabilities in the control plane. Besides, we will support data management for disaggregated frameworks where each rank manages the data retrieval by itself, rather than coordinated by a single controller.
64+
> In the future, we plan to support **load-balancing** and **dynamic batching** capabilities in the control plane. Additionally, we will support data management for disaggregated frameworks where each rank manages the data retrieval by itself, rather than coordinated by a single controller.
6065
6166
### Data Plane: Distributed Data Storage
6267

@@ -86,13 +91,13 @@ The interaction workflow of TransferQueue system is as follows:
8691
2. `TransferQueueController` scans the production and consumption metadata for each sample (row), and dynamically assembles a micro-batch metadata according to the load-balancing policy. This mechanism enables sample-level data scheduling.
8792
3. The process retrieves the actual data from distributed storage units using the metadata provided by the controller.
8893

89-
To simplify the usage of TransferQueue, we have encapsulated this process into `AsyncTransferQueueClient` and `TransferQueueClient`. These clients provide both asynchronous and synchronous interfaces for data transfer, allowing users to easily integrate TransferQueue to their framework.
94+
To simplify the usage of TransferQueue, we have encapsulated this process into `AsyncTransferQueueClient` and `TransferQueueClient`. These clients provide both asynchronous and synchronous interfaces for data transfer, allowing users to easily integrate TransferQueue into their framework.
9095

9196

9297
> In the future, we will provide a `StreamingDataLoader` interface for disaggregated frameworks as discussed in [RFC#2662](https://github.com/volcengine/verl/discussions/2662). Leveraging this abstraction, each rank can automatically get its own data like `DataLoader` in PyTorch. The TransferQueue system will handle the underlying data scheduling and transfer logic caused by different parallelism strategies, significantly simplifying the design of disaggregated frameworks.
9398
9499

95-
<h2 id="show-cases">🔥 Show Cases</h2>
100+
<h2 id="show-cases">🔥 Showcases</h2>
96101

97102
### General Usage
98103

@@ -146,7 +151,7 @@ We will soon release the Python package on PyPI.
146151

147152
### Build wheel package from source code
148153

149-
The building and installation steps are the following:
154+
Follow these steps to build and install:
150155
1. Retrieve source code from GitHub repo
151156
```bash
152157
git clone https://github.com/TransferQueue/TransferQueue/
@@ -167,11 +172,15 @@ The building and installation steps are the following:
167172

168173
<h2 id="milestones"> 🛣️ RoadMap</h2>
169174

175+
- [ ] Support data rewrite for partial rollout & agentic post-training
176+
- [x] Provide a general storage abstraction layer `TransferQueueStorageManager` to manage distributed storage units, which simplifies `Client` design and makes it possible to introduce different storage backends ([PR66](https://github.com/TransferQueue/TransferQueue/pull/66))
177+
- [ ] Provide a `KVStorageManager` to cover all the KV based storage backends
178+
- [ ] Support topic-based data partitioning to maintain train/val/test data simultaneously
170179
- [ ] Release the first stable version through PyPI
171180
- [ ] Support disaggregated framework (each rank retrieves its own data without going through a centralized node)
172181
- [ ] Provide a `StreamingDataLoader` interface for disaggregated framework
173182
- [ ] Support load-balancing and dynamic batching
174-
- [ ] Provide a general storage abstraction layer for different backends (e.g., [MoonCakeStore](https://github.com/kvcache-ai/Mooncake))
183+
- [ ] Support high-performance storage backends for RDMA transmission (e.g., [MoonCakeStore](https://github.com/kvcache-ai/Mooncake), [Ray Direct Transport](https://docs.ray.io/en/master/ray-core/direct-transport.html)...)
175184
- [ ] High-performance serialization and deserialization
176185
- [ ] More documentation, examples and tutorials
177186

0 commit comments

Comments
 (0)