Skip to content

Commit 7bae3af

Browse files
Merge pull request foundation-model-stack#80 from takeshi-yoshimura/tyos/scan-fix-01
0.3.2
2 parents 799a5de + 3ffe389 commit 7bae3af

12 files changed

Lines changed: 401 additions & 236 deletions

File tree

.github/workflows/test-paddle.yaml

Lines changed: 4 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -32,22 +32,12 @@ jobs:
3232
- name: Install Python dependencies
3333
run: |
3434
python -m pip install --upgrade pip
35-
tf_ver=4.52
36-
npy_ver=2.2
37-
torch_ver=2.7
38-
if [ "${{ matrix.python-version }}" = "3.10" ]; then
39-
torch_ver=2.3
40-
elif [ "${{ matrix.python-version }}" = "3.11" ]; then
41-
torch_ver=2.5
42-
elif [ "${{ matrix.python-version }}" = "3.12" ]; then
43-
torch_ver=2.6
44-
elif [ "${{ matrix.python-version }}" = "3.13" ]; then
45-
torch_ver=2.7
46-
fi
47-
pip install torch==${torch_ver} --index-url https://download.pytorch.org/whl/cpu # transformers requires torch
35+
tf_ver=5.0.0
36+
npy_spec="numpy>=2.2,<2.5"
37+
pip install torch==2.11.0 --index-url https://download.pytorch.org/whl/cpu # transformers requires torch
4838
pip install paddlepaddle==3.0.0
4939
# TOFIX: safetensors version (0.7.0 had a bug around fp8 in Dec 5 2025)
50-
pip install pytest pytest-cov setuptools_scm safetensors==0.6.2 transformers==${tf_ver} numpy==${npy_ver}
40+
pip install pytest pytest-cov setuptools_scm safetensors==0.6.2 transformers==${tf_ver} "${npy_spec}"
5141
- name: Build Package
5242
run: |
5343
pip install .

.github/workflows/test-torch.yaml

Lines changed: 6 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -32,22 +32,10 @@ jobs:
3232
- name: Install Python dependencies
3333
run: |
3434
python -m pip install --upgrade pip
35-
tf_ver=4.52
36-
npy_ver=2.2
37-
torch_ver=2.7
38-
if [ "${{ matrix.python-version }}" = "3.10" ]; then
39-
torch_ver=2.3
40-
elif [ "${{ matrix.python-version }}" = "3.11" ]; then
41-
torch_ver=2.5
42-
elif [ "${{ matrix.python-version }}" = "3.12" ]; then
43-
torch_ver=2.6
44-
elif [ "${{ matrix.python-version }}" = "3.13" ]; then
45-
torch_ver=2.7
46-
elif [ "${{ matrix.python-version }}" = "3.14" ]; then
47-
torch_ver=2.11
48-
fi
49-
pip install torch==${torch_ver} --index-url https://download.pytorch.org/whl/cpu
50-
pip install pytest pytest-cov setuptools_scm safetensors transformers==${tf_ver} numpy==${npy_ver}
35+
tf_ver=5.0.0
36+
npy_spec="numpy>=2.2,<2.5"
37+
pip install torch==2.11.0 --index-url https://download.pytorch.org/whl/cpu
38+
pip install pytest pytest-cov setuptools_scm safetensors transformers==${tf_ver} "${npy_spec}"
5139
- name: Build package
5240
run: |
5341
pip install .
@@ -89,8 +77,8 @@ jobs:
8977
shell: bash
9078
run: |
9179
python -m pip install --upgrade pip
92-
pip install torch==2.7 --index-url https://download.pytorch.org/whl/cpu
93-
pip install pytest pytest-cov setuptools_scm safetensors transformers==4.52 numpy==2.2
80+
pip install torch==2.11.0 --index-url https://download.pytorch.org/whl/cpu
81+
pip install pytest pytest-cov setuptools_scm safetensors transformers==5.0.0 "numpy>=2.2,<2.5"
9482
pip install .
9583
- name: Run tests
9684
shell: bash

README.md

Lines changed: 23 additions & 110 deletions
Original file line numberDiff line numberDiff line change
@@ -1,127 +1,40 @@
1-
fastsafetensors is an efficient safetensors model loader.
2-
This library is tested with Python 3.10-3.13 and PyTorch 2.1-2.7.
1+
fastsafetensors
2+
================
33

4-
Disclaimer: This repository contains a research prototype. It should be used with caution.
4+
fastsafetensors is an efficient safetensors loader. If you develop your own code that loads large safetensors files, you can try fastsafetensors APIs (see [docs](./docs/overview.md)). For example, vLLM and SGLang have `--load-format fastsafetensors` command-line argument to speed up their initialization.
55

6-
# Features
6+
This library supports Linux/CUDA, ROCm without GDS, Windows, [3FS](https://github.com/deepseek-ai/3fs), unified-memory systems such as DGX Spark, and so on. We welcome more platform/storage-specific optimizations like them by adding new [copier backends](fastsafetensors/copier/). Our CI tests Python 3.10-3.14 with PyTorch 2.11.0.
77

8-
We introduced three major features to optimize model loading performance:
9-
1. Batched, lazy tensor instantiation.
10-
2. GPU offloading for sharding, type conversions, and device pointer alignment.
11-
3. GPU Direct Storage enablement for file loading from storage to GPU memory.
8+
# Performance Highlights
129

13-
A major design difference from the original safetensors file loader is that fastsafetensors does *NOT* use `mmap`.
14-
The original loader loads tensors on demand from memory-mapped files,
15-
but unfortunately, it cannot fully utilize high-throughput I/O such as NVMe SSDs.
16-
Therefore, we asynchronously transfer files in parallel to saturate storage throughput.
17-
The loader then lazily instantiates tensors in GPU device memory with DLPack.
10+
Performance highlights from the [CLOUD 2025 paper](https://arxiv.org/abs/2505.23072) and benchmark docs:
11+
- Standalone model loading was **4.8x-7.5x faster** than the default `safetensors` deserializer on Llama, Falcon, and Bloom models, and reached **26.4 GB/s** NVMe read throughput for Llama-70B on four GPUs with GDS.
12+
- In the paper's vLLM integration experiment, startup time dropped from **12.39s to 4.74s** for Llama-2-13B on 4x L40S GPUs, and from **16.04s to 6.88s** on 1x A100.
13+
- On AMD ROCm without GDS, the documented `nogds` path reached **6.02 GB/s** for GPT-2 Medium versus **1.28 GB/s** with `mmap` (**4.7x** throughput), and **2.62 GB/s** for GPT-2 versus **1.01 GB/s** with `mmap` (**2.6x** throughput). See the [report](./docs/amd-perf.md) for more details.
1814

19-
Another design change is to offload sharding and other tensor manipulations to GPUs.
20-
The original loader provides slicing for sharding in user programs before copying to device memory. However, it incurs high CPU usage for host memory accesses.
21-
Therefore, we introduce special APIs to run sharding with `torch.distributed` collective operations such as `broadcast` and `scatter`.
22-
The offloading is also applied to other tensor manipulations such as type conversions.
23-
24-
The above two designs can be naturally extended to utilize device-to-device data transfers with GPU Direct Storage.
25-
The technology helps minimize copy overheads from NVMe SSDs to GPU memory by bypassing host CPU and memory.
26-
27-
## Basic API usage
28-
29-
`SafeTensorsFileLoader` is a low-level entrypoint. To use it, pass either `SingleGroup()` for simple inference or `ProcessGroup()` (from `torch.distributed`) for tensor-parallel inference. The loader supports both CPU and CUDA devices, with optional GPU Direct Storage (GDS) support. You can specify the device and GDS settings using the `device` and `nogds` arguments, respectively. Note that if GDS is not available, the loader will fail to open files when `nogds=False`. For more information on enabling GDS, please refer to the NVIDIA documentation.
30-
31-
After creating a `SafeTensorsFileLoader` instance, first map target files and a rank using the `.add_filenames()` method. Then, call `.copy_file_to_device()` to trigger the actual file copies on aggregated GPU memory fragments and directly instantiate a group of tensors. Once the files are loaded, you can retrieve a tensor using the `.get_tensor()` method. Additionally, you can obtain sharded tensors by `.get_sharded()`, which internally runs collective operations in `torch.distributed`.
32-
33-
Important: To release the GPU memory allocated for tensors, you must explicitly call the `.close()` method. This is because fastsafetensors allows multiple tensors to share a limited number of GPU memory fragments. As a result, it is the user's responsibility to ensure that all tensors are properly released before calling `.close()`, which will then safely release the underlying GPU memory.
34-
35-
`fastsafe_open` is an easier entrypoint. You can force GDS off and run in fallback mode if `nogds=True`. However, users must be aware of the above tricky memory management model, which should be fixed in future releases.
36-
37-
```python
38-
with fastsafe_open(filenames=[filename], nogds=True, device="cpu", debug_log=True) as f:
39-
for key in f.get_keys():
40-
t = f.get_tensor(key).clone().detach() # clone if t is used outside
41-
```
42-
43-
## Configuration
44-
45-
`AutoLoader` supports file-based configuration for loader type, pipeline mode, copy settings, and more.
46-
See [Configuration Guide](./docs/configuration.md) for defaults, examples, and all available options.
47-
48-
## Development
49-
50-
### Pre-commit Hooks
51-
52-
Our CI workflow checks code formatting and linting with Python 3.13. Therefore, we recommend testing your code with Python 3.13 and running the following pre-commit hooks before contributing your code.
53-
54-
To set up:
55-
56-
1. Install development dependencies:
57-
```bash
58-
pip install -e ".[dev]"
59-
```
60-
61-
2. Install pre-commit hooks:
62-
```bash
63-
pre-commit install
64-
```
65-
66-
Now, every time you commit, the following checks will run automatically:
67-
- **black**: Code formatting
68-
- **isort**: Import sorting
69-
- **flake8**: Basic linting (syntax errors, undefined names)
70-
- **mypy**: Type checking
71-
- **trailing-whitespace**: Remove trailing whitespace
72-
- **end-of-file-fixer**: Ensure files end with a newline
73-
- **check-yaml**: Validate YAML files
74-
- **check-toml**: Validate TOML files
75-
- **check-merge-conflict**: Detect merge conflict markers
76-
- **debug-statements**: Detect debug statements
77-
78-
To manually run pre-commit on all files:
79-
```bash
80-
pre-commit run --all-files
81-
```
82-
83-
To skip pre-commit hooks (not recommended):
84-
```bash
85-
git commit --no-verify
86-
```
87-
88-
## Code of Conduct
89-
90-
Please refer to [Foundation Model Stack Community Code of Conduct](https://github.com/foundation-model-stack/foundation-model-stack/blob/main/code-of-conduct.md).
91-
92-
## Publication
93-
94-
Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, Swaminathan Sundararaman. (2025) Speeding up Model Loading with fastsafetensors [arXiv:2505.23072](https://arxiv.org/abs/2505.23072) and IEEE CLOUD 2025.
95-
96-
## For NVIDIA
97-
98-
### Install from PyPI
99-
100-
See https://pypi.org/project/fastsafetensors/
15+
# Quick Start
10116

10217
```bash
10318
pip install fastsafetensors
19+
pip install vllm # for quick demo
20+
vllm serve Qwen/Qwen3-0.6B --load-format fastsafetensors
21+
...
22+
Loading safetensors using Fastsafetensor loader: 0% Completed | 0/1 [00:00<?, ?it/s]
23+
Loading safetensors using Fastsafetensor loader: 100% Completed | 1/1 [00:00<00:00, 1.23it/s]
10424
```
10525

106-
### Install from source
26+
# Design Details
10727

108-
```bash
109-
pip install .
110-
```
28+
See [Overview](./docs/overview.md) for features, basic API usage, and configuration.
11129

112-
## For ROCm
30+
# Code of Conduct
11331

114-
On ROCm, there is no GDS-equivalent support, so fastsafetensors only supports `nogds=True` mode.
115-
The performance gain example can be found at [amd-perf.md](./docs/amd-perf.md).
32+
Please refer to [Foundation Model Stack Community Code of Conduct](https://github.com/foundation-model-stack/foundation-model-stack/blob/main/code-of-conduct.md).
11633

117-
### Install from GitHub Source
34+
# Development
11835

119-
```bash
120-
pip install git+https://github.com/foundation-model-stack/fastsafetensors.git
121-
```
36+
See [Development](./docs/development.md).
12237

123-
### Install from source
38+
# Publication
12439

125-
```bash
126-
pip install .
127-
```
40+
Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, Swaminathan Sundararaman. (2025) Speeding up Model Loading with fastsafetensors [arXiv:2505.23072](https://arxiv.org/abs/2505.23072) and IEEE CLOUD 2025.

docs/development.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
Development
2+
===========
3+
4+
This project requires all commits to comply with the Developer Certificate of Origin (DCO). We can only accept contributions whose commits include a valid
5+
`Signed-off-by` line.
6+
7+
To sign off a commit, use:
8+
9+
```bash
10+
git commit -s
11+
```
12+
13+
Each commit in a pull request must include a sign-off line such as:
14+
15+
```
16+
Signed-off-by: Your Name <your.email@example.com>
17+
```
18+
19+
# Tests
20+
21+
This repository has CI with CPU-only mode. They automatically run when you raise a PR. We only accept changes that can pass these tests and lint checks with DCO.
22+
23+
You can also use Makefile on your local environment.
24+
25+
```
26+
make unittest
27+
make unittest-parallel
28+
make vllm
29+
```
30+
31+
# Pre-commit Hooks
32+
33+
Our CI workflow checks code formatting and linting with Python 3.13. Therefore, we recommend testing your code with Python 3.13 and running the following pre-commit hooks before contributing your code.
34+
35+
To set up:
36+
37+
1. Install development dependencies:
38+
```bash
39+
pip install -e ".[dev]"
40+
```
41+
42+
2. Install pre-commit hooks:
43+
```bash
44+
pre-commit install
45+
```
46+
47+
Now, every time you commit, the following checks will run automatically:
48+
- **black**: Code formatting
49+
- **isort**: Import sorting
50+
- **flake8**: Basic linting (syntax errors, undefined names)
51+
- **mypy**: Type checking
52+
- **trailing-whitespace**: Remove trailing whitespace
53+
- **end-of-file-fixer**: Ensure files end with a newline
54+
- **check-yaml**: Validate YAML files
55+
- **check-toml**: Validate TOML files
56+
- **check-merge-conflict**: Detect merge conflict markers
57+
- **debug-statements**: Detect debug statements
58+
59+
To manually run pre-commit on all files:
60+
```bash
61+
pre-commit run --all-files
62+
```
63+
64+
To skip pre-commit hooks (not recommended):
65+
```bash
66+
git commit --no-verify
67+
```
68+
69+
# Build & install
70+
71+
## Build & install from GitHub Source
72+
73+
```bash
74+
pip install git+https://github.com/foundation-model-stack/fastsafetensors.git
75+
```
76+
77+
## Build & install from source
78+
79+
```bash
80+
pip install .
81+
```

docs/overview.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
Overview
2+
=========
3+
4+
# Features
5+
6+
Fastsafetensors introduces three major features to optimize model loading performance:
7+
1. Batched, lazy tensor instantiation.
8+
2. GPU offloading for sharding, type conversions, and device pointer alignment.
9+
3. GPU Direct Storage enablement for file loading from storage to GPU memory.
10+
11+
A major design difference from the original safetensors file loader is that fastsafetensors does *NOT* use `mmap`.
12+
The original loader loads tensors on demand from memory-mapped files,
13+
but unfortunately, it cannot fully utilize high-throughput I/O such as NVMe SSDs.
14+
Therefore, we asynchronously transfer files in parallel to saturate storage throughput.
15+
The loader then lazily instantiates tensors in GPU device memory with DLPack.
16+
17+
Another design change is to offload sharding and other tensor manipulations to GPUs.
18+
The original loader provides slicing for sharding in user programs before copying to device memory. However, it incurs high CPU usage for host memory accesses.
19+
Therefore, we introduce special APIs to run sharding with `torch.distributed` collective operations such as `broadcast` and `scatter`.
20+
The offloading is also applied to other tensor manipulations such as type conversions.
21+
22+
The above two designs can be naturally extended to utilize device-to-device data transfers with GPU Direct Storage.
23+
The technology helps minimize copy overheads from NVMe SSDs to GPU memory by bypassing host CPU and memory.
24+
25+
# Basic API usage
26+
27+
`SafeTensorsFileLoader` is a low-level entrypoint. To use it, pass either `SingleGroup()` for simple inference or `ProcessGroup()` (from `torch.distributed`) for tensor-parallel inference. The loader supports both CPU and CUDA devices, with optional GPU Direct Storage (GDS) support. You can specify the device and GDS settings using the `device` and `nogds` arguments, respectively. Note that if GDS is not available, the loader will fail to open files when `nogds=False`. For more information on enabling GDS, please refer to the NVIDIA documentation.
28+
29+
After creating a `SafeTensorsFileLoader` instance, first map target files and a rank using the `.add_filenames()` method. Then, call `.copy_file_to_device()` to trigger the actual file copies on aggregated GPU memory fragments and directly instantiate a group of tensors. Once the files are loaded, you can retrieve a tensor using the `.get_tensor()` method. Additionally, you can obtain sharded tensors by `.get_sharded()`, which internally runs collective operations in `torch.distributed`.
30+
31+
Important: To release the GPU memory allocated for tensors, you must explicitly call the `.close()` method. This is because fastsafetensors allows multiple tensors to share a limited number of GPU memory fragments. As a result, it is the user's responsibility to ensure that all tensors are properly released before calling `.close()`, which will then safely release the underlying GPU memory.
32+
33+
`fastsafe_open` is an easier entrypoint. You can force GDS off and run in fallback mode if `nogds=True`. However, users must be aware of the above tricky memory management model, which should be fixed in future releases.
34+
35+
```python
36+
with fastsafe_open(filenames=[filename], nogds=True, device="cpu", debug_log=True) as f:
37+
for key in f.get_keys():
38+
t = f.get_tensor(key).clone().detach() # clone if t is used outside
39+
```
40+
41+
# AutoLoader configuration
42+
43+
`AutoLoader` supports file-based configuration for loader type, pipeline mode, copy settings, and more.
44+
See [Configuration Guide](./configuration.md) for defaults, examples, and all available options.
45+
46+
# ROCm
47+
48+
On ROCm, there is no GDS-equivalent support, so fastsafetensors only supports `nogds=True` mode.
49+
The performance gain example can be found at [amd-perf.md](./amd-perf.md).
50+
51+
# Windows
52+
53+
From [PR#72](https://github.com/foundation-model-stack/fastsafetensors/pull/72):
54+
55+
On Linux, GDS uses cuFile to DMA data directly from NVMe into GPU memory. Windows has no cuFile — instead, it offers [DirectStorage](https://devblogs.microsoft.com/directx/directstorage-api-available-on-pc/), a DirectX 12 API designed for the same purpose.
56+
57+
Since DirectStorage writes into D3D12 resources (not CUDA buffers), we bridge the two APIs through CUDA external memory interop:
58+
59+
```
60+
NVMe -> [DirectStorage] -> D3D12 shared buffer -> [cudaImportExternalMemory] -> CUDA device pointer
61+
```
62+
63+
The key steps are:
64+
65+
1. Create a D3D12 committed resource with D3D12_HEAP_FLAG_SHARED so it can be exported
66+
2. DirectStorage reads from NVMe into this D3D12 buffer via IDStorageQueue
67+
3. Export the D3D12 resource as an NT handle via CreateSharedHandle
68+
4. Import into CUDA via cudaImportExternalMemory + cudaExternalMemoryGetMappedBuffer to get a regular CUDA device pointer
69+
5. Synchronize using a D3D12 fence imported as a cudaExternalSemaphore
70+
71+
All DirectStorage, D3D12, and DXGI libraries are loaded at runtime via LoadLibrary/GetProcAddress — no link-time SDK dependency on DirectStorage is required.

0 commit comments

Comments
 (0)