|
1 | | -fastsafetensors is an efficient safetensors model loader. |
2 | | -This library is tested with Python 3.10-3.13 and PyTorch 2.1-2.7. |
| 1 | +fastsafetensors |
| 2 | +================ |
3 | 3 |
|
4 | | -Disclaimer: This repository contains a research prototype. It should be used with caution. |
| 4 | +fastsafetensors is an efficient safetensors loader. If you develop your own code that loads large safetensors files, you can try fastsafetensors APIs (see [docs](./docs/overview.md)). For example, vLLM and SGLang have `--load-format fastsafetensors` command-line argument to speed up their initialization. |
5 | 5 |
|
6 | | -# Features |
| 6 | +This library supports Linux/CUDA, ROCm without GDS, Windows, [3FS](https://github.com/deepseek-ai/3fs), unified-memory systems such as DGX Spark, and so on. We welcome more platform/storage-specific optimizations like them by adding new [copier backends](fastsafetensors/copier/). Our CI tests Python 3.10-3.14 with PyTorch 2.11.0. |
7 | 7 |
|
8 | | -We introduced three major features to optimize model loading performance: |
9 | | -1. Batched, lazy tensor instantiation. |
10 | | -2. GPU offloading for sharding, type conversions, and device pointer alignment. |
11 | | -3. GPU Direct Storage enablement for file loading from storage to GPU memory. |
| 8 | +# Performance Highlights |
12 | 9 |
|
13 | | -A major design difference from the original safetensors file loader is that fastsafetensors does *NOT* use `mmap`. |
14 | | -The original loader loads tensors on demand from memory-mapped files, |
15 | | -but unfortunately, it cannot fully utilize high-throughput I/O such as NVMe SSDs. |
16 | | -Therefore, we asynchronously transfer files in parallel to saturate storage throughput. |
17 | | -The loader then lazily instantiates tensors in GPU device memory with DLPack. |
| 10 | +Performance highlights from the [CLOUD 2025 paper](https://arxiv.org/abs/2505.23072) and benchmark docs: |
| 11 | +- Standalone model loading was **4.8x-7.5x faster** than the default `safetensors` deserializer on Llama, Falcon, and Bloom models, and reached **26.4 GB/s** NVMe read throughput for Llama-70B on four GPUs with GDS. |
| 12 | +- In the paper's vLLM integration experiment, startup time dropped from **12.39s to 4.74s** for Llama-2-13B on 4x L40S GPUs, and from **16.04s to 6.88s** on 1x A100. |
| 13 | +- On AMD ROCm without GDS, the documented `nogds` path reached **6.02 GB/s** for GPT-2 Medium versus **1.28 GB/s** with `mmap` (**4.7x** throughput), and **2.62 GB/s** for GPT-2 versus **1.01 GB/s** with `mmap` (**2.6x** throughput). See the [report](./docs/amd-perf.md) for more details. |
18 | 14 |
|
19 | | -Another design change is to offload sharding and other tensor manipulations to GPUs. |
20 | | -The original loader provides slicing for sharding in user programs before copying to device memory. However, it incurs high CPU usage for host memory accesses. |
21 | | -Therefore, we introduce special APIs to run sharding with `torch.distributed` collective operations such as `broadcast` and `scatter`. |
22 | | -The offloading is also applied to other tensor manipulations such as type conversions. |
23 | | - |
24 | | -The above two designs can be naturally extended to utilize device-to-device data transfers with GPU Direct Storage. |
25 | | -The technology helps minimize copy overheads from NVMe SSDs to GPU memory by bypassing host CPU and memory. |
26 | | - |
27 | | -## Basic API usage |
28 | | - |
29 | | -`SafeTensorsFileLoader` is a low-level entrypoint. To use it, pass either `SingleGroup()` for simple inference or `ProcessGroup()` (from `torch.distributed`) for tensor-parallel inference. The loader supports both CPU and CUDA devices, with optional GPU Direct Storage (GDS) support. You can specify the device and GDS settings using the `device` and `nogds` arguments, respectively. Note that if GDS is not available, the loader will fail to open files when `nogds=False`. For more information on enabling GDS, please refer to the NVIDIA documentation. |
30 | | - |
31 | | -After creating a `SafeTensorsFileLoader` instance, first map target files and a rank using the `.add_filenames()` method. Then, call `.copy_file_to_device()` to trigger the actual file copies on aggregated GPU memory fragments and directly instantiate a group of tensors. Once the files are loaded, you can retrieve a tensor using the `.get_tensor()` method. Additionally, you can obtain sharded tensors by `.get_sharded()`, which internally runs collective operations in `torch.distributed`. |
32 | | - |
33 | | -Important: To release the GPU memory allocated for tensors, you must explicitly call the `.close()` method. This is because fastsafetensors allows multiple tensors to share a limited number of GPU memory fragments. As a result, it is the user's responsibility to ensure that all tensors are properly released before calling `.close()`, which will then safely release the underlying GPU memory. |
34 | | - |
35 | | -`fastsafe_open` is an easier entrypoint. You can force GDS off and run in fallback mode if `nogds=True`. However, users must be aware of the above tricky memory management model, which should be fixed in future releases. |
36 | | - |
37 | | -```python |
38 | | -with fastsafe_open(filenames=[filename], nogds=True, device="cpu", debug_log=True) as f: |
39 | | - for key in f.get_keys(): |
40 | | - t = f.get_tensor(key).clone().detach() # clone if t is used outside |
41 | | -``` |
42 | | - |
43 | | -## Configuration |
44 | | - |
45 | | -`AutoLoader` supports file-based configuration for loader type, pipeline mode, copy settings, and more. |
46 | | -See [Configuration Guide](./docs/configuration.md) for defaults, examples, and all available options. |
47 | | - |
48 | | -## Development |
49 | | - |
50 | | -### Pre-commit Hooks |
51 | | - |
52 | | -Our CI workflow checks code formatting and linting with Python 3.13. Therefore, we recommend testing your code with Python 3.13 and running the following pre-commit hooks before contributing your code. |
53 | | - |
54 | | -To set up: |
55 | | - |
56 | | -1. Install development dependencies: |
57 | | -```bash |
58 | | -pip install -e ".[dev]" |
59 | | -``` |
60 | | - |
61 | | -2. Install pre-commit hooks: |
62 | | -```bash |
63 | | -pre-commit install |
64 | | -``` |
65 | | - |
66 | | -Now, every time you commit, the following checks will run automatically: |
67 | | -- **black**: Code formatting |
68 | | -- **isort**: Import sorting |
69 | | -- **flake8**: Basic linting (syntax errors, undefined names) |
70 | | -- **mypy**: Type checking |
71 | | -- **trailing-whitespace**: Remove trailing whitespace |
72 | | -- **end-of-file-fixer**: Ensure files end with a newline |
73 | | -- **check-yaml**: Validate YAML files |
74 | | -- **check-toml**: Validate TOML files |
75 | | -- **check-merge-conflict**: Detect merge conflict markers |
76 | | -- **debug-statements**: Detect debug statements |
77 | | - |
78 | | -To manually run pre-commit on all files: |
79 | | -```bash |
80 | | -pre-commit run --all-files |
81 | | -``` |
82 | | - |
83 | | -To skip pre-commit hooks (not recommended): |
84 | | -```bash |
85 | | -git commit --no-verify |
86 | | -``` |
87 | | - |
88 | | -## Code of Conduct |
89 | | - |
90 | | -Please refer to [Foundation Model Stack Community Code of Conduct](https://github.com/foundation-model-stack/foundation-model-stack/blob/main/code-of-conduct.md). |
91 | | - |
92 | | -## Publication |
93 | | - |
94 | | -Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, Swaminathan Sundararaman. (2025) Speeding up Model Loading with fastsafetensors [arXiv:2505.23072](https://arxiv.org/abs/2505.23072) and IEEE CLOUD 2025. |
95 | | - |
96 | | -## For NVIDIA |
97 | | - |
98 | | -### Install from PyPI |
99 | | - |
100 | | -See https://pypi.org/project/fastsafetensors/ |
| 15 | +# Quick Start |
101 | 16 |
|
102 | 17 | ```bash |
103 | 18 | pip install fastsafetensors |
| 19 | +pip install vllm # for quick demo |
| 20 | +vllm serve Qwen/Qwen3-0.6B --load-format fastsafetensors |
| 21 | +... |
| 22 | +Loading safetensors using Fastsafetensor loader: 0% Completed | 0/1 [00:00<?, ?it/s] |
| 23 | +Loading safetensors using Fastsafetensor loader: 100% Completed | 1/1 [00:00<00:00, 1.23it/s] |
104 | 24 | ``` |
105 | 25 |
|
106 | | -### Install from source |
| 26 | +# Design Details |
107 | 27 |
|
108 | | -```bash |
109 | | -pip install . |
110 | | -``` |
| 28 | +See [Overview](./docs/overview.md) for features, basic API usage, and configuration. |
111 | 29 |
|
112 | | -## For ROCm |
| 30 | +# Code of Conduct |
113 | 31 |
|
114 | | -On ROCm, there is no GDS-equivalent support, so fastsafetensors only supports `nogds=True` mode. |
115 | | -The performance gain example can be found at [amd-perf.md](./docs/amd-perf.md). |
| 32 | +Please refer to [Foundation Model Stack Community Code of Conduct](https://github.com/foundation-model-stack/foundation-model-stack/blob/main/code-of-conduct.md). |
116 | 33 |
|
117 | | -### Install from GitHub Source |
| 34 | +# Development |
118 | 35 |
|
119 | | -```bash |
120 | | -pip install git+https://github.com/foundation-model-stack/fastsafetensors.git |
121 | | -``` |
| 36 | +See [Development](./docs/development.md). |
122 | 37 |
|
123 | | -### Install from source |
| 38 | +# Publication |
124 | 39 |
|
125 | | -```bash |
126 | | -pip install . |
127 | | -``` |
| 40 | +Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, Swaminathan Sundararaman. (2025) Speeding up Model Loading with fastsafetensors [arXiv:2505.23072](https://arxiv.org/abs/2505.23072) and IEEE CLOUD 2025. |
0 commit comments