Motivation
As inference engine (e.g., vLLM) users, the storage backend for model weights varies across deployment environments:
| Deployment Environment |
Expected Loading Method |
| Local NVMe SSD + CUDA |
SafeTensorsFileLoader (GDS, auto-fallback to nogds) |
| 3FS Distributed File System |
ThreeFSLoader / ParallelThreeFSLoader |
| Other distributed storage (future) |
New loader implementations |
Pain point: Switching storage backends requires modifying engine code (different imports, different constructor parameters). This means:
- Engine maintainers must maintain separate code branches for each storage type — not sustainable
- End users cannot switch backends via simple configuration
- The same engine codebase cannot be reused across different storage environments
Goal: Engine code uses a single unified entry point; users switch the underlying loader via environment variables or config files:
# Default (local GDS/nogds)
python run_inference.py
# Switch to 3FS
FASTSAFETENSORS_LOADER=threefs python run_inference.py
# Fine-grained control via config file
FASTSAFETENSORS_CONFIG=/path/to/config.json python run_inference.py
Current State
PipelineParallel (base class: producer-consumer parallel loading framework)
├── ParallelLoader → internally creates SafeTensorsFileLoader
└── ParallelThreeFSLoader → internally creates ThreeFSLoader
SafeTensorsFileLoader -> can also be used via ParallelLoader with queue_size=-1, but not intuitive
- Both differ only in which underlying loader is created in the constructor; all other logic (
iterate_weights(), iterator, scheduling) is identical
- Existing env var conventions:
FASTSAFETENSORS_DEBUG, FASTSAFETENSORS_UNIFIED_MEM
Approach Comparison
Approach A: New Unified Entry Class (e.g., FastLoader)
from fastsafetensors import FastLoader
loader = FastLoader(pg, files, device="cuda:0") # auto-selects underlying loader via env/config
for key, tensor in loader.iterate_weights():
process(key, tensor)
loader.close()
- Pros: Purely additive — no changes to existing classes; best backward compatibility; clean separation of concerns
- Cons: New class name; one extra layer of delegation
Approach B: Extend Existing ParallelLoader
from fastsafetensors import ParallelLoader
loader = ParallelLoader(pg, files, device="cuda:0") # auto-selects underlying loader via env/config
- Pros: Zero learning curve — entry point stays the same
- Cons: Modifies existing class behavior;
nogds/bbuf_size_kb params would be ignored in threefs mode, which may cause confusion
Backward compatibility: Both approaches behave identically to today when no new env vars are set. Approach A leaves existing ParallelLoader completely untouched; Approach B defaults to loader="base", so existing calls are unaffected.
Configuration Design
Principles
- Static config (env / config file): loader type, framework, debug toggle, loader-specific tuning params
- Runtime params (passed in code):
device, pg, hf_weights_files — never in config files
- Priority: env vars > config file > code params > defaults
Environment Variables
| Variable |
Description |
Values |
Default |
FASTSAFETENSORS_LOADER |
Loader type |
base, threefs |
base |
FASTSAFETENSORS_CONFIG |
Config file path |
file path |
none |
FASTSAFETENSORS_FRAMEWORK |
DL framework |
pytorch, paddle |
none |
FASTSAFETENSORS_DEBUG |
Debug logging (existing) |
true, false |
false |
Config File (JSON)
Specified via FASTSAFETENSORS_CONFIG:
{
"loader": "threefs",
"framework": "pytorch",
"debug_log": false,
"base": {
"nogds": false,
"bbuf_size_kb": 16384,
"max_threads": 16
},
"threefs": {},
"parallel": {
"max_concurrent_producers": 1,
"queue_size": 0,
"use_tqdm_on_load": true
}
}
loader: "base" → SafeTensorsFileLoader (handles GDS/nogds fallback internally), "threefs" → ThreeFSLoader
base / threefs: loader-specific tuning params
parallel: PipelineParallel-level tuning params
Discussion
The above proposal is based on real-world inference engine integration scenarios. We'd love to hear the maintainers' thoughts:
- Do you agree with the direction of unifying the loader entry point via configuration?
- If so, any suggestions or alternative approaches you'd prefer?
Motivation
As inference engine (e.g., vLLM) users, the storage backend for model weights varies across deployment environments:
SafeTensorsFileLoader(GDS, auto-fallback to nogds)ThreeFSLoader/ParallelThreeFSLoaderPain point: Switching storage backends requires modifying engine code (different imports, different constructor parameters). This means:
Goal: Engine code uses a single unified entry point; users switch the underlying loader via environment variables or config files:
Current State
iterate_weights(), iterator, scheduling) is identicalFASTSAFETENSORS_DEBUG,FASTSAFETENSORS_UNIFIED_MEMApproach Comparison
Approach A: New Unified Entry Class (e.g.,
FastLoader)Approach B: Extend Existing
ParallelLoadernogds/bbuf_size_kbparams would be ignored inthreefsmode, which may cause confusionBackward compatibility: Both approaches behave identically to today when no new env vars are set. Approach A leaves existing
ParallelLoadercompletely untouched; Approach B defaults toloader="base", so existing calls are unaffected.Configuration Design
Principles
device,pg,hf_weights_files— never in config filesEnvironment Variables
FASTSAFETENSORS_LOADERbase,threefsbaseFASTSAFETENSORS_CONFIGFASTSAFETENSORS_FRAMEWORKpytorch,paddleFASTSAFETENSORS_DEBUGtrue,falsefalseConfig File (JSON)
Specified via
FASTSAFETENSORS_CONFIG:{ "loader": "threefs", "framework": "pytorch", "debug_log": false, "base": { "nogds": false, "bbuf_size_kb": 16384, "max_threads": 16 }, "threefs": {}, "parallel": { "max_concurrent_producers": 1, "queue_size": 0, "use_tqdm_on_load": true } }loader:"base"→SafeTensorsFileLoader(handles GDS/nogds fallback internally),"threefs"→ThreeFSLoaderbase/threefs: loader-specific tuning paramsparallel:PipelineParallel-level tuning paramsDiscussion
The above proposal is based on real-world inference engine integration scenarios. We'd love to hear the maintainers' thoughts: