Welcome to the NumPack API documentation! NumPack is a high-performance array storage library that combines Rust's performance with Python's ease of use.
- 01. Getting Started Guide
- Installation instructions
- Quick start examples
- Basic concepts and usage patterns
- Context manager and file management
- Supported data types
-
API Reference (Detailed) β NEW
- Complete function-level documentation
- Parameters, return values, and examples
- Organized by module (Core, IO, Utils)
-
- Complete API reference for all basic operations
save(),load(),replace(),append(),drop()- Random access with
getitem() - Metadata operations
- Stream loading
- File management
-
- High-performance batch modes (25-174x speedup)
batch_mode(): Memory-cached processingwritable_batch_mode(): Zero-copy file mapping- Comparison and selection guide
- Best practices and examples
-
- Lazy arrays and memory-mapped loading
- Streaming operations for large datasets
- Advanced indexing and slicing
- In-place operations
- Memory management strategies
- Cross-platform considerations
- 05. Performance Guide
- Comprehensive benchmark results
- Optimization strategies
- Common performance pitfalls
- Platform-specific optimizations
- Real-world use case examples
- 06. Quick Reference
- API cheatsheet
- Common patterns
- Decision trees for choosing the right approach
- Troubleshooting guide
- 07. IO Conversion
- PyTorch tensor conversion
- PyArrow/Feather/Parquet conversion
- SafeTensors conversion
- NumPy, HDF5, Zarr, CSV conversion
- Memory β .npk and File β .npk patterns
| Use Case | Documentation | Key Features |
|---|---|---|
| First-time users | Getting Started | Installation, basic usage |
| API lookup | Core Operations | Complete API reference |
| Performance optimization | Batch Processing, Performance Guide | 25-174x speedup |
| Large datasets | Advanced Features | Lazy loading, streaming |
| Quick answers | Quick Reference | Cheatsheet, common patterns |
| Format conversion | IO Conversion | PyTorch, Arrow, Parquet, SafeTensors |
| Feature | Documentation | Performance Gain |
|---|---|---|
| Batch modifications | Batch Processing | 25-174x faster |
| Row replacement | Core Operations | 397x faster than NPY |
| Data append | Core Operations | 405x faster than NPY |
| Lazy loading | Advanced Features | 54x faster initialization |
| Streaming | Advanced Features | Memory-efficient |
NumPack excels in three key areas:
# Replace 100 rows: 0.047ms (NPY: 18.51ms)
npk.replace({'features': new_data}, [0, 1, 2, ...])
# Append 100 rows: 0.067ms (NPY: 27.09ms)
npk.append({'features': new_data})# Initialization: 0.002ms (NPY mmap: 0.107ms)
lazy_arr = npk.load('features', lazy=True)
subset = lazy_arr[1000:2000]# 100 modifications: 4.9ms (Normal: 856ms)
with npk.writable_batch_mode() as wb:
arr = wb.load('features')
for i in range(100):
arr *= 1.1from numpack import NumPack
import numpy as np
with NumPack("data.npk") as npk:
# Save
npk.save({'features': np.random.rand(1000, 100)})
# Load
features = npk.load('features')with NumPack("data.npk") as npk:
# Replace specific rows (397x faster than NPY)
npk.replace({'features': new_data}, [0, 1, 2])
# Append new rows (405x faster than NPY)
npk.append({'features': more_data})# For frequent modifications (174x speedup)
with NumPack("data.npk") as npk:
with npk.writable_batch_mode() as wb:
arr = wb.load('features')
arr *= 2.0 # Direct file modification# Lazy loading (54x faster initialization)
with NumPack("large_data.npk") as npk:
lazy_arr = npk.load('features', lazy=True)
subset = lazy_arr[1000:2000]
# Streaming (memory-efficient)
with NumPack("large_data.npk") as npk:
for batch in npk.stream_load('features', buffer_size=10000):
process(batch)pip install numpackRequirements:
- Python >= 3.9
- NumPy >= 1.26.0
git clone https://github.com/BirchKwok/NumPack.git
cd NumPack
pip install maturin>=1.0,<2.0
maturin developAdditional requirements:
- Rust >= 1.70.0
- Appropriate C/C++ compiler
- Machine learning and deep learning pipelines
- Real-time data stream processing
- Data annotation and correction workflows
- Feature stores with dynamic updates
- Any scenario requiring frequent data modifications
- Fast data loading requirements
- Write-once, never modify β Use NPY (2.2x faster initial write)
- Frequent single-row random access β Use NPY mmap
- Extreme compression requirements β Use NPZ (10% smaller, 1000x slower)
from numpack import NumPack, LazyArray
# Main class
npk = NumPack("data.npk")
# Lazy array (memory-mapped)
lazy_arr = npk.load('array', lazy=True)| Method | Purpose | Performance |
|---|---|---|
save(arrays) |
Save arrays | 2.2x slower than NPY |
load(name, lazy=False) |
Load array | 1.3x faster (eager), 54x faster (lazy) |
replace(arrays, indexes) |
Replace rows | 397x faster than NPY |
append(arrays) |
Append rows | 405x faster than NPY |
drop(name, indexes) |
Drop arrays/rows | Very fast (logical) |
getitem(name, indexes) |
Random access | Fast |
stream_load(name, buffer_size) |
Stream batches | Memory-efficient |
batch_mode() |
Batch processing | 25-37x speedup |
writable_batch_mode() |
Zero-copy batch | 174x speedup |
- Read Getting Started Guide
- Try basic examples from Core Operations
- Learn about context managers and file management
- Explore Batch Processing for performance gains
- Learn when to use batch_mode vs writable_batch_mode
- Study Advanced Features for lazy loading
- Master Performance Guide
- Optimize your specific use case
- Understand platform-specific optimizations
Q: File handle warning on Windows?
- Use context manager:
with NumPack(...) as npk: - See Getting Started
Q: Out of memory errors?
- Use lazy loading or streaming
- See Advanced Features
Q: Slow performance?
- Use appropriate batch mode
- See Performance Guide
Q: Need to reclaim disk space after deletions?
- Call
npk.update(array_name)to compact - See Core Operations
All documentation includes practical examples. For complete working examples, see:
examples/inplace_operators_example.pyexamples/writable_batch_mode_example.pyexamples/drop_operations_example.py
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the Apache License, Version 2.0.
Copyright 2025 NumPack Contributors