Name	Name	Last commit message	Last commit date
parent directory ..
api_reference	api_reference
01_getting_started.md	01_getting_started.md
02_core_operations.md	02_core_operations.md
03_batch_processing.md	03_batch_processing.md
04_advanced_features.md	04_advanced_features.md
05_performance_guide.md	05_performance_guide.md
06_quick_reference.md	06_quick_reference.md
07_io_conversion.md	07_io_conversion.md
README.md	README.md

NumPack API Documentation

Welcome to the NumPack API documentation! NumPack is a high-performance array storage library that combines Rust's performance with Python's ease of use.

Documentation Index

Getting Started

01. Getting Started Guide
- Installation instructions
- Quick start examples
- Basic concepts and usage patterns
- Context manager and file management
- Supported data types

API Reference

API Reference (Detailed) ⭐ NEW
- Complete function-level documentation
- Parameters, return values, and examples
- Organized by module (Core, IO, Utils)
02. Core Operations
- Complete API reference for all basic operations
- save(), load(), replace(), append(), drop()
- Random access with getitem()
- Metadata operations
- Stream loading
- File management
03. Batch Processing
- High-performance batch modes (25-174x speedup)
- batch_mode(): Memory-cached processing
- writable_batch_mode(): Zero-copy file mapping
- Comparison and selection guide
- Best practices and examples
04. Advanced Features
- Lazy arrays and memory-mapped loading
- Streaming operations for large datasets
- Advanced indexing and slicing
- In-place operations
- Memory management strategies
- Cross-platform considerations

Optimization

05. Performance Guide
- Comprehensive benchmark results
- Optimization strategies
- Common performance pitfalls
- Platform-specific optimizations
- Real-world use case examples

Quick Reference

06. Quick Reference
- API cheatsheet
- Common patterns
- Decision trees for choosing the right approach
- Troubleshooting guide

Format Conversion

07. IO Conversion
- PyTorch tensor conversion
- PyArrow/Feather/Parquet conversion
- SafeTensors conversion
- NumPy, HDF5, Zarr, CSV conversion
- Memory ↔ .npk and File ↔ .npk patterns

Quick Links

By Use Case

Use Case	Documentation	Key Features
First-time users	Getting Started	Installation, basic usage
API lookup	Core Operations	Complete API reference
Performance optimization	Batch Processing, Performance Guide	25-174x speedup
Large datasets	Advanced Features	Lazy loading, streaming
Quick answers	Quick Reference	Cheatsheet, common patterns
Format conversion	IO Conversion	PyTorch, Arrow, Parquet, SafeTensors

By Feature

Feature	Documentation	Performance Gain
Batch modifications	Batch Processing	25-174x faster
Row replacement	Core Operations	397x faster than NPY
Data append	Core Operations	405x faster than NPY
Lazy loading	Advanced Features	54x faster initialization
Streaming	Advanced Features	Memory-efficient

Performance Highlights

NumPack excels in three key areas:

1. Data Modification (397-405x faster)

# Replace 100 rows: 0.047ms (NPY: 18.51ms)
npk.replace({'features': new_data}, [0, 1, 2, ...])

# Append 100 rows: 0.067ms (NPY: 27.09ms)
npk.append({'features': new_data})

2. Lazy Loading (54x faster)

# Initialization: 0.002ms (NPY mmap: 0.107ms)
lazy_arr = npk.load('features', lazy=True)
subset = lazy_arr[1000:2000]

3. Batch Processing (25-174x faster)

# 100 modifications: 4.9ms (Normal: 856ms)
with npk.writable_batch_mode() as wb:
    arr = wb.load('features')
    for i in range(100):
        arr *= 1.1

Common Tasks

Save and Load Data

from numpack import NumPack
import numpy as np

with NumPack("data.npk") as npk:
    # Save
    npk.save({'features': np.random.rand(1000, 100)})
    
    # Load
    features = npk.load('features')

Full documentation

Modify Existing Data

with NumPack("data.npk") as npk:
    # Replace specific rows (397x faster than NPY)
    npk.replace({'features': new_data}, [0, 1, 2])
    
    # Append new rows (405x faster than NPY)
    npk.append({'features': more_data})

Full documentation

High-Performance Batch Processing

# For frequent modifications (174x speedup)
with NumPack("data.npk") as npk:
    with npk.writable_batch_mode() as wb:
        arr = wb.load('features')
        arr *= 2.0  # Direct file modification

Full documentation

Large Dataset Handling

# Lazy loading (54x faster initialization)
with NumPack("large_data.npk") as npk:
    lazy_arr = npk.load('features', lazy=True)
    subset = lazy_arr[1000:2000]

# Streaming (memory-efficient)
with NumPack("large_data.npk") as npk:
    for batch in npk.stream_load('features', buffer_size=10000):
        process(batch)

Full documentation

Installation

From PyPI (Recommended)

pip install numpack

Requirements:

Python >= 3.9
NumPy >= 1.26.0

From Source

git clone https://github.com/BirchKwok/NumPack.git
cd NumPack
pip install maturin>=1.0,<2.0
maturin develop

Additional requirements:

Rust >= 1.70.0
Appropriate C/C++ compiler

Full installation guide

When to Use NumPack

Strongly Recommended (90% of use cases)

Machine learning and deep learning pipelines
Real-time data stream processing
Data annotation and correction workflows
Feature stores with dynamic updates
Any scenario requiring frequent data modifications
Fast data loading requirements

Consider Alternatives (10% of use cases)

Write-once, never modify → Use NPY (2.2x faster initial write)
Frequent single-row random access → Use NPY mmap
Extreme compression requirements → Use NPZ (10% smaller, 1000x slower)

Performance comparison

API Overview

Core Classes

from numpack import NumPack, LazyArray

# Main class
npk = NumPack("data.npk")

# Lazy array (memory-mapped)
lazy_arr = npk.load('array', lazy=True)

Key Methods

Method	Purpose	Performance
`save(arrays)`	Save arrays	2.2x slower than NPY
`load(name, lazy=False)`	Load array	1.3x faster (eager), 54x faster (lazy)
`replace(arrays, indexes)`	Replace rows	397x faster than NPY
`append(arrays)`	Append rows	405x faster than NPY
`drop(name, indexes)`	Drop arrays/rows	Very fast (logical)
`getitem(name, indexes)`	Random access	Fast
`stream_load(name, buffer_size)`	Stream batches	Memory-efficient
`batch_mode()`	Batch processing	25-37x speedup
`writable_batch_mode()`	Zero-copy batch	174x speedup

Complete API reference

Learning Path

Beginner

Read Getting Started Guide
Try basic examples from Core Operations
Learn about context managers and file management

Intermediate

Explore Batch Processing for performance gains
Learn when to use batch_mode vs writable_batch_mode
Study Advanced Features for lazy loading

Advanced

Master Performance Guide
Optimize your specific use case
Understand platform-specific optimizations

Troubleshooting

Common Issues

Q: File handle warning on Windows?

Use context manager: with NumPack(...) as npk:
See Getting Started

Q: Out of memory errors?

Use lazy loading or streaming
See Advanced Features

Q: Slow performance?

Use appropriate batch mode
See Performance Guide

Q: Need to reclaim disk space after deletions?

Call npk.update(array_name) to compact
See Core Operations

Full troubleshooting guide

Examples Repository

All documentation includes practical examples. For complete working examples, see:

examples/inplace_operators_example.py
examples/writable_batch_mode_example.py
examples/drop_operations_example.py

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the Apache License, Version 2.0.

FilesExpand file tree

docs

Directory actions

More options

Directory actions

More options

Latest commit

History

docs

Folders and files

parent directory

README.md

NumPack API Documentation

Documentation Index

Getting Started

API Reference

Optimization

Quick Reference

Format Conversion

Quick Links

By Use Case

By Feature

Performance Highlights

1. Data Modification (397-405x faster)

2. Lazy Loading (54x faster)

3. Batch Processing (25-174x faster)

Common Tasks

Save and Load Data

Modify Existing Data

High-Performance Batch Processing

Large Dataset Handling

Installation

From PyPI (Recommended)

From Source

When to Use NumPack

Strongly Recommended (90% of use cases)

Consider Alternatives (10% of use cases)

API Overview

Core Classes

Key Methods

Learning Path

Beginner

Intermediate

Advanced

Troubleshooting

Common Issues

Examples Repository

🤝 Contributing

📄 License

🔗 Additional Resources