You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Unify API around LemaModel and model.get_trainer()
- Centralize model conversion and utility logic in lema.utils.model_utils
- Add automatic interval-based checkpointing to LemaTrainer
- Update Kaggle generation tool to create a comprehensive research workspace
- Refresh all documentation (README, User Guide, API Reference, Architecture)
- Verify 7B model stability via Kaggle benchmarks
**Architectural Specification for VRAM-Efficient Model Fine-Tuning**
3
+
**Virtualize GPU VRAM for LLM Fine-Tuning**
4
4
5
-
LEMA is a specialized framework designed to facilitate the fine-tuning of Large Language Models (LLMs) on hardware where model size exceeds available VRAM. Unlike standard frameworks that require the full model to be resident in GPU memory, LEMA treats the model as a collection of discrete, addressable binary segments. By implementing a virtualized memory abstraction layer, LEMA performs asynchronous pre-fetching of layers into VRAM, effectively trading PCIe bandwidth for memory headroom.
5
+
LEMA is a specialized framework designed to facilitate the fine-tuning of Large Language Models (LLMs) on hardware where model size exceeds available VRAM. By treating model weights as addressable binary segments and implementing a **Triple-Buffer Strategy**, LEMA allows training 7B+ models on GPUs with as little as 16GB VRAM.
6
6
7
-
## Core Features
8
-
9
-
### 1. Binary Indexed Engagement
10
-
LEMA utilizes a **Global Binary Index (GBI)** to map `.safetensors` files directly into the process's virtual address space using `mmap`. This allows for zero-copy mapping and O(1) access to specific layer weights without full model deserialization.
11
-
12
-
### 2. Layer-wise Execution (Patchwork)
13
-
Instead of a monolithic `model.forward()`, LEMA decomposes the computational graph into a sequence of isolated layer blocks.
14
-
-**Weight Swapping**: Only the current layer and the next layer occupy VRAM.
15
-
-**Persistence**: Model weights remain frozen in System RAM/Disk; only LoRA adapters are maintained in active memory.
16
-
17
-
### 3. The Triple-Buffer Strategy
18
-
LEMA orchestrates data movement across three tiers to hide the latency of PCIe transfers:
19
-
-**Storage (NVMe)**: The source of truth (Global Binary File).
20
-
-**System RAM**: Pinned Memory Buffers for staging.
21
-
-**VRAM**: Active Slot / Prefetch Slot for execution.
22
-
23
-
This strategy allows for asynchronous prefetching, where the CPU pushes the next layer to VRAM while the GPU computes the current layer.
24
-
25
-
## Performance
7
+
## Key Performance (Tesla P100 - 16GB)
26
8
27
-
Benchmarks performed on a Tesla P100 (16GB VRAM) comparing Standard PEFT (LoRA) vs LEMA (Streaming).
*\*Note: At sequence length 128, Standard PEFT narrowly fits in 16GB VRAM. However, increasing the workload to a standard sequence length of 512 causes an immediate **Out-Of-Memory (OOM)** crash. LEMA maintains a consistent ~6GB footprint even as sequence length scales, providing over **10GB of headroom** for activations and larger batches.*
The primary value of LEMA is not just "fitting" the model, but providing the **computational headroom** necessary for real-world training. On a 16GB GPU:
43
-
-**Standard PEFT**: Operating at ~88% VRAM capacity just to load the model and run a minimal step. Zero room for longer contexts or higher batch sizes.
44
-
-**LEMA**: Operating at ~37% VRAM capacity. Allows for significantly larger sequence lengths, higher batch sizes, or even larger models (13B+) on the same hardware.
17
+
-**Binary Indexed Engagement (GBI)**: Zero-copy mapping of `.safetensors` files using `mmap`.
18
+
-**Triple-Buffer Pipeline**: Pipelined data movement (Disk -> RAM -> VRAM) to hide PCIe latency.
19
+
-**High-Level API**: Simplified `LemaModel`and `LemaTrainer` interfaces for fast integration.
20
+
-**Automatic Checkpointing**: Built-in interval-based saving of LoRA adapters and optimizer states.
45
21
46
22
## Installation
47
23
48
-
### From Source
49
24
```bash
50
25
git clone https://github.com/Pomilon/LEMA.git
51
26
cd LEMA
52
27
pip install -e .
53
28
```
54
29
55
-
### Requirements
56
-
- PyTorch >= 2.0.0
57
-
- Transformers >= 4.30.0
58
-
- Safetensors >= 0.3.0
59
-
- Accelerate >= 0.20.0
60
-
- PEFT >= 0.4.0
61
-
62
-
## Usage
63
-
64
-
LEMA uses a configuration-driven approach:
30
+
## Quick Start
65
31
66
32
```python
67
-
from lema.config import LemaConfig, MemoryStrategy
68
-
from lema.engine.trainer import LemaTrainer
69
-
from lema.models.llama import LlamaAdapter
70
-
from lema.core.gbi import GlobalBinaryIndex
71
-
from lema.core.lora import LoRAManager
72
33
import torch
34
+
from lema import LemaConfig, LemaModel, MemoryStrategy
73
35
74
36
# 1. Configuration
75
37
config = LemaConfig(
76
-
model_name_or_path="llama2_7b.safetensors",
77
-
device="cuda",
78
-
strategy=MemoryStrategy.STREAMING,# Disk -> RAM -> VRAM
38
+
model_name_or_path="NousResearch/Llama-2-7b-hf",
39
+
gbi_path="llama2_7b.safetensors", # Single monolithic safetensors file
40
+
strategy=MemoryStrategy.STREAMING,
79
41
lora_rank=16,
80
-
learning_rate=1e-4,
81
-
gradient_checkpointing=True# Essential for large models
@@ -128,3 +73,6 @@ While LEMA v1.0 is stable and functional for 7B fine-tuning, I aim to significan
128
73
***Library Integration**: I am working toward deeper integration with Hugging Face `Trainer` and `Accelerate` for seamless usage in existing pipelines.
129
74
***Quantization Support**: I intend to implement native support for 8-bit and 4-bit loading within the streaming pipeline for even lower memory footprints.
130
75
***Model Support**: I am expanding support beyond Llama and GPT-2 to include Mistral, Mixtral (MoE), and other architectures.
This document describes the internal mechanics of the Layer-wise Efficient Memory Abstraction (LEMA) framework.
4
+
5
+
## The Problem: The VRAM Wall
6
+
Standard fine-tuning (even with PEFT/LoRA) requires the entire model weights to be resident in VRAM. For a Llama-2 7B model in FP16, this is ~14GB. Adding optimizer states and activations quickly exceeds the capacity of consumer GPUs (e.g., 16GB).
7
+
8
+
## The LEMA Solution: Virtualization
9
+
LEMA treats GPU VRAM not as a static storage for the model, but as a **dynamic cache** for execution.
10
+
11
+
### 1. The Triple-Buffer Strategy
12
+
LEMA hides data transfer latency by pipelining movements across three memory tiers:
13
+
14
+
1.**Storage (NVMe)**: Weights reside in `.safetensors` files. Accessed via `mmap` (Zero-copy).
15
+
2.**System RAM (Pinned)**: Acting as a "Prefetch Buffer". Pinned memory ensures high-speed Host-to-Device (H2D) transfers.
16
+
3.**VRAM (Execution)**: Divided into two "Slots" (Active and Prefetch).
17
+
18
+
### 2. The Execution Pipeline
19
+
While the GPU is computing Layer $N$ in Slot A, LEMA is:
20
+
- Asynchronously transferring Layer $N+1$ from RAM to Slot B (VRAM).
21
+
- Loading Layer $N+2$ from Disk to RAM (Staging).
22
+
23
+
When Layer $N$ finishes, the slots swap instantly.
24
+
25
+
### 3. The LEMA-Loop (Training Logic)
26
+
27
+
#### Forward Pass
28
+
- Model is executed layer-by-layer.
29
+
- Only "Boundary Activations" (the output of each layer) are stored in VRAM.
30
+
- Intermediate activations are discarded.
31
+
32
+
#### Backward Pass
33
+
- LEMA traverses the layers in reverse.
34
+
- For each layer:
35
+
1. The weights are swapped back into VRAM.
36
+
2. The layer's forward pass is **re-executed** (Segmented Gradient Checkpointing) using the stored boundary activations.
37
+
3. Gradients are calculated for the LoRA adapters.
38
+
4. Optimizer states for those specific adapters are updated.
39
+
40
+
### 4. GBI (Global Binary Index)
41
+
LEMA uses a specialized indexer to bypass standard PyTorch/Pickle deserialization. By reading the `.safetensors` header, LEMA knows the exact byte offsets for every parameter, allowing it to "slice" the file and load only the parameters needed for the current layer module.
42
+
43
+
## Performance Trade-offs
44
+
-**VRAM Efficiency**: ~50-70% reduction for 7B+ models.
45
+
-**Compute Overhead**: 1.2x - 1.8x slowdown compared to fully resident training, depending on PCIe bandwidth and disk speed.
46
+
-**System RAM**: Requires space equal to the model size (or less if using aggressive disk streaming).
This guide covers common workflows for fine-tuning Large Language Models using LEMA on memory-constrained hardware.
4
+
5
+
## 1. Preparing Your Model
6
+
7
+
LEMA requires model weights in a single, non-sharded `.safetensors` format. We provide a utility to handle conversion and shared-weight breaking automatically.
8
+
9
+
### Recommended Conversion
10
+
11
+
```python
12
+
from lema.utils.model_utils import prepare_monolithic_safetensors
13
+
14
+
# This handles downloading, shared-weight cloning, and monolithic saving
15
+
prepare_monolithic_safetensors(
16
+
"NousResearch/Llama-2-7b-hf",
17
+
"llama2_7b.safetensors",
18
+
device="auto"# Use 'auto' to save RAM during conversion if a GPU is available
19
+
)
20
+
```
21
+
22
+
## 2. Fine-Tuning Workflow
23
+
24
+
The standard workflow involves four steps: Configuration, Initialization, Training, and Saving.
25
+
26
+
### Basic Example
27
+
28
+
```python
29
+
import torch
30
+
from lema import LemaConfig, LemaModel, LemaTrainer
When using LEMA, ensure your `lora_target_modules` in `LemaConfig` match your model's architecture:
59
+
-**Llama**: `["q_proj", "v_proj", ...]` (Default)
60
+
-**GPT-2**: `["c_attn"]`
61
+
62
+
## 4. Memory Strategies
63
+
64
+
LEMA supports two primary strategies in `LemaConfig`:
65
+
66
+
-**`MemoryStrategy.STREAMING` (Default)**:
67
+
-**Path**: Disk -> Pinned RAM -> VRAM.
68
+
-**Pros**: Lowest VRAM usage. Can fit models much larger than System RAM if needed (via `mmap`).
69
+
-**Cons**: Higher latency due to PCIe/Disk bottleneck.
70
+
-**`MemoryStrategy.RESIDENT`**:
71
+
-**Path**: RAM -> VRAM.
72
+
-**Pros**: Faster than streaming. Model weights stay in RAM.
73
+
-**Cons**: Requires enough System RAM to hold the full model weights (~14GB for a 7B FP16 model).
74
+
75
+
## 4. Tips for Maximum Efficiency
76
+
77
+
1.**Gradient Checkpointing**: Always enable `gradient_checkpointing=True` for 7B+ models. This significantly reduces VRAM usage during the backward pass by not storing intermediate activations.
78
+
2.**Pinned Memory**: LEMA automatically uses pinned memory for transfers. Ensure your system has sufficient RAM available for the staging buffers (~2x the size of the largest layer).
79
+
3.**NVMe Storage**: When using `STREAMING` mode, placing your `.safetensors` file on an NVMe SSD will greatly reduce the "Streaming Overhead".
0 commit comments