Skip to content

Commit b52d3f5

Browse files
committed
Refactor LEMA into a unified library framework
- Unify API around LemaModel and model.get_trainer() - Centralize model conversion and utility logic in lema.utils.model_utils - Add automatic interval-based checkpointing to LemaTrainer - Update Kaggle generation tool to create a comprehensive research workspace - Refresh all documentation (README, User Guide, API Reference, Architecture) - Verify 7B model stability via Kaggle benchmarks
1 parent b2e1bf3 commit b52d3f5

22 files changed

Lines changed: 2179 additions & 934 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,3 +76,4 @@ wandb/
7676
*_results.txt
7777
kaggle_status*.log
7878
results/
79+
*-metadata.json

README.md

Lines changed: 40 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -1,124 +1,69 @@
11
# LEMA: Layer-wise Efficient Memory Abstraction
22

3-
**Architectural Specification for VRAM-Efficient Model Fine-Tuning**
3+
**Virtualize GPU VRAM for LLM Fine-Tuning**
44

5-
LEMA is a specialized framework designed to facilitate the fine-tuning of Large Language Models (LLMs) on hardware where model size exceeds available VRAM. Unlike standard frameworks that require the full model to be resident in GPU memory, LEMA treats the model as a collection of discrete, addressable binary segments. By implementing a virtualized memory abstraction layer, LEMA performs asynchronous pre-fetching of layers into VRAM, effectively trading PCIe bandwidth for memory headroom.
5+
LEMA is a specialized framework designed to facilitate the fine-tuning of Large Language Models (LLMs) on hardware where model size exceeds available VRAM. By treating model weights as addressable binary segments and implementing a **Triple-Buffer Strategy**, LEMA allows training 7B+ models on GPUs with as little as 16GB VRAM.
66

7-
## Core Features
8-
9-
### 1. Binary Indexed Engagement
10-
LEMA utilizes a **Global Binary Index (GBI)** to map `.safetensors` files directly into the process's virtual address space using `mmap`. This allows for zero-copy mapping and O(1) access to specific layer weights without full model deserialization.
11-
12-
### 2. Layer-wise Execution (Patchwork)
13-
Instead of a monolithic `model.forward()`, LEMA decomposes the computational graph into a sequence of isolated layer blocks.
14-
- **Weight Swapping**: Only the current layer and the next layer occupy VRAM.
15-
- **Persistence**: Model weights remain frozen in System RAM/Disk; only LoRA adapters are maintained in active memory.
16-
17-
### 3. The Triple-Buffer Strategy
18-
LEMA orchestrates data movement across three tiers to hide the latency of PCIe transfers:
19-
- **Storage (NVMe)**: The source of truth (Global Binary File).
20-
- **System RAM**: Pinned Memory Buffers for staging.
21-
- **VRAM**: Active Slot / Prefetch Slot for execution.
22-
23-
This strategy allows for asynchronous prefetching, where the CPU pushes the next layer to VRAM while the GPU computes the current layer.
24-
25-
## Performance
7+
## Key Performance (Tesla P100 - 16GB)
268

27-
Benchmarks performed on a Tesla P100 (16GB VRAM) comparing Standard PEFT (LoRA) vs LEMA (Streaming).
9+
| Model | Standard PEFT | LEMA | Status |
10+
| :--- | :--- | :--- | :--- |
11+
| **Llama-2 7B** | **OOM (Crash)** | **5.90 GB VRAM** | **Stable** |
12+
| **SmolLM2 1.7B**| 3.88 GB | 3.20 GB | Stable |
13+
| **TinyLlama 1.1B**| 2.67 GB | 2.12 GB | Stable |
2814

29-
![VRAM Benchmark](docs/assets/vram_benchmark.png)
30-
31-
| Model | Standard PEFT VRAM | LEMA VRAM | Savings | Status (Fine-Tuning) |
32-
| :--- | :--- | :--- | :--- | :--- |
33-
| **TinyLlama 1.1B** | 2.67 GB | **2.12 GB** | **20.5%** | **Stable** |
34-
| **SmolLM2 1.7B** | 3.88 GB | **3.20 GB** | **17.6%** | **Stable** |
35-
| **Llama-2 7B** | 13.99 GB* | **5.90 GB** | **~58%** | **LEMA Recommended** |
36-
37-
*\*Note: At sequence length 128, Standard PEFT narrowly fits in 16GB VRAM. However, increasing the workload to a standard sequence length of 512 causes an immediate **Out-Of-Memory (OOM)** crash. LEMA maintains a consistent ~6GB footprint even as sequence length scales, providing over **10GB of headroom** for activations and larger batches.*
38-
39-
![Speed Benchmark](docs/assets/speed_benchmark.png)
15+
## Core Features
4016

41-
### The Headroom Advantage
42-
The primary value of LEMA is not just "fitting" the model, but providing the **computational headroom** necessary for real-world training. On a 16GB GPU:
43-
- **Standard PEFT**: Operating at ~88% VRAM capacity just to load the model and run a minimal step. Zero room for longer contexts or higher batch sizes.
44-
- **LEMA**: Operating at ~37% VRAM capacity. Allows for significantly larger sequence lengths, higher batch sizes, or even larger models (13B+) on the same hardware.
17+
- **Binary Indexed Engagement (GBI)**: Zero-copy mapping of `.safetensors` files using `mmap`.
18+
- **Triple-Buffer Pipeline**: Pipelined data movement (Disk -> RAM -> VRAM) to hide PCIe latency.
19+
- **High-Level API**: Simplified `LemaModel` and `LemaTrainer` interfaces for fast integration.
20+
- **Automatic Checkpointing**: Built-in interval-based saving of LoRA adapters and optimizer states.
4521

4622
## Installation
4723

48-
### From Source
4924
```bash
5025
git clone https://github.com/Pomilon/LEMA.git
5126
cd LEMA
5227
pip install -e .
5328
```
5429

55-
### Requirements
56-
- PyTorch >= 2.0.0
57-
- Transformers >= 4.30.0
58-
- Safetensors >= 0.3.0
59-
- Accelerate >= 0.20.0
60-
- PEFT >= 0.4.0
61-
62-
## Usage
63-
64-
LEMA uses a configuration-driven approach:
30+
## Quick Start
6531

6632
```python
67-
from lema.config import LemaConfig, MemoryStrategy
68-
from lema.engine.trainer import LemaTrainer
69-
from lema.models.llama import LlamaAdapter
70-
from lema.core.gbi import GlobalBinaryIndex
71-
from lema.core.lora import LoRAManager
7233
import torch
34+
from lema import LemaConfig, LemaModel, MemoryStrategy
7335

7436
# 1. Configuration
7537
config = LemaConfig(
76-
model_name_or_path="llama2_7b.safetensors",
77-
device="cuda",
78-
strategy=MemoryStrategy.STREAMING, # Disk -> RAM -> VRAM
38+
model_name_or_path="NousResearch/Llama-2-7b-hf",
39+
gbi_path="llama2_7b.safetensors", # Single monolithic safetensors file
40+
strategy=MemoryStrategy.STREAMING,
7941
lora_rank=16,
80-
learning_rate=1e-4,
81-
gradient_checkpointing=True # Essential for large models
42+
gradient_checkpointing=True
8243
)
8344

84-
# 2. Components
85-
# Load HF config dict manually or via AutoConfig
86-
from transformers import AutoConfig
87-
hf_config = AutoConfig.from_pretrained("NousResearch/Llama-2-7b-hf")
88-
adapter = LlamaAdapter(hf_config.to_dict())
89-
90-
gbi = GlobalBinaryIndex(config.gbi_path)
91-
92-
# 3. LoRA Setup
93-
lora_manager = LoRAManager({
94-
"r": config.lora_rank,
95-
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"]
96-
}, device=config.device)
97-
98-
# Initialize adapter with LoRA
99-
for layer in adapter.get_layer_metadata():
100-
if layer['type'] == 'block':
101-
module = adapter.construct_layer_module(layer['id'], None, lora_manager)
102-
adapter.release_layer_module(module)
103-
104-
# 4. Trainer
105-
optimizer = torch.optim.AdamW(lora_manager.get_trainable_parameters(), lr=config.learning_rate)
106-
107-
trainer = LemaTrainer(
108-
config=config,
109-
model_adapter=adapter,
110-
gbi=gbi,
111-
lora_manager=lora_manager,
112-
optimizer=optimizer
113-
)
45+
# 2. Initialize Model & Trainer
46+
model = LemaModel(config)
47+
model.initialize_lora() # Pre-initialize adapters
48+
49+
optimizer = torch.optim.AdamW(model.get_trainable_parameters(), lr=1e-4)
50+
trainer = model.get_trainer(optimizer)
11451

115-
# 5. Training Step
52+
# 3. Train
11653
input_ids = torch.randint(0, 32000, (1, 512)).cuda()
117-
trainer.train_step(input_ids, labels=input_ids)
54+
logits, loss = trainer.train_step(input_ids, labels=input_ids)
11855
```
11956

120-
## License
121-
MIT License - Copyright (c) 2026 Pomilon
57+
## Documentation
58+
59+
- [**User Guide**](docs/USER_GUIDE.md): Model preparation, conversion, and tips.
60+
- [**API Reference**](docs/API_REFERENCE.md): Detailed class and method specifications.
61+
- [**Architecture**](docs/ARCHITECTURE.md): Deep dive into the memory pipeline and LEMA-loop.
62+
63+
## Kaggle Benchmark
64+
65+
You can run the latest verification suite on Kaggle using the provided notebook:
66+
[**LEMA Benchmark Notebook**](https://www.kaggle.com/code/kloyford/lema-benchmark-notebook)
12267

12368
## Future Roadmap
12469

@@ -128,3 +73,6 @@ While LEMA v1.0 is stable and functional for 7B fine-tuning, I aim to significan
12873
* **Library Integration**: I am working toward deeper integration with Hugging Face `Trainer` and `Accelerate` for seamless usage in existing pipelines.
12974
* **Quantization Support**: I intend to implement native support for 8-bit and 4-bit loading within the streaming pipeline for even lower memory footprints.
13075
* **Model Support**: I am expanding support beyond Llama and GPT-2 to include Mistral, Mixtral (MoE), and other architectures.
76+
77+
## License
78+
MIT License - Copyright (c) 2026 Pomilon

demo_model/config.json

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
{
2+
"activation_function": "gelu_new",
3+
"attn_pdrop": 0.1,
4+
"bos_token_id": 50256,
5+
"embd_pdrop": 0.1,
6+
"eos_token_id": 50256,
7+
"initializer_range": 0.02,
8+
"layer_norm_epsilon": 1e-05,
9+
"model_type": "gpt2",
10+
"n_embd": 128,
11+
"n_head": 4,
12+
"n_inner": null,
13+
"n_layer": 2,
14+
"n_positions": 1024,
15+
"reorder_and_upcast_attn": false,
16+
"resid_pdrop": 0.1,
17+
"scale_attn_by_inverse_layer_idx": false,
18+
"scale_attn_weights": true,
19+
"summary_activation": null,
20+
"summary_first_dropout": 0.1,
21+
"summary_proj_to_labels": true,
22+
"summary_type": "cls_index",
23+
"summary_use_proj": true,
24+
"transformers_version": "4.55.2",
25+
"use_cache": true,
26+
"vocab_size": 1000
27+
}

docs/API_REFERENCE.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# LEMA API Reference
2+
3+
This document provides detailed information about the LEMA (Layer-wise Efficient Memory Abstraction) library API.
4+
5+
## Core API
6+
7+
### `LemaModel`
8+
The primary entry point for the framework. It orchestrates memory management, adapters, and LoRA parameters.
9+
10+
#### `__init__(config: LemaConfig)`
11+
Initializes the model using a `LemaConfig` object.
12+
13+
#### `get_trainer(optimizer: torch.optim.Optimizer)`
14+
Returns a `LemaTrainer` instance pre-configured with this model's components and memory manager.
15+
16+
#### `initialize_lora()`
17+
Pre-initializes all LoRA adapters. Must be called before `get_trainable_parameters()` for new models.
18+
19+
#### `get_trainable_parameters()`
20+
Returns a list of all trainable parameters (LoRA weights) managed by the model.
21+
22+
#### `save_pretrained(save_directory: str)`
23+
Saves the configuration and LoRA adapter weights.
24+
25+
#### `from_pretrained(path: str, **kwargs)` (Class Method)
26+
Loads a LEMA model from a directory containing `lema_config.json` and `adapter_model.bin`.
27+
28+
---
29+
30+
### `LemaConfig`
31+
Configuration dataclass for LEMA.
32+
33+
| Parameter | Type | Default | Description |
34+
| :--- | :--- | :--- | :--- |
35+
| `model_name_or_path` | `str` | Required | HuggingFace ID or path to model directory. |
36+
| `model_type` | `str` | `None` | `llama` or `gpt2`. Auto-detected if None. |
37+
| `gbi_path` | `str` | `None` | Path to the `.safetensors` file. |
38+
| `device` | `str` | `"cuda"` | Execution device. |
39+
| `strategy` | `MemoryStrategy` | `STREAMING` | `STREAMING` or `RESIDENT`. |
40+
| `save_steps` | `int` | `500` | Steps between automatic checkpoints. |
41+
| `output_dir` | `str` | `"output"` | Directory for automatic checkpoints. |
42+
| `lora_rank` | `int` | `16` | LoRA rank (r). |
43+
| `lora_alpha` | `int` | `32` | LoRA alpha. |
44+
| `learning_rate` | `float` | `1e-4` | Learning rate. |
45+
| `gradient_checkpointing`| `bool` | `False` | Enable to save activation VRAM. |
46+
47+
---
48+
49+
### `LemaTrainer`
50+
Orchestrates the training loop with layer-swapping logic.
51+
52+
#### `__init__(config, model_adapter, gbi, lora_manager=None, optimizer=None, memory_manager=None)`
53+
Low-level constructor. Preferred usage is via `LemaModel.get_trainer()`.
54+
55+
#### `train_step(inputs: torch.Tensor, labels: torch.Tensor = None)`
56+
Executes one forward and backward pass. Tracks `global_step` and triggers auto-checkpointing.
57+
- Returns: `(logits, loss_value)`.
58+
59+
#### `save_checkpoint(save_directory: str)`
60+
Saves the model state, configuration, and optimizer state.

docs/ARCHITECTURE.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# LEMA Architecture
2+
3+
This document describes the internal mechanics of the Layer-wise Efficient Memory Abstraction (LEMA) framework.
4+
5+
## The Problem: The VRAM Wall
6+
Standard fine-tuning (even with PEFT/LoRA) requires the entire model weights to be resident in VRAM. For a Llama-2 7B model in FP16, this is ~14GB. Adding optimizer states and activations quickly exceeds the capacity of consumer GPUs (e.g., 16GB).
7+
8+
## The LEMA Solution: Virtualization
9+
LEMA treats GPU VRAM not as a static storage for the model, but as a **dynamic cache** for execution.
10+
11+
### 1. The Triple-Buffer Strategy
12+
LEMA hides data transfer latency by pipelining movements across three memory tiers:
13+
14+
1. **Storage (NVMe)**: Weights reside in `.safetensors` files. Accessed via `mmap` (Zero-copy).
15+
2. **System RAM (Pinned)**: Acting as a "Prefetch Buffer". Pinned memory ensures high-speed Host-to-Device (H2D) transfers.
16+
3. **VRAM (Execution)**: Divided into two "Slots" (Active and Prefetch).
17+
18+
### 2. The Execution Pipeline
19+
While the GPU is computing Layer $N$ in Slot A, LEMA is:
20+
- Asynchronously transferring Layer $N+1$ from RAM to Slot B (VRAM).
21+
- Loading Layer $N+2$ from Disk to RAM (Staging).
22+
23+
When Layer $N$ finishes, the slots swap instantly.
24+
25+
### 3. The LEMA-Loop (Training Logic)
26+
27+
#### Forward Pass
28+
- Model is executed layer-by-layer.
29+
- Only "Boundary Activations" (the output of each layer) are stored in VRAM.
30+
- Intermediate activations are discarded.
31+
32+
#### Backward Pass
33+
- LEMA traverses the layers in reverse.
34+
- For each layer:
35+
1. The weights are swapped back into VRAM.
36+
2. The layer's forward pass is **re-executed** (Segmented Gradient Checkpointing) using the stored boundary activations.
37+
3. Gradients are calculated for the LoRA adapters.
38+
4. Optimizer states for those specific adapters are updated.
39+
40+
### 4. GBI (Global Binary Index)
41+
LEMA uses a specialized indexer to bypass standard PyTorch/Pickle deserialization. By reading the `.safetensors` header, LEMA knows the exact byte offsets for every parameter, allowing it to "slice" the file and load only the parameters needed for the current layer module.
42+
43+
## Performance Trade-offs
44+
- **VRAM Efficiency**: ~50-70% reduction for 7B+ models.
45+
- **Compute Overhead**: 1.2x - 1.8x slowdown compared to fully resident training, depending on PCIe bandwidth and disk speed.
46+
- **System RAM**: Requires space equal to the model size (or less if using aggressive disk streaming).

docs/USER_GUIDE.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# LEMA User Guide
2+
3+
This guide covers common workflows for fine-tuning Large Language Models using LEMA on memory-constrained hardware.
4+
5+
## 1. Preparing Your Model
6+
7+
LEMA requires model weights in a single, non-sharded `.safetensors` format. We provide a utility to handle conversion and shared-weight breaking automatically.
8+
9+
### Recommended Conversion
10+
11+
```python
12+
from lema.utils.model_utils import prepare_monolithic_safetensors
13+
14+
# This handles downloading, shared-weight cloning, and monolithic saving
15+
prepare_monolithic_safetensors(
16+
"NousResearch/Llama-2-7b-hf",
17+
"llama2_7b.safetensors",
18+
device="auto" # Use 'auto' to save RAM during conversion if a GPU is available
19+
)
20+
```
21+
22+
## 2. Fine-Tuning Workflow
23+
24+
The standard workflow involves four steps: Configuration, Initialization, Training, and Saving.
25+
26+
### Basic Example
27+
28+
```python
29+
import torch
30+
from lema import LemaConfig, LemaModel, LemaTrainer
31+
32+
# 1. Setup Config
33+
config = LemaConfig(
34+
model_name_or_path="NousResearch/Llama-2-7b-hf",
35+
gbi_path="llama2_7b.safetensors",
36+
lora_rank=16,
37+
gradient_checkpointing=True
38+
)
39+
40+
# 2. Initialize
41+
model = LemaModel(config)
42+
model.initialize_lora() # Crucial for new models
43+
44+
# 3. Training
45+
optimizer = torch.optim.AdamW(model.get_trainable_parameters(), lr=1e-4)
46+
trainer = model.get_trainer(optimizer)
47+
48+
for batch in dataloader:
49+
logits, loss = trainer.train_step(batch['input_ids'], labels=batch['labels'])
50+
print(f"Loss: {loss}")
51+
52+
# 4. Save
53+
trainer.save_checkpoint("checkpoints/lema-llama-7b-v1")
54+
```
55+
56+
## 3. Architecture Specifics
57+
58+
When using LEMA, ensure your `lora_target_modules` in `LemaConfig` match your model's architecture:
59+
- **Llama**: `["q_proj", "v_proj", ...]` (Default)
60+
- **GPT-2**: `["c_attn"]`
61+
62+
## 4. Memory Strategies
63+
64+
LEMA supports two primary strategies in `LemaConfig`:
65+
66+
- **`MemoryStrategy.STREAMING` (Default)**:
67+
- **Path**: Disk -> Pinned RAM -> VRAM.
68+
- **Pros**: Lowest VRAM usage. Can fit models much larger than System RAM if needed (via `mmap`).
69+
- **Cons**: Higher latency due to PCIe/Disk bottleneck.
70+
- **`MemoryStrategy.RESIDENT`**:
71+
- **Path**: RAM -> VRAM.
72+
- **Pros**: Faster than streaming. Model weights stay in RAM.
73+
- **Cons**: Requires enough System RAM to hold the full model weights (~14GB for a 7B FP16 model).
74+
75+
## 4. Tips for Maximum Efficiency
76+
77+
1. **Gradient Checkpointing**: Always enable `gradient_checkpointing=True` for 7B+ models. This significantly reduces VRAM usage during the backward pass by not storing intermediate activations.
78+
2. **Pinned Memory**: LEMA automatically uses pinned memory for transfers. Ensure your system has sufficient RAM available for the staging buffers (~2x the size of the largest layer).
79+
3. **NVMe Storage**: When using `STREAMING` mode, placing your `.safetensors` file on an NVMe SSD will greatly reduce the "Streaming Overhead".

0 commit comments

Comments
 (0)