Skip to content

Commit b2e1bf3

Browse files
committed
Initial Public Release: LEMA v1.0.0
0 parents  commit b2e1bf3

37 files changed

Lines changed: 4612 additions & 0 deletions

.github/workflows/tests.yml

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
name: LEMA Tests
2+
3+
on:
4+
push:
5+
branches: [ "main" ]
6+
pull_request:
7+
branches: [ "main" ]
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
strategy:
13+
matrix:
14+
python-version: ["3.10", "3.11", "3.12"]
15+
16+
steps:
17+
- uses: actions/checkout@v4
18+
19+
- name: Set up Python ${{ matrix.python-version }}
20+
uses: actions/setup-python@v5
21+
with:
22+
python-version: ${{ matrix.python-version }}
23+
24+
- name: Install dependencies
25+
run: |
26+
python -m pip install --upgrade pip
27+
pip install -r requirements.txt
28+
pip install pytest
29+
30+
- name: Run Tests
31+
env:
32+
PYTHONPATH: .
33+
run: |
34+
pytest tests/
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
name: Verify Installation
2+
3+
on:
4+
push:
5+
branches: [ "main" ]
6+
pull_request:
7+
branches: [ "main" ]
8+
9+
jobs:
10+
smoke-test:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v4
14+
15+
- name: Set up Python
16+
uses: actions/setup-python@v5
17+
with:
18+
python-version: "3.10"
19+
20+
- name: Install LEMA
21+
run: |
22+
pip install .
23+
24+
- name: Verify Import
25+
run: |
26+
python -c "from lema.config import LemaConfig; print('LEMA Config imported successfully')"
27+
python -c "from lema.models.llama import LlamaAdapter; print('LEMA LlamaAdapter imported successfully')"

.gitignore

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
*.so
6+
.Python
7+
build/
8+
develop-eggs/
9+
dist/
10+
downloads/
11+
eggs/
12+
.eggs/
13+
lib/
14+
lib64/
15+
parts/
16+
sdist/
17+
var/
18+
wheels/
19+
share/python-wheels/
20+
*.egg-info/
21+
.installed.cfg
22+
*.egg
23+
MANIFEST
24+
25+
# Virtual Environment
26+
.env
27+
.venv
28+
env/
29+
venv/
30+
ENV/
31+
env.bak/
32+
venv.bak/
33+
34+
# Installer logs
35+
pip-log.txt
36+
pip-delete-this-directory.txt
37+
38+
# Unit test / coverage reports
39+
htmlcov/
40+
.tox/
41+
.nox/
42+
.coverage
43+
.coverage.*
44+
.cache
45+
nosetests.xml
46+
coverage.xml
47+
*.cover
48+
*.py,cover
49+
.hypothesis/
50+
.pytest_cache/
51+
cover/
52+
53+
# Jupyter Notebook
54+
.ipynb_checkpoints
55+
56+
# IDEs
57+
.vscode/
58+
.idea/
59+
*.swp
60+
*.swo
61+
62+
# Project specific
63+
backups/
64+
kaggle_logs*/
65+
kaggle_benchmark/
66+
*.safetensors
67+
*.bin
68+
*.pt
69+
*.pth
70+
output/
71+
tmp/
72+
temp/
73+
wandb/
74+
75+
# LEMA specific logs
76+
*_results.txt
77+
kaggle_status*.log
78+
results/

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2026 Pomilon
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# LEMA: Layer-wise Efficient Memory Abstraction
2+
3+
**Architectural Specification for VRAM-Efficient Model Fine-Tuning**
4+
5+
LEMA is a specialized framework designed to facilitate the fine-tuning of Large Language Models (LLMs) on hardware where model size exceeds available VRAM. Unlike standard frameworks that require the full model to be resident in GPU memory, LEMA treats the model as a collection of discrete, addressable binary segments. By implementing a virtualized memory abstraction layer, LEMA performs asynchronous pre-fetching of layers into VRAM, effectively trading PCIe bandwidth for memory headroom.
6+
7+
## Core Features
8+
9+
### 1. Binary Indexed Engagement
10+
LEMA utilizes a **Global Binary Index (GBI)** to map `.safetensors` files directly into the process's virtual address space using `mmap`. This allows for zero-copy mapping and O(1) access to specific layer weights without full model deserialization.
11+
12+
### 2. Layer-wise Execution (Patchwork)
13+
Instead of a monolithic `model.forward()`, LEMA decomposes the computational graph into a sequence of isolated layer blocks.
14+
- **Weight Swapping**: Only the current layer and the next layer occupy VRAM.
15+
- **Persistence**: Model weights remain frozen in System RAM/Disk; only LoRA adapters are maintained in active memory.
16+
17+
### 3. The Triple-Buffer Strategy
18+
LEMA orchestrates data movement across three tiers to hide the latency of PCIe transfers:
19+
- **Storage (NVMe)**: The source of truth (Global Binary File).
20+
- **System RAM**: Pinned Memory Buffers for staging.
21+
- **VRAM**: Active Slot / Prefetch Slot for execution.
22+
23+
This strategy allows for asynchronous prefetching, where the CPU pushes the next layer to VRAM while the GPU computes the current layer.
24+
25+
## Performance
26+
27+
Benchmarks performed on a Tesla P100 (16GB VRAM) comparing Standard PEFT (LoRA) vs LEMA (Streaming).
28+
29+
![VRAM Benchmark](docs/assets/vram_benchmark.png)
30+
31+
| Model | Standard PEFT VRAM | LEMA VRAM | Savings | Status (Fine-Tuning) |
32+
| :--- | :--- | :--- | :--- | :--- |
33+
| **TinyLlama 1.1B** | 2.67 GB | **2.12 GB** | **20.5%** | **Stable** |
34+
| **SmolLM2 1.7B** | 3.88 GB | **3.20 GB** | **17.6%** | **Stable** |
35+
| **Llama-2 7B** | 13.99 GB* | **5.90 GB** | **~58%** | **LEMA Recommended** |
36+
37+
*\*Note: At sequence length 128, Standard PEFT narrowly fits in 16GB VRAM. However, increasing the workload to a standard sequence length of 512 causes an immediate **Out-Of-Memory (OOM)** crash. LEMA maintains a consistent ~6GB footprint even as sequence length scales, providing over **10GB of headroom** for activations and larger batches.*
38+
39+
![Speed Benchmark](docs/assets/speed_benchmark.png)
40+
41+
### The Headroom Advantage
42+
The primary value of LEMA is not just "fitting" the model, but providing the **computational headroom** necessary for real-world training. On a 16GB GPU:
43+
- **Standard PEFT**: Operating at ~88% VRAM capacity just to load the model and run a minimal step. Zero room for longer contexts or higher batch sizes.
44+
- **LEMA**: Operating at ~37% VRAM capacity. Allows for significantly larger sequence lengths, higher batch sizes, or even larger models (13B+) on the same hardware.
45+
46+
## Installation
47+
48+
### From Source
49+
```bash
50+
git clone https://github.com/Pomilon/LEMA.git
51+
cd LEMA
52+
pip install -e .
53+
```
54+
55+
### Requirements
56+
- PyTorch >= 2.0.0
57+
- Transformers >= 4.30.0
58+
- Safetensors >= 0.3.0
59+
- Accelerate >= 0.20.0
60+
- PEFT >= 0.4.0
61+
62+
## Usage
63+
64+
LEMA uses a configuration-driven approach:
65+
66+
```python
67+
from lema.config import LemaConfig, MemoryStrategy
68+
from lema.engine.trainer import LemaTrainer
69+
from lema.models.llama import LlamaAdapter
70+
from lema.core.gbi import GlobalBinaryIndex
71+
from lema.core.lora import LoRAManager
72+
import torch
73+
74+
# 1. Configuration
75+
config = LemaConfig(
76+
model_name_or_path="llama2_7b.safetensors",
77+
device="cuda",
78+
strategy=MemoryStrategy.STREAMING, # Disk -> RAM -> VRAM
79+
lora_rank=16,
80+
learning_rate=1e-4,
81+
gradient_checkpointing=True # Essential for large models
82+
)
83+
84+
# 2. Components
85+
# Load HF config dict manually or via AutoConfig
86+
from transformers import AutoConfig
87+
hf_config = AutoConfig.from_pretrained("NousResearch/Llama-2-7b-hf")
88+
adapter = LlamaAdapter(hf_config.to_dict())
89+
90+
gbi = GlobalBinaryIndex(config.gbi_path)
91+
92+
# 3. LoRA Setup
93+
lora_manager = LoRAManager({
94+
"r": config.lora_rank,
95+
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"]
96+
}, device=config.device)
97+
98+
# Initialize adapter with LoRA
99+
for layer in adapter.get_layer_metadata():
100+
if layer['type'] == 'block':
101+
module = adapter.construct_layer_module(layer['id'], None, lora_manager)
102+
adapter.release_layer_module(module)
103+
104+
# 4. Trainer
105+
optimizer = torch.optim.AdamW(lora_manager.get_trainable_parameters(), lr=config.learning_rate)
106+
107+
trainer = LemaTrainer(
108+
config=config,
109+
model_adapter=adapter,
110+
gbi=gbi,
111+
lora_manager=lora_manager,
112+
optimizer=optimizer
113+
)
114+
115+
# 5. Training Step
116+
input_ids = torch.randint(0, 32000, (1, 512)).cuda()
117+
trainer.train_step(input_ids, labels=input_ids)
118+
```
119+
120+
## License
121+
MIT License - Copyright (c) 2026 Pomilon
122+
123+
## Future Roadmap
124+
125+
While LEMA v1.0 is stable and functional for 7B fine-tuning, I aim to significantly reduce the streaming overhead and expand compatibility.
126+
127+
* **C++/CUDA Backend**: I plan to move the `TripleBufferManager` and memory streaming logic from Python to a C++ extension or custom CUDA kernels to bypass the GIL and reduce overhead to the theoretical minimum (~1.1x).
128+
* **Library Integration**: I am working toward deeper integration with Hugging Face `Trainer` and `Accelerate` for seamless usage in existing pipelines.
129+
* **Quantization Support**: I intend to implement native support for 8-bit and 4-bit loading within the streaming pipeline for even lower memory footprints.
130+
* **Model Support**: I am expanding support beyond Llama and GPT-2 to include Mistral, Mixtral (MoE), and other architectures.

docs/BENCHMARK_RESULTS.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# LEMA Benchmark Results (v0.7 - Release Candidate)
2+
3+
Benchmarks were performed on **Kaggle (Tesla P100 GPU, 16GB VRAM)**.
4+
Comparisons were made between **Standard PEFT (LoRA)** and **LEMA (Streaming Strategy)**.
5+
6+
## 1. VRAM Usage (Memory Efficiency)
7+
8+
LEMA demonstrates significant VRAM savings, particularly for larger models where the overhead of optimization states and activations usually causes OOM errors.
9+
10+
![VRAM Benchmark](assets/vram_benchmark.png)
11+
12+
### Detailed Metrics
13+
14+
| Model | Parameters | Standard PEFT VRAM | LEMA VRAM | Savings |
15+
| :--- | :--- | :--- | :--- | :--- |
16+
| **GPT-2 (Small)** | 124M | 0.44 GB | 1.05 GB | N/A* |
17+
| **TinyLlama** | 1.1B | 2.67 GB | **2.12 GB** | **20.5%** |
18+
| **SmolLM2** | 1.7B | 3.88 GB | **3.20 GB** | **17.6%** |
19+
| **Llama-2** | 7B | **13.99 GB** (Load Only)** | **5.90 GB** | **57.9%** |
20+
21+
*\*Note on GPT-2: For extremely small models, LEMA's fixed buffering overhead exceeds the model size. LEMA is optimized for Large-scale models.*
22+
*\**Note on Llama-2 7B: Standard PEFT can load the model (13.99GB) but fails immediately with **Out-Of-Memory (OOM)** when attempting a training step due to gradients/activations. LEMA trains comfortably with >10GB headroom.*
23+
24+
---
25+
26+
## 2. Training Speed (Throughput)
27+
28+
LEMA trades execution speed for memory capability. The architecture involves moving weights from system RAM to VRAM for every layer, introducing latency.
29+
30+
![Speed Benchmark](assets/speed_benchmark.png)
31+
32+
### Detailed Metrics
33+
34+
| Model | PEFT Speed (s/step) | LEMA Speed (s/step) | Overhead Factor | Status |
35+
| :--- | :--- | :--- | :--- | :--- |
36+
| **TinyLlama 1.1B** | 0.46 s | 1.45 s | **3.1x** | Usable |
37+
| **Llama-2 7B** | **FAILED (OOM)** | **7.21 s** | **N/A** | **Enabling** |
38+
39+
**Analysis**:
40+
- For models that fit in VRAM (1.1B), LEMA introduces a ~3x overhead due to Python-based stream orchestration and PCIe transfer latency.
41+
- For models that **do not fit** (7B on 16GB cards), LEMA provides infinite speedup by enabling training where it was previously impossible.
42+
43+
## 3. Configuration Used
44+
45+
- **LoRA Targets**: All linear layers (`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` for Llama).
46+
- **Sequence Length**: 512.
47+
- **Precision**: FP16.
48+
- **Gradient Checkpointing**: Enabled for 7B, Disabled for smaller models.

0 commit comments

Comments
 (0)