|
| 1 | +# LEMA: Layer-wise Efficient Memory Abstraction |
| 2 | + |
| 3 | +**Architectural Specification for VRAM-Efficient Model Fine-Tuning** |
| 4 | + |
| 5 | +LEMA is a specialized framework designed to facilitate the fine-tuning of Large Language Models (LLMs) on hardware where model size exceeds available VRAM. Unlike standard frameworks that require the full model to be resident in GPU memory, LEMA treats the model as a collection of discrete, addressable binary segments. By implementing a virtualized memory abstraction layer, LEMA performs asynchronous pre-fetching of layers into VRAM, effectively trading PCIe bandwidth for memory headroom. |
| 6 | + |
| 7 | +## Core Features |
| 8 | + |
| 9 | +### 1. Binary Indexed Engagement |
| 10 | +LEMA utilizes a **Global Binary Index (GBI)** to map `.safetensors` files directly into the process's virtual address space using `mmap`. This allows for zero-copy mapping and O(1) access to specific layer weights without full model deserialization. |
| 11 | + |
| 12 | +### 2. Layer-wise Execution (Patchwork) |
| 13 | +Instead of a monolithic `model.forward()`, LEMA decomposes the computational graph into a sequence of isolated layer blocks. |
| 14 | +- **Weight Swapping**: Only the current layer and the next layer occupy VRAM. |
| 15 | +- **Persistence**: Model weights remain frozen in System RAM/Disk; only LoRA adapters are maintained in active memory. |
| 16 | + |
| 17 | +### 3. The Triple-Buffer Strategy |
| 18 | +LEMA orchestrates data movement across three tiers to hide the latency of PCIe transfers: |
| 19 | +- **Storage (NVMe)**: The source of truth (Global Binary File). |
| 20 | +- **System RAM**: Pinned Memory Buffers for staging. |
| 21 | +- **VRAM**: Active Slot / Prefetch Slot for execution. |
| 22 | + |
| 23 | +This strategy allows for asynchronous prefetching, where the CPU pushes the next layer to VRAM while the GPU computes the current layer. |
| 24 | + |
| 25 | +## Performance |
| 26 | + |
| 27 | +Benchmarks performed on a Tesla P100 (16GB VRAM) comparing Standard PEFT (LoRA) vs LEMA (Streaming). |
| 28 | + |
| 29 | + |
| 30 | + |
| 31 | +| Model | Standard PEFT VRAM | LEMA VRAM | Savings | Status (Fine-Tuning) | |
| 32 | +| :--- | :--- | :--- | :--- | :--- | |
| 33 | +| **TinyLlama 1.1B** | 2.67 GB | **2.12 GB** | **20.5%** | **Stable** | |
| 34 | +| **SmolLM2 1.7B** | 3.88 GB | **3.20 GB** | **17.6%** | **Stable** | |
| 35 | +| **Llama-2 7B** | 13.99 GB* | **5.90 GB** | **~58%** | **LEMA Recommended** | |
| 36 | + |
| 37 | +*\*Note: At sequence length 128, Standard PEFT narrowly fits in 16GB VRAM. However, increasing the workload to a standard sequence length of 512 causes an immediate **Out-Of-Memory (OOM)** crash. LEMA maintains a consistent ~6GB footprint even as sequence length scales, providing over **10GB of headroom** for activations and larger batches.* |
| 38 | + |
| 39 | + |
| 40 | + |
| 41 | +### The Headroom Advantage |
| 42 | +The primary value of LEMA is not just "fitting" the model, but providing the **computational headroom** necessary for real-world training. On a 16GB GPU: |
| 43 | +- **Standard PEFT**: Operating at ~88% VRAM capacity just to load the model and run a minimal step. Zero room for longer contexts or higher batch sizes. |
| 44 | +- **LEMA**: Operating at ~37% VRAM capacity. Allows for significantly larger sequence lengths, higher batch sizes, or even larger models (13B+) on the same hardware. |
| 45 | + |
| 46 | +## Installation |
| 47 | + |
| 48 | +### From Source |
| 49 | +```bash |
| 50 | +git clone https://github.com/Pomilon/LEMA.git |
| 51 | +cd LEMA |
| 52 | +pip install -e . |
| 53 | +``` |
| 54 | + |
| 55 | +### Requirements |
| 56 | +- PyTorch >= 2.0.0 |
| 57 | +- Transformers >= 4.30.0 |
| 58 | +- Safetensors >= 0.3.0 |
| 59 | +- Accelerate >= 0.20.0 |
| 60 | +- PEFT >= 0.4.0 |
| 61 | + |
| 62 | +## Usage |
| 63 | + |
| 64 | +LEMA uses a configuration-driven approach: |
| 65 | + |
| 66 | +```python |
| 67 | +from lema.config import LemaConfig, MemoryStrategy |
| 68 | +from lema.engine.trainer import LemaTrainer |
| 69 | +from lema.models.llama import LlamaAdapter |
| 70 | +from lema.core.gbi import GlobalBinaryIndex |
| 71 | +from lema.core.lora import LoRAManager |
| 72 | +import torch |
| 73 | + |
| 74 | +# 1. Configuration |
| 75 | +config = LemaConfig( |
| 76 | + model_name_or_path="llama2_7b.safetensors", |
| 77 | + device="cuda", |
| 78 | + strategy=MemoryStrategy.STREAMING, # Disk -> RAM -> VRAM |
| 79 | + lora_rank=16, |
| 80 | + learning_rate=1e-4, |
| 81 | + gradient_checkpointing=True # Essential for large models |
| 82 | +) |
| 83 | + |
| 84 | +# 2. Components |
| 85 | +# Load HF config dict manually or via AutoConfig |
| 86 | +from transformers import AutoConfig |
| 87 | +hf_config = AutoConfig.from_pretrained("NousResearch/Llama-2-7b-hf") |
| 88 | +adapter = LlamaAdapter(hf_config.to_dict()) |
| 89 | + |
| 90 | +gbi = GlobalBinaryIndex(config.gbi_path) |
| 91 | + |
| 92 | +# 3. LoRA Setup |
| 93 | +lora_manager = LoRAManager({ |
| 94 | + "r": config.lora_rank, |
| 95 | + "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"] |
| 96 | +}, device=config.device) |
| 97 | + |
| 98 | +# Initialize adapter with LoRA |
| 99 | +for layer in adapter.get_layer_metadata(): |
| 100 | + if layer['type'] == 'block': |
| 101 | + module = adapter.construct_layer_module(layer['id'], None, lora_manager) |
| 102 | + adapter.release_layer_module(module) |
| 103 | + |
| 104 | +# 4. Trainer |
| 105 | +optimizer = torch.optim.AdamW(lora_manager.get_trainable_parameters(), lr=config.learning_rate) |
| 106 | + |
| 107 | +trainer = LemaTrainer( |
| 108 | + config=config, |
| 109 | + model_adapter=adapter, |
| 110 | + gbi=gbi, |
| 111 | + lora_manager=lora_manager, |
| 112 | + optimizer=optimizer |
| 113 | +) |
| 114 | + |
| 115 | +# 5. Training Step |
| 116 | +input_ids = torch.randint(0, 32000, (1, 512)).cuda() |
| 117 | +trainer.train_step(input_ids, labels=input_ids) |
| 118 | +``` |
| 119 | + |
| 120 | +## License |
| 121 | +MIT License - Copyright (c) 2026 Pomilon |
| 122 | + |
| 123 | +## Future Roadmap |
| 124 | + |
| 125 | +While LEMA v1.0 is stable and functional for 7B fine-tuning, I aim to significantly reduce the streaming overhead and expand compatibility. |
| 126 | + |
| 127 | +* **C++/CUDA Backend**: I plan to move the `TripleBufferManager` and memory streaming logic from Python to a C++ extension or custom CUDA kernels to bypass the GIL and reduce overhead to the theoretical minimum (~1.1x). |
| 128 | +* **Library Integration**: I am working toward deeper integration with Hugging Face `Trainer` and `Accelerate` for seamless usage in existing pipelines. |
| 129 | +* **Quantization Support**: I intend to implement native support for 8-bit and 4-bit loading within the streaming pipeline for even lower memory footprints. |
| 130 | +* **Model Support**: I am expanding support beyond Llama and GPT-2 to include Mistral, Mixtral (MoE), and other architectures. |
0 commit comments