Pomilon
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 40 additions & 92 deletions b/‎README.md‎
Lines changed: 40 additions & 92 deletions
diff --git a/‎demo_model/config.json‎
Lines changed: 27 additions & 0 deletions b/‎demo_model/config.json‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎docs/API_REFERENCE.md‎
Lines changed: 60 additions & 0 deletions b/‎docs/API_REFERENCE.md‎
Lines changed: 60 additions & 0 deletions
diff --git a/‎docs/ARCHITECTURE.md‎
Lines changed: 46 additions & 0 deletions b/‎docs/ARCHITECTURE.md‎
Lines changed: 46 additions & 0 deletions
diff --git a/‎docs/USER_GUIDE.md‎
Lines changed: 79 additions & 0 deletions b/‎docs/USER_GUIDE.md‎
Lines changed: 79 additions & 0 deletions
@@ -76,3 +76,4 @@ wandb/
 *_results.txt
 kaggle_status*.log
 results/
+*-metadata.json
@@ -1,124 +1,69 @@
 # LEMA: Layer-wise Efficient Memory Abstraction
 
-**Architectural Specification for VRAM-Efficient Model Fine-Tuning**
+**Virtualize GPU VRAM for LLM Fine-Tuning**
 
-LEMA is a specialized framework designed to facilitate the fine-tuning of Large Language Models (LLMs) on hardware where model size exceeds available VRAM. Unlike standard frameworks that require the full model to be resident in GPU memory, LEMA treats the model as a collection of discrete, addressable binary segments. By implementing a virtualized memory abstraction layer, LEMA performs asynchronous pre-fetching of layers into VRAM, effectively trading PCIe bandwidth for memory headroom.
+LEMA is a specialized framework designed to facilitate the fine-tuning of Large Language Models (LLMs) on hardware where model size exceeds available VRAM. By treating model weights as addressable binary segments and implementing a **Triple-Buffer Strategy**, LEMA allows training 7B+ models on GPUs with as little as 16GB VRAM.
 
-## Core Features
-
-### 1. Binary Indexed Engagement
-LEMA utilizes a **Global Binary Index (GBI)** to map `.safetensors` files directly into the process's virtual address space using `mmap`. This allows for zero-copy mapping and O(1) access to specific layer weights without full model deserialization.
-
-### 2. Layer-wise Execution (Patchwork)
-Instead of a monolithic `model.forward()`, LEMA decomposes the computational graph into a sequence of isolated layer blocks.
-- **Weight Swapping**: Only the current layer and the next layer occupy VRAM.
-- **Persistence**: Model weights remain frozen in System RAM/Disk; only LoRA adapters are maintained in active memory.
-
-### 3. The Triple-Buffer Strategy
-LEMA orchestrates data movement across three tiers to hide the latency of PCIe transfers:
-- **Storage (NVMe)**: The source of truth (Global Binary File).
-- **System RAM**: Pinned Memory Buffers for staging.
-- **VRAM**: Active Slot / Prefetch Slot for execution.
-
-This strategy allows for asynchronous prefetching, where the CPU pushes the next layer to VRAM while the GPU computes the current layer.
-
-## Performance
+## Key Performance (Tesla P100 - 16GB)
 
-Benchmarks performed on a Tesla P100 (16GB VRAM) comparing Standard PEFT (LoRA) vs LEMA (Streaming).
+| Model | Standard PEFT | LEMA | Status |
+| :--- | :--- | :--- | :--- |
+| **Llama-2 7B** | **OOM (Crash)** | **5.90 GB VRAM** | **Stable** |
+| **SmolLM2 1.7B**| 3.88 GB | 3.20 GB | Stable |
+| **TinyLlama 1.1B**| 2.67 GB | 2.12 GB | Stable |
 
-![VRAM Benchmark](docs/assets/vram_benchmark.png)
-
-| Model | Standard PEFT VRAM | LEMA VRAM | Savings | Status (Fine-Tuning) |
-| :--- | :--- | :--- | :--- | :--- |
-| **TinyLlama 1.1B** | 2.67 GB | **2.12 GB** | **20.5%** | **Stable** |
-| **SmolLM2 1.7B** | 3.88 GB | **3.20 GB** | **17.6%** | **Stable** |
-| **Llama-2 7B** | 13.99 GB* | **5.90 GB** | **~58%** | **LEMA Recommended** |
-
-*\*Note: At sequence length 128, Standard PEFT narrowly fits in 16GB VRAM. However, increasing the workload to a standard sequence length of 512 causes an immediate **Out-Of-Memory (OOM)** crash. LEMA maintains a consistent ~6GB footprint even as sequence length scales, providing over **10GB of headroom** for activations and larger batches.*
-
-![Speed Benchmark](docs/assets/speed_benchmark.png)
+## Core Features
 
-### The Headroom Advantage
-The primary value of LEMA is not just "fitting" the model, but providing the **computational headroom** necessary for real-world training. On a 16GB GPU:
-- **Standard PEFT**: Operating at ~88% VRAM capacity just to load the model and run a minimal step. Zero room for longer contexts or higher batch sizes.
-- **LEMA**: Operating at ~37% VRAM capacity. Allows for significantly larger sequence lengths, higher batch sizes, or even larger models (13B+) on the same hardware.
+- **Binary Indexed Engagement (GBI)**: Zero-copy mapping of `.safetensors` files using `mmap`.
+- **Triple-Buffer Pipeline**: Pipelined data movement (Disk -> RAM -> VRAM) to hide PCIe latency.
+- **High-Level API**: Simplified `LemaModel` and `LemaTrainer` interfaces for fast integration.
+- **Automatic Checkpointing**: Built-in interval-based saving of LoRA adapters and optimizer states.
 
 ## Installation
 
-### From Source
 ```bash
 git clone https://github.com/Pomilon/LEMA.git
 cd LEMA
 pip install -e .
 ```
 
-### Requirements
-- PyTorch >= 2.0.0
-- Transformers >= 4.30.0
-- Safetensors >= 0.3.0
-- Accelerate >= 0.20.0
-- PEFT >= 0.4.0
-
-## Usage
-
-LEMA uses a configuration-driven approach:
+## Quick Start
 
 ```python
-from lema.config import LemaConfig, MemoryStrategy
-from lema.engine.trainer import LemaTrainer
-from lema.models.llama import LlamaAdapter
-from lema.core.gbi import GlobalBinaryIndex
-from lema.core.lora import LoRAManager
 import torch
+from lema import LemaConfig, LemaModel, MemoryStrategy
 
 # 1. Configuration
 config = LemaConfig(
-    model_name_or_path="llama2_7b.safetensors",
-    device="cuda",
-    strategy=MemoryStrategy.STREAMING, # Disk -> RAM -> VRAM
+    model_name_or_path="NousResearch/Llama-2-7b-hf",
+    gbi_path="llama2_7b.safetensors", # Single monolithic safetensors file
+    strategy=MemoryStrategy.STREAMING,
     lora_rank=16,
-    learning_rate=1e-4,
-    gradient_checkpointing=True # Essential for large models
+    gradient_checkpointing=True
 )
 
-# 2. Components
-# Load HF config dict manually or via AutoConfig
-from transformers import AutoConfig
-hf_config = AutoConfig.from_pretrained("NousResearch/Llama-2-7b-hf")
-adapter = LlamaAdapter(hf_config.to_dict())
-
-gbi = GlobalBinaryIndex(config.gbi_path)
-
-# 3. LoRA Setup
-lora_manager = LoRAManager({
-    "r": config.lora_rank,
-    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"]
-}, device=config.device)
-
-# Initialize adapter with LoRA
-for layer in adapter.get_layer_metadata():
-    if layer['type'] == 'block':
-        module = adapter.construct_layer_module(layer['id'], None, lora_manager)
-        adapter.release_layer_module(module)
-
-# 4. Trainer
-optimizer = torch.optim.AdamW(lora_manager.get_trainable_parameters(), lr=config.learning_rate)
-
-trainer = LemaTrainer(
-    config=config,
-    model_adapter=adapter,
-    gbi=gbi,
-    lora_manager=lora_manager,
-    optimizer=optimizer
-)
+# 2. Initialize Model & Trainer
+model = LemaModel(config)
+model.initialize_lora() # Pre-initialize adapters
+
+optimizer = torch.optim.AdamW(model.get_trainable_parameters(), lr=1e-4)
+trainer = model.get_trainer(optimizer)
 
-# 5. Training Step
+# 3. Train
 input_ids = torch.randint(0, 32000, (1, 512)).cuda()
-trainer.train_step(input_ids, labels=input_ids)
+logits, loss = trainer.train_step(input_ids, labels=input_ids)
 ```
 
-## License
-MIT License - Copyright (c) 2026 Pomilon
+## Documentation
+
+- [**User Guide**](docs/USER_GUIDE.md): Model preparation, conversion, and tips.
+- [**API Reference**](docs/API_REFERENCE.md): Detailed class and method specifications.
+- [**Architecture**](docs/ARCHITECTURE.md): Deep dive into the memory pipeline and LEMA-loop.
+
+## Kaggle Benchmark
+
+You can run the latest verification suite on Kaggle using the provided notebook:
+[**LEMA Benchmark Notebook**](https://www.kaggle.com/code/kloyford/lema-benchmark-notebook)
 
 ## Future Roadmap
 
@@ -128,3 +73,6 @@ While LEMA v1.0 is stable and functional for 7B fine-tuning, I aim to significan
 * **Library Integration**: I am working toward deeper integration with Hugging Face `Trainer` and `Accelerate` for seamless usage in existing pipelines.
 * **Quantization Support**: I intend to implement native support for 8-bit and 4-bit loading within the streaming pipeline for even lower memory footprints.
 * **Model Support**: I am expanding support beyond Llama and GPT-2 to include Mistral, Mixtral (MoE), and other architectures.
+
+## License
+MIT License - Copyright (c) 2026 Pomilon
@@ -0,0 +1,27 @@
+{
+  "activation_function": "gelu_new",
+  "attn_pdrop": 0.1,
+  "bos_token_id": 50256,
+  "embd_pdrop": 0.1,
+  "eos_token_id": 50256,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "gpt2",
+  "n_embd": 128,
+  "n_head": 4,
+  "n_inner": null,
+  "n_layer": 2,
+  "n_positions": 1024,
+  "reorder_and_upcast_attn": false,
+  "resid_pdrop": 0.1,
+  "scale_attn_by_inverse_layer_idx": false,
+  "scale_attn_weights": true,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "transformers_version": "4.55.2",
+  "use_cache": true,
+  "vocab_size": 1000
+}
@@ -0,0 +1,60 @@
+# LEMA API Reference
+
+This document provides detailed information about the LEMA (Layer-wise Efficient Memory Abstraction) library API.
+
+## Core API
+
+### `LemaModel`
+The primary entry point for the framework. It orchestrates memory management, adapters, and LoRA parameters.
+
+#### `__init__(config: LemaConfig)`
+Initializes the model using a `LemaConfig` object.
+
+#### `get_trainer(optimizer: torch.optim.Optimizer)`
+Returns a `LemaTrainer` instance pre-configured with this model's components and memory manager.
+
+#### `initialize_lora()`
+Pre-initializes all LoRA adapters. Must be called before `get_trainable_parameters()` for new models.
+
+#### `get_trainable_parameters()`
+Returns a list of all trainable parameters (LoRA weights) managed by the model.
+
+#### `save_pretrained(save_directory: str)`
+Saves the configuration and LoRA adapter weights.
+
+#### `from_pretrained(path: str, **kwargs)` (Class Method)
+Loads a LEMA model from a directory containing `lema_config.json` and `adapter_model.bin`.
+
+---
+
+### `LemaConfig`
+Configuration dataclass for LEMA.
+
+| Parameter | Type | Default | Description |
+| :--- | :--- | :--- | :--- |
+| `model_name_or_path` | `str` | Required | HuggingFace ID or path to model directory. |
+| `model_type` | `str` | `None` | `llama` or `gpt2`. Auto-detected if None. |
+| `gbi_path` | `str` | `None` | Path to the `.safetensors` file. |
+| `device` | `str` | `"cuda"` | Execution device. |
+| `strategy` | `MemoryStrategy` | `STREAMING` | `STREAMING` or `RESIDENT`. |
+| `save_steps` | `int` | `500` | Steps between automatic checkpoints. |
+| `output_dir` | `str` | `"output"` | Directory for automatic checkpoints. |
+| `lora_rank` | `int` | `16` | LoRA rank (r). |
+| `lora_alpha` | `int` | `32` | LoRA alpha. |
+| `learning_rate` | `float` | `1e-4` | Learning rate. |
+| `gradient_checkpointing`| `bool` | `False` | Enable to save activation VRAM. |
+
+---
+
+### `LemaTrainer`
+Orchestrates the training loop with layer-swapping logic.
+
+#### `__init__(config, model_adapter, gbi, lora_manager=None, optimizer=None, memory_manager=None)`
+Low-level constructor. Preferred usage is via `LemaModel.get_trainer()`.
+
+#### `train_step(inputs: torch.Tensor, labels: torch.Tensor = None)`
+Executes one forward and backward pass. Tracks `global_step` and triggers auto-checkpointing.
+- Returns: `(logits, loss_value)`.
+
+#### `save_checkpoint(save_directory: str)`
+Saves the model state, configuration, and optimizer state.
@@ -0,0 +1,46 @@
+# LEMA Architecture
+
+This document describes the internal mechanics of the Layer-wise Efficient Memory Abstraction (LEMA) framework.
+
+## The Problem: The VRAM Wall
+Standard fine-tuning (even with PEFT/LoRA) requires the entire model weights to be resident in VRAM. For a Llama-2 7B model in FP16, this is ~14GB. Adding optimizer states and activations quickly exceeds the capacity of consumer GPUs (e.g., 16GB).
+
+## The LEMA Solution: Virtualization
+LEMA treats GPU VRAM not as a static storage for the model, but as a **dynamic cache** for execution.
+
+### 1. The Triple-Buffer Strategy
+LEMA hides data transfer latency by pipelining movements across three memory tiers:
+
+1.  **Storage (NVMe)**: Weights reside in `.safetensors` files. Accessed via `mmap` (Zero-copy).
+2.  **System RAM (Pinned)**: Acting as a "Prefetch Buffer". Pinned memory ensures high-speed Host-to-Device (H2D) transfers.
+3.  **VRAM (Execution)**: Divided into two "Slots" (Active and Prefetch).
+
+### 2. The Execution Pipeline
+While the GPU is computing Layer $N$ in Slot A, LEMA is:
+-   Asynchronously transferring Layer $N+1$ from RAM to Slot B (VRAM).
+-   Loading Layer $N+2$ from Disk to RAM (Staging).
+
+When Layer $N$ finishes, the slots swap instantly.
+
+### 3. The LEMA-Loop (Training Logic)
+
+#### Forward Pass
+-   Model is executed layer-by-layer.
+-   Only "Boundary Activations" (the output of each layer) are stored in VRAM.
+-   Intermediate activations are discarded.
+
+#### Backward Pass
+-   LEMA traverses the layers in reverse.
+-   For each layer:
+    1.  The weights are swapped back into VRAM.
+    2.  The layer's forward pass is **re-executed** (Segmented Gradient Checkpointing) using the stored boundary activations.
+    3.  Gradients are calculated for the LoRA adapters.
+    4.  Optimizer states for those specific adapters are updated.
+
+### 4. GBI (Global Binary Index)
+LEMA uses a specialized indexer to bypass standard PyTorch/Pickle deserialization. By reading the `.safetensors` header, LEMA knows the exact byte offsets for every parameter, allowing it to "slice" the file and load only the parameters needed for the current layer module.
+
+## Performance Trade-offs
+-   **VRAM Efficiency**: ~50-70% reduction for 7B+ models.
+-   **Compute Overhead**: 1.2x - 1.8x slowdown compared to fully resident training, depending on PCIe bandwidth and disk speed.
+-   **System RAM**: Requires space equal to the model size (or less if using aggressive disk streaming).
@@ -0,0 +1,79 @@
+# LEMA User Guide
+
+This guide covers common workflows for fine-tuning Large Language Models using LEMA on memory-constrained hardware.
+
+## 1. Preparing Your Model
+
+LEMA requires model weights in a single, non-sharded `.safetensors` format. We provide a utility to handle conversion and shared-weight breaking automatically.
+
+### Recommended Conversion
+
+```python
+from lema.utils.model_utils import prepare_monolithic_safetensors
+
+# This handles downloading, shared-weight cloning, and monolithic saving
+prepare_monolithic_safetensors(
+    "NousResearch/Llama-2-7b-hf", 
+    "llama2_7b.safetensors",
+    device="auto" # Use 'auto' to save RAM during conversion if a GPU is available
+)
+```
+
+## 2. Fine-Tuning Workflow
+
+The standard workflow involves four steps: Configuration, Initialization, Training, and Saving.
+
+### Basic Example
+
+```python
+import torch
+from lema import LemaConfig, LemaModel, LemaTrainer
+
+# 1. Setup Config
+config = LemaConfig(
+    model_name_or_path="NousResearch/Llama-2-7b-hf",
+    gbi_path="llama2_7b.safetensors",
+    lora_rank=16,
+    gradient_checkpointing=True
+)
+
+# 2. Initialize
+model = LemaModel(config)
+model.initialize_lora() # Crucial for new models
+
+# 3. Training
+optimizer = torch.optim.AdamW(model.get_trainable_parameters(), lr=1e-4)
+trainer = model.get_trainer(optimizer)
+
+for batch in dataloader:
+    logits, loss = trainer.train_step(batch['input_ids'], labels=batch['labels'])
+    print(f"Loss: {loss}")
+
+# 4. Save
+trainer.save_checkpoint("checkpoints/lema-llama-7b-v1")
+```
+
+## 3. Architecture Specifics
+
+When using LEMA, ensure your `lora_target_modules` in `LemaConfig` match your model's architecture:
+- **Llama**: `["q_proj", "v_proj", ...]` (Default)
+- **GPT-2**: `["c_attn"]`
+
+## 4. Memory Strategies
+
+LEMA supports two primary strategies in `LemaConfig`:
+
+- **`MemoryStrategy.STREAMING` (Default)**: 
+    - **Path**: Disk -> Pinned RAM -> VRAM.
+    - **Pros**: Lowest VRAM usage. Can fit models much larger than System RAM if needed (via `mmap`).
+    - **Cons**: Higher latency due to PCIe/Disk bottleneck.
+- **`MemoryStrategy.RESIDENT`**:
+    - **Path**: RAM -> VRAM.
+    - **Pros**: Faster than streaming. Model weights stay in RAM.
+    - **Cons**: Requires enough System RAM to hold the full model weights (~14GB for a 7B FP16 model).
+
+## 4. Tips for Maximum Efficiency
+
+1. **Gradient Checkpointing**: Always enable `gradient_checkpointing=True` for 7B+ models. This significantly reduces VRAM usage during the backward pass by not storing intermediate activations.
+2. **Pinned Memory**: LEMA automatically uses pinned memory for transfers. Ensure your system has sufficient RAM available for the staging buffers (~2x the size of the largest layer).
+3. **NVMe Storage**: When using `STREAMING` mode, placing your `.safetensors` file on an NVMe SSD will greatly reduce the "Streaming Overhead".