|
| 1 | +# LODA LLM: Natural Language to Assembly Code Generation |
| 2 | + |
| 3 | +This module extends the LODA Python project with Large Language Model (LLM) capabilities for generating LODA assembly code from natural language descriptions of integer sequences. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The LODA LLM system can understand descriptions like "Fibonacci numbers" or "squares of positive integers" and generate corresponding LODA assembly programs that compute these sequences. |
| 8 | + |
| 9 | +### Key Features |
| 10 | + |
| 11 | +- **Transformer-based Architecture**: Uses T5 encoder-decoder model for sequence-to-sequence translation |
| 12 | +- **OEIS Integration**: Trained on 145,000+ OEIS sequence descriptions and LODA programs |
| 13 | +- **Robust Preprocessing**: Extracts and augments training data from existing LODA programs |
| 14 | +- **Comprehensive Evaluation**: Validates generated programs and evaluates sequence correctness |
| 15 | +- **Interactive Interface**: Command-line tool for real-time code generation |
| 16 | + |
| 17 | +## Architecture |
| 18 | + |
| 19 | +``` |
| 20 | +Natural Language → T5 Encoder → Hidden Representation → T5 Decoder → LODA Code |
| 21 | + ↓ ↓ |
| 22 | +"Fibonacci numbers" "mov $1,$0\n..." |
| 23 | +``` |
| 24 | + |
| 25 | +### Components |
| 26 | + |
| 27 | +1. **Data Preprocessing** (`data_preprocessing.py`) |
| 28 | + - Extracts sequence descriptions from LODA program comments |
| 29 | + - Creates training pairs of (description, LODA code) |
| 30 | + - Augments data with description variations |
| 31 | + - Handles data cleaning and validation |
| 32 | + |
| 33 | +2. **Model Architecture** (`model.py`) |
| 34 | + - T5-based encoder-decoder transformer |
| 35 | + - Custom LODA tokenizer for assembly syntax |
| 36 | + - Text format conversion for T5 compatibility |
| 37 | + - Model saving/loading utilities |
| 38 | + |
| 39 | +3. **Training Pipeline** (`trainer.py`) |
| 40 | + - PyTorch training loop with proper batching |
| 41 | + - Learning rate scheduling and gradient clipping |
| 42 | + - Validation and checkpointing |
| 43 | + - Support for different T5 model sizes |
| 44 | + |
| 45 | +4. **Inference & Evaluation** (`inference.py`) |
| 46 | + - Code generation from natural language |
| 47 | + - Program validation and sequence testing |
| 48 | + - Evaluation metrics (validity, accuracy) |
| 49 | + - Interactive generation interface |
| 50 | + |
| 51 | +## Installation |
| 52 | + |
| 53 | +1. Install dependencies: |
| 54 | +```bash |
| 55 | +pip install -r requirements.txt |
| 56 | +``` |
| 57 | + |
| 58 | +2. The new LLM dependencies include: |
| 59 | + - `torch>=1.9.0` - PyTorch for deep learning |
| 60 | + - `transformers>=4.20.0` - Hugging Face transformers (T5) |
| 61 | + - `datasets>=2.0.0` - Data loading utilities |
| 62 | + - `tqdm>=4.62.0` - Progress bars |
| 63 | + - `scikit-learn>=1.0.0` - Evaluation metrics |
| 64 | + |
| 65 | +## Usage |
| 66 | + |
| 67 | +### 1. Prepare Training Data |
| 68 | + |
| 69 | +```python |
| 70 | +from loda.ml.llm.data_preprocessing import create_dataset |
| 71 | + |
| 72 | +# Create training dataset from OEIS programs |
| 73 | +dataset = create_dataset( |
| 74 | + programs_dir="programs/oeis", |
| 75 | + output_file="loda_training_data.json", |
| 76 | + max_examples=10000, # Use subset for faster training |
| 77 | + augment=True # Create description variations |
| 78 | +) |
| 79 | +``` |
| 80 | + |
| 81 | +### 2. Train the Model |
| 82 | + |
| 83 | +```python |
| 84 | +from loda.ml.llm.trainer import train_loda_llm |
| 85 | + |
| 86 | +# Train the model |
| 87 | +model = train_loda_llm( |
| 88 | + programs_dir="programs/oeis", |
| 89 | + output_dir="trained_model", |
| 90 | + model_name="t5-small", # or "t5-base", "t5-large" |
| 91 | + max_examples=10000, |
| 92 | + num_epochs=3, |
| 93 | + batch_size=8 |
| 94 | +) |
| 95 | +``` |
| 96 | + |
| 97 | +Command line training: |
| 98 | +```bash |
| 99 | +python -m loda.ml.llm.trainer \ |
| 100 | + --programs_dir programs/oeis \ |
| 101 | + --output_dir trained_model \ |
| 102 | + --max_examples 10000 \ |
| 103 | + --num_epochs 3 |
| 104 | +``` |
| 105 | + |
| 106 | +### 3. Generate Code |
| 107 | + |
| 108 | +```python |
| 109 | +from loda.ml.llm.inference import load_model_for_inference |
| 110 | + |
| 111 | +# Load trained model |
| 112 | +generator = load_model_for_inference("trained_model") |
| 113 | + |
| 114 | +# Generate code |
| 115 | +results = generator.generate("Fibonacci numbers") |
| 116 | +for result in results: |
| 117 | + print(f"Generated: {result.generated_code}") |
| 118 | + print(f"Valid: {result.is_valid}") |
| 119 | + if result.generated_sequence: |
| 120 | + print(f"Sequence: {result.generated_sequence}") |
| 121 | +``` |
| 122 | + |
| 123 | +Interactive mode: |
| 124 | +```bash |
| 125 | +python -m loda.ml.llm.inference --mode interactive --model_path trained_model |
| 126 | +``` |
| 127 | + |
| 128 | +### 4. Evaluate Performance |
| 129 | + |
| 130 | +```python |
| 131 | +from loda.ml.llm.inference import evaluate_model |
| 132 | + |
| 133 | +# Evaluate on test set |
| 134 | +metrics, results = evaluate_model("trained_model", "test_data.json") |
| 135 | +print(f"Valid program rate: {metrics['valid_program_rate']:.1%}") |
| 136 | +print(f"Sequence match rate: {metrics['sequence_match_rate']:.1%}") |
| 137 | +``` |
| 138 | + |
| 139 | +## Training Data Format |
| 140 | + |
| 141 | +Training examples are JSON objects with the following structure: |
| 142 | + |
| 143 | +```json |
| 144 | +{ |
| 145 | + "sequence_id": "A000045", |
| 146 | + "description": "Fibonacci numbers: F(n) = F(n-1) + F(n-2) with F(0) = 0 and F(1) = 1", |
| 147 | + "loda_code": "mov $1,$0\nmov $4,1\nlpb $0\n...", |
| 148 | + "terms": [0, 1, 1, 2, 3, 5, 8, 13, 21, 34] |
| 149 | +} |
| 150 | +``` |
| 151 | + |
| 152 | +## Model Configuration |
| 153 | + |
| 154 | +### Supported T5 Models |
| 155 | + |
| 156 | +- `t5-small` (60M parameters) - Fast training, good for experimentation |
| 157 | +- `t5-base` (220M parameters) - Better quality, moderate resource requirements |
| 158 | +- `t5-large` (770M parameters) - Best quality, high resource requirements |
| 159 | + |
| 160 | +### Training Parameters |
| 161 | + |
| 162 | +```python |
| 163 | +# Recommended settings for different use cases |
| 164 | + |
| 165 | +# Quick experimentation |
| 166 | +train_loda_llm( |
| 167 | + model_name="t5-small", |
| 168 | + max_examples=1000, |
| 169 | + batch_size=16, |
| 170 | + num_epochs=1, |
| 171 | + learning_rate=1e-4 |
| 172 | +) |
| 173 | + |
| 174 | +# Production training |
| 175 | +train_loda_llm( |
| 176 | + model_name="t5-base", |
| 177 | + max_examples=-1, # Use all data |
| 178 | + batch_size=8, |
| 179 | + num_epochs=5, |
| 180 | + learning_rate=5e-5 |
| 181 | +) |
| 182 | +``` |
| 183 | + |
| 184 | +## Evaluation Metrics |
| 185 | + |
| 186 | +The system provides several evaluation metrics: |
| 187 | + |
| 188 | +- **Valid Program Rate**: Percentage of generated programs that parse and execute |
| 189 | +- **Exact Match Rate**: Percentage matching the target program exactly |
| 190 | +- **Sequence Match Rate**: Percentage generating correct sequence terms |
| 191 | +- **Generation Time**: Average time to generate code |
| 192 | + |
| 193 | +## Implementation Details |
| 194 | + |
| 195 | +### LODA Tokenization |
| 196 | + |
| 197 | +The system uses a custom tokenizer designed for LODA assembly: |
| 198 | + |
| 199 | +```python |
| 200 | +# LODA operations |
| 201 | +operations = ['mov', 'add', 'sub', 'mul', 'div', 'lpb', 'lpe', ...] |
| 202 | + |
| 203 | +# Memory operands |
| 204 | +operands = ['$0', '$1', '$2', '$$1', '$$2', ...] |
| 205 | + |
| 206 | +# Constants |
| 207 | +constants = ['0', '1', '2', '-1', ...] |
| 208 | +``` |
| 209 | + |
| 210 | +### Text Format Conversion |
| 211 | + |
| 212 | +Since T5 expects text input/output, LODA code is converted to a text representation: |
| 213 | + |
| 214 | +``` |
| 215 | +Original LODA: mov $1,$0 |
| 216 | + add $1,5 |
| 217 | + |
| 218 | +Text format: mov $1 $0 | add $1 5 |
| 219 | +``` |
| 220 | + |
| 221 | +### Data Augmentation |
| 222 | + |
| 223 | +Training descriptions are augmented to improve robustness: |
| 224 | + |
| 225 | +``` |
| 226 | +Original: "Fibonacci numbers" |
| 227 | +Augmented: |
| 228 | +- "Sequence of fibonacci numbers" |
| 229 | +- "Generate fibonacci numbers" |
| 230 | +- "Compute fibonacci numbers" |
| 231 | +``` |
| 232 | + |
| 233 | +## Performance Considerations |
| 234 | + |
| 235 | +### Memory Usage |
| 236 | + |
| 237 | +- T5-small: ~2GB GPU memory for training |
| 238 | +- T5-base: ~8GB GPU memory for training |
| 239 | +- T5-large: ~16GB GPU memory for training |
| 240 | + |
| 241 | +### Training Time |
| 242 | + |
| 243 | +Approximate training times (on V100 GPU): |
| 244 | +- 1,000 examples: 10-30 minutes |
| 245 | +- 10,000 examples: 2-6 hours |
| 246 | +- 100,000+ examples: 1-3 days |
| 247 | + |
| 248 | +### Generation Speed |
| 249 | + |
| 250 | +- T5-small: ~0.1-0.5 seconds per program |
| 251 | +- T5-base: ~0.2-1.0 seconds per program |
| 252 | +- T5-large: ~0.5-2.0 seconds per program |
| 253 | + |
| 254 | +## Troubleshooting |
| 255 | + |
| 256 | +### Common Issues |
| 257 | + |
| 258 | +1. **CUDA out of memory**: Reduce batch size or use smaller model |
| 259 | +2. **Poor generation quality**: Train longer or use larger model |
| 260 | +3. **Invalid programs**: Check training data quality and augmentation |
| 261 | + |
| 262 | +### Model Selection |
| 263 | + |
| 264 | +Choose model size based on your requirements: |
| 265 | + |
| 266 | +| Use Case | Model | Trade-offs | |
| 267 | +|----------|-------|------------| |
| 268 | +| Research/Experimentation | t5-small | Fast, lower quality | |
| 269 | +| Production/Demo | t5-base | Balanced speed/quality | |
| 270 | +| Best Results | t5-large | Slow, highest quality | |
| 271 | + |
| 272 | +## Extending the System |
| 273 | + |
| 274 | +### Custom Training Data |
| 275 | + |
| 276 | +Add new training examples: |
| 277 | + |
| 278 | +```python |
| 279 | +from loda.ml.llm.data_preprocessing import TrainingExample |
| 280 | + |
| 281 | +custom_example = TrainingExample( |
| 282 | + sequence_id="custom_001", |
| 283 | + description="Powers of 2", |
| 284 | + loda_code="mov $1,1\nlpb $0\n mul $1,2\n sub $0,1\nlpe\nmov $0,$1", |
| 285 | + terms=[1, 2, 4, 8, 16, 32] |
| 286 | +) |
| 287 | +``` |
| 288 | + |
| 289 | +### Fine-tuning |
| 290 | + |
| 291 | +Fine-tune on specific sequence types: |
| 292 | + |
| 293 | +```python |
| 294 | +# Load pre-trained model |
| 295 | +model = LodaT5Model.load_model("base_model") |
| 296 | + |
| 297 | +# Train on specialized data |
| 298 | +train_loda_llm( |
| 299 | + programs_dir="specialized_programs", |
| 300 | + model=model, # Start from pre-trained |
| 301 | + learning_rate=1e-5, # Lower learning rate |
| 302 | + num_epochs=1 |
| 303 | +) |
| 304 | +``` |
| 305 | + |
| 306 | +## Future Improvements |
| 307 | + |
| 308 | +- **Better tokenization**: Domain-specific vocabulary |
| 309 | +- **Program synthesis**: Multi-step reasoning |
| 310 | +- **Verification**: Formal correctness checking |
| 311 | +- **Interactive refinement**: Human-in-the-loop generation |
| 312 | +- **Specialized architectures**: CodeBERT, CodeT5+ integration |
| 313 | + |
| 314 | +--- |
| 315 | + |
| 316 | +For more information, see the LODA project documentation and the individual module docstrings. |
0 commit comments