Skip to content

Latest commit

 

History

History
111 lines (88 loc) · 2.94 KB

File metadata and controls

111 lines (88 loc) · 2.94 KB

01 Full Simulation - Complete LLM from Scratch

This notebook walks through a complete LLM training and inference cycle.

Setup

// Load all modules
const SimpleTokenizer = require("../01_tokenizer/simple_tokenizer");
const TokenEmbeddings = require("../02_embeddings/token_embeddings");
const PositionalEncoding = require("../02_embeddings/positional_encoding");
const SimpleGPT = require("../05_full_model/gpt");
const GPTTrainer = require("../05_full_model/train");
const GreedyDecoder = require("../06_inference/greedy_decode");

Step 1: Initialize Components

// Create tokenizer
const tokenizer = new SimpleTokenizer();
console.log(`Vocabulary size: ${tokenizer.vocab_size}`);

// Test encoding/decoding
const text = "hello";
const tokens = tokenizer.encode(text);
console.log(`Text: "${text}"`);
console.log(`Tokens: ${tokens}`);
console.log(`Decoded: "${tokenizer.decode(tokens)}"`);

// Pad sequences
const padded = tokenizer.pad_sequence(tokens, 10);
console.log(`Padded (length 10): ${padded}`);

Step 2: Create Embeddings

const embeddings = new TokenEmbeddings(
  (vocab_size = 100),
  (embedding_dim = 64),
);
const pos_encoding = new PositionalEncoding((embedding_dim = 64));

// Get embeddings for tokens
const token_ids = [2, 3, 5];
const emb = embeddings.forward(token_ids);
console.log(`Embeddings shape: ${emb.length} × ${emb[0].length}`);

// Add positional encoding
const with_pos = pos_encoding.forward(emb);
console.log(
  `With positional encoding: ${with_pos.length} × ${with_pos[0].length}`,
);

Step 3: Init Model

const model = new SimpleGPT(
  (vocab_size = 100),
  (embedding_dim = 64),
  (num_heads = 8),
  (num_blocks = 2),
  (max_length = 50),
);
console.log("Model created!");

Step 4: Training

// Create training data
const train_texts = ["hello world", "how are you", "machine learning"];

const train_targets = [
  [2, 3, 0, 0], // next tokens
  [4, 5, 6, 0],
  [7, 8, 9, 0],
];

// Train
const trainer = new GPTTrainer(model, (learning_rate = 0.001));
trainer.train(train_texts, train_targets, (epochs = 5));

// Plot losses
const losses = trainer.get_losses();
console.log("Training complete!");

Step 5: Inference

// Try generation with greedy decoding
const decoder = new GreedyDecoder(model, tokenizer);
const prompt = "hello";
const generated = decoder.decode(prompt, (max_tokens = 10));
console.log(`Generated: "${generated}"`);

Key Takeaways

  1. ✅ Tokenizer converts text to numbers
  2. ✅ Embeddings convert numbers to vectors
  3. ✅ Positional encoding adds position information
  4. ✅ Transformer blocks process and refine representations
  5. ✅ Output projection predicts next token
  6. ✅ Training minimizes loss (prediction error)
  7. ✅ Inference generates new text

That's it! You've built a complete LLM! 🎉