Skip to content

Latest commit

 

History

History
338 lines (269 loc) · 11.9 KB

File metadata and controls

338 lines (269 loc) · 11.9 KB

🧠 LLM Memory Calculator

A comprehensive web application for calculating Large Language Model (LLM) memory requirements and performance metrics for various GPU configurations and unified memory systems (Apple Silicon).

✨ Features

🚀 Latest Features (v2.0)

  • 🔄 Auto-Updating Model Database: Fetches latest Ollama models including Gemma3, DeepSeek-R1, Qwen3, Llama4, Phi4
  • 🍎 Enhanced Apple Silicon Support: M4 series support, adjustable unified memory (8GB-512GB)
  • 🎛️ Advanced Quantization: 1-bit to 32-bit precision options (INT1, INT2, INT3, INT4, INT5, INT6, INT8, FP16, BF16, FP32)
  • 🧮 Modular Calculations: Granular control over memory and performance calculations
  • 💾 Comprehensive Memory Analysis: KV cache, activation memory, framework overhead, system overhead, peak memory

Core Features

  • Memory Footprint Calculation: VMware-based sizing formulas with industry-standard accuracy
  • Performance Metrics: Latency, throughput, time-to-first-token, prefill time estimation
  • GPU Database: 80+ GPUs including NVIDIA H100/H200, AMD MI300X, Apple M4 series
  • LLM Model Support: 200+ models from Ollama + proprietary APIs (Claude, GPT, Gemini)
  • Real-time Analysis: OOM detection, optimization recommendations, warnings
  • Multi-GPU Support: Tensor parallelism calculations for enterprise deployments

🚀 Quick Start

Prerequisites

  • Node.js 18+
  • npm/yarn/pnpm

Installation

# Clone the repository
git clone <repository-url>
cd llm-memory-calculator

# Install dependencies
npm install

# Start development server
npm run dev

# Open http://localhost:3000

🎯 Usage Guide

Basic Workflow

  1. 📱 Select Model: Choose from auto-updated Ollama models or proprietary APIs
  2. 🖥️ Choose Hardware: Select from comprehensive GPU/processor database
  3. ⚙️ Configure Parameters: Set context size, concurrent requests, quantization
  4. 🎛️ Customize Calculations: Toggle memory components and performance metrics
  5. 📊 Analyze Results: Review memory usage, performance, warnings, and recommendations

Advanced Features

Memory Calculation Options

  • ✅ KV Cache Memory: Attention cache for inference contexts
  • ✅ Activation Memory: Intermediate computation memory
  • ✅ Framework Overhead: PyTorch/CUDA overhead (15% default)
  • ✅ System Overhead: OS/driver reserved memory (10% default)
  • ✅ Peak Memory Factor: Model loading peak usage (1.5x multiplier)

Performance Calculation Options

  • ✅ Prefill Time: Input processing latency
  • ✅ Generation Time (TPOT): Time per output token
  • ✅ Throughput: Tokens per second output rate
  • ✅ End-to-End Latency: Complete request processing time

Analysis Options

  • ✅ Warnings: Performance bottlenecks, OOM conditions
  • ✅ Recommendations: Optimization suggestions, hardware advice

🧮 Calculation Methodology

Memory Footprint Formula

Total Memory = Model_Weights + KV_Cache + Activation_Memory + Framework_Overhead + System_Overhead

Model_Weights = parameters × quantization_bytes_per_param
KV_Cache = 2 × 2 × n_layers × d_model × context_window × concurrent_requests / (1024³)
Activation_Memory = batch_size × seq_len × n_layers × d_model × bytes_per_activation / (1024³)
Framework_Overhead = Model_Weights × 0.15  (configurable)
System_Overhead = Total_GPU_Memory × 0.10  (configurable)

Performance Metrics

Prefill_Time = (2 × Model_Parameters / num_GPUs) / GPU_TFLOPS
Time_per_Output_Token = (2 × Model_Parameters / num_GPUs) / Memory_Bandwidth × 1000
TTFT = Prefill_Time + TPOT
E2E_Latency = Prompt_Size × Prefill_Time + Response_Size × TPOT
Throughput = Response_Size / E2E_Latency

🖥️ Supported Hardware

NVIDIA GPUs (Consumer & Data Center)

Series Models Memory Range
RTX 40 4090, 4080 SUPER, 4070 Ti, 4060 Ti 8GB - 24GB
RTX 30 3090 Ti, 3090, 3080 Ti, 3070, 3060 8GB - 24GB
H-Series H100 SXM/PCIe/NVL, H200 SXM/NVL 80GB - 188GB
A-Series A100 80GB/40GB, A30, A10 24GB - 80GB
L-Series L40S, L40 48GB

AMD GPUs (Consumer & Data Center)

Series Models Memory Range
RX 7000 7900 XTX, 7900 XT, 7800 XT, 7700 XT 12GB - 24GB
RX 6000 6900 XT, 6800 XT, 6700 XT 12GB - 16GB
MI Series MI300X, MI250X 128GB - 192GB

Intel GPUs

Series Models Memory Range
Arc A A770, A750 8GB - 16GB

Apple Silicon (Unified Memory)

Generation Models Memory Range Bandwidth
M4 (2024) M4, M4 Pro, M4 Max 16GB - 128GB 120GB/s - 546GB/s
M3 (2023) M3, M3 Pro, M3 Max, M3 Ultra 24GB - 512GB 100GB/s - 800GB/s
M2 (2022) M2, M2 Pro, M2 Max, M2 Ultra 24GB - 192GB 100GB/s - 800GB/s
M1 (2020) M1, M1 Pro, M1 Max, M1 Ultra 16GB - 128GB 68GB/s - 800GB/s

Note: Apple Silicon supports adjustable unified memory configurations

🤖 Supported Models

🏠 Local Models (Auto-Updated from Ollama)

  • Meta Llama: 3.1 (8B, 70B, 405B), 3.2 (1B, 3B, 11B, 90B), 3.3 (70B), 4 (expected)
  • Mistral AI: 7B v0.3, Nemo 12B, Small 22B, Large 123B
  • Mixtral: 8x7B, 8x22B (Mixture of Experts)
  • Alibaba Qwen: 2.5 (0.5B-72B), 2.5-Coder (7B-32B), Qwen3 series
  • Google Gemma: 2B, 7B, 9B, 27B, Gemma3 series
  • Microsoft Phi: 3-Mini (3.8B), 3-Medium (14B), Phi4 (14B)
  • DeepSeek: Coder (6.7B, 33B), DeepSeek-R1 (1.5B-67B), DeepSeek-V3
  • Code Models: CodeLlama, StarCoder2, CodeGemma, Granite-Code
  • Lightweight: TinyLlama (1.1B), SmolLM2 (135M-1.7B), MiniCPM
  • Specialized: Nomic-Embed, BGE, Moondream (vision), LLaVA

☁️ Proprietary Models (API Only)

  • Anthropic Claude: 3 Haiku, 3 Sonnet, 3 Opus
  • OpenAI GPT: 3.5-Turbo, 4, 4-Turbo
  • Google Gemini: 1.5 Flash, 1.5 Pro

🔧 Quantization Support

Format Bits Memory Usage Quality Use Case
FP32 32 100% Highest Research, training
FP16 16 50% High Production inference
BF16 16 50% High Stable training
INT8 8 25% Good Efficient inference
INT6 6 18.75% Moderate Memory-constrained
INT5 5 15.625% Moderate Extreme efficiency
INT4 4 12.5% Acceptable Maximum practical
INT3 3 9.375% Poor Research
INT2 2 6.25% Very Poor Experimental
INT1 1 3.125% Unusable Binary networks

🏗️ Architecture

src/
├── components/                 # React components
│   ├── Calculator.tsx         # Main calculator interface
│   ├── DebugModels.tsx       # Model debugging tools
│   ├── DatabaseStatus.tsx    # Database status display
│   └── FeatureHighlights.tsx # Feature showcase
├── data/                      # Static configuration
│   ├── gpuSpecs.ts           # GPU specifications
│   ├── quantizationConfigs.ts # Quantization options
│   └── ollamaModels.ts       # Fallback model data
├── hooks/                     # React hooks
│   └── useDataUpdater.ts     # Auto-updating data hook
├── types/                     # TypeScript definitions
│   └── index.ts              # Shared type definitions
├── utils/                     # Core logic
│   ├── calculator.ts         # Calculation engine
│   ├── dataUpdater.ts        # Dynamic model fetching
│   └── __tests__/            # Test suites
└── App.tsx                   # Main application

🛠️ Development

Scripts

npm run dev          # Development server with HMR
npm run build        # Production build
npm run preview      # Preview production build
npm run test         # Run test suite
npm run test:watch   # Tests in watch mode
npm run test:coverage # Coverage report
npm run lint         # Code linting
npm run lint:fix     # Auto-fix linting issues

Testing

Comprehensive test coverage for:

  • ✅ Memory calculation accuracy
  • ✅ Performance metric calculations
  • ✅ Quantization conversions
  • ✅ Multi-GPU configurations
  • ✅ OOM detection logic
  • ✅ Edge cases and error handling
  • ✅ Data fetching and parsing
# Run tests
npm run test

# Watch mode during development
npm run test:watch

# Generate coverage report
npm run test:coverage

Technical Stack

  • Framework: React 18 + TypeScript
  • UI Library: Material-UI (MUI) v5
  • Build Tool: Vite (fast HMR, modern bundling)
  • Testing: Jest + React Testing Library
  • Code Quality: ESLint + TypeScript ESLint
  • Charts: Recharts for data visualization

🔄 Auto-Update System

The application automatically fetches the latest model information from Ollama's model registry:

Features

  • 🔄 24-Hour Auto-Updates: Checks for new models daily
  • 🚀 Force Update: Manual refresh for immediate updates
  • 📦 CORS Proxy: Bypasses browser restrictions via Vite proxy
  • 🛡️ Fallback System: Uses cached data if updates fail
  • 🐛 Debug Tools: Built-in model fetching diagnostics

Model Detection

Automatically detects and adds new models including:

  • Latest Llama, Gemma, Qwen releases
  • Emerging models from Mistral, DeepSeek
  • Specialized models (code, vision, embedding)
  • Community-contributed models

🚀 Deployment

Production Build

# Create optimized build
npm run build

# Preview production build
npm run preview

# Deploy dist/ folder to your hosting platform

Environment Configuration

Create .env.local for environment-specific settings:

VITE_API_BASE_URL=https://your-api-domain.com
VITE_ENABLE_DEBUG=false

Hosting Recommendations

  • Vercel: Zero-config deployment with automatic HTTPS
  • Netlify: Easy deployment with form handling
  • GitHub Pages: Free hosting for open-source projects
  • Docker: Containerized deployment for enterprise

🤝 Contributing

Development Setup

# Fork and clone the repository
git clone https://github.com/your-username/llm-memory-calculator.git
cd llm-memory-calculator

# Install dependencies
npm install

# Start development server
npm run dev

Contribution Guidelines

  1. 🔀 Fork the repository
  2. 🌿 Create a feature branch: git checkout -b feature/amazing-feature
  3. ✨ Develop your changes with tests
  4. ✅ Test your changes: npm run test
  5. 📝 Commit with clear messages: git commit -m 'Add amazing feature'
  6. 🚀 Push to your branch: git push origin feature/amazing-feature
  7. 📋 Submit a pull request

Areas for Contribution

  • 🆕 New GPU/processor support
  • 🤖 Additional LLM model support
  • 📊 Enhanced visualization features
  • 🧮 Advanced calculation options
  • 🌐 Internationalization
  • 📱 Mobile responsiveness
  • ⚡ Performance optimizations

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

  • VMware: Sizing methodology and performance formulas
  • qoofyk: Original Python calculator inspiration
  • Ollama Team: Model registry and local inference platform
  • TechPowerUp: Comprehensive GPU specification database
  • React Community: Exceptional development ecosystem

📚 References

📧 Support

For questions, issues, or feature requests:


⭐ Star this repository if it helped you! ⭐

Made with ❤️ for the LLM community