🧠 LLM Memory Calculator

A comprehensive web application for calculating Large Language Model (LLM) memory requirements and performance metrics for various GPU configurations and unified memory systems (Apple Silicon).

✨ Features

🚀 Latest Features (v2.0)

🔄 Auto-Updating Model Database: Fetches latest Ollama models including Gemma3, DeepSeek-R1, Qwen3, Llama4, Phi4
🍎 Enhanced Apple Silicon Support: M4 series support, adjustable unified memory (8GB-512GB)
🎛️ Advanced Quantization: 1-bit to 32-bit precision options (INT1, INT2, INT3, INT4, INT5, INT6, INT8, FP16, BF16, FP32)
🧮 Modular Calculations: Granular control over memory and performance calculations
💾 Comprehensive Memory Analysis: KV cache, activation memory, framework overhead, system overhead, peak memory

Core Features

Memory Footprint Calculation: VMware-based sizing formulas with industry-standard accuracy
Performance Metrics: Latency, throughput, time-to-first-token, prefill time estimation
GPU Database: 80+ GPUs including NVIDIA H100/H200, AMD MI300X, Apple M4 series
LLM Model Support: 200+ models from Ollama + proprietary APIs (Claude, GPT, Gemini)
Real-time Analysis: OOM detection, optimization recommendations, warnings
Multi-GPU Support: Tensor parallelism calculations for enterprise deployments

🚀 Quick Start

Prerequisites

Node.js 18+
npm/yarn/pnpm

Installation

# Clone the repository
git clone <repository-url>
cd llm-memory-calculator

# Install dependencies
npm install

# Start development server
npm run dev

# Open http://localhost:3000

🎯 Usage Guide

Basic Workflow

📱 Select Model: Choose from auto-updated Ollama models or proprietary APIs
🖥️ Choose Hardware: Select from comprehensive GPU/processor database
⚙️ Configure Parameters: Set context size, concurrent requests, quantization
🎛️ Customize Calculations: Toggle memory components and performance metrics
📊 Analyze Results: Review memory usage, performance, warnings, and recommendations

Advanced Features

Memory Calculation Options

✅ KV Cache Memory: Attention cache for inference contexts
✅ Activation Memory: Intermediate computation memory
✅ Framework Overhead: PyTorch/CUDA overhead (15% default)
✅ System Overhead: OS/driver reserved memory (10% default)
✅ Peak Memory Factor: Model loading peak usage (1.5x multiplier)

Performance Calculation Options

✅ Prefill Time: Input processing latency
✅ Generation Time (TPOT): Time per output token
✅ Throughput: Tokens per second output rate
✅ End-to-End Latency: Complete request processing time

Analysis Options

✅ Warnings: Performance bottlenecks, OOM conditions
✅ Recommendations: Optimization suggestions, hardware advice

🧮 Calculation Methodology

Memory Footprint Formula

Total Memory = Model_Weights + KV_Cache + Activation_Memory + Framework_Overhead + System_Overhead

Model_Weights = parameters × quantization_bytes_per_param
KV_Cache = 2 × 2 × n_layers × d_model × context_window × concurrent_requests / (1024³)
Activation_Memory = batch_size × seq_len × n_layers × d_model × bytes_per_activation / (1024³)
Framework_Overhead = Model_Weights × 0.15  (configurable)
System_Overhead = Total_GPU_Memory × 0.10  (configurable)

Performance Metrics

Prefill_Time = (2 × Model_Parameters / num_GPUs) / GPU_TFLOPS
Time_per_Output_Token = (2 × Model_Parameters / num_GPUs) / Memory_Bandwidth × 1000
TTFT = Prefill_Time + TPOT
E2E_Latency = Prompt_Size × Prefill_Time + Response_Size × TPOT
Throughput = Response_Size / E2E_Latency

🖥️ Supported Hardware

NVIDIA GPUs (Consumer & Data Center)

Series	Models	Memory Range
RTX 40	4090, 4080 SUPER, 4070 Ti, 4060 Ti	8GB - 24GB
RTX 30	3090 Ti, 3090, 3080 Ti, 3070, 3060	8GB - 24GB
H-Series	H100 SXM/PCIe/NVL, H200 SXM/NVL	80GB - 188GB
A-Series	A100 80GB/40GB, A30, A10	24GB - 80GB
L-Series	L40S, L40	48GB

AMD GPUs (Consumer & Data Center)

Series	Models	Memory Range
RX 7000	7900 XTX, 7900 XT, 7800 XT, 7700 XT	12GB - 24GB
RX 6000	6900 XT, 6800 XT, 6700 XT	12GB - 16GB
MI Series	MI300X, MI250X	128GB - 192GB

Intel GPUs

Series	Models	Memory Range
Arc A	A770, A750	8GB - 16GB

Apple Silicon (Unified Memory)

Generation	Models	Memory Range	Bandwidth
M4 (2024)	M4, M4 Pro, M4 Max	16GB - 128GB	120GB/s - 546GB/s
M3 (2023)	M3, M3 Pro, M3 Max, M3 Ultra	24GB - 512GB	100GB/s - 800GB/s
M2 (2022)	M2, M2 Pro, M2 Max, M2 Ultra	24GB - 192GB	100GB/s - 800GB/s
M1 (2020)	M1, M1 Pro, M1 Max, M1 Ultra	16GB - 128GB	68GB/s - 800GB/s

Note: Apple Silicon supports adjustable unified memory configurations

🤖 Supported Models

🏠 Local Models (Auto-Updated from Ollama)

Meta Llama: 3.1 (8B, 70B, 405B), 3.2 (1B, 3B, 11B, 90B), 3.3 (70B), 4 (expected)
Mistral AI: 7B v0.3, Nemo 12B, Small 22B, Large 123B
Mixtral: 8x7B, 8x22B (Mixture of Experts)
Alibaba Qwen: 2.5 (0.5B-72B), 2.5-Coder (7B-32B), Qwen3 series
Google Gemma: 2B, 7B, 9B, 27B, Gemma3 series
Microsoft Phi: 3-Mini (3.8B), 3-Medium (14B), Phi4 (14B)
DeepSeek: Coder (6.7B, 33B), DeepSeek-R1 (1.5B-67B), DeepSeek-V3
Code Models: CodeLlama, StarCoder2, CodeGemma, Granite-Code
Lightweight: TinyLlama (1.1B), SmolLM2 (135M-1.7B), MiniCPM
Specialized: Nomic-Embed, BGE, Moondream (vision), LLaVA

☁️ Proprietary Models (API Only)

Anthropic Claude: 3 Haiku, 3 Sonnet, 3 Opus
OpenAI GPT: 3.5-Turbo, 4, 4-Turbo
Google Gemini: 1.5 Flash, 1.5 Pro

🔧 Quantization Support

Format	Bits	Memory Usage	Quality	Use Case
FP32	32	100%	Highest	Research, training
FP16	16	50%	High	Production inference
BF16	16	50%	High	Stable training
INT8	8	25%	Good	Efficient inference
INT6	6	18.75%	Moderate	Memory-constrained
INT5	5	15.625%	Moderate	Extreme efficiency
INT4	4	12.5%	Acceptable	Maximum practical
INT3	3	9.375%	Poor	Research
INT2	2	6.25%	Very Poor	Experimental
INT1	1	3.125%	Unusable	Binary networks

🏗️ Architecture

src/
├── components/                 # React components
│   ├── Calculator.tsx         # Main calculator interface
│   ├── DebugModels.tsx       # Model debugging tools
│   ├── DatabaseStatus.tsx    # Database status display
│   └── FeatureHighlights.tsx # Feature showcase
├── data/                      # Static configuration
│   ├── gpuSpecs.ts           # GPU specifications
│   ├── quantizationConfigs.ts # Quantization options
│   └── ollamaModels.ts       # Fallback model data
├── hooks/                     # React hooks
│   └── useDataUpdater.ts     # Auto-updating data hook
├── types/                     # TypeScript definitions
│   └── index.ts              # Shared type definitions
├── utils/                     # Core logic
│   ├── calculator.ts         # Calculation engine
│   ├── dataUpdater.ts        # Dynamic model fetching
│   └── __tests__/            # Test suites
└── App.tsx                   # Main application

🛠️ Development

Scripts

npm run dev          # Development server with HMR
npm run build        # Production build
npm run preview      # Preview production build
npm run test         # Run test suite
npm run test:watch   # Tests in watch mode
npm run test:coverage # Coverage report
npm run lint         # Code linting
npm run lint:fix     # Auto-fix linting issues

Testing

Comprehensive test coverage for:

✅ Memory calculation accuracy
✅ Performance metric calculations
✅ Quantization conversions
✅ Multi-GPU configurations
✅ OOM detection logic
✅ Edge cases and error handling
✅ Data fetching and parsing

# Run tests
npm run test

# Watch mode during development
npm run test:watch

# Generate coverage report
npm run test:coverage

Technical Stack

Framework: React 18 + TypeScript
UI Library: Material-UI (MUI) v5
Build Tool: Vite (fast HMR, modern bundling)
Testing: Jest + React Testing Library
Code Quality: ESLint + TypeScript ESLint
Charts: Recharts for data visualization

🔄 Auto-Update System

The application automatically fetches the latest model information from Ollama's model registry:

Features

🔄 24-Hour Auto-Updates: Checks for new models daily
🚀 Force Update: Manual refresh for immediate updates
📦 CORS Proxy: Bypasses browser restrictions via Vite proxy
🛡️ Fallback System: Uses cached data if updates fail
🐛 Debug Tools: Built-in model fetching diagnostics

Model Detection

Automatically detects and adds new models including:

Latest Llama, Gemma, Qwen releases
Emerging models from Mistral, DeepSeek
Specialized models (code, vision, embedding)
Community-contributed models

🚀 Deployment

Production Build

# Create optimized build
npm run build

# Preview production build
npm run preview

# Deploy dist/ folder to your hosting platform

Environment Configuration

Create .env.local for environment-specific settings:

VITE_API_BASE_URL=https://your-api-domain.com
VITE_ENABLE_DEBUG=false

Hosting Recommendations

Vercel: Zero-config deployment with automatic HTTPS
Netlify: Easy deployment with form handling
GitHub Pages: Free hosting for open-source projects
Docker: Containerized deployment for enterprise

🤝 Contributing

Development Setup

# Fork and clone the repository
git clone https://github.com/your-username/llm-memory-calculator.git
cd llm-memory-calculator

# Install dependencies
npm install

# Start development server
npm run dev

Contribution Guidelines

🔀 Fork the repository
🌿 Create a feature branch: git checkout -b feature/amazing-feature
✨ Develop your changes with tests
✅ Test your changes: npm run test
📝 Commit with clear messages: git commit -m 'Add amazing feature'
🚀 Push to your branch: git push origin feature/amazing-feature
📋 Submit a pull request

Areas for Contribution

🆕 New GPU/processor support
🤖 Additional LLM model support
📊 Enhanced visualization features
🧮 Advanced calculation options
🌐 Internationalization
📱 Mobile responsiveness
⚡ Performance optimizations

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

VMware: Sizing methodology and performance formulas
qoofyk: Original Python calculator inspiration
Ollama Team: Model registry and local inference platform
TechPowerUp: Comprehensive GPU specification database
React Community: Exceptional development ecosystem

📚 References

📧 Support

For questions, issues, or feature requests:

⭐ Star this repository if it helped you! ⭐

Made with ❤️ for the LLM community

FilesExpand file tree

README.md

Latest commit

History