A comprehensive web application for calculating Large Language Model (LLM) memory requirements and performance metrics for various GPU configurations and unified memory systems (Apple Silicon).
- ๐ Auto-Updating Model Database: Fetches latest Ollama models including Gemma3, DeepSeek-R1, Qwen3, Llama4, Phi4
- ๐ Enhanced Apple Silicon Support: M4 series support, adjustable unified memory (8GB-512GB)
- ๐๏ธ Advanced Quantization: 1-bit to 32-bit precision options (INT1, INT2, INT3, INT4, INT5, INT6, INT8, FP16, BF16, FP32)
- ๐งฎ Modular Calculations: Granular control over memory and performance calculations
- ๐พ Comprehensive Memory Analysis: KV cache, activation memory, framework overhead, system overhead, peak memory
- Memory Footprint Calculation: VMware-based sizing formulas with industry-standard accuracy
- Performance Metrics: Latency, throughput, time-to-first-token, prefill time estimation
- GPU Database: 80+ GPUs including NVIDIA H100/H200, AMD MI300X, Apple M4 series
- LLM Model Support: 200+ models from Ollama + proprietary APIs (Claude, GPT, Gemini)
- Real-time Analysis: OOM detection, optimization recommendations, warnings
- Multi-GPU Support: Tensor parallelism calculations for enterprise deployments
- Node.js 18+
- npm/yarn/pnpm
# Clone the repository
git clone <repository-url>
cd llm-memory-calculator
# Install dependencies
npm install
# Start development server
npm run dev
# Open http://localhost:3000- ๐ฑ Select Model: Choose from auto-updated Ollama models or proprietary APIs
- ๐ฅ๏ธ Choose Hardware: Select from comprehensive GPU/processor database
- โ๏ธ Configure Parameters: Set context size, concurrent requests, quantization
- ๐๏ธ Customize Calculations: Toggle memory components and performance metrics
- ๐ Analyze Results: Review memory usage, performance, warnings, and recommendations
- โ KV Cache Memory: Attention cache for inference contexts
- โ Activation Memory: Intermediate computation memory
- โ Framework Overhead: PyTorch/CUDA overhead (15% default)
- โ System Overhead: OS/driver reserved memory (10% default)
- โ Peak Memory Factor: Model loading peak usage (1.5x multiplier)
- โ Prefill Time: Input processing latency
- โ Generation Time (TPOT): Time per output token
- โ Throughput: Tokens per second output rate
- โ End-to-End Latency: Complete request processing time
- โ Warnings: Performance bottlenecks, OOM conditions
- โ Recommendations: Optimization suggestions, hardware advice
Total Memory = Model_Weights + KV_Cache + Activation_Memory + Framework_Overhead + System_Overhead
Model_Weights = parameters ร quantization_bytes_per_param
KV_Cache = 2 ร 2 ร n_layers ร d_model ร context_window ร concurrent_requests / (1024ยณ)
Activation_Memory = batch_size ร seq_len ร n_layers ร d_model ร bytes_per_activation / (1024ยณ)
Framework_Overhead = Model_Weights ร 0.15 (configurable)
System_Overhead = Total_GPU_Memory ร 0.10 (configurable)
Prefill_Time = (2 ร Model_Parameters / num_GPUs) / GPU_TFLOPS
Time_per_Output_Token = (2 ร Model_Parameters / num_GPUs) / Memory_Bandwidth ร 1000
TTFT = Prefill_Time + TPOT
E2E_Latency = Prompt_Size ร Prefill_Time + Response_Size ร TPOT
Throughput = Response_Size / E2E_Latency
| Series | Models | Memory Range |
|---|---|---|
| RTX 40 | 4090, 4080 SUPER, 4070 Ti, 4060 Ti | 8GB - 24GB |
| RTX 30 | 3090 Ti, 3090, 3080 Ti, 3070, 3060 | 8GB - 24GB |
| H-Series | H100 SXM/PCIe/NVL, H200 SXM/NVL | 80GB - 188GB |
| A-Series | A100 80GB/40GB, A30, A10 | 24GB - 80GB |
| L-Series | L40S, L40 | 48GB |
| Series | Models | Memory Range |
|---|---|---|
| RX 7000 | 7900 XTX, 7900 XT, 7800 XT, 7700 XT | 12GB - 24GB |
| RX 6000 | 6900 XT, 6800 XT, 6700 XT | 12GB - 16GB |
| MI Series | MI300X, MI250X | 128GB - 192GB |
| Series | Models | Memory Range |
|---|---|---|
| Arc A | A770, A750 | 8GB - 16GB |
| Generation | Models | Memory Range | Bandwidth |
|---|---|---|---|
| M4 (2024) | M4, M4 Pro, M4 Max | 16GB - 128GB | 120GB/s - 546GB/s |
| M3 (2023) | M3, M3 Pro, M3 Max, M3 Ultra | 24GB - 512GB | 100GB/s - 800GB/s |
| M2 (2022) | M2, M2 Pro, M2 Max, M2 Ultra | 24GB - 192GB | 100GB/s - 800GB/s |
| M1 (2020) | M1, M1 Pro, M1 Max, M1 Ultra | 16GB - 128GB | 68GB/s - 800GB/s |
Note: Apple Silicon supports adjustable unified memory configurations
- Meta Llama: 3.1 (8B, 70B, 405B), 3.2 (1B, 3B, 11B, 90B), 3.3 (70B), 4 (expected)
- Mistral AI: 7B v0.3, Nemo 12B, Small 22B, Large 123B
- Mixtral: 8x7B, 8x22B (Mixture of Experts)
- Alibaba Qwen: 2.5 (0.5B-72B), 2.5-Coder (7B-32B), Qwen3 series
- Google Gemma: 2B, 7B, 9B, 27B, Gemma3 series
- Microsoft Phi: 3-Mini (3.8B), 3-Medium (14B), Phi4 (14B)
- DeepSeek: Coder (6.7B, 33B), DeepSeek-R1 (1.5B-67B), DeepSeek-V3
- Code Models: CodeLlama, StarCoder2, CodeGemma, Granite-Code
- Lightweight: TinyLlama (1.1B), SmolLM2 (135M-1.7B), MiniCPM
- Specialized: Nomic-Embed, BGE, Moondream (vision), LLaVA
- Anthropic Claude: 3 Haiku, 3 Sonnet, 3 Opus
- OpenAI GPT: 3.5-Turbo, 4, 4-Turbo
- Google Gemini: 1.5 Flash, 1.5 Pro
| Format | Bits | Memory Usage | Quality | Use Case |
|---|---|---|---|---|
| FP32 | 32 | 100% | Highest | Research, training |
| FP16 | 16 | 50% | High | Production inference |
| BF16 | 16 | 50% | High | Stable training |
| INT8 | 8 | 25% | Good | Efficient inference |
| INT6 | 6 | 18.75% | Moderate | Memory-constrained |
| INT5 | 5 | 15.625% | Moderate | Extreme efficiency |
| INT4 | 4 | 12.5% | Acceptable | Maximum practical |
| INT3 | 3 | 9.375% | Poor | Research |
| INT2 | 2 | 6.25% | Very Poor | Experimental |
| INT1 | 1 | 3.125% | Unusable | Binary networks |
src/
โโโ components/ # React components
โ โโโ Calculator.tsx # Main calculator interface
โ โโโ DebugModels.tsx # Model debugging tools
โ โโโ DatabaseStatus.tsx # Database status display
โ โโโ FeatureHighlights.tsx # Feature showcase
โโโ data/ # Static configuration
โ โโโ gpuSpecs.ts # GPU specifications
โ โโโ quantizationConfigs.ts # Quantization options
โ โโโ ollamaModels.ts # Fallback model data
โโโ hooks/ # React hooks
โ โโโ useDataUpdater.ts # Auto-updating data hook
โโโ types/ # TypeScript definitions
โ โโโ index.ts # Shared type definitions
โโโ utils/ # Core logic
โ โโโ calculator.ts # Calculation engine
โ โโโ dataUpdater.ts # Dynamic model fetching
โ โโโ __tests__/ # Test suites
โโโ App.tsx # Main application
npm run dev # Development server with HMR
npm run build # Production build
npm run preview # Preview production build
npm run test # Run test suite
npm run test:watch # Tests in watch mode
npm run test:coverage # Coverage report
npm run lint # Code linting
npm run lint:fix # Auto-fix linting issuesComprehensive test coverage for:
- โ Memory calculation accuracy
- โ Performance metric calculations
- โ Quantization conversions
- โ Multi-GPU configurations
- โ OOM detection logic
- โ Edge cases and error handling
- โ Data fetching and parsing
# Run tests
npm run test
# Watch mode during development
npm run test:watch
# Generate coverage report
npm run test:coverage- Framework: React 18 + TypeScript
- UI Library: Material-UI (MUI) v5
- Build Tool: Vite (fast HMR, modern bundling)
- Testing: Jest + React Testing Library
- Code Quality: ESLint + TypeScript ESLint
- Charts: Recharts for data visualization
The application automatically fetches the latest model information from Ollama's model registry:
- ๐ 24-Hour Auto-Updates: Checks for new models daily
- ๐ Force Update: Manual refresh for immediate updates
- ๐ฆ CORS Proxy: Bypasses browser restrictions via Vite proxy
- ๐ก๏ธ Fallback System: Uses cached data if updates fail
- ๐ Debug Tools: Built-in model fetching diagnostics
Automatically detects and adds new models including:
- Latest Llama, Gemma, Qwen releases
- Emerging models from Mistral, DeepSeek
- Specialized models (code, vision, embedding)
- Community-contributed models
# Create optimized build
npm run build
# Preview production build
npm run preview
# Deploy dist/ folder to your hosting platformCreate .env.local for environment-specific settings:
VITE_API_BASE_URL=https://your-api-domain.com
VITE_ENABLE_DEBUG=false- Vercel: Zero-config deployment with automatic HTTPS
- Netlify: Easy deployment with form handling
- GitHub Pages: Free hosting for open-source projects
- Docker: Containerized deployment for enterprise
# Fork and clone the repository
git clone https://github.com/your-username/llm-memory-calculator.git
cd llm-memory-calculator
# Install dependencies
npm install
# Start development server
npm run dev- ๐ Fork the repository
- ๐ฟ Create a feature branch:
git checkout -b feature/amazing-feature - โจ Develop your changes with tests
- โ
Test your changes:
npm run test - ๐ Commit with clear messages:
git commit -m 'Add amazing feature' - ๐ Push to your branch:
git push origin feature/amazing-feature - ๐ Submit a pull request
- ๐ New GPU/processor support
- ๐ค Additional LLM model support
- ๐ Enhanced visualization features
- ๐งฎ Advanced calculation options
- ๐ Internationalization
- ๐ฑ Mobile responsiveness
- โก Performance optimizations
MIT License - see LICENSE file for details.
- VMware: Sizing methodology and performance formulas
- qoofyk: Original Python calculator inspiration
- Ollama Team: Model registry and local inference platform
- TechPowerUp: Comprehensive GPU specification database
- React Community: Exceptional development ecosystem
- VMware LLM Inference Sizing Guide
- Original LLM Sizing Guide
- Ollama Model Library
- TechPowerUp GPU Database
For questions, issues, or feature requests:
- ๐ GitHub Issues
- ๐ฌ Discussions
- ๐ง Email: your-email@domain.com
โญ Star this repository if it helped you! โญ
Made with โค๏ธ for the LLM community