Hardware Needed to Run a Local LLM
Introduction
From cost savings and data privacy to having more choice and control over the model being run, there are compelling reasons to run an LLM locally rather than using services like Microsoft Co-Pilot or ChatGPT. However, the process can be overwhelming for newcomers, and rapidly changing technology means much online information quickly becomes outdated.
This guide provides an introductory overview for those interested in running their own LLM, covering key variables to consider when taking your first steps.
Hardware Considerations
One of the first considerations for running local LLMs is available hardware. While LLMs can run on almost any computer, optimal performance requires one or more graphics cards suitable for LLM workloads.
Consumer vs Workstation Platform
Platform choice typically affects available options rather than direct performance. Key factors include:
- PCIe lanes: Determines how many GPUs can be effectively utilized
- Memory channels: Critical for CPU inference performance
Workstation platforms like AMD Threadripper Pro and Intel Xeon W provide significantly more capability than consumer platforms (AMD Ryzen or Intel Core). This becomes increasingly important with larger models and multi-GPU configurations.
Between AMD and Intel platforms, performance running an LLM with a given GPU is nearly identical.
CPU vs GPU for Inference
GPU Inference:
- Significantly faster (10x to 100x) than CPU inference
- Limited by VRAM capacity (typically 8GB to 48GB)
- Best for smaller models that fit in VRAM
CPU Inference:
- Uses system RAM instead of VRAM
- Can handle larger models that exceed VRAM capacity
- Much slower performance
Hybrid Approach: Inference libraries like llama.cpp support hybrid methods utilizing both CPU/RAM and GPU/VRAM resources, offering a middle ground when models don’t entirely fit in VRAM.
NVIDIA vs AMD GPUs
For LLMs, NVIDIA is the straightforward choice due to:
- Broad CUDA support across Linux and Windows
- Wide adoption across inference libraries
- Superior raw performance
However, AMD shouldn’t be dismissed entirely, especially if you already own AMD GPUs. Options like LM Studio offer ROCm support (technical preview on Windows), and Linux provides more robust AMD GPU support.
RAM & Storage Considerations
RAM Requirements
For GPU inference, RAM primarily facilitates loading model weights from storage into VRAM. We recommend:
- At least equal RAM to VRAM
- Preferably 1.5-2x more RAM than VRAM
For CPU inference, RAM bandwidth is the primary performance bottleneck:
- Faster memory clock speeds preferred over lower latency
- Platforms supporting more memory channels perform better
Storage
Storage doesn’t significantly impact performance once models are loaded. Fast NVMe drives minimize loading times, but this matters most when frequently testing different models. Keep in mind that larger models can measure hundreds of GB, so storage capacity becomes important for maintaining a model library.
Software & OS Options
Operating System Choice
Linux is the preferred OS for running LLMs:
- Most AI/ML projects developed and optimized for Linux
- Lower system resource usage, especially desktop VRAM
- Proper NCCL support for multi-GPU configurations
- Better documentation and support
Windows has become more accessible with tools like:
- Chat with RTX
- LM Studio
- text-generation-webui (oobabooga)
These provide simple GUIs for getting started without command-line expertise.
Multi-User Access
When planning to serve multiple users, consider:
- Batching: Allows parallel processing of multiple inputs, greatly improving throughput at the cost of increased memory usage
- API access: Most modern setups use API servers rather than direct LLM access, enabling integration with various applications and plugins
- Security: Ensure your server isn’t inadvertently exposed to unauthorized users
Popular backends for multi-user scenarios include Ollama and vLLM, while text-generation-webui supports API functionality with the --api flag.
Choosing Which LLM to Use
Two Key Questions:
1. Does it do what I want it to do?
- For translation: Choose models trained on your target languages (e.g., Qwen2-7B-Instruct for Japanese)
- For coding: Select coding-specific models or models with code in training data
- For image analysis: Use multimodal models like idefics2-8b-base
Note that newer general models often outperform older specialized models due to improved generalization capabilities.
2. Does it fit within available system resources?
Estimating Memory Requirements
Parameter Count Method: Models are typically named with their parameter count (e.g., Llama 3.1 comes in 8B, 70B, and 405B versions).
For FP16/BF16 models: Parameter count (billions) × 2 = GB of memory required
Example: An 8B parameter model needs roughly 16GB of memory.
Quantization: Quantized models reduce memory footprint at the cost of some accuracy:
- 8-bit quantization: ~50% of original size
- 4-bit quantization: ~25% of original size
Common quantization formats: GPTQ, GGUF, AWQ, exl2
Context Window Considerations
Context windows have grown dramatically:
- Llama 1: 2,048 tokens
- Llama 2: 4,096 tokens
- Llama 3: 8,192 tokens
- Llama 3.1: 128,000 tokens
Larger context windows require additional memory beyond model weights. Plan for approximately 15% additional memory on top of the model size to accommodate context.
Optimization techniques like Flash Attention and context quantization can significantly reduce context memory overhead.
LLM Leaderboards
Review leaderboards to compare model capabilities:
- LMSys Leaderboard: Includes proprietary models like ChatGPT
- Open LLM Leaderboard: Only open-source models
While no benchmark is perfect, they provide useful comparative insights.
Final Thoughts
Running LLMs on local hardware has never been more accessible. From choosing the right hardware platform to selecting appropriate models and software tools, the landscape offers options for various budgets and use cases.
Key takeaways:
- GPU inference offers best performance but is VRAM-limited
- Quantization enables running larger models on modest hardware
- Linux provides the most robust support, but Windows options are improving
- Context windows significantly impact memory requirements
- Choose models based on both capability and resource constraints
Ready to deploy LLMs on your infrastructure? Contact us to discuss your specific requirements and hardware recommendations.
This guide provides foundational knowledge for getting started with local LLMs. The AI/ML landscape evolves rapidly, so always verify current best practices when implementing solutions.
Need Help Implementing This?
Our team of experts can help you apply these concepts to your business.
Contact Us