Hardware Needed to Run a Local LLM

Introduction

From cost savings and data privacy to having more choice and control over the model being run, there are compelling reasons to run an LLM locally rather than using services like Microsoft Co-Pilot or ChatGPT. However, the process can be overwhelming for newcomers, and rapidly changing technology means much online information quickly becomes outdated.

This guide provides an introductory overview for those interested in running their own LLM, covering key variables to consider when taking your first steps.

Hardware Considerations

One of the first considerations for running local LLMs is available hardware. While LLMs can run on almost any computer, optimal performance requires one or more graphics cards suitable for LLM workloads.

Consumer vs Workstation Platform

Platform choice typically affects available options rather than direct performance. Key factors include:

PCIe lanes: Determines how many GPUs can be effectively utilized
Memory channels: Critical for CPU inference performance

Workstation platforms like AMD Threadripper Pro and Intel Xeon W provide significantly more capability than consumer platforms (AMD Ryzen or Intel Core). This becomes increasingly important with larger models and multi-GPU configurations.

Between AMD and Intel platforms, performance running an LLM with a given GPU is nearly identical.

CPU vs GPU for Inference

GPU Inference:

Significantly faster (10x to 100x) than CPU inference
Limited by VRAM capacity (typically 8GB to 48GB)
Best for smaller models that fit in VRAM

CPU Inference:

Uses system RAM instead of VRAM
Can handle larger models that exceed VRAM capacity
Much slower performance

Hybrid Approach: Inference libraries like llama.cpp support hybrid methods utilizing both CPU/RAM and GPU/VRAM resources, offering a middle ground when models don’t entirely fit in VRAM.

NVIDIA vs AMD GPUs

For LLMs, NVIDIA is the straightforward choice due to:

Broad CUDA support across Linux and Windows
Wide adoption across inference libraries
Superior raw performance

However, AMD shouldn’t be dismissed entirely, especially if you already own AMD GPUs. Options like LM Studio offer ROCm support (technical preview on Windows), and Linux provides more robust AMD GPU support.

RAM & Storage Considerations

RAM Requirements

For GPU inference, RAM primarily facilitates loading model weights from storage into VRAM. We recommend:

At least equal RAM to VRAM
Preferably 1.5-2x more RAM than VRAM

For CPU inference, RAM bandwidth is the primary performance bottleneck:

Faster memory clock speeds preferred over lower latency
Platforms supporting more memory channels perform better

Storage

Storage doesn’t significantly impact performance once models are loaded. Fast NVMe drives minimize loading times, but this matters most when frequently testing different models. Keep in mind that larger models can measure hundreds of GB, so storage capacity becomes important for maintaining a model library.

Software & OS Options

Operating System Choice

Linux is the preferred OS for running LLMs:

Most AI/ML projects developed and optimized for Linux
Lower system resource usage, especially desktop VRAM
Proper NCCL support for multi-GPU configurations
Better documentation and support

Windows has become more accessible with tools like:

Chat with RTX
LM Studio
text-generation-webui (oobabooga)

These provide simple GUIs for getting started without command-line expertise.

Multi-User Access

When planning to serve multiple users, consider:

Batching: Allows parallel processing of multiple inputs, greatly improving throughput at the cost of increased memory usage
API access: Most modern setups use API servers rather than direct LLM access, enabling integration with various applications and plugins
Security: Ensure your server isn’t inadvertently exposed to unauthorized users

Popular backends for multi-user scenarios include Ollama and vLLM, while text-generation-webui supports API functionality with the --api flag.

Choosing Which LLM to Use

Two Key Questions:

1. Does it do what I want it to do?

For translation: Choose models trained on your target languages (e.g., Qwen2-7B-Instruct for Japanese)
For coding: Select coding-specific models or models with code in training data
For image analysis: Use multimodal models like idefics2-8b-base

Note that newer general models often outperform older specialized models due to improved generalization capabilities.

2. Does it fit within available system resources?

Estimating Memory Requirements

Parameter Count Method: Models are typically named with their parameter count (e.g., Llama 3.1 comes in 8B, 70B, and 405B versions).

For FP16/BF16 models: Parameter count (billions) × 2 = GB of memory required

Example: An 8B parameter model needs roughly 16GB of memory.

Quantization: Quantized models reduce memory footprint at the cost of some accuracy:

8-bit quantization: ~50% of original size
4-bit quantization: ~25% of original size

Common quantization formats: GPTQ, GGUF, AWQ, exl2

Context Window Considerations

Context windows have grown dramatically:

Llama 1: 2,048 tokens
Llama 2: 4,096 tokens
Llama 3: 8,192 tokens
Llama 3.1: 128,000 tokens

Larger context windows require additional memory beyond model weights. Plan for approximately 15% additional memory on top of the model size to accommodate context.

Optimization techniques like Flash Attention and context quantization can significantly reduce context memory overhead.

LLM Leaderboards

Review leaderboards to compare model capabilities:

LMSys Leaderboard: Includes proprietary models like ChatGPT
Open LLM Leaderboard: Only open-source models

While no benchmark is perfect, they provide useful comparative insights.

Final Thoughts

Running LLMs on local hardware has never been more accessible. From choosing the right hardware platform to selecting appropriate models and software tools, the landscape offers options for various budgets and use cases.

Key takeaways:

GPU inference offers best performance but is VRAM-limited
Quantization enables running larger models on modest hardware
Linux provides the most robust support, but Windows options are improving
Context windows significantly impact memory requirements
Choose models based on both capability and resource constraints

Ready to deploy LLMs on your infrastructure? Contact us to discuss your specific requirements and hardware recommendations.

This guide provides foundational knowledge for getting started with local LLMs. The AI/ML landscape evolves rapidly, so always verify current best practices when implementing solutions.