Open-Source LLMs You Can Run Locally (2025 Guide)

Large Language Models are no longer cloud-only.
Today, several open or open-weight models can be run entirely on your own machine, giving you privacy, control, and predictable costs.

This guide explains which models matter, what hardware you need, and which ones are worth running locally in 2025.


Why Run an LLM Locally?

Running locally means:

  • 🔐 Your data never leaves your machine
  • 💸 No API costs
  • Low-latency responses
  • 🛠️ Full control over tuning, prompts, and tooling

The trade-off is hardware requirements — mainly RAM and GPU memory.


How to Read the Tables (Important)

Memory numbers assume INT4 / Q4 quantization, which is the standard for local inference.

TermMeaning
RAMSystem memory (CPU inference)
VRAMGPU memory (much faster)
Q44-bit quantization (best balance)

The Most Popular Open Models (At a Glance)

https://miro.medium.com/1%2ACQs4ceLpN8tIN8QyezL2Ag.png?utm_source=chatgpt.com
https://cdn.prod.website-files.com/614c82ed388d53640613982e/6578556eb4ef096d05d5d39e_switch.webp?utm_source=chatgpt.com
https://miro.medium.com/1%2As4fOuay5yf9C_jrd-eqdJg.jpeg?utm_source=chatgpt.com

5


Meta LLaMA – The Default Choice

Meta’s LLaMA models are the reference standard for local LLMs: strong reasoning, huge ecosystem, excellent tooling.

LLaMA 3.1 Memory Requirements

ModelParametersRAM (Q4)VRAM (Q4)Typical Use
LLaMA 3.1 8B8B6–8 GB~6 GBBest all-round local model
LLaMA 3.1 70B70B40–48 GB~40 GBResearch, agents, long reasoning
LLaMA 3.1 405B405BDatacenter-only

Verdict:
👉 If you run only one local model, make it LLaMA 3.1 8B.


Mistral & Mixtral – Smarter per Token

Mistral models are known for efficiency and speed.
Mixtral uses a Mixture-of-Experts (MoE) architecture, activating only parts of the model per request.

https://substackcdn.com/image/fetch/%24s_%21o-PE%21%2Cf_auto%2Cq_auto%3Agood%2Cfl_progressive%3Asteep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a9eba8-8490-4959-8cda-f0855af65d67_1360x972.png?utm_source=chatgpt.com
https://obot.ai/nitropack_static/oiAsdAWHqkinnLvTcqEzKiVWZKaxnXSX/assets/images/optimized/rev-1dd2471/obot.ai/wp-content/uploads/2024/11/mistral7b_vs_llama_2af009bbf4.png?utm_source=chatgpt.com

Mistral / Mixtral Memory Table

ModelParametersRAM (Q4)VRAM (Q4)Notes
Mistral 7B7B~5 GB~5 GBFast, efficient
Mixtral 8x7B~47B active30–36 GB24–32 GBExcellent quality
Mixtral 8x22B~141B active80–100 GB60–80 GBSerious workstation

Verdict:
👉 Mixtral 8×7B is one of the best “feels like GPT-4” local models.


DeepSeek – Reasoning Monsters

DeepSeek models punch far above their size, especially for logic, analysis, and math.

https://blog.ovhcloud.com/wp-content/uploads/2025/03/image-12.png?utm_source=chatgpt.com
https://media.licdn.com/dms/image/v2/D5612AQEMvGvDBJByWg/article-cover_image-shrink_600_2000/article-cover_image-shrink_600_2000/0/1737998853206?e=2147483647&t=dGFzzLK-OvEiyM8JeuSUlucLbaGXyQLsNyLdCQYaqho&v=beta&utm_source=chatgpt.com

DeepSeek Memory Requirements

ModelParametersRAM (Q4)VRAM (Q4)Best For
DeepSeek-R1 7B7B~6 GB~6 GBLightweight reasoning
DeepSeek-R1 32B32B20–24 GB~20 GBDeep analysis
DeepSeek-V2236BNot local

Verdict:
👉 Best reasoning per GB of RAM available today.


Coding-Focused Open Models

If you want code completion, refactoring, or architecture reasoning, use a model trained specifically for code.

Coding Models (Local)

ModelParametersRAM (Q4)VRAM (Q4)Strength
Code LLaMA 13B13B~10 GB~10 GBSolid baseline
Codestral 22B22B14–18 GB~16 GBExcellent reasoning
Qwen2.5-Coder 32B32B20–24 GB~20 GBTop-tier open coder

Small Models for Laptops & Miniservers

Not everyone has a GPU monster — these models work on 32 GB laptops.

https://news.microsoft.com/source/wp-content/uploads/2024/04/The-Phi-3-small-language-models-with-big-potential-1.jpg?utm_source=chatgpt.com
https://mobilesyrup.com/wp-content/uploads/2024/02/Gemma-header.jpg?utm_source=chatgpt.com
ModelParametersRAM (Q4)Why Use It
Phi-3 Mini3.8B~3 GBShockingly capable
Gemma 2 9B9B7–8 GBGood quality per byte

Recommended Hardware Setups

SetupWhat You Can Run
Laptop (32 GB RAM)7B–13B models
RTX 3090 / 4090 (24 GB)7B–32B, Mixtral 8×7B
Dual-GPU workstation70B class models
CPU-only server (128 GB RAM)32B–70B (slower)

Recommended Software Stack

  • Inference: llama.cpp, exllama2, vLLM
  • Model formats: GGUF (CPU), GPTQ / AWQ (GPU)
  • UIs: Ollama, LM Studio, text-generation-webui

Final Recommendations

Use CaseBest Model
One local modelLLaMA 3.1 8B
Best reasoningDeepSeek-R1 32B
CodingQwen2.5-Coder 32B
LaptopPhi-3 Mini
“Feels like GPT-4”Mixtral 8×7B

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.