Running LLMs Locally on Your Machine
Most of us interact with Large Language Models (LLMs) every day through cloud-based inference APIs such as ChatGPT, Claude, Gemini, GitHub Copilot, or Amazon Q. These services make it effortless to use powerful models without worrying about hardware, setup, or optimization.
However, there are times when running an LLM locally on your computer makes more sense. Let’s explore when and how to do that.
Why Run LLMs Locally?
There are two main reasons you might consider running an LLM on your own system instead of through the cloud.
1. Privacy and Data Sensitivity
If your task involves confidential information, such as proprietary code, internal data, or customer records, it may not be safe (or allowed) to send it over a network to an external service.
Running a model locally keeps all processing on your own device, so no data leaves your machine.
2. Cost Efficiency
Cloud-based models charge based on usage, and every token counts. For simple or repetitive tasks, those costs can add up quickly.
Local models are free to run after installation, making them ideal for lightweight coding tasks, experimentation, or offline use.
When to Use Local vs Cloud Models
Here’s a simple way to decide:
- Use Cloud Models (like GPT-5, Claude 3.5 Sonnet, or Gemini-Pro) for:
- Complex reasoning and creative generation
- Research or analytical work
- Tasks that demand high-quality, nuanced outputs
- Use Local Models (like Llama 3, Mistral, Phi-3, etc.) for:
- Coding assistance
- Drafting small text snippets
- Quick experiments and prototypes
- Offline or private workloads
Running LLMs with Ollama
One of the easiest ways to run open-source LLMs locally is using Ollama. It provides a simple command-line tool and Python API to run and manage models.
Step 1: Download and Install
Download Ollama for your operating system from
👉 https://ollama.com/download
Then install it following the instructions for your platform.
Step 2: Verify Installation
Open your terminal and run:
ollama
You’ll see the list of available commands like:
Usage:
ollama [flags]
ollama [command]
Available Commands:
serve Start ollama
create Create a model
show Show information for a model
run Run a model
stop Stop a running model
pull Pull a model from a registry
push Push a model to a registry
signin Sign in to ollama.com
signout Sign out from ollama.com
list List models
ps List running models
cp Copy a model
rm Remove a modelStep 3: Explore Available Models
Ollama maintains a model library at ollama.com/library, where you can find models like:
llama3.2mistralphi3gemmacodellama
Step 4: Download a Model
For example, to download Llama 3.2, run:
ollama pull llama3.2
Step 5: Run the Model
Start an interactive chat session with:
ollama run llama3.2
You’ll enter an Ollama REPL (Read, Evaluate, Print, Loop) prompt:
>>>
Now you can chat directly with the model.
Step 6: System Requirements
Roughly, you’ll need:
- 2 GB of VRAM for a 1B-parameter model in FP16
- 4 GB of VRAM for the same model in FP32
Pick your model based on your system’s GPU or CPU capacity.
Using Ollama with Python
You can also run Ollama models programmatically from Python.
mkdir running-llm-locally
cd running-llm-locally
python3 -m venv .venv
source .venv/bin/activateThen create a simple script:
import ollama
model = 'llama3.2'
response = ollama.chat(model, messages=[{
'role': 'user',
'content': 'How to run LLM locally?'
}])
print(response)That’s it. You now have a local LLM responding to prompts directly from Python.
Running LLMs with Hugging Face Transformers
Another powerful way to run local models is using the Hugging Face Transformers library.
Hugging Face hosts thousands of open-source models and datasets for NLP, vision, and multimodal tasks.
Step 1: Install Dependencies
mkdir running-llm-transformers
cd running-llm-transformers
python3 -m venv .venv
source .venv/bin/activate
pip install transformers torchStep 2: Load and Run a Model
Here’s a minimal example using the Phi-3 Mini Instruct model from Microsoft:
from transformers import pipeline
model_path = 'microsoft/Phi-3-mini-4k-instruct'
generative_pipe = pipeline('text-generation', model=model_path, tokenizer=model_path)
prompt = "How to run LLM locally?"
response = generative_pipe(prompt, max_new_tokens=100)
print(response[0]['generated_text'])This code downloads the model automatically (the first run might take a few minutes) and runs it locally for text generation.
Final Thoughts
Running LLMs locally gives you freedom, privacy, and control.
For heavy workloads, cloud-based models like GPT-5 or Claude 3.5 are unmatched.
But for everyday coding, lightweight writing, or offline experimentation, local models via Ollama or Hugging Face Transformers are practical, efficient, and surprisingly capable.
It’s easier than ever to get started. Just install, pull a model, and start chatting with your own AI right on your machine.