Running LLMs Locally on Your Machine

Running LLMs Locally on Your Machine

Most of us interact with Large Language Models (LLMs) every day through cloud-based inference APIs such as ChatGPT, Claude, Gemini, GitHub Copilot, or Amazon Q. These services make it effortless to use powerful models without worrying about hardware, setup, or optimization.

However, there are times when running an LLM locally on your computer makes more sense. Let’s explore when and how to do that.


Why Run LLMs Locally?

There are two main reasons you might consider running an LLM on your own system instead of through the cloud.

1. Privacy and Data Sensitivity

If your task involves confidential information, such as proprietary code, internal data, or customer records, it may not be safe (or allowed) to send it over a network to an external service.
Running a model locally keeps all processing on your own device, so no data leaves your machine.

2. Cost Efficiency

Cloud-based models charge based on usage, and every token counts. For simple or repetitive tasks, those costs can add up quickly.
Local models are free to run after installation, making them ideal for lightweight coding tasks, experimentation, or offline use.


When to Use Local vs Cloud Models

Here’s a simple way to decide:

  • Use Cloud Models (like GPT-5, Claude 3.5 Sonnet, or Gemini-Pro) for:
    • Complex reasoning and creative generation
    • Research or analytical work
    • Tasks that demand high-quality, nuanced outputs
  • Use Local Models (like Llama 3, Mistral, Phi-3, etc.) for:
    • Coding assistance
    • Drafting small text snippets
    • Quick experiments and prototypes
    • Offline or private workloads

Running LLMs with Ollama

One of the easiest ways to run open-source LLMs locally is using Ollama. It provides a simple command-line tool and Python API to run and manage models.

Step 1: Download and Install

Download Ollama for your operating system from
👉 https://ollama.com/download

Then install it following the instructions for your platform.

Step 2: Verify Installation

Open your terminal and run:

ollama

You’ll see the list of available commands like:

Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve       Start ollama
  create      Create a model
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  signin      Sign in to ollama.com
  signout     Sign out from ollama.com
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model

Step 3: Explore Available Models

Ollama maintains a model library at ollama.com/library, where you can find models like:

  • llama3.2
  • mistral
  • phi3
  • gemma
  • codellama

Step 4: Download a Model

For example, to download Llama 3.2, run:

ollama pull llama3.2

Step 5: Run the Model

Start an interactive chat session with:

ollama run llama3.2

You’ll enter an Ollama REPL (Read, Evaluate, Print, Loop) prompt:

>>>

Now you can chat directly with the model.

Step 6: System Requirements

Roughly, you’ll need:

  • 2 GB of VRAM for a 1B-parameter model in FP16
  • 4 GB of VRAM for the same model in FP32

Pick your model based on your system’s GPU or CPU capacity.


Using Ollama with Python

You can also run Ollama models programmatically from Python.

mkdir running-llm-locally
cd running-llm-locally
python3 -m venv .venv
source .venv/bin/activate

Then create a simple script:

import ollama

model = 'llama3.2'

response = ollama.chat(model, messages=[{
    'role': 'user',
    'content': 'How to run LLM locally?'
}])

print(response)

That’s it. You now have a local LLM responding to prompts directly from Python.


Running LLMs with Hugging Face Transformers

Another powerful way to run local models is using the Hugging Face Transformers library.
Hugging Face hosts thousands of open-source models and datasets for NLP, vision, and multimodal tasks.

Step 1: Install Dependencies

mkdir running-llm-transformers
cd running-llm-transformers
python3 -m venv .venv
source .venv/bin/activate
pip install transformers torch

Step 2: Load and Run a Model

Here’s a minimal example using the Phi-3 Mini Instruct model from Microsoft:

from transformers import pipeline

model_path = 'microsoft/Phi-3-mini-4k-instruct'
generative_pipe = pipeline('text-generation', model=model_path, tokenizer=model_path)

prompt = "How to run LLM locally?"
response = generative_pipe(prompt, max_new_tokens=100)

print(response[0]['generated_text'])

This code downloads the model automatically (the first run might take a few minutes) and runs it locally for text generation.


Final Thoughts

Running LLMs locally gives you freedom, privacy, and control.
For heavy workloads, cloud-based models like GPT-5 or Claude 3.5 are unmatched.
But for everyday coding, lightweight writing, or offline experimentation, local models via Ollama or Hugging Face Transformers are practical, efficient, and surprisingly capable.

It’s easier than ever to get started. Just install, pull a model, and start chatting with your own AI right on your machine.