Getting Started with vLLM: Scalable LLM Inference Made Easy

vLLM is an open-source inference and serving engine for Large Language Models (LLMs).
If you’ve ever tried running models using Hugging Face Transformers or llama.cpp, you know that serving multiple concurrent requests efficiently is challenging - especially with limited GPU/CPU resources.

That’s where vLLM shines. It provides optimized techniques for faster and more efficient inference, including:

Key-Value (K,V) caching to avoid recomputing tokens
Quantization (FP8, FP16, INT8) to reduce memory footprint
PagedAttention for smarter memory management
Continuous batching for better request throughput
An OpenAI-compatible API server for easy integration

In this post, I’ll walk you through installing and running vLLM locally on macOS Sonoma (M3 Apple Silicon).
For other platforms, please refer to the official vLLM installation guide.

Installation

We’ll use uv, a modern and fast Python package manager.
Install it using:

curl -LsSf https://astral.sh/uv/install.sh | sh

Step 1: Create a working directory

mkdir vllm-demo && cd vllm-demo

Step 2: Create a virtual environment

vLLM requires Python 3.10+. I’m using Python 3.12:

uv venv --python 3.12

Step 3: Clone the vLLM source

Currently, there’s no precompiled vLLM binary for macOS (Apple Silicon), so we’ll build it from source:

git clone https://github.com/vllm-project/vllm.git
cd vllm

Step 4: Install dependencies

Try installing dependencies using:

uv pip install -r requirements/cpu.txt

If you encounter version conflicts, this command works better:

uv pip install --index-strategy unsafe-best-match -r requirements/cpu.txt

Step 5: Build and install vLLM

uv pip install -e .

That’s it - vLLM is now installed locally.

Running Your First Inference API

Let’s start a local API server using a lightweight model so it fits into your system memory.
We’ll use microsoft/Phi-3-mini-4k-instruct:

python3.12 -m vllm.entrypoints.api_server \
  --model microsoft/Phi-3-mini-4k-instruct \
  --disable-sliding-window

Since we’re running on CPU, we disable the sliding window feature (it’s GPU-optimized).

If everything goes well, you’ll see output like:

INFO:     Started server process [28887]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Your inference server is now live at http://localhost:8000

Making Requests to the Inference API

You can use the simple client provided in the vLLM examples:

api_client.py on GitHub

Try it out:

python3.12 api_client.py --prompt "Who is the director of Batman?"

Output example:

Prompt: 'Who is the director of Batman?'

Beam candidate 0:
"The director of the 1989 film 'Batman' is Tim Burton."

You can increase --max-tokens from the default 16 to 100 or more for longer outputs.

Why Use vLLM?

Feature	What It Does
PagedAttention	Efficient memory management by splitting the attention Key/Value cache into “pages” - reducing fragmentation and memory waste. (Red Hat, arXiv Paper)
Continuous Batching	Dynamically merges incoming requests into active batches to keep hardware fully utilized - no more waiting for a full batch before inference. (Designveloper)
Optimized Kernels & Quantization	Support for quantized models (INT8, FP8, FP16) and optimized GPU/CPU kernels, integrating with high-performance libraries like FlashAttention. (Docs)
Streaming + OpenAI-Compatible API	Supports real-time streaming token generation and an OpenAI-compatible API for seamless integration with existing clients.

Conclusion

vLLM is an excellent choice if you want high-performance, scalable, and OpenAI-compatible LLM inference - without relying on expensive cloud endpoints.
It brings production-grade features like continuous batching, paged attention, and streaming responses right to your local setup.

Whether you’re experimenting with small models on a MacBook or deploying large models on GPUs in production, vLLM offers a fast and flexible foundation for serving LLMs efficiently.