Local LLM Setup Guide
Step-by-step guide to running local language models with Ollama and LM Studio — installation, model selection, VS Code integration, Python usage, and when to use local vs. cloud.
Why Run Models Locally?
Local models keep your data on your machine. No API calls, no usage logs, no third-party access. This matters when you’re working with student data, unpublished research, or sensitive institutional information.
For the full argument on why local matters — including privacy, institutional compliance, and defence use cases — read the companion blog post: Why I Run Language Models on My Own Machine.
This guide covers the practical how-to: installation, model selection, and connecting local models to your development tools.
Option 1: Ollama
Ollama is the fastest way to get a local model running from the command line. It handles model downloads, quantisation, and serving with a single command.
Install Ollama
- Download from ollama.com or install via Homebrew
- Verify the installation works
- Pull your first model
# macOS (Homebrew)
brew install ollama
# Verify installation
ollama --version
# Pull a model
ollama pull llama3.2
Run a Model
# Start a chat session
ollama run llama3.2
# Or use the API
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.2", "prompt": "Explain p-values in plain English"}'
Useful Ollama Commands
# List downloaded models
ollama list
# Pull a specific model
ollama pull mistral
# Remove a model to free disk space
ollama rm llama3.2
# Show model details (size, parameters, quantisation)
ollama show llama3.2
# Run with a system prompt
ollama run llama3.2 --system "You are a helpful research assistant specialising in psychology."
Option 2: LM Studio
LM Studio provides a graphical interface for downloading and chatting with models. It’s a good choice if you prefer a GUI over the terminal.
- Download from lmstudio.ai
- Search for a model in the built-in model browser
- Download and load the model
- Start chatting in the built-in interface
LM Studio also provides a local API server that’s compatible with the OpenAI API format — useful for connecting to other tools and scripts.
Choosing a Model
The right model depends on your hardware and what you need it for. Here’s a practical guide:
General-Purpose Models
| Model | Size | RAM Needed | Good For |
|---|---|---|---|
| Llama 3.2 (3B) | ~2 GB | 8 GB | Quick tasks, summarisation, simple Q&A |
| Gemma 3 (4B) | ~2.5 GB | 8 GB | Compact but strong, great quality for size |
| Phi-4 Mini (3.8B) | ~2.3 GB | 8 GB | Compact, fast inference, good reasoning |
| Mistral (7B) | ~4 GB | 16 GB | Solid all-rounder, structured output |
| Qwen 3 (8B) | ~5 GB | 16 GB | Strong multilingual, good at reasoning |
| Gemma 3 (12B) | ~7.5 GB | 16 GB | Excellent quality, multimodal (vision) |
| GPT-OSS (20B) | ~12 GB | 24 GB | OpenAI’s open-source model, general-purpose |
Coding-Focused Models
| Model | Size | RAM Needed | Good For |
|---|---|---|---|
| Qwen 2.5 Coder (7B) | ~4.5 GB | 16 GB | Code generation, refactoring |
| Qwen 3 Coder (30B) | ~18 GB | 48 GB | State-of-the-art open-source coding |
Larger Models (If You Have the RAM)
| Model | Size | RAM Needed | Good For |
|---|---|---|---|
| Gemma 3 (27B) | ~16 GB | 32 GB | Beats much larger models on benchmarks |
| QwQ (32B) | ~18 GB | 48 GB | Strong reasoning, long context |
| Llama 3.3 (70B) | ~40 GB | 64 GB | Near-cloud quality, versatile |
| GPT-OSS (120B) | ~70 GB | 128 GB | OpenAI’s large open-source model, frontier-level |
Practical advice: Start with Gemma 3 (4B) or Qwen 3 (8B). They handle most everyday tasks well and run on any modern laptop. Move to a larger model only if you find the output quality insufficient for your specific use case.
To pull any of these in Ollama:
ollama pull gemma3
ollama pull qwen3
ollama pull phi4-mini
ollama pull gpt-oss
ollama pull mistral
Hardware Requirements
- Minimum: 8 GB RAM for small models (3B parameters)
- Recommended: 16 GB RAM for 7–8B models
- Ideal: Apple Silicon Mac with 32 GB+ unified memory
- Power user: 64 GB+ for 70B models or running multiple models
Apple Silicon Notes
Apple Silicon Macs (M1/M2/M3/M4) are exceptionally well-suited for local LLMs because of unified memory — the CPU and GPU share the same memory pool, so the full RAM is available for model inference. A 32 GB M-series Mac handles 8B models with ease and can run larger quantised models too.
ARM-Based Mini PCs
The new generation of ARM-based mini PCs (like those from Ampere or Qualcomm-based systems) with shared VRAM offer similar benefits to Apple Silicon at different price points. These are worth considering if you need a dedicated local inference machine.
Model Formats: GGUF vs. MLX
When you download a model, it comes in a specific format. The two you’ll encounter most for local use are GGUF and MLX.
GGUF (llama.cpp)
GGUF is the universal format for local LLMs. It runs on everything — Mac, Windows, Linux, CPU, GPU. Ollama uses GGUF under the hood.
- Works on: Any hardware (CPU or GPU)
- Best for: Most users, cross-platform compatibility, Ollama
- Trade-off: Good performance everywhere, but not optimised for any one chip
MLX (Apple Silicon only)
MLX is Apple’s machine learning framework, optimised specifically for M-series chips. MLX models squeeze more speed out of Apple Silicon’s unified memory architecture.
- Works on: Apple Silicon Macs only (M1/M2/M3/M4)
- Best for: Mac users who want maximum inference speed
- Trade-off: Faster on Apple Silicon, but Mac-only — not portable
If you’re on a Mac: Try MLX models via LM Studio (which supports both formats) for the best speed. Ollama also supports MLX for some models.
If you’re on anything else (or want simplicity): Stick with GGUF via Ollama. It just works.
Quantisation Levels
Models are compressed (“quantised”) to fit in less RAM. Common quantisation levels you’ll see:
| Quantisation | Size vs. Full | Quality | When to Use |
|---|---|---|---|
| Q4_K_M | ~25–30% | Good for most tasks | Default choice — best balance of size and quality |
| Q5_K_M | ~35% | Slightly better | When you have spare RAM and want a bit more quality |
| Q6_K | ~45% | Near-original | When quality matters most and RAM isn’t tight |
| Q8_0 | ~50% | Excellent | Maximum quality quantised model |
| Q3_K_S | ~20% | Noticeable degradation | Only when RAM is very limited |
Practical advice: Ollama picks a sensible default (usually Q4_K_M) when you pull a model. Unless you have a specific reason to change it, the default is fine.
Connecting Local Models to VS Code
Running a local model in the terminal is useful, but connecting it to your code editor makes it part of your daily workflow.
Continue Extension
Continue is an open-source AI coding assistant for VS Code that works with local models.
- Install the Continue extension from the VS Code marketplace
- Open Continue settings (click the gear icon in the Continue panel)
- Add Ollama as a provider:
{
"models": [
{
"title": "Ollama - Llama 3.1",
"provider": "ollama",
"model": "llama3.1"
}
]
}
- Make sure Ollama is running (
ollama servein a terminal) - Start chatting or using inline completions
Continue gives you chat, inline editing, and code explanations — all powered by your local model, with no data leaving your machine.
GitHub Copilot with Local Models
GitHub Copilot doesn’t natively support local models, but you can use Ollama’s OpenAI-compatible API with tools that accept custom API endpoints. If you’re already using Copilot for cloud-based assistance, Continue is the best complement for local model access.
Other Options
- Cody (Sourcegraph) — supports Ollama as a backend
- Cline — VS Code extension that works with local Ollama models
- Aider — terminal-based coding assistant with Ollama support
Using Local Models from Python
If you’re building tools, research pipelines, or just want to script your interactions, connecting to local models from Python is straightforward.
Using the Ollama Python Library
pip install ollama
import ollama
# Simple generation
response = ollama.generate(
model='llama3.1',
prompt='Summarise the key assumptions of linear regression in 3 bullet points.'
)
print(response['response'])
# Chat with message history
response = ollama.chat(
model='llama3.1',
messages=[
{'role': 'system', 'content': 'You are a research methods expert.'},
{'role': 'user', 'content': 'When should I use a mixed-effects model instead of a repeated-measures ANOVA?'}
]
)
print(response['message']['content'])
Using the OpenAI-Compatible API
Ollama serves an OpenAI-compatible API on localhost:11434. This means you can use the OpenAI Python library with local models — useful if you have existing code that uses the OpenAI API and want to switch to local.
pip install openai
from openai import OpenAI
# Point the client at Ollama's local server
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # required but not used
)
response = client.chat.completions.create(
model='llama3.1',
messages=[
{'role': 'user', 'content': 'Write a Python function to compute Cohen\'s d.'}
]
)
print(response.choices[0].message.content)
This approach lets you develop locally and swap in a cloud model for production by changing the base_url and api_key — no other code changes needed.
Batch Processing Research Papers
Here’s a practical example — processing multiple paper abstracts:
import ollama
import json
abstracts = [
"Abstract of paper 1...",
"Abstract of paper 2...",
"Abstract of paper 3...",
]
results = []
for i, abstract in enumerate(abstracts):
response = ollama.generate(
model='llama3.1',
prompt=f"""Analyse this research abstract and extract:
- Research question
- Methodology (1 sentence)
- Key finding (1 sentence)
- Sample size
Abstract: {abstract}
Respond in JSON format."""
)
results.append({
'paper': i + 1,
'analysis': response['response']
})
# Save results
with open('paper_analysis.json', 'w') as f:
json.dump(results, f, indent=2)
When to Use Local vs. Cloud
Not every task needs a local model, and not every task needs a cloud model. Here’s a practical decision framework:
Use Local When:
- Privacy matters — student data, unpublished manuscripts, ethics-restricted information, peer review content
- You’re iterating rapidly — testing prompts, debugging pipelines, running batch jobs where API costs add up
- You’re offline or on restricted networks — travel, secure environments, unreliable internet
- Cost is a concern — local models have zero marginal cost per request
- You’re teaching — students can experiment freely without usage limits or account requirements
Use Cloud When:
- You need maximum capability — complex reasoning, long-form writing, nuanced analysis that exceeds what 8B models can do
- The task requires a large context window — processing very long documents where local models run out of context
- Speed matters more than privacy — cloud models on dedicated hardware are typically faster than local inference
- You need specific features — web search, image generation, tool use, or other capabilities not available locally
The Hybrid Approach
The most practical setup is both:
- Ollama running locally for everyday tasks, privacy-sensitive work, and development
- A cloud model (ChatGPT, Claude, Gemini) for tasks that need frontier capability
Develop locally, deploy to cloud when needed. Use local models as your default and escalate to cloud when the task demands it. This gives you the best of both worlds: privacy and zero cost for most tasks, maximum capability when you need it.
Quick Start Checklist
- Install Ollama:
brew install ollama(macOS) or download from ollama.com - Pull a model:
ollama pull llama3.1 - Test it:
ollama run llama3.1 - Install Continue in VS Code for editor integration
- Install the Python library:
pip install ollama - Try the OpenAI-compatible API for existing scripts
You can have a working local LLM setup in under 10 minutes. Start with a chat session, see how it fits your workflow, and build from there.