Local LLM Setup Guide

Step-by-step guide to running local language models with Ollama and LM Studio — installation, model selection, VS Code integration, Python usage, and when to use local vs. cloud.

Ollama LM Studio Privacy

Last updated February 15, 2026

Why Run Models Locally?

Local models keep your data on your machine. No API calls, no usage logs, no third-party access. This matters when you’re working with student data, unpublished research, or sensitive institutional information.

For the full argument on why local matters — including privacy, institutional compliance, and defence use cases — read the companion blog post: Why I Run Language Models on My Own Machine.

This guide covers the practical how-to: installation, model selection, and connecting local models to your development tools.

Option 1: Ollama

Ollama is the fastest way to get a local model running from the command line. It handles model downloads, quantisation, and serving with a single command.

Install Ollama

Download from ollama.com or install via Homebrew
Verify the installation works
Pull your first model

# macOS (Homebrew)
brew install ollama

# Verify installation
ollama --version

# Pull a model
ollama pull llama3.2

Run a Model

# Start a chat session
ollama run llama3.2

# Or use the API
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.2", "prompt": "Explain p-values in plain English"}'

Useful Ollama Commands

# List downloaded models
ollama list

# Pull a specific model
ollama pull mistral

# Remove a model to free disk space
ollama rm llama3.2

# Show model details (size, parameters, quantisation)
ollama show llama3.2

# Run with a system prompt
ollama run llama3.2 --system "You are a helpful research assistant specialising in psychology."

Option 2: LM Studio

LM Studio provides a graphical interface for downloading and chatting with models. It’s a good choice if you prefer a GUI over the terminal.

Download from lmstudio.ai
Search for a model in the built-in model browser
Download and load the model
Start chatting in the built-in interface

LM Studio also provides a local API server that’s compatible with the OpenAI API format — useful for connecting to other tools and scripts.

Choosing a Model

The right model depends on your hardware and what you need it for. Here’s a practical guide:

General-Purpose Models

Model	Size	RAM Needed	Good For
Llama 3.2 (3B)	~2 GB	8 GB	Quick tasks, summarisation, simple Q&A
Gemma 3 (4B)	~2.5 GB	8 GB	Compact but strong, great quality for size
Phi-4 Mini (3.8B)	~2.3 GB	8 GB	Compact, fast inference, good reasoning
Mistral (7B)	~4 GB	16 GB	Solid all-rounder, structured output
Qwen 3 (8B)	~5 GB	16 GB	Strong multilingual, good at reasoning
Gemma 3 (12B)	~7.5 GB	16 GB	Excellent quality, multimodal (vision)
GPT-OSS (20B)	~12 GB	24 GB	OpenAI’s open-source model, general-purpose

Coding-Focused Models

Model	Size	RAM Needed	Good For
Qwen 2.5 Coder (7B)	~4.5 GB	16 GB	Code generation, refactoring
Qwen 3 Coder (30B)	~18 GB	48 GB	State-of-the-art open-source coding

Larger Models (If You Have the RAM)

Model	Size	RAM Needed	Good For
Gemma 3 (27B)	~16 GB	32 GB	Beats much larger models on benchmarks
QwQ (32B)	~18 GB	48 GB	Strong reasoning, long context
Llama 3.3 (70B)	~40 GB	64 GB	Near-cloud quality, versatile
GPT-OSS (120B)	~70 GB	128 GB	OpenAI’s large open-source model, frontier-level

Practical advice: Start with Gemma 3 (4B) or Qwen 3 (8B). They handle most everyday tasks well and run on any modern laptop. Move to a larger model only if you find the output quality insufficient for your specific use case.

To pull any of these in Ollama:

ollama pull gemma3
ollama pull qwen3
ollama pull phi4-mini
ollama pull gpt-oss
ollama pull mistral

Hardware Requirements

Minimum: 8 GB RAM for small models (3B parameters)
Recommended: 16 GB RAM for 7–8B models
Ideal: Apple Silicon Mac with 32 GB+ unified memory
Power user: 64 GB+ for 70B models or running multiple models

Apple Silicon Notes

Apple Silicon Macs (M1/M2/M3/M4) are exceptionally well-suited for local LLMs because of unified memory — the CPU and GPU share the same memory pool, so the full RAM is available for model inference. A 32 GB M-series Mac handles 8B models with ease and can run larger quantised models too.

ARM-Based Mini PCs

The new generation of ARM-based mini PCs (like those from Ampere or Qualcomm-based systems) with shared VRAM offer similar benefits to Apple Silicon at different price points. These are worth considering if you need a dedicated local inference machine.

Model Formats: GGUF vs. MLX

When you download a model, it comes in a specific format. The two you’ll encounter most for local use are GGUF and MLX.

GGUF (llama.cpp)

GGUF is the universal format for local LLMs. It runs on everything — Mac, Windows, Linux, CPU, GPU. Ollama uses GGUF under the hood.

Works on: Any hardware (CPU or GPU)
Best for: Most users, cross-platform compatibility, Ollama
Trade-off: Good performance everywhere, but not optimised for any one chip

MLX (Apple Silicon only)

MLX is Apple’s machine learning framework, optimised specifically for M-series chips. MLX models squeeze more speed out of Apple Silicon’s unified memory architecture.

Works on: Apple Silicon Macs only (M1/M2/M3/M4)
Best for: Mac users who want maximum inference speed
Trade-off: Faster on Apple Silicon, but Mac-only — not portable

If you’re on a Mac: Try MLX models via LM Studio (which supports both formats) for the best speed. Ollama also supports MLX for some models.

If you’re on anything else (or want simplicity): Stick with GGUF via Ollama. It just works.

Quantisation Levels

Models are compressed (“quantised”) to fit in less RAM. Common quantisation levels you’ll see:

Quantisation	Size vs. Full	Quality	When to Use
Q4_K_M	~25–30%	Good for most tasks	Default choice — best balance of size and quality
Q5_K_M	~35%	Slightly better	When you have spare RAM and want a bit more quality
Q6_K	~45%	Near-original	When quality matters most and RAM isn’t tight
Q8_0	~50%	Excellent	Maximum quality quantised model
Q3_K_S	~20%	Noticeable degradation	Only when RAM is very limited

Practical advice: Ollama picks a sensible default (usually Q4_K_M) when you pull a model. Unless you have a specific reason to change it, the default is fine.

Connecting Local Models to VS Code

Running a local model in the terminal is useful, but connecting it to your code editor makes it part of your daily workflow.

Continue Extension

Continue is an open-source AI coding assistant for VS Code that works with local models.

Install the Continue extension from the VS Code marketplace
Open Continue settings (click the gear icon in the Continue panel)
Add Ollama as a provider:

{
  "models": [
    {
      "title": "Ollama - Llama 3.1",
      "provider": "ollama",
      "model": "llama3.1"
    }
  ]
}

Make sure Ollama is running (ollama serve in a terminal)
Start chatting or using inline completions

Continue gives you chat, inline editing, and code explanations — all powered by your local model, with no data leaving your machine.

GitHub Copilot with Local Models

GitHub Copilot doesn’t natively support local models, but you can use Ollama’s OpenAI-compatible API with tools that accept custom API endpoints. If you’re already using Copilot for cloud-based assistance, Continue is the best complement for local model access.

Other Options

Cody (Sourcegraph) — supports Ollama as a backend
Cline — VS Code extension that works with local Ollama models
Aider — terminal-based coding assistant with Ollama support

Using Local Models from Python

If you’re building tools, research pipelines, or just want to script your interactions, connecting to local models from Python is straightforward.

Using the Ollama Python Library

pip install ollama

import ollama

# Simple generation
response = ollama.generate(
    model='llama3.1',
    prompt='Summarise the key assumptions of linear regression in 3 bullet points.'
)
print(response['response'])

# Chat with message history
response = ollama.chat(
    model='llama3.1',
    messages=[
        {'role': 'system', 'content': 'You are a research methods expert.'},
        {'role': 'user', 'content': 'When should I use a mixed-effects model instead of a repeated-measures ANOVA?'}
    ]
)
print(response['message']['content'])

Using the OpenAI-Compatible API

Ollama serves an OpenAI-compatible API on localhost:11434. This means you can use the OpenAI Python library with local models — useful if you have existing code that uses the OpenAI API and want to switch to local.

pip install openai

from openai import OpenAI

# Point the client at Ollama's local server
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required but not used
)

response = client.chat.completions.create(
    model='llama3.1',
    messages=[
        {'role': 'user', 'content': 'Write a Python function to compute Cohen\'s d.'}
    ]
)
print(response.choices[0].message.content)

This approach lets you develop locally and swap in a cloud model for production by changing the base_url and api_key — no other code changes needed.

Batch Processing Research Papers

Here’s a practical example — processing multiple paper abstracts:

import ollama
import json

abstracts = [
    "Abstract of paper 1...",
    "Abstract of paper 2...",
    "Abstract of paper 3...",
]

results = []
for i, abstract in enumerate(abstracts):
    response = ollama.generate(
        model='llama3.1',
        prompt=f"""Analyse this research abstract and extract:
- Research question
- Methodology (1 sentence)
- Key finding (1 sentence)
- Sample size

Abstract: {abstract}

Respond in JSON format."""
    )
    results.append({
        'paper': i + 1,
        'analysis': response['response']
    })

# Save results
with open('paper_analysis.json', 'w') as f:
    json.dump(results, f, indent=2)

When to Use Local vs. Cloud

Not every task needs a local model, and not every task needs a cloud model. Here’s a practical decision framework:

Use Local When:

Privacy matters — student data, unpublished manuscripts, ethics-restricted information, peer review content
You’re iterating rapidly — testing prompts, debugging pipelines, running batch jobs where API costs add up
You’re offline or on restricted networks — travel, secure environments, unreliable internet
Cost is a concern — local models have zero marginal cost per request
You’re teaching — students can experiment freely without usage limits or account requirements

Use Cloud When:

You need maximum capability — complex reasoning, long-form writing, nuanced analysis that exceeds what 8B models can do
The task requires a large context window — processing very long documents where local models run out of context
Speed matters more than privacy — cloud models on dedicated hardware are typically faster than local inference
You need specific features — web search, image generation, tool use, or other capabilities not available locally

The Hybrid Approach

The most practical setup is both:

Ollama running locally for everyday tasks, privacy-sensitive work, and development
A cloud model (ChatGPT, Claude, Gemini) for tasks that need frontier capability

Develop locally, deploy to cloud when needed. Use local models as your default and escalate to cloud when the task demands it. This gives you the best of both worlds: privacy and zero cost for most tasks, maximum capability when you need it.

Quick Start Checklist

Install Ollama: brew install ollama (macOS) or download from ollama.com
Pull a model: ollama pull llama3.1
Test it: ollama run llama3.1
Install Continue in VS Code for editor integration
Install the Python library: pip install ollama
Try the OpenAI-compatible API for existing scripts

You can have a working local LLM setup in under 10 minutes. Start with a chat session, see how it fits your workflow, and build from there.

← Back to Tools & Resources