Getting Started with Ollama: Run LLMs Locally on Your Mac (2026)

Ollama makes running large language models (LLMs) on your local machine as easy as running a single command. No cloud API keys, no per-token costs, no data leaving your device.

In this guide, you'll learn how to install Ollama, choose the right model for your hardware, and start using local AI for coding, writing, and more.

What is Ollama?

Ollama is a tool that packages and runs open-source LLMs locally. Think of it as Docker for AI models—it handles downloading, configuring, and running models with simple commands.

Key benefits:

Privacy: Your prompts never leave your device (see why local-first AI matters)
No API costs: Run unlimited queries for free
Offline: Works without internet after model download
Fast: Local inference with no network latency

Installing Ollama

macOS Installation

The easiest way to install Ollama on Mac:

# Download and install
curl -fsSL https://ollama.com/install.sh | sh

Or download the app directly from ollama.com.

Verify Installation

ollama --version
# Output: ollama version 0.x.x

That's it. Ollama is ready to use.

Downloading Your First Model

Ollama supports dozens of open-source models. Let's start with Llama 3, Meta's latest:

# Download Llama 3 (8B parameters)
ollama pull llama3

# This takes a few minutes depending on your connection
# Model size: ~4.7 GB

Popular Models to Try

Model	Size	Best For	Command
llama3	4.7 GB	General tasks, coding	`ollama pull llama3`
mistral	4.1 GB	Fast, efficient	`ollama pull mistral`
codellama	3.8 GB	Code generation	`ollama pull codellama`
gemma	5.0 GB	Balanced performance	`ollama pull gemma`
phi-3	2.2 GB	Small, fast	`ollama pull phi3`

Model Size Recommendations

Your RAM	Recommended Models
8 GB	phi-3, gemma:2b
16 GB	llama3:8b, mistral, codellama
32 GB+	llama3:70b, mixtral

Running Your First Query

Interactive Chat

ollama run llama3

This starts an interactive session:

>>> What's the best way to handle errors in Python?

In Python, there are several approaches to error handling...
[AI response continues]

>>> /bye

Single Query

ollama run llama3 "Explain async/await in JavaScript in 3 sentences"

From a File

cat prompt.txt | ollama run llama3

Using Ollama with Code

REST API

Ollama runs a local API server on port 11434:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Write a Python function to check if a number is prime"
}'

Python Integration

import requests

def query_ollama(prompt, model="llama3"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    )
    return response.json()["response"]

# Use it
result = query_ollama("Explain recursion simply")
print(result)

JavaScript/Node.js

async function queryOllama(prompt, model = "llama3") {
  const response = await fetch("http://localhost:11434/api/generate", {
    method: "POST",
    body: JSON.stringify({ model, prompt, stream: false }),
  });
  const data = await response.json();
  return data.response;
}

// Use it
const result = await queryOllama("What is a closure?");
console.log(result);

Common Use Cases

Code Generation

ollama run codellama "Write a TypeScript function that debounces another function"

Code Review

cat my-code.py | ollama run llama3 "Review this code for bugs and improvements"

Documentation

ollama run llama3 "Generate JSDoc for this function: function add(a, b) { return a + b; }"

Writing Assistance

ollama run llama3 "Make this email more professional: hey can you send me that report thing"

Learning

ollama run llama3 "Explain Docker containers like I'm a beginner developer"

Performance Tips

1. Use the Right Model Size

Smaller models are faster. For quick tasks, use lighter models:

# Fast but less capable
ollama run phi3 "Quick question here"

# Slower but more capable
ollama run llama3:70b "Complex analysis needed"

2. Keep Ollama Running

Start Ollama as a service for faster first responses:

# macOS: Ollama runs automatically after installation
# Check status
ollama ps

3. Use GPU Acceleration

On Apple Silicon Macs, Ollama automatically uses the GPU. No configuration needed.

4. Adjust Context Window

For longer conversations, increase the context:

ollama run llama3 --num-ctx 8192

Managing Models

List Installed Models

ollama list

# Output:
# NAME            ID              SIZE      MODIFIED
# llama3:latest   a6990ed6be41    4.7 GB    2 days ago
# mistral:latest  f974a74358d6    4.1 GB    1 week ago

Remove a Model

ollama rm mistral

Update a Model

ollama pull llama3
# Re-downloads if there's a new version

Check Model Info

ollama show llama3

Ollama vs Cloud APIs

Aspect	Ollama (Local)	Cloud APIs
Privacy	✅ 100% local	❌ Data sent to servers
Cost	✅ Free	❌ Pay per token
Speed	⚠️ Hardware dependent	✅ Fast servers
Offline	✅ Yes	❌ No
Model Quality	⚠️ Open source	✅ Frontier models
Setup	⚠️ Some config	✅ API key only

When to use Ollama:

Privacy-sensitive tasks (learn about Vocoding's privacy-first architecture)
High-volume queries where cost matters
Offline work (check system compatibility requirements)
Learning and experimentation

When to use cloud:

Maximum quality needed (GPT-4, Claude)
Limited local hardware
Production applications with SLAs

Combining Ollama with Vocoding

Vocoding integrates with Ollama for a powerful local AI workflow. See how Vocoding works for the full pipeline:

Voice input → Local Whisper transcription
Prompt optimization → Local or cloud LLM
AI processing → Ollama for local, or cloud for maximum quality

This gives you:

Complete privacy when using Ollama
Cloud quality when needed
Seamless switching between both

Example Workflow

Press ⌥+T - Speak your intent
Vocoding transcribes locally with Whisper
Press ⌥+O - Optimize with Ollama
Paste optimized prompt anywhere

No audio or text ever leaves your device.

Troubleshooting

"Model not found"

# Make sure the model is downloaded
ollama pull llama3

"Out of memory"

Use a smaller model:

ollama run phi3  # Instead of llama3

Slow responses

Close other memory-intensive apps
Use a smaller model
Check Activity Monitor for memory pressure

Connection refused

Make sure Ollama is running:

# Start Ollama service
ollama serve

What's Next?

Once you're comfortable with Ollama basics:

Try different models - Each has strengths for different tasks
Build integrations - Use the API in your projects
Create custom models - Fine-tune for specific use cases
Explore Modelfiles - Customize model behavior

Ready for Voice-to-AI Workflows?

Vocoding combines local Whisper transcription with Ollama integration, giving you a complete privacy-first AI assistant.

Get Vocoding for €147 - Voice input, local AI, zero cloud dependency.