Vocoding
Tutorial5 min read

Getting Started with Ollama: Run LLMs Locally on Your Mac (2026)

Learn how to install and use Ollama to run powerful AI models like Llama 3, Mistral, and Gemma locally on your Mac. Complete beginner's guide.

Ollama makes running large language models (LLMs) on your local machine as easy as running a single command. No cloud API keys, no per-token costs, no data leaving your device.

In this guide, you'll learn how to install Ollama, choose the right model for your hardware, and start using local AI for coding, writing, and more.

What is Ollama?

Ollama is a tool that packages and runs open-source LLMs locally. Think of it as Docker for AI models—it handles downloading, configuring, and running models with simple commands.

Key benefits:

  • Privacy: Your prompts never leave your device (see why local-first AI matters)
  • No API costs: Run unlimited queries for free
  • Offline: Works without internet after model download
  • Fast: Local inference with no network latency

Installing Ollama

macOS Installation

The easiest way to install Ollama on Mac:

# Download and install
curl -fsSL https://ollama.com/install.sh | sh

Or download the app directly from ollama.com.

Verify Installation

ollama --version
# Output: ollama version 0.x.x

That's it. Ollama is ready to use.

Downloading Your First Model

Ollama supports dozens of open-source models. Let's start with Llama 3, Meta's latest:

# Download Llama 3 (8B parameters)
ollama pull llama3

# This takes a few minutes depending on your connection
# Model size: ~4.7 GB
ModelSizeBest ForCommand
llama34.7 GBGeneral tasks, codingollama pull llama3
mistral4.1 GBFast, efficientollama pull mistral
codellama3.8 GBCode generationollama pull codellama
gemma5.0 GBBalanced performanceollama pull gemma
phi-32.2 GBSmall, fastollama pull phi3

Model Size Recommendations

Your RAMRecommended Models
8 GBphi-3, gemma:2b
16 GBllama3:8b, mistral, codellama
32 GB+llama3:70b, mixtral

Running Your First Query

Interactive Chat

ollama run llama3

This starts an interactive session:

>>> What's the best way to handle errors in Python?

In Python, there are several approaches to error handling...
[AI response continues]

>>> /bye

Single Query

ollama run llama3 "Explain async/await in JavaScript in 3 sentences"

From a File

cat prompt.txt | ollama run llama3

Using Ollama with Code

REST API

Ollama runs a local API server on port 11434:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Write a Python function to check if a number is prime"
}'

Python Integration

import requests

def query_ollama(prompt, model="llama3"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    )
    return response.json()["response"]

# Use it
result = query_ollama("Explain recursion simply")
print(result)

JavaScript/Node.js

async function queryOllama(prompt, model = "llama3") {
  const response = await fetch("http://localhost:11434/api/generate", {
    method: "POST",
    body: JSON.stringify({ model, prompt, stream: false }),
  });
  const data = await response.json();
  return data.response;
}

// Use it
const result = await queryOllama("What is a closure?");
console.log(result);

Common Use Cases

Code Generation

ollama run codellama "Write a TypeScript function that debounces another function"

Code Review

cat my-code.py | ollama run llama3 "Review this code for bugs and improvements"

Documentation

ollama run llama3 "Generate JSDoc for this function: function add(a, b) { return a + b; }"

Writing Assistance

ollama run llama3 "Make this email more professional: hey can you send me that report thing"

Learning

ollama run llama3 "Explain Docker containers like I'm a beginner developer"

Performance Tips

1. Use the Right Model Size

Smaller models are faster. For quick tasks, use lighter models:

# Fast but less capable
ollama run phi3 "Quick question here"

# Slower but more capable
ollama run llama3:70b "Complex analysis needed"

2. Keep Ollama Running

Start Ollama as a service for faster first responses:

# macOS: Ollama runs automatically after installation
# Check status
ollama ps

3. Use GPU Acceleration

On Apple Silicon Macs, Ollama automatically uses the GPU. No configuration needed.

4. Adjust Context Window

For longer conversations, increase the context:

ollama run llama3 --num-ctx 8192

Managing Models

List Installed Models

ollama list

# Output:
# NAME            ID              SIZE      MODIFIED
# llama3:latest   a6990ed6be41    4.7 GB    2 days ago
# mistral:latest  f974a74358d6    4.1 GB    1 week ago

Remove a Model

ollama rm mistral

Update a Model

ollama pull llama3
# Re-downloads if there's a new version

Check Model Info

ollama show llama3

Ollama vs Cloud APIs

AspectOllama (Local)Cloud APIs
Privacy✅ 100% local❌ Data sent to servers
Cost✅ Free❌ Pay per token
Speed⚠️ Hardware dependent✅ Fast servers
Offline✅ Yes❌ No
Model Quality⚠️ Open source✅ Frontier models
Setup⚠️ Some config✅ API key only

When to use Ollama:

When to use cloud:

  • Maximum quality needed (GPT-4, Claude)
  • Limited local hardware
  • Production applications with SLAs

Combining Ollama with Vocoding

Vocoding integrates with Ollama for a powerful local AI workflow. See how Vocoding works for the full pipeline:

  1. Voice input → Local Whisper transcription
  2. Prompt optimization → Local or cloud LLM
  3. AI processing → Ollama for local, or cloud for maximum quality

This gives you:

  • Complete privacy when using Ollama
  • Cloud quality when needed
  • Seamless switching between both

Example Workflow

  1. Press ⌥+T - Speak your intent
  2. Vocoding transcribes locally with Whisper
  3. Press ⌥+O - Optimize with Ollama
  4. Paste optimized prompt anywhere

No audio or text ever leaves your device.

Troubleshooting

"Model not found"

# Make sure the model is downloaded
ollama pull llama3

"Out of memory"

Use a smaller model:

ollama run phi3  # Instead of llama3

Slow responses

  • Close other memory-intensive apps
  • Use a smaller model
  • Check Activity Monitor for memory pressure

Connection refused

Make sure Ollama is running:

# Start Ollama service
ollama serve

What's Next?

Once you're comfortable with Ollama basics:

  1. Try different models - Each has strengths for different tasks
  2. Build integrations - Use the API in your projects
  3. Create custom models - Fine-tune for specific use cases
  4. Explore Modelfiles - Customize model behavior

Ready for Voice-to-AI Workflows?

Vocoding combines local Whisper transcription with Ollama integration, giving you a complete privacy-first AI assistant.

Get Vocoding for €147 - Voice input, local AI, zero cloud dependency.

ollama getting startedlocal llmrun ai locallyollama tutorialllama local

Ready to code at the speed of thought?

Join developers using voice-first AI productivity.

Get Early Access