Getting Started with Ollama: Run LLMs Locally on Your Mac (2026)
Learn how to install and use Ollama to run powerful AI models like Llama 3, Mistral, and Gemma locally on your Mac. Complete beginner's guide.
Ollama makes running large language models (LLMs) on your local machine as easy as running a single command. No cloud API keys, no per-token costs, no data leaving your device.
In this guide, you'll learn how to install Ollama, choose the right model for your hardware, and start using local AI for coding, writing, and more.
What is Ollama?
Ollama is a tool that packages and runs open-source LLMs locally. Think of it as Docker for AI models—it handles downloading, configuring, and running models with simple commands.
Key benefits:
- Privacy: Your prompts never leave your device (see why local-first AI matters)
- No API costs: Run unlimited queries for free
- Offline: Works without internet after model download
- Fast: Local inference with no network latency
Installing Ollama
macOS Installation
The easiest way to install Ollama on Mac:
# Download and install
curl -fsSL https://ollama.com/install.sh | sh
Or download the app directly from ollama.com.
Verify Installation
ollama --version
# Output: ollama version 0.x.x
That's it. Ollama is ready to use.
Downloading Your First Model
Ollama supports dozens of open-source models. Let's start with Llama 3, Meta's latest:
# Download Llama 3 (8B parameters)
ollama pull llama3
# This takes a few minutes depending on your connection
# Model size: ~4.7 GB
Popular Models to Try
| Model | Size | Best For | Command |
|---|---|---|---|
| llama3 | 4.7 GB | General tasks, coding | ollama pull llama3 |
| mistral | 4.1 GB | Fast, efficient | ollama pull mistral |
| codellama | 3.8 GB | Code generation | ollama pull codellama |
| gemma | 5.0 GB | Balanced performance | ollama pull gemma |
| phi-3 | 2.2 GB | Small, fast | ollama pull phi3 |
Model Size Recommendations
| Your RAM | Recommended Models |
|---|---|
| 8 GB | phi-3, gemma:2b |
| 16 GB | llama3:8b, mistral, codellama |
| 32 GB+ | llama3:70b, mixtral |
Running Your First Query
Interactive Chat
ollama run llama3
This starts an interactive session:
>>> What's the best way to handle errors in Python?
In Python, there are several approaches to error handling...
[AI response continues]
>>> /bye
Single Query
ollama run llama3 "Explain async/await in JavaScript in 3 sentences"
From a File
cat prompt.txt | ollama run llama3
Using Ollama with Code
REST API
Ollama runs a local API server on port 11434:
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Write a Python function to check if a number is prime"
}'
Python Integration
import requests
def query_ollama(prompt, model="llama3"):
response = requests.post(
"http://localhost:11434/api/generate",
json={"model": model, "prompt": prompt, "stream": False}
)
return response.json()["response"]
# Use it
result = query_ollama("Explain recursion simply")
print(result)
JavaScript/Node.js
async function queryOllama(prompt, model = "llama3") {
const response = await fetch("http://localhost:11434/api/generate", {
method: "POST",
body: JSON.stringify({ model, prompt, stream: false }),
});
const data = await response.json();
return data.response;
}
// Use it
const result = await queryOllama("What is a closure?");
console.log(result);
Common Use Cases
Code Generation
ollama run codellama "Write a TypeScript function that debounces another function"
Code Review
cat my-code.py | ollama run llama3 "Review this code for bugs and improvements"
Documentation
ollama run llama3 "Generate JSDoc for this function: function add(a, b) { return a + b; }"
Writing Assistance
ollama run llama3 "Make this email more professional: hey can you send me that report thing"
Learning
ollama run llama3 "Explain Docker containers like I'm a beginner developer"
Performance Tips
1. Use the Right Model Size
Smaller models are faster. For quick tasks, use lighter models:
# Fast but less capable
ollama run phi3 "Quick question here"
# Slower but more capable
ollama run llama3:70b "Complex analysis needed"
2. Keep Ollama Running
Start Ollama as a service for faster first responses:
# macOS: Ollama runs automatically after installation
# Check status
ollama ps
3. Use GPU Acceleration
On Apple Silicon Macs, Ollama automatically uses the GPU. No configuration needed.
4. Adjust Context Window
For longer conversations, increase the context:
ollama run llama3 --num-ctx 8192
Managing Models
List Installed Models
ollama list
# Output:
# NAME ID SIZE MODIFIED
# llama3:latest a6990ed6be41 4.7 GB 2 days ago
# mistral:latest f974a74358d6 4.1 GB 1 week ago
Remove a Model
ollama rm mistral
Update a Model
ollama pull llama3
# Re-downloads if there's a new version
Check Model Info
ollama show llama3
Ollama vs Cloud APIs
| Aspect | Ollama (Local) | Cloud APIs |
|---|---|---|
| Privacy | ✅ 100% local | ❌ Data sent to servers |
| Cost | ✅ Free | ❌ Pay per token |
| Speed | ⚠️ Hardware dependent | ✅ Fast servers |
| Offline | ✅ Yes | ❌ No |
| Model Quality | ⚠️ Open source | ✅ Frontier models |
| Setup | ⚠️ Some config | ✅ API key only |
When to use Ollama:
- Privacy-sensitive tasks (learn about Vocoding's privacy-first architecture)
- High-volume queries where cost matters
- Offline work (check system compatibility requirements)
- Learning and experimentation
When to use cloud:
- Maximum quality needed (GPT-4, Claude)
- Limited local hardware
- Production applications with SLAs
Combining Ollama with Vocoding
Vocoding integrates with Ollama for a powerful local AI workflow. See how Vocoding works for the full pipeline:
- Voice input → Local Whisper transcription
- Prompt optimization → Local or cloud LLM
- AI processing → Ollama for local, or cloud for maximum quality
This gives you:
- Complete privacy when using Ollama
- Cloud quality when needed
- Seamless switching between both
Example Workflow
- Press ⌥+T - Speak your intent
- Vocoding transcribes locally with Whisper
- Press ⌥+O - Optimize with Ollama
- Paste optimized prompt anywhere
No audio or text ever leaves your device.
Troubleshooting
"Model not found"
# Make sure the model is downloaded
ollama pull llama3
"Out of memory"
Use a smaller model:
ollama run phi3 # Instead of llama3
Slow responses
- Close other memory-intensive apps
- Use a smaller model
- Check
Activity Monitorfor memory pressure
Connection refused
Make sure Ollama is running:
# Start Ollama service
ollama serve
What's Next?
Once you're comfortable with Ollama basics:
- Try different models - Each has strengths for different tasks
- Build integrations - Use the API in your projects
- Create custom models - Fine-tune for specific use cases
- Explore Modelfiles - Customize model behavior
Ready for Voice-to-AI Workflows?
Vocoding combines local Whisper transcription with Ollama integration, giving you a complete privacy-first AI assistant.
Get Vocoding for €147 - Voice input, local AI, zero cloud dependency.
Ready to code at the speed of thought?
Join developers using voice-first AI productivity.
Get Early Access