Local LLMs
Run powerful language models locally for complete privacy and zero API costs. Memory supports Ollama for easy local inference with GPU acceleration.
Why Local LLMs?
Complete Privacy
Your data never leaves your machine. No API calls, no cloud processing, no data sharing.
Zero API Costs
After initial setup, inference is free. No per-token charges, no monthly bills.
Offline Capable
Works without internet. Perfect for sensitive environments or travel.
Low Latency
No network round-trips. Responses start immediately for a snappier experience.
Setting Up Ollama
Ollama is the easiest way to run local LLMs. Install it in one command:
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
ollama version 0.5.4
# Start the Ollama service
ollama serve
Recommended Models
These models offer the best balance of quality and performance for Memory:
Mistral 7B
4.1 GBExcellent general-purpose model. Best quality-to-size ratio for most use cases. Great at following instructions and reasoning.
Qwen 2.5 7B
4.4 GBStrong multilingual model from Alibaba. Excellent at coding and structured tasks. Good alternative to Mistral.
Phi-3 Medium
7.9 GBMicrosoft's efficient model. Strong reasoning for its size. Best for complex analytical tasks.
Llama 3.1 8B
4.7 GBMeta's latest open model. Excellent instruction following and broad knowledge.
Gemma 2 9B
5.4 GBGoogle's efficient model. Great for conversational tasks and general assistance.
TinyLlama 1.1B
637 MBUltra-lightweight model for resource-constrained systems. Basic tasks only.
Installing Models
# Pull recommended models
ollama pull mistral:7b
ollama pull qwen2.5:7b
ollama pull phi3:14b
# Pull embedding model (required for memory)
ollama pull nomic-embed-text
# List installed models
ollama list
NAME ID SIZE MODIFIED
mistral:7b f974a74358d6 4.1 GB 2 days ago
qwen2.5:7b 845dbda0ea48 4.4 GB 2 days ago
nomic-embed-text 0a109f422b47 274 MB 2 days ago
Configuring Memory
Update your Memory configuration to use local models:
# config/memory.yaml
llm:
provider: ollama
base_url: http://localhost:11434
# Primary chat model
model: mistral:7b
# Embedding model for semantic search
embedding_model: nomic-embed-text
# Model selection strategy
auto_select: true # Use different models based on task
models:
fast: mistral:7b # Quick responses
balanced: qwen2.5:7b # General use
complex: phi3:14b # Complex reasoning
Hardware Requirements
| Model Size | Min RAM | Recommended RAM | GPU VRAM | CPU Speed |
|---|---|---|---|---|
| 1-3B (TinyLlama) | 4GB | 8GB | 2GB+ | ~30 tok/s |
| 7B (Mistral, Qwen) | 8GB | 16GB | 6GB+ | ~20 tok/s |
| 14B (Phi-3 Medium) | 16GB | 32GB | 10GB+ | ~10 tok/s |
| 70B+ (Llama 3 70B) | 64GB | 128GB | 48GB+ | ~5 tok/s |
GPU Acceleration
For optimal performance, use GPU acceleration:
NVIDIA (Recommended)
Best support with CUDA. Works out of the box with Ollama.
- RTX 3060 (12GB) - 7B models
- RTX 3080/4080 (10-16GB) - 14B models
- RTX 3090/4090 (24GB) - 30B+ models
Apple Silicon
Excellent Metal support. Unified memory is a huge advantage.
- M1/M2 (8GB) - 7B models
- M1/M2 Pro (16GB) - 14B models
- M1/M2 Max/Ultra (32-128GB) - 70B+ models
AMD
ROCm support available but less mature than CUDA.
- RX 6800/7900 - Good support
- Requires ROCm setup
- Check Ollama docs for compatibility
Hybrid Mode: Local + Cloud
Memory supports switching between local and cloud models based on task complexity. Use local models for routine queries, escalate to Claude or GPT-4 for complex reasoning.
# config/memory.yaml - Hybrid configuration
llm:
default_provider: ollama
providers:
ollama:
base_url: http://localhost:11434
model: mistral:7b
claude:
api_key: ${ANTHROPIC_API_KEY}
model: claude-3-sonnet
openai:
api_key: ${OPENAI_API_KEY}
model: gpt-4o
routing:
simple_queries: ollama # "What's my schedule?"
code_tasks: ollama # Code completion
complex_analysis: claude # Deep reasoning
creative_writing: claude # Long-form content
Cost Comparison
Running Mistral 7B locally costs approximately $0.001/1K tokens (electricity only), compared to $0.015/1K tokens for Claude 3 Sonnet or $0.03/1K tokens for GPT-4. For 100K tokens/day, that's ~$3/month local vs ~$45-90/month cloud.