Local LLMs

Run powerful language models locally for complete privacy and zero API costs. Memory supports Ollama for easy local inference with GPU acceleration.

Why Local LLMs?

Complete Privacy

Your data never leaves your machine. No API calls, no cloud processing, no data sharing.

Zero API Costs

After initial setup, inference is free. No per-token charges, no monthly bills.

Offline Capable

Works without internet. Perfect for sensitive environments or travel.

Low Latency

No network round-trips. Responses start immediately for a snappier experience.

Setting Up Ollama

Ollama is the easiest way to run local LLMs. Install it in one command:

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version
ollama version 0.5.4

# Start the Ollama service
ollama serve

Recommended Models

These models offer the best balance of quality and performance for Memory:

Qwen 2.5 7B

4.4 GB

Strong multilingual model from Alibaba. Excellent at coding and structured tasks. Good alternative to Mistral.

Parameters 7B
Context 32K
RAM Needed 8GB+
Speed Fast

Phi-3 Medium

7.9 GB

Microsoft's efficient model. Strong reasoning for its size. Best for complex analytical tasks.

Parameters 14B
Context 128K
RAM Needed 16GB+
Speed Medium

Llama 3.1 8B

4.7 GB

Meta's latest open model. Excellent instruction following and broad knowledge.

Parameters 8B
Context 128K
RAM Needed 8GB+
Speed Fast

Gemma 2 9B

5.4 GB

Google's efficient model. Great for conversational tasks and general assistance.

Parameters 9B
Context 8K
RAM Needed 10GB+
Speed Fast

TinyLlama 1.1B

637 MB

Ultra-lightweight model for resource-constrained systems. Basic tasks only.

Parameters 1.1B
Context 2K
RAM Needed 4GB+
Speed Very Fast

Installing Models

# Pull recommended models
ollama pull mistral:7b
ollama pull qwen2.5:7b
ollama pull phi3:14b

# Pull embedding model (required for memory)
ollama pull nomic-embed-text

# List installed models
ollama list
NAME ID SIZE MODIFIED
mistral:7b f974a74358d6 4.1 GB 2 days ago
qwen2.5:7b 845dbda0ea48 4.4 GB 2 days ago
nomic-embed-text 0a109f422b47 274 MB 2 days ago

Configuring Memory

Update your Memory configuration to use local models:

# config/memory.yaml

llm:
  provider: ollama
  base_url: http://localhost:11434
  
  # Primary chat model
  model: mistral:7b
  
  # Embedding model for semantic search
  embedding_model: nomic-embed-text
  
  # Model selection strategy
  auto_select: true # Use different models based on task
  models:
    fast: mistral:7b # Quick responses
    balanced: qwen2.5:7b # General use
    complex: phi3:14b # Complex reasoning

Hardware Requirements

Model Size Min RAM Recommended RAM GPU VRAM CPU Speed
1-3B (TinyLlama) 4GB 8GB 2GB+ ~30 tok/s
7B (Mistral, Qwen) 8GB 16GB 6GB+ ~20 tok/s
14B (Phi-3 Medium) 16GB 32GB 10GB+ ~10 tok/s
70B+ (Llama 3 70B) 64GB 128GB 48GB+ ~5 tok/s

GPU Acceleration

For optimal performance, use GPU acceleration:

NVIDIA (Recommended)

Best support with CUDA. Works out of the box with Ollama.

  • RTX 3060 (12GB) - 7B models
  • RTX 3080/4080 (10-16GB) - 14B models
  • RTX 3090/4090 (24GB) - 30B+ models

Apple Silicon

Excellent Metal support. Unified memory is a huge advantage.

  • M1/M2 (8GB) - 7B models
  • M1/M2 Pro (16GB) - 14B models
  • M1/M2 Max/Ultra (32-128GB) - 70B+ models

AMD

ROCm support available but less mature than CUDA.

  • RX 6800/7900 - Good support
  • Requires ROCm setup
  • Check Ollama docs for compatibility

Hybrid Mode: Local + Cloud

Memory supports switching between local and cloud models based on task complexity. Use local models for routine queries, escalate to Claude or GPT-4 for complex reasoning.

# config/memory.yaml - Hybrid configuration

llm:
  default_provider: ollama
  
  providers:
    ollama:
      base_url: http://localhost:11434
      model: mistral:7b
    
    claude:
      api_key: ${ANTHROPIC_API_KEY}
      model: claude-3-sonnet
    
    openai:
      api_key: ${OPENAI_API_KEY}
      model: gpt-4o
  
  routing:
    simple_queries: ollama # "What's my schedule?"
    code_tasks: ollama # Code completion
    complex_analysis: claude # Deep reasoning
    creative_writing: claude # Long-form content

Cost Comparison

Running Mistral 7B locally costs approximately $0.001/1K tokens (electricity only), compared to $0.015/1K tokens for Claude 3 Sonnet or $0.03/1K tokens for GPT-4. For 100K tokens/day, that's ~$3/month local vs ~$45-90/month cloud.