Local LLMs

Run powerful language models locally for complete privacy and zero API costs. Memory supports Ollama for easy local inference with GPU acceleration.

Why Local LLMs?

Complete Privacy

Your data never leaves your machine. No API calls, no cloud processing, no data sharing.

Zero API Costs

After initial setup, inference is free. No per-token charges, no monthly bills.

Offline Capable

Works without internet. Perfect for sensitive environments or travel.

Low Latency

No network round-trips. Responses start immediately for a snappier experience.

Setting Up Ollama

Ollama is the easiest way to run local LLMs. Install it in one command:

          
# Install Ollama (macOS/Linux)

curl -fsSL https://ollama.com/install.sh | sh

# Verify installation

ollama --version

ollama version 0.5.4

# Start the Ollama service

ollama serve

Recommended Models

These models offer the best balance of quality and performance for Memory:

Excellent general-purpose model. Best quality-to-size ratio for most use cases. Great at following instructions and reasoning.

Parameters 7B

Context 8K

RAM Needed 8GB+

Speed Fast

Strong multilingual model from Alibaba. Excellent at coding and structured tasks. Good alternative to Mistral.

Parameters 7B

Context 32K

RAM Needed 8GB+

Speed Fast

Microsoft's efficient model. Strong reasoning for its size. Best for complex analytical tasks.

Parameters 14B

Context 128K

RAM Needed 16GB+

Speed Medium

Meta's latest open model. Excellent instruction following and broad knowledge.

Parameters 8B

Context 128K

RAM Needed 8GB+

Speed Fast

Google's efficient model. Great for conversational tasks and general assistance.

Parameters 9B

Context 8K

RAM Needed 10GB+

Speed Fast

Ultra-lightweight model for resource-constrained systems. Basic tasks only.

Parameters 1.1B

Context 2K

RAM Needed 4GB+

Speed Very Fast

Installing Models

          
# Pull recommended models

ollama pull mistral:7b

ollama pull qwen2.5:7b

ollama pull phi3:14b

# Pull embedding model (required for memory)

ollama pull nomic-embed-text

# List installed models

ollama list

NAME              ID            SIZE    MODIFIED

mistral:7b        f974a74358d6  4.1 GB  2 days ago

qwen2.5:7b        845dbda0ea48  4.4 GB  2 days ago

nomic-embed-text  0a109f422b47  274 MB  2 days ago

Configuring Memory

Update your Memory configuration to use local models:

          
# config/memory.yaml

llm:

  provider: ollama

  base_url: http://localhost:11434

  # Primary chat model

  model: mistral:7b

  # Embedding model for semantic search

  embedding_model: nomic-embed-text

  # Model selection strategy

  auto_select: true  # Use different models based on task

  models:

    fast: mistral:7b      # Quick responses

    balanced: qwen2.5:7b  # General use

    complex: phi3:14b     # Complex reasoning

Hardware Requirements

Model Size	Min RAM	Recommended RAM	GPU VRAM	CPU Speed
1-3B (TinyLlama)	4GB	8GB	2GB+	~30 tok/s
7B (Mistral, Qwen)	8GB	16GB	6GB+	~20 tok/s
14B (Phi-3 Medium)	16GB	32GB	10GB+	~10 tok/s
70B+ (Llama 3 70B)	64GB	128GB	48GB+	~5 tok/s

GPU Acceleration

For optimal performance, use GPU acceleration:

NVIDIA (Recommended)

Best support with CUDA. Works out of the box with Ollama.

RTX 3060 (12GB) - 7B models
RTX 3080/4080 (10-16GB) - 14B models
RTX 3090/4090 (24GB) - 30B+ models

Apple Silicon

Excellent Metal support. Unified memory is a huge advantage.

M1/M2 (8GB) - 7B models
M1/M2 Pro (16GB) - 14B models
M1/M2 Max/Ultra (32-128GB) - 70B+ models

AMD

ROCm support available but less mature than CUDA.

RX 6800/7900 - Good support
Requires ROCm setup
Check Ollama docs for compatibility

Hybrid Mode: Local + Cloud

Memory supports switching between local and cloud models based on task complexity. Use local models for routine queries, escalate to Claude or GPT-4 for complex reasoning.

          
# config/memory.yaml - Hybrid configuration

llm:

  default_provider: ollama

  providers:

    ollama:

      base_url: http://localhost:11434

      model: mistral:7b

    claude:

      api_key: ${ANTHROPIC_API_KEY}

      model: claude-3-sonnet

    openai:

      api_key: ${OPENAI_API_KEY}

      model: gpt-4o

  routing:

    simple_queries: ollama    # "What's my schedule?"

    code_tasks: ollama        # Code completion

    complex_analysis: claude  # Deep reasoning

    creative_writing: claude  # Long-form content

Cost Comparison

Running Mistral 7B locally costs approximately $0.001/1K tokens (electricity only), compared to $0.015/1K tokens for Claude 3 Sonnet or $0.03/1K tokens for GPT-4. For 100K tokens/day, that's ~$3/month local vs ~$45-90/month cloud.

Local LLMs

Why Local LLMs?

Complete Privacy

Zero API Costs

Offline Capable

Low Latency

Setting Up Ollama

Recommended Models

Mistral 7B

Qwen 2.5 7B

Phi-3 Medium

Llama 3.1 8B

Gemma 2 9B

TinyLlama 1.1B

Installing Models

Configuring Memory

Hardware Requirements

GPU Acceleration

NVIDIA (Recommended)

Apple Silicon

AMD

Hybrid Mode: Local + Cloud