Architecture

AWS Cloud Architecture

The Memory demo runs on AWS with a clean separation between static frontend and GPU-accelerated backend. This architecture balances cost, performance, and scalability.

Infrastructure Components

The Vue.js single-page application is served from S3 with CloudFront providing global CDN distribution and SSL termination. The entire frontend is under 100KB, ensuring fast load times worldwide.

<100KB

Bundle Size

<50ms

CDN Latency

~$5/mo

Hosting Cost

TLS 1.3

Encryption

The backend runs on a GPU-enabled EC2 instance with NVIDIA T4 GPU for local LLM inference. Ollama manages the models, FastAPI handles the REST API, and Nginx provides reverse proxy with Let's Encrypt SSL.

4 vCPU

Compute

16GB

RAM

16GB

GPU VRAM

100GB

SSD Storage

Three production-ready Small Language Models are pre-loaded, each optimized for different use cases. All inference happens locally on the GPU with zero external API calls.

Mistral 7B ~4.1GB General purpose, excellent instruction following Free

Qwen 2.5 7B ~4.4GB Strong reasoning, multilingual support Free

Phi-3 14B ~8.0GB Complex reasoning, longer context Free

Cost Comparison: Local SLM vs Cloud LLM APIs

The economics of running local models are compelling. After initial infrastructure costs, inference is essentially free - eliminating the per-token charges that make cloud LLM APIs expensive at scale.

Cloud LLM APIs Only

$0.003 - $0.06 per 1K tokens (varies by model)
Costs scale linearly with usage
No predictable monthly budget
Data sent to external servers
Rate limits and quotas
Vendor lock-in risk

Local SLM + Optional Cloud

Fixed monthly infrastructure cost
Unlimited local inference at no extra cost
Predictable budget regardless of usage
Data stays on your infrastructure
No rate limits for local models
Escalate to cloud LLMs only when needed

Scenario	Cloud API Cost	Memory + Local SLM	Monthly Savings
Light usage (50K queries/mo)	$150 - $500	~$150 fixed	$0 - $350
Moderate usage (200K queries/mo)	$600 - $2,000	~$150 fixed	$450 - $1,850
Heavy usage (500K queries/mo)	$1,500 - $5,000	~$150 fixed	$1,350 - $4,850
Enterprise (1M+ queries/mo)	$3,000 - $10,000+	~$400 (larger instance)	$2,600 - $9,600+

The Hybrid Sweet Spot

Route 80-90% of queries to local SLMs (free after infrastructure), and only escalate complex reasoning tasks to Claude or GPT-4. This gives you the best of both worlds: low costs for routine queries, frontier capabilities when needed.

Hosting Cost Breakdown

Resource	Specification	On-Demand	Spot Instance
EC2 g4dn.xlarge	4 vCPU, 16GB RAM, T4 GPU	$380/mo	$120/mo
EBS gp3 Storage	100GB SSD	$10/mo	$10/mo
S3 + CloudFront	Static hosting + CDN	~$5/mo	~$5/mo
Route 53	DNS hosting	$1/mo	$1/mo
Data Transfer	~100GB/mo estimated	~$10/mo	~$10/mo
Total		~$406/mo	~$146/mo

Spot Instance Strategy

Using EC2 Spot Instances reduces GPU costs by ~70%. For personal/demo use, occasional interruptions are acceptable. For production, use On-Demand or Reserved Instances for guaranteed availability.

Three-Tier Memory System

The memory architecture mirrors human cognition with three distinct tiers, each optimized for different access patterns and retention characteristics.

Recent Messages Session Context O(1) Access

Vector Embeddings Semantic Search ChromaDB

User Profile Preferences Core Facts

Technology Stack

Backend

Python 3.11, FastAPI, SQLite for structured data, ChromaDB for vector storage. Async throughout for high concurrency.

Frontend

Vue.js 3 single-file application. No build step required - loads directly from a single HTML file.

LLM Integration

Ollama for local inference, with adapters for Claude and OpenAI APIs. Model-agnostic prompt templates.

Security

JWT authentication with MFA support. bcrypt password hashing. All API endpoints require authentication.

Research Foundations View Source