AWS Cloud Architecture

The Memory demo runs on AWS with a clean separation between static frontend and GPU-accelerated backend. This architecture balances cost, performance, and scalability.

AWS CLOUD ARCHITECTURE User HTTPS CloudFront CDN + SSL myai.senuamedia.com S3 Bucket Static Frontend Vue.js SPA (~100KB) API Calls Route 53 DNS Management EC2 g4dn.xlarge (GPU) ai.senuamedia.com Nginx Reverse Proxy FastAPI Memory Core Ollama GPU Inference Mistral 7B | Qwen 2.5 7B | Phi-3 14B Local SLM Models (~50GB) Let's Encrypt SSL Certificate EBS gp3 Storage 100GB - Models + Data SQLite + ChromaDB Monthly Cost ~$150 Spot Instance + Storage + Transfer LEGEND: AWS Service Memory App Local LLM

Infrastructure Components

S3

Static Frontend

S3 + CloudFront CDN

The Vue.js single-page application is served from S3 with CloudFront providing global CDN distribution and SSL termination. The entire frontend is under 100KB, ensuring fast load times worldwide.

<100KB
Bundle Size
<50ms
CDN Latency
~$5/mo
Hosting Cost
TLS 1.3
Encryption
GPU

GPU Backend Server

EC2 g4dn.xlarge

The backend runs on a GPU-enabled EC2 instance with NVIDIA T4 GPU for local LLM inference. Ollama manages the models, FastAPI handles the REST API, and Nginx provides reverse proxy with Let's Encrypt SSL.

4 vCPU
Compute
16GB
RAM
16GB
GPU VRAM
100GB
SSD Storage
LLM

Local SLM Models

Ollama Model Server

Three production-ready Small Language Models are pre-loaded, each optimized for different use cases. All inference happens locally on the GPU with zero external API calls.

Mistral 7B ~4.1GB General purpose, excellent instruction following Free
Qwen 2.5 7B ~4.4GB Strong reasoning, multilingual support Free
Phi-3 14B ~8.0GB Complex reasoning, longer context Free

Cost Comparison: Local SLM vs Cloud LLM APIs

The economics of running local models are compelling. After initial infrastructure costs, inference is essentially free - eliminating the per-token charges that make cloud LLM APIs expensive at scale.

Cloud LLM APIs Only

  • $0.003 - $0.06 per 1K tokens (varies by model)
  • Costs scale linearly with usage
  • No predictable monthly budget
  • Data sent to external servers
  • Rate limits and quotas
  • Vendor lock-in risk

Local SLM + Optional Cloud

  • Fixed monthly infrastructure cost
  • Unlimited local inference at no extra cost
  • Predictable budget regardless of usage
  • Data stays on your infrastructure
  • No rate limits for local models
  • Escalate to cloud LLMs only when needed
Scenario Cloud API Cost Memory + Local SLM Monthly Savings
Light usage (50K queries/mo) $150 - $500 ~$150 fixed $0 - $350
Moderate usage (200K queries/mo) $600 - $2,000 ~$150 fixed $450 - $1,850
Heavy usage (500K queries/mo) $1,500 - $5,000 ~$150 fixed $1,350 - $4,850
Enterprise (1M+ queries/mo) $3,000 - $10,000+ ~$400 (larger instance) $2,600 - $9,600+

The Hybrid Sweet Spot

Route 80-90% of queries to local SLMs (free after infrastructure), and only escalate complex reasoning tasks to Claude or GPT-4. This gives you the best of both worlds: low costs for routine queries, frontier capabilities when needed.

Hosting Cost Breakdown

Resource Specification On-Demand Spot Instance
EC2 g4dn.xlarge 4 vCPU, 16GB RAM, T4 GPU $380/mo $120/mo
EBS gp3 Storage 100GB SSD $10/mo $10/mo
S3 + CloudFront Static hosting + CDN ~$5/mo ~$5/mo
Route 53 DNS hosting $1/mo $1/mo
Data Transfer ~100GB/mo estimated ~$10/mo ~$10/mo
Total ~$406/mo ~$146/mo

Spot Instance Strategy

Using EC2 Spot Instances reduces GPU costs by ~70%. For personal/demo use, occasional interruptions are acceptable. For production, use On-Demand or Reserved Instances for guaranteed availability.

Three-Tier Memory System

The memory architecture mirrors human cognition with three distinct tiers, each optimized for different access patterns and retention characteristics.

Short-Term Memory
Recent Messages Session Context O(1) Access
Long-Term Memory
Vector Embeddings Semantic Search ChromaDB
Persistent Memory
User Profile Preferences Core Facts

Technology Stack

Backend

Python 3.11, FastAPI, SQLite for structured data, ChromaDB for vector storage. Async throughout for high concurrency.

Frontend

Vue.js 3 single-file application. No build step required - loads directly from a single HTML file.

LLM Integration

Ollama for local inference, with adapters for Claude and OpenAI APIs. Model-agnostic prompt templates.

Security

JWT authentication with MFA support. bcrypt password hashing. All API endpoints require authentication.

Research Foundations View Source