AWS Cloud Architecture
The Memory demo runs on AWS with a clean separation between static frontend and GPU-accelerated backend. This architecture balances cost, performance, and scalability.
Infrastructure Components
Static Frontend
S3 + CloudFront CDN
The Vue.js single-page application is served from S3 with CloudFront providing global CDN distribution and SSL termination. The entire frontend is under 100KB, ensuring fast load times worldwide.
GPU Backend Server
EC2 g4dn.xlarge
The backend runs on a GPU-enabled EC2 instance with NVIDIA T4 GPU for local LLM inference. Ollama manages the models, FastAPI handles the REST API, and Nginx provides reverse proxy with Let's Encrypt SSL.
Local SLM Models
Ollama Model Server
Three production-ready Small Language Models are pre-loaded, each optimized for different use cases. All inference happens locally on the GPU with zero external API calls.
Cost Comparison: Local SLM vs Cloud LLM APIs
The economics of running local models are compelling. After initial infrastructure costs, inference is essentially free - eliminating the per-token charges that make cloud LLM APIs expensive at scale.
Cloud LLM APIs Only
- $0.003 - $0.06 per 1K tokens (varies by model)
- Costs scale linearly with usage
- No predictable monthly budget
- Data sent to external servers
- Rate limits and quotas
- Vendor lock-in risk
Local SLM + Optional Cloud
- Fixed monthly infrastructure cost
- Unlimited local inference at no extra cost
- Predictable budget regardless of usage
- Data stays on your infrastructure
- No rate limits for local models
- Escalate to cloud LLMs only when needed
| Scenario | Cloud API Cost | Memory + Local SLM | Monthly Savings |
|---|---|---|---|
| Light usage (50K queries/mo) | $150 - $500 | ~$150 fixed | $0 - $350 |
| Moderate usage (200K queries/mo) | $600 - $2,000 | ~$150 fixed | $450 - $1,850 |
| Heavy usage (500K queries/mo) | $1,500 - $5,000 | ~$150 fixed | $1,350 - $4,850 |
| Enterprise (1M+ queries/mo) | $3,000 - $10,000+ | ~$400 (larger instance) | $2,600 - $9,600+ |
The Hybrid Sweet Spot
Route 80-90% of queries to local SLMs (free after infrastructure), and only escalate complex reasoning tasks to Claude or GPT-4. This gives you the best of both worlds: low costs for routine queries, frontier capabilities when needed.
Hosting Cost Breakdown
| Resource | Specification | On-Demand | Spot Instance |
|---|---|---|---|
| EC2 g4dn.xlarge | 4 vCPU, 16GB RAM, T4 GPU | $380/mo | $120/mo |
| EBS gp3 Storage | 100GB SSD | $10/mo | $10/mo |
| S3 + CloudFront | Static hosting + CDN | ~$5/mo | ~$5/mo |
| Route 53 | DNS hosting | $1/mo | $1/mo |
| Data Transfer | ~100GB/mo estimated | ~$10/mo | ~$10/mo |
| Total | ~$406/mo | ~$146/mo |
Spot Instance Strategy
Using EC2 Spot Instances reduces GPU costs by ~70%. For personal/demo use, occasional interruptions are acceptable. For production, use On-Demand or Reserved Instances for guaranteed availability.
Three-Tier Memory System
The memory architecture mirrors human cognition with three distinct tiers, each optimized for different access patterns and retention characteristics.
Technology Stack
Backend
Python 3.11, FastAPI, SQLite for structured data, ChromaDB for vector storage. Async throughout for high concurrency.
Frontend
Vue.js 3 single-file application. No build step required - loads directly from a single HTML file.
LLM Integration
Ollama for local inference, with adapters for Claude and OpenAI APIs. Model-agnostic prompt templates.
Security
JWT authentication with MFA support. bcrypt password hashing. All API endpoints require authentication.