Research - Memory | Personal Language Models

1. The Markov Property and Language Models

To understand why current AI systems lack personal memory, we must start with a fundamental concept from probability theory: the Markov property.

A stochastic process has the Markov property if the probability of future states depends only on the current state, not on the sequence of events that preceded it. Formally:

Markov Property

P(X_n+1 | X_n, X_n-1, ..., X₁) = P(X_n+1 | X_n)

Early language models were explicitly Markov chains. An n-gram model predicts the next word based only on the previous n-1 words:

N-gram Language Model

P(w_t | w₁, ..., w_t-1) ≈ P(w_t | w_t-n+1, ..., w_t-1)

This approximation makes computation tractable but fundamentally limits the model's ability to capture long-range dependencies. The model has no memory of anything beyond its fixed context window.

Key Insight

While modern transformers aren't technically Markov chains (they can attend to their full context window), they still exhibit a form of "session-level Markov property" - each conversation starts fresh with no memory of previous interactions.

2. Transformer Architecture and the Attention Mechanism

The transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), revolutionized language modeling by replacing recurrence with self-attention.

At its core, a transformer computes attention scores between all pairs of tokens in the input sequence:

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QK^T / √d_k)V

Where Q (queries), K (keys), and V (values) are linear projections of the input, and d_k is the dimension of the key vectors. This mechanism allows each token to "attend" to every other token in the sequence.

Attention Computation Flow

Input Embeddings → Q, K, V Projections → Attention Scores → Weighted Sum → Output

The transformer's power comes from parallel processing of all positions and multi-head attention, which allows the model to attend to different types of relationships simultaneously:

Multi-Head Attention

MultiHead(Q, K, V) = Concat(head₁, ..., head_h)W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

3. Why LLMs Are Fundamentally Memoryless

Despite their sophistication, large language models have a critical limitation: they cannot form new memories. This emerges from several architectural and practical constraints:

Fixed Parameters After Training

Once training is complete, the model's weights are frozen. The knowledge encoded in those weights - the patterns learned from training data - cannot be updated through inference. When you chat with GPT-4 or Claude, you're not teaching them anything permanent.

Inference vs Training

Training: θ_t+1 = θ_t - η∇L(θ_t) (weights update)
Inference: θ remains constant (no learning)

Context Window Limitations

Transformers have a fixed context window - the maximum number of tokens they can process at once. While this has grown (from 2K to 128K+ tokens), it's still fundamentally bounded:

Computational cost scales quadratically with sequence length due to self-attention: O(n²)
Position encodings must accommodate longer sequences
Working memory is limited to what fits in the context

No User-Specific Adaptation

An LLM serving millions of users cannot maintain individual state for each user. From the model's perspective, every user is indistinguishable except for the content of the current context window.

The Stateless Inference Problem

Each session is independent. Information from Session 1 never reaches Session 2.

Session 1
"I'm a Python developer working on ML pipelines..."

Session 2
Model has no knowledge of Session 1

4. The Individual Context Problem

The memoryless nature of LLMs creates a fundamental mismatch with how humans expect to interact with intelligent systems. Consider what a human assistant naturally accumulates over time:

Preferences - Communication style, technical depth, formatting preferences
Context - Current projects, team members, organizational structure
History - Past decisions, resolved issues, learned patterns
Expertise model - What you know well, where you need help

An LLM lacks all of this. Every interaction requires re-establishing context, leading to:

Repetition overhead - Explaining the same context repeatedly
Generic responses - Advice that doesn't account for your specific situation
Lost continuity - No ability to reference "what we discussed last week"
Missed opportunities - Cannot proactively surface relevant past information

The Efficiency Gap

Studies suggest knowledge workers spend 20-30% of their time searching for information they've encountered before. An AI assistant without memory cannot help close this gap - it needs the same information re-provided every time.

5. Three-Tier Memory Architecture

Memory's architecture draws from cognitive science research on human memory systems. Rather than a single monolithic store, we implement three distinct tiers optimized for different temporal scales and access patterns:

Short-Term Memory (Working Memory)

Capacity: ~20 recent messages | Retention: Current session | Access: O(1) direct retrieval

Long-Term Memory (Episodic/Semantic)

Capacity: Unbounded | Retention: Persistent with decay | Access: O(log n) vector similarity search

Persistent Memory (Core Identity)

Capacity: Structured facts | Retention: Permanent | Access: O(1) key-value lookup

Cognitive Science Foundations

This architecture maps to established models of human memory:

Atkinson-Shiffrin Model - Sensory → Short-term → Long-term memory stores
Baddeley's Working Memory - Central executive managing phonological loop, visuospatial sketchpad, and episodic buffer
Tulving's Memory Systems - Distinction between episodic (events) and semantic (facts) memory

6. Mathematical Formulation

We can formalize the memory-augmented generation process. Let:

x = input query
M_S = short-term memory (recent context)
M_L = long-term memory (vector store)
M_P = persistent memory (user profile)
θ = LLM parameters

Memory-Augmented Response Generation

P(y | x) = P_θ(y | x, C(x, M_S, M_L, M_P))

Where C is the context construction function that retrieves and assembles relevant information from all memory tiers:

Context Construction

C(x, M_S, M_L, M_P) = [M_P; TopK(M_L, embed(x)); M_S]

The TopK function performs approximate nearest neighbor search in the embedding space:

Semantic Retrieval

TopK(M_L, q) = argmax_k { sim(q, m) | m ∈ M_L }
where sim(a, b) = (a · b) / (||a|| ||b||) (cosine similarity)

Memory Update Dynamics

After each interaction, memories are updated according to different policies:

Short-Term Memory Update (FIFO Queue)

M_S^t+1 = [M_S^t[1:]; (x^t, y^t)] if |M_S| < capacity
M_S^t+1 = [M_S^t[2:]; (x^t, y^t)] otherwise

Long-Term Memory Update (Surprise-Based)

M_L^t+1 = M_L^t ∪ { (embed(x^t, y^t), t) } if surprise(x^t, y^t) > τ

The surprise function measures how unexpected the interaction was relative to existing memories, inspired by the Titan architecture:

Surprise Scoring

surprise(x, y) = 1 - max_{m ∈ M_L} sim(embed(x, y), m)

7. Retrieval-Augmented Generation

The core mechanism for injecting personal context into LLM responses is Retrieval-Augmented Generation (RAG). Rather than relying solely on parametric knowledge, RAG retrieves relevant documents and includes them in the prompt.

RAG Pipeline

User Query → Embed Query → Vector Search → Retrieve Top-K → Augment Prompt → Generate

Advanced RAG Techniques

Basic RAG often retrieves irrelevant or redundant information. Memory implements several advanced techniques:

Hybrid Search - Combines dense vector search with sparse BM25 retrieval for better recall
Query Expansion - Uses the LLM to rewrite queries before retrieval, improving match quality
Reranking - Cross-encoder models rerank initial results for higher precision
Multi-hop Retrieval - Iteratively retrieves and reasons for complex queries requiring information synthesis
Source Diversity - Ensures retrieved context spans multiple data sources

8. Personal SLM Training

Beyond retrieval augmentation, Memory supports training personalized Small Language Models (SLMs) on your data. This creates a model that has genuinely learned your patterns, not just retrieved them.

LoRA: Low-Rank Adaptation

Full fine-tuning of LLMs is computationally expensive. LoRA (Low-Rank Adaptation) enables efficient personalization by training small adapter matrices:

LoRA Weight Update

W' = W + BA
where W ∈ R^d×k, B ∈ R^d×r, A ∈ R^r×k, r << min(d, k)

By keeping the rank r small (typically 8-64), LoRA reduces trainable parameters by 10,000x while preserving adaptation quality.

Personal SLM Pipeline

Data Curation - Select high-quality examples from your memory store
Instruction Formatting - Convert memories to instruction-response pairs
LoRA Training - Train adapters on base model (Mistral, Phi, Qwen)
Evaluation - Test on held-out personal data
Deployment - Merge adapters with base model for inference

Hybrid Approach

The most effective system combines retrieval (RAG) with personalized weights (LoRA). RAG provides specific factual context while the trained model captures stylistic patterns and implicit preferences.

9. References and Further Reading

Foundational Papers

Vaswani et al. (2017). "Attention Is All You Need" - The transformer architecture
Lewis et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" - RAG foundations
Hu et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models" - Efficient fine-tuning

Memory Systems

Google Research (2024). "Titans: Learning to Memorize at Test Time" - Neural long-term memory
Park et al. (2023). "Generative Agents: Interactive Simulacra of Human Behavior" - Memory for agent systems
Zhong et al. (2024). "MemoryBank: Enhancing Large Language Models with Long-Term Memory"

Cognitive Science

Baddeley (2000). "The episodic buffer: a new component of working memory?"
Tulving (1985). "Memory and consciousness"
Atkinson & Shiffrin (1968). "Human memory: A proposed system and its control processes"

Model Compression

Dettmers et al. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale"
Frantar et al. (2023). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers"

Enterprise Applications Technical Architecture