◆ Open source AI memory infrastructure

Cached
SmartVector
Memory

The core idea is simple — fetch only the context your AI actually needs from memory, as fast as possible, at the lowest cost possible, so it can give better answers.

Get Started →How it works
5
Cache Layers
~$0
Per-query cost
90%
Search space reduction
100%
Local & private
Hierarchical Cache MemoryLearned Cache IndexLocal LLM ControllerZero Cloud DependencyQuery-Aware SynthesisTyped Memory ChainsBidirectional Pointer GraphPrivacy PreservingHierarchical Cache MemoryLearned Cache IndexLocal LLM ControllerZero Cloud DependencyQuery-Aware SynthesisTyped Memory ChainsBidirectional Pointer GraphPrivacy Preserving
// The problem

LLMs are stateless.
Memory systems are broken.

Every time your AI starts a new session, it starts from zero. Current solutions either scan everything (expensive and slow) or return irrelevant chunks (inaccurate).

Most memory systems are flat RAG pipelines — they embed a query, find similar text, dump it in the prompt, and hope for the best. There's no intelligence in the retrieval. No concept of what's important. No hierarchy.

CaSVeM treats memory like a CPU cache hierarchy — layered, intelligent, and fast. The right context reaches your AI at near-zero cost, every time.

# What everyone else does
query → embed → scan ALL vectors
→ return top-K chunks
→ hope it's relevant
→ pay per API call

# What CaSVeM does
query → learned index predicts slots
→ search only relevant subset
→ LLM synthesises memory block
→ exact context, ~$0 cost

# Result
90% smaller search space
Zero cloud dependency
Better answers
// Memory Hierarchy

Five layers.
One system.

L1
Ultra-Compressed Context
Always injected into every prompt automatically. Core facts, active preferences, current projects.
~500 tokens
20 lines max
L2
Topic Summaries
Compressed summaries per topic. Searched first after L1. Vector search entry point.
~5K tokens
100 lines max
L3
Detailed Knowledge
Context-rich, temporally aware facts. Accessed via pointers from L2. Never scanned in full.
~50K tokens
500 lines max
L4
Raw Extracted Facts
Atomic facts from sessions. Timestamped, tagged, deduplicated. Source for L3 consolidation.
~500K tokens
5000 lines max
L5
Session Archive
Full conversation transcripts. Append-only, never modified. Permanent source of truth.
Unlimited
Never deleted
// Query Pipeline

Seven stages.
Milliseconds.

01
Encode
~20ms
Dense embedding via local model
02
L1 Cache
~1ms
Exact Redis key lookup
03
L2 Cache
~5ms
ANN cosine ≥ 0.92 hit
04
Filter
~1ms
Tenant + type metadata filter
04.5
Learned Index
~2ms ◆ NEW
Predicts relevant slots before ANN runs. 90% smaller search space.
05
ANN Search
~10ms
Hybrid dense + sparse BM25 RRF on predicted subset
06
Rerank
~80ms
Cross-encoder on candidates
// Comparison

How CaSVeM
stacks up.

FeatureCaSVeMMem0HydraDBSupermemoryZep
Fully local deployment✓ AlwaysPartial
Zero per-query cost✓ ~$0✗ API cost✗ $249/mo+✗ Token-based✗ API cost
LLM as cache controller
Learned cache index✓ Stage 4.5
Hierarchical cache layers✓ L1–L5✗ Flat✗ Flat✗ Flat✗ Flat
Query-aware synthesis✗ Raw chunks✗ Raw chunks✗ Raw chunks✗ Raw chunks
Typed memory chains✓ Causal/temporalPartialPartialGraph only
Open source✓ Apache 2.0PartialPartial
Privacy preserving✓ Never leaves device✗ Their servers✗ Cloud✗ Cloudflare✗ Cloud
// Core features

Built different.

Learned Cache Index

A neural classifier trained on your access patterns predicts which memory slots to search before vector search runs. 90% smaller search space. Gets smarter with every query.

Novel — first in category
🧠
LLM as Controller

A local model decides retrieval depth dynamically. No algorithm. No fixed rules. The controller understands your query and decides whether L2 is enough or L4 is needed.

Intelligent retrieval
🔗
Memory Chains

Typed directed graphs for causal, temporal, and departmental retrieval. Decision chains, finance chains, entity chains — traverse history in one hop, not dozens of searches.

Causal + temporal
🔒
Fully Local

Runs entirely on your hardware. No cloud API. No data sent anywhere. No per-query cost after setup. Works offline. Air-gapped enterprise deployments supported.

Privacy preserving
📊
Retention Scoring

Every memory line has a composite score: importance × recency × access frequency × uniqueness. Hot memories promote up. Cold memories demote down. Automatically.

Auto-managed
🔄
Query-Aware Synthesis

The output is not raw chunks. A local LLM synthesises a tailored memory block written specifically for each query. Every prompt gets a different memory. Every answer gets better context.

Not raw chunks
// Quick start

Up in minutes.

Terminal
# Pull embedding model
ollama pull qwen3-embedding:0.6b

# Start Qdrant
docker run -d --name qdrant \
-p 6333:6333 qdrant/qdrant

# Install CaSVeM
pip install casvem

# Start the system
casvem start
Python
import casvem

# Write a session
casvem.write(
transcript="...",
user_id="user_01"
)

# Retrieve context for query
ctx = casvem.query(
q="What are my preferences?",
user_id="user_01"
)

# Send to any LLM
llm.chat(ctx.prompt)

Give your AI
a real memory.

GitHub →