◆ Open source AI memory infrastructure

Cached
SmartVector
Memory

The core idea is simple — fetch only the context your AI actually needs from memory, as fast as possible, at the lowest cost possible, so it can give better answers.

Get Started →How it works

Cache Layers

~$0

Per-query cost

90%

Search space reduction

100%

Local & private

Hierarchical Cache MemoryLearned Cache IndexLocal LLM ControllerZero Cloud DependencyQuery-Aware SynthesisTyped Memory ChainsBidirectional Pointer GraphPrivacy PreservingHierarchical Cache MemoryLearned Cache IndexLocal LLM ControllerZero Cloud DependencyQuery-Aware SynthesisTyped Memory ChainsBidirectional Pointer GraphPrivacy Preserving

// The problem

LLMs are stateless.
Memory systems are broken.

Every time your AI starts a new session, it starts from zero. Current solutions either scan everything (expensive and slow) or return irrelevant chunks (inaccurate).

Most memory systems are flat RAG pipelines — they embed a query, find similar text, dump it in the prompt, and hope for the best. There's no intelligence in the retrieval. No concept of what's important. No hierarchy.

CaSVeM treats memory like a CPU cache hierarchy — layered, intelligent, and fast. The right context reaches your AI at near-zero cost, every time.

# What everyone else does

query → embed → scan ALL vectors

→ return top-K chunks

→ hope it's relevant

→ pay per API call

# What CaSVeM does

query → learned index predicts slots

→ search only relevant subset

→ LLM synthesises memory block

→ exact context, ~$0 cost

# Result

90% smaller search space

Zero cloud dependency

Better answers

// Memory Hierarchy

Five layers.
One system.

Ultra-Compressed Context

Always injected into every prompt automatically. Core facts, active preferences, current projects.

~500 tokens

20 lines max

Topic Summaries

Compressed summaries per topic. Searched first after L1. Vector search entry point.

~5K tokens

100 lines max

Detailed Knowledge

Context-rich, temporally aware facts. Accessed via pointers from L2. Never scanned in full.

~50K tokens

500 lines max

Raw Extracted Facts

Atomic facts from sessions. Timestamped, tagged, deduplicated. Source for L3 consolidation.

~500K tokens

5000 lines max

Session Archive

Full conversation transcripts. Append-only, never modified. Permanent source of truth.

Unlimited

Never deleted

// Query Pipeline

Seven stages.
Milliseconds.

Encode

~20ms

Dense embedding via local model

L1 Cache

~1ms

Exact Redis key lookup

L2 Cache

~5ms

ANN cosine ≥ 0.92 hit

Filter

~1ms

Tenant + type metadata filter

04.5

Learned Index

~2ms ◆ NEW

Predicts relevant slots before ANN runs. 90% smaller search space.

ANN Search

~10ms

Hybrid dense + sparse BM25 RRF on predicted subset

Rerank

~80ms

Cross-encoder on candidates

// Comparison

How CaSVeM
stacks up.

Feature	CaSVeM	Mem0	HydraDB	Supermemory	Zep
Fully local deployment	✓ Always	✗	✗	✗	Partial
Zero per-query cost	✓ ~$0	✗ API cost	✗ $249/mo+	✗ Token-based	✗ API cost
LLM as cache controller	✓	✗	✗	✗	✗
Learned cache index	✓ Stage 4.5	✗	✗	✗	✗
Hierarchical cache layers	✓ L1–L5	✗ Flat	✗ Flat	✗ Flat	✗ Flat
Query-aware synthesis	✓	✗ Raw chunks	✗ Raw chunks	✗ Raw chunks	✗ Raw chunks
Typed memory chains	✓ Causal/temporal	✗	Partial	Partial	Graph only
Open source	✓ Apache 2.0	✓	✗	Partial	Partial
Privacy preserving	✓ Never leaves device	✗ Their servers	✗ Cloud	✗ Cloudflare	✗ Cloud

// Core features

Built different.

⚡

Learned Cache Index

A neural classifier trained on your access patterns predicts which memory slots to search before vector search runs. 90% smaller search space. Gets smarter with every query.

Novel — first in category

🧠

LLM as Controller

A local model decides retrieval depth dynamically. No algorithm. No fixed rules. The controller understands your query and decides whether L2 is enough or L4 is needed.

Intelligent retrieval

🔗

Memory Chains

Typed directed graphs for causal, temporal, and departmental retrieval. Decision chains, finance chains, entity chains — traverse history in one hop, not dozens of searches.

Causal + temporal

🔒

Fully Local

Runs entirely on your hardware. No cloud API. No data sent anywhere. No per-query cost after setup. Works offline. Air-gapped enterprise deployments supported.

Privacy preserving

📊

Retention Scoring

Every memory line has a composite score: importance × recency × access frequency × uniqueness. Hot memories promote up. Cold memories demote down. Automatically.

Auto-managed

🔄

Query-Aware Synthesis

The output is not raw chunks. A local LLM synthesises a tailored memory block written specifically for each query. Every prompt gets a different memory. Every answer gets better context.

Not raw chunks

// Quick start

Up in minutes.

Terminal

# Pull embedding model

ollama pull qwen3-embedding:0.6b

# Start Qdrant

docker run -d --name qdrant \

-p 6333:6333 qdrant/qdrant

# Install CaSVeM

pip install casvem

# Start the system

casvem start

Python

import casvem

# Write a session

casvem.write(

transcript="...",

user_id="user_01"

)

# Retrieve context for query

ctx = casvem.query(

q="What are my preferences?",

user_id="user_01"

)

# Send to any LLM

llm.chat(ctx.prompt)

CachedSmartVectorMemory

LLMs are stateless.Memory systems are broken.

Five layers.One system.

Seven stages.Milliseconds.

How CaSVeMstacks up.