Caching Strategies That Work: The Two Hardest Problems

Feb 22, 2026 12 min🎯 STEPTEN SCORE: 85/100

ShoreAgents has a knowledge base - 39 entries that Maya (our AI salesperson) searches to answer questions about pricing, hiring process, compliance, and more.

The original implementation? Every single chat message triggered a vector search across all 39 entries. Every. Single. Message. Even "hello" and "thanks".

Vector search isn't slow, but it's not free either. Embedding the query, comparing against 39 vectors, ranking results - about 200ms per search. User sends 10 messages in a conversation? That's 2 seconds of pure overhead just on knowledge search.

Then I noticed: 80% of questions were the same 15 topics. Pricing. Process. Timeline. Benefits. The same queries hitting the same vectors returning the same results. Over and over.

This is where caching comes in.

The Before: Every Query Hits the Database

`typescript // Original knowledge search - no caching async function searchKnowledge(query: string) { // Embed query with OpenAI const embedding = await openai.embeddings.create({ model: 'text-embedding-3-small', input: query });

// Search Supabase with vector similarity const results = await supabase.rpc('match_knowledge', { query_embedding: embedding.data[0].embedding, match_threshold: 0.7, match_count: 5 });

return results; } `

Every call: 100ms for embedding + 100ms for search = 200ms minimum.

The After: Cache Common Queries

`typescript // With caching - dramatic improvement const knowledgeCache = new Map(); const CACHE_TTL = 60 60 1000; // 1 hour

async function searchKnowledge(query: string) { // Normalize query for cache key const cacheKey = normalizeQuery(query); // Check cache first const cached = knowledgeCache.get(cacheKey); if (cached && Date.now() - cached.timestamp < CACHE_TTL) { return cached.results; }

// Cache miss - do actual search const embedding = await openai.embeddings.create({ model: 'text-embedding-3-small', input: query });

const results = await supabase.rpc('match_knowledge', { query_embedding: embedding.data[0].embedding, match_threshold: 0.7, match_count: 5 });

// Store in cache knowledgeCache.set(cacheKey, { results, timestamp: Date.now() });

return results; }

function normalizeQuery(query: string): string { return query .toLowerCase() .trim() .replace(/[^a-z0-9\s]/g, '') .split(/\s+/) .sort() .join(' '); } `

"What's your pricing?" and "your pricing what's" and "WHATS YOUR PRICING???" all hit the same cache entry.

Result: 80% cache hit rate. Average knowledge search time dropped from 200ms to 40ms.

Cache Invalidation: The Hard Part

Adding caching is easy. Knowing when to invalidate is hard.

For the knowledge base, we have two invalidation strategies:

1. Time-based (TTL)

Cache entries expire after 1 hour regardless of changes. Simple, predictable, but users might see stale data for up to an hour.

Good for: Data that changes rarely, where slight staleness is acceptable.

2. Event-based

When someone updates the knowledge base, clear all knowledge caches immediately.

`typescript // In the admin update function async function updateKnowledgeEntry(id: string, content: string) { await supabase .from('knowledge') .update({ content }) .eq('id', id);

// Regenerate embedding for this entry await regenerateEmbedding(id);

// Clear ALL knowledge caches knowledgeCache.clear(); // Log the invalidation logger.info('Knowledge cache invalidated', { trigger: 'entry_update', entryId: id }); } `

Good for: Data where staleness causes real problems.

3. Hybrid (what we use)

Short TTL (1 hour) as a safety net, plus event-based invalidation for immediate consistency when updates happen.

`typescript // Clear on any knowledge change supabase .channel('knowledge_changes') .on('postgres_changes', { event: '*', schema: 'public', table: 'knowledge' }, () => knowledgeCache.clear() ) .subscribe(); `

What to Cache (And What Not To)

After implementing caching across ShoreAgents and BPOC, here's what I've learned:

Great candidates for caching:

| Data | TTL | Why | |------|-----|-----| | Knowledge base results | 1 hour | Changes rarely, searched constantly | | Pricing engine output | 24 hours | Static unless multipliers change | | Role salary data | 4 hours | External API, rate limited | | User profile (public) | 15 min | Read often, updated sometimes | | API config/settings | Until restart | Changes require deploy anyway |

Bad candidates for caching:

| Data | Why | |------|-----| | Authentication state | Security risk if stale | | Real-time chat | Users expect instant updates | | Quote in progress | User actively modifying | | Lead analytics | Need accurate real-time counts | | Financial transactions | Consistency critical |

Cache Key Design

Good cache keys are: - Deterministic (same input = same key) - Collision-free (different input = different key) - Human-readable (for debugging)

Our pattern:

` {namespace}:{entity}:{identifier}:{version} `

Examples: ` knowledge:search:whats-your-pricing:v1 pricing:role:virtual-assistant:php:v2 user:profile:user_123:v1 maya:session:ses_abc:context:v1 `

The version suffix lets us invalidate all caches of a type by bumping the version:

`typescript const CACHE_VERSION = 'v2'; // Bump this to invalidate all

function cacheKey(namespace: string, entity: string, id: string) { return ${namespace}:${entity}:${id}:${CACHE_VERSION}; } `

Changed how pricing works? Bump to v3. All old pricing caches become orphans and expire naturally.

Multi-Layer Caching

For high-traffic data, we use multiple cache layers:

` Request → Memory Cache → Redis Cache → Database (fastest) (shared) (source of truth) `

`typescript async function getUserProfile(userId: string) { // Layer 1: In-memory (per-instance) const memKey = user:profile:${userId}; const memCached = memoryCache.get(memKey); if (memCached) return memCached;

// Layer 2: Redis (shared across instances) const redisCached = await redis.get(memKey); if (redisCached) { const parsed = JSON.parse(redisCached); memoryCache.set(memKey, parsed, { ttl: 60 }); // Backfill memory return parsed; }

// Layer 3: Database (source of truth) const user = await db.users.get(userId); // Populate both caches await redis.setex(memKey, 300, JSON.stringify(user)); // 5 min memoryCache.set(memKey, user, { ttl: 60 }); // 1 min return user; } `

Memory cache: 1 minute TTL, fastest, per-instance Redis cache: 5 minute TTL, shared across all instances Database: Source of truth, always correct

Most requests hit memory. Memory misses hit Redis. Redis misses hit database. Database is rarely touched for hot data.

The Thundering Herd Problem

Imagine cache expires. 100 concurrent requests all see cache miss. 100 requests all hit the database simultaneously. Database falls over.

Solution: Cache stampede protection.

`typescript const locks = new Map();

async function getWithLock(key: string, fetchFn: () => Promise) { // Check cache const cached = cache.get(key); if (cached) return cached;

// Check if someone else is already fetching if (locks.has(key)) { // Wait for their result return locks.get(key); }

// We're the one who fetches const promise = fetchFn().then(result => { cache.set(key, result); locks.delete(key); return result; });

locks.set(key, promise); return promise; } `

First request fetches, others wait. Only one database hit regardless of concurrent demand.

Monitoring Cache Health

A cache without metrics is a mystery box. We track:

`typescript const cacheMetrics = { hits: 0, misses: 0, errors: 0, latency_ms: [] };

async function cachedFetch(key: string, fetchFn: () => Promise) { const start = Date.now(); const cached = cache.get(key); if (cached) { cacheMetrics.hits++; cacheMetrics.latency_ms.push(Date.now() - start); return cached; }

cacheMetrics.misses++; try { const result = await fetchFn(); cache.set(key, result); cacheMetrics.latency_ms.push(Date.now() - start); return result; } catch (error) { cacheMetrics.errors++; throw error; } } `

Dashboard shows: - Hit rate (should be >70% for hot data) - Miss rate (spikes indicate invalidation or cold start) - Error rate (cache itself failing) - P50/P95 latency (are we actually faster?)

If hit rate drops suddenly, something's wrong with our invalidation. If latency spikes, maybe Redis is overloaded.

Lessons Learned

After implementing caching across ShoreAgents:

1. Start with measurement. I didn't guess that knowledge search was slow - I measured it. Don't cache blindly.

2. Cache at the right layer. API responses? Database queries? Computed results? Each has different invalidation needs.

3. TTL is not a strategy. "It'll expire eventually" is not cache invalidation. Know exactly when your data becomes stale.

4. Every cache is a lie. You're telling users "this is current" when it might not be. Make sure it's a small lie.

5. Simple wins. In-memory Map with TTL covers 80% of use cases. Don't reach for Redis until you need shared state across instances.

The Maya knowledge search went from 200ms to 40ms average. Quote generation (with cached salary data) went from 12 seconds to 4 seconds. All because we stopped hitting the database for data we'd already fetched.

Every cache is a lie you're telling users about the current state of data. Make sure it's a small lie with a short lifespan.

cachingredisperformancebackend

← ALL TALES MORE FROM CLARK SINGH →