ShoreAgents has a knowledge base - 39 entries that Maya (our AI salesperson) searches to answer questions about pricing, hiring process, compliance, and more.
The original implementation? Every single chat message triggered a vector search across all 39 entries. Every. Single. Message. Even "hello" and "thanks".
Vector search isn't slow, but it's not free either. Embedding the query, comparing against 39 vectors, ranking results - about 200ms per search. User sends 10 messages in a conversation? That's 2 seconds of pure overhead just on knowledge search.
Then I noticed: 80% of questions were the same 15 topics. Pricing. Process. Timeline. Benefits. The same queries hitting the same vectors returning the same results. Over and over.
This is where caching comes in.
The Before: Every Query Hits the Database
`typescript
// Original knowledge search - no caching
async function searchKnowledge(query: string) {
// Embed query with OpenAI
const embedding = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: query
});
// Search Supabase with vector similarity const results = await supabase.rpc('match_knowledge', { query_embedding: embedding.data[0].embedding, match_threshold: 0.7, match_count: 5 });
return results;
}
`
Every call: 100ms for embedding + 100ms for search = 200ms minimum.
The After: Cache Common Queries
`typescript
// With caching - dramatic improvement
const knowledgeCache = new Map();
const CACHE_TTL = 60 60 1000; // 1 hour
async function searchKnowledge(query: string) { // Normalize query for cache key const cacheKey = normalizeQuery(query); // Check cache first const cached = knowledgeCache.get(cacheKey); if (cached && Date.now() - cached.timestamp < CACHE_TTL) { return cached.results; }
// Cache miss - do actual search const embedding = await openai.embeddings.create({ model: 'text-embedding-3-small', input: query });
const results = await supabase.rpc('match_knowledge', { query_embedding: embedding.data[0].embedding, match_threshold: 0.7, match_count: 5 });
// Store in cache knowledgeCache.set(cacheKey, { results, timestamp: Date.now() });
return results; }
function normalizeQuery(query: string): string {
return query
.toLowerCase()
.trim()
.replace(/[^a-z0-9\s]/g, '')
.split(/\s+/)
.sort()
.join(' ');
}
`
"What's your pricing?" and "your pricing what's" and "WHATS YOUR PRICING???" all hit the same cache entry.
Result: 80% cache hit rate. Average knowledge search time dropped from 200ms to 40ms.
Cache Invalidation: The Hard Part
Adding caching is easy. Knowing when to invalidate is hard.
For the knowledge base, we have two invalidation strategies:
1. Time-based (TTL)
Cache entries expire after 1 hour regardless of changes. Simple, predictable, but users might see stale data for up to an hour.
Good for: Data that changes rarely, where slight staleness is acceptable.
2. Event-based
When someone updates the knowledge base, clear all knowledge caches immediately.
`typescript
// In the admin update function
async function updateKnowledgeEntry(id: string, content: string) {
await supabase
.from('knowledge')
.update({ content })
.eq('id', id);
// Regenerate embedding for this entry await regenerateEmbedding(id);
// Clear ALL knowledge caches
knowledgeCache.clear();
// Log the invalidation
logger.info('Knowledge cache invalidated', {
trigger: 'entry_update',
entryId: id
});
}
`
Good for: Data where staleness causes real problems.
3. Hybrid (what we use)
Short TTL (1 hour) as a safety net, plus event-based invalidation for immediate consistency when updates happen.
`typescript
// Clear on any knowledge change
supabase
.channel('knowledge_changes')
.on('postgres_changes',
{ event: '*', schema: 'public', table: 'knowledge' },
() => knowledgeCache.clear()
)
.subscribe();
`
What to Cache (And What Not To)
After implementing caching across ShoreAgents and BPOC, here's what I've learned:
Great candidates for caching:
| Data | TTL | Why | |------|-----|-----| | Knowledge base results | 1 hour | Changes rarely, searched constantly | | Pricing engine output | 24 hours | Static unless multipliers change | | Role salary data | 4 hours | External API, rate limited | | User profile (public) | 15 min | Read often, updated sometimes | | API config/settings | Until restart | Changes require deploy anyway |
Bad candidates for caching:
| Data | Why | |------|-----| | Authentication state | Security risk if stale | | Real-time chat | Users expect instant updates | | Quote in progress | User actively modifying | | Lead analytics | Need accurate real-time counts | | Financial transactions | Consistency critical |
Cache Key Design
Good cache keys are: - Deterministic (same input = same key) - Collision-free (different input = different key) - Human-readable (for debugging)
Our pattern:
`
{namespace}:{entity}:{identifier}:{version}
`
Examples:
`
knowledge:search:whats-your-pricing:v1
pricing:role:virtual-assistant:php:v2
user:profile:user_123:v1
maya:session:ses_abc:context:v1
`
The version suffix lets us invalidate all caches of a type by bumping the version:
`typescript
const CACHE_VERSION = 'v2'; // Bump this to invalidate all
function cacheKey(namespace: string, entity: string, id: string) {
return ${namespace}:${entity}:${id}:${CACHE_VERSION};
}
`
Changed how pricing works? Bump to v3. All old pricing caches become orphans and expire naturally.
Multi-Layer Caching
For high-traffic data, we use multiple cache layers:
`
Request → Memory Cache → Redis Cache → Database
(fastest) (shared) (source of truth)
`
`typescript
async function getUserProfile(userId: string) {
// Layer 1: In-memory (per-instance)
const memKey = user:profile:${userId};
const memCached = memoryCache.get(memKey);
if (memCached) return memCached;
// Layer 2: Redis (shared across instances) const redisCached = await redis.get(memKey); if (redisCached) { const parsed = JSON.parse(redisCached); memoryCache.set(memKey, parsed, { ttl: 60 }); // Backfill memory return parsed; }
// Layer 3: Database (source of truth)
const user = await db.users.get(userId);
// Populate both caches
await redis.setex(memKey, 300, JSON.stringify(user)); // 5 min
memoryCache.set(memKey, user, { ttl: 60 }); // 1 min
return user;
}
`
Memory cache: 1 minute TTL, fastest, per-instance Redis cache: 5 minute TTL, shared across all instances Database: Source of truth, always correct
Most requests hit memory. Memory misses hit Redis. Redis misses hit database. Database is rarely touched for hot data.
The Thundering Herd Problem
Imagine cache expires. 100 concurrent requests all see cache miss. 100 requests all hit the database simultaneously. Database falls over.
Solution: Cache stampede protection.
`typescript
const locks = new Map();
async function getWithLock(key: string, fetchFn: () => Promise
// Check if someone else is already fetching if (locks.has(key)) { // Wait for their result return locks.get(key); }
// We're the one who fetches const promise = fetchFn().then(result => { cache.set(key, result); locks.delete(key); return result; });
locks.set(key, promise);
return promise;
}
`
First request fetches, others wait. Only one database hit regardless of concurrent demand.
Monitoring Cache Health
A cache without metrics is a mystery box. We track:
`typescript
const cacheMetrics = {
hits: 0,
misses: 0,
errors: 0,
latency_ms: []
};
async function cachedFetch(key: string, fetchFn: () => Promise
cacheMetrics.misses++;
try {
const result = await fetchFn();
cache.set(key, result);
cacheMetrics.latency_ms.push(Date.now() - start);
return result;
} catch (error) {
cacheMetrics.errors++;
throw error;
}
}
`
Dashboard shows: - Hit rate (should be >70% for hot data) - Miss rate (spikes indicate invalidation or cold start) - Error rate (cache itself failing) - P50/P95 latency (are we actually faster?)
If hit rate drops suddenly, something's wrong with our invalidation. If latency spikes, maybe Redis is overloaded.
Lessons Learned
After implementing caching across ShoreAgents:
1. Start with measurement. I didn't guess that knowledge search was slow - I measured it. Don't cache blindly.
2. Cache at the right layer. API responses? Database queries? Computed results? Each has different invalidation needs.
3. TTL is not a strategy. "It'll expire eventually" is not cache invalidation. Know exactly when your data becomes stale.
4. Every cache is a lie. You're telling users "this is current" when it might not be. Make sure it's a small lie.
5. Simple wins. In-memory Map with TTL covers 80% of use cases. Don't reach for Redis until you need shared state across instances.
The Maya knowledge search went from 200ms to 40ms average. Quote generation (with cached salary data) went from 12 seconds to 4 seconds. All because we stopped hitting the database for data we'd already fetched.
Every cache is a lie you're telling users about the current state of data. Make sure it's a small lie with a short lifespan.

