Large Language Model (LLM) costs were ballooning—increasing 30% month-over-month—even though traffic hadn’t surged at the same rate. The culprit? Users were asking the same questions in countless different ways. “What’s your return policy?”, “How do I return something?”, and “Can I get a refund?” each triggered separate LLM calls, generating nearly identical responses and racking up hefty API bills.
Semantic Caching Cuts LLM Costs by 73%
A smarter caching strategy can dramatically reduce expenses when working with large language models.
Traditional, exact-match caching only captured 18% of these redundant requests. Slightly rephrased questions bypassed the cache entirely. By implementing semantic caching—a system that focuses on the meaning of queries, not just the words—we boosted our cache hit rate to 67%, slashing LLM API costs by 73%. But achieving those savings required overcoming some unexpected hurdles.
Why Exact-Match Caching Falls Short
Conventional caching relies on the query text as the cache key. This works perfectly when queries are identical:
cache_key = hash(query_text)
if cache_key in cache:
return cache[cache_key]
However, users rarely phrase questions the same way. An analysis of 100,000 production queries revealed:
- Only 18% were exact duplicates.
- 47% were semantically similar (same intent, different wording).
- 35% were genuinely novel queries.
That 47% represented a significant opportunity for cost savings. Each semantically similar query triggered a full LLM call, producing a response nearly identical to one already computed and cached.
Semantic Caching Architecture
Semantic caching replaces text-based keys with embedding-based similarity lookup:
class SemanticCache:
def __init__(self, embedding_model, similarity_threshold=0.92):
self.embedding_model = embedding_model
self.threshold = similarity_threshold
self.vector_store = VectorStore() # FAISS, Pinecone, etc.
self.response_store = ResponseStore() # Redis, DynamoDB, etc.
def get(self, query: str) -> Optional[str]:
"""Return cached response if semantically similar query exists."""
query_embedding = self.embedding_model.encode(query)
# Find most similar cached query
matches = self.vector_store.search(query_embedding, top_k=1)
if matches and matches[0].similarity >= self.threshold:
cache_id = matches[0].id
return self.response_store.get(cache_id)
return None
def set(self, query: str, response: str):
"""Cache query-response pair."""
query_embedding = self.embedding_model.encode(query)
cache_id = generate_id()
self.vector_store.add(cache_id, query_embedding)
self.response_store.set(cache_id, {
'query': query,
'response': response,
'timestamp': datetime.utcnow()
})
The core idea is to embed queries into a vector space and find cached queries within a defined similarity threshold, rather than relying on exact text matches.
The Threshold Problem
The similarity threshold is a critical parameter. Set it too high, and you miss valid cache hits. Set it too low, and you risk returning incorrect responses. An initial threshold of 0.85 seemed reasonable—85% similar should equate to “the same question,” right?
Wrong. At 0.85, we encountered cache hits that were demonstrably incorrect.
We discovered that optimal thresholds vary depending on the query type:
| Query type | Optimal threshold | Rationale |
| FAQ-style questions | 0.94 | High precision needed; wrong answers damage trust |
| Product searches | 0.88 | More tolerance for near-matches |
| Support queries | 0.92 | Balance between coverage and accuracy |
| Transactional queries | 0.97 | Very low tolerance for errors |
We implemented query-type-specific thresholds:
class AdaptiveSemanticCache:
def __init__(self):
self.thresholds = {
'faq': 0.94,
'search': 0.88,
'support': 0.92,
'transactional': 0.97,
'default': 0.92
}
self.query_classifier = QueryClassifier()
def get_threshold(self, query: str) -> float:
query_type = self.query_classifier.classify(query)
return self.thresholds.get(query_type, self.thresholds['default'])
def get(self, query: str) -> Optional[str]:
threshold = self.get_threshold(query)
query_embedding = self.embedding_model.encode(query)
matches = self.vector_store.search(query_embedding, top_k=1)
if matches and matches[0].similarity >= threshold:
return self.response_store.get(matches[0].id)
return None
Threshold Tuning Methodology
Tuning thresholds required ground truth data—identifying which query pairs truly had the same intent. Our methodology involved:
- Sampling query pairs: We sampled 5,000 query pairs at various similarity levels (0.80-0.99).
- Human labeling: Annotators labeled each pair as “same intent” or “different intent.” We used three annotators per pair and took a majority vote.
- Computing precision/recall curves: For each threshold, we calculated precision (of cache hits, what fraction had the same intent?) and recall (of same-intent pairs, what fraction did we cache-hit?).
We then selected thresholds based on the cost of errors. For FAQ queries, where incorrect answers erode trust, we prioritized precision (0.94 threshold yielded 98% precision). For search queries, where a missed cache hit simply costs money, we optimized for recall (0.88 threshold).
Latency Overhead
Semantic caching introduces latency—you must embed the query and search the vector store before determining whether to call the LLM.
| Operation | Latency (p50) | Latency (p99) |
| Query embedding | 12ms | 28ms |
| Vector search | 8ms | 19ms |
| Total cache lookup | 20ms | 47ms |
| LLM API call | 850ms | 2400ms |
The 20ms overhead is minimal compared to the 850ms LLM call avoided on cache hits. Even at the 99th percentile, the 47ms overhead is acceptable. Cache misses now take 20ms longer, but at a 67% hit rate, the net result is a 65% improvement in overall latency alongside the cost reduction.
Cache Invalidation
Cached responses become stale. Product information changes, policies are updated, and yesterday’s correct answer can become today’s misinformation. We implemented three invalidation strategies:
- Time-based TTL: Simple expiration based on content type.
- Event-based invalidation: Invalidating related cache entries when underlying data changes.
- Staleness detection: Periodically re-running queries against current data to identify and invalidate outdated responses.
Production Results
After three months in production:
| Metric | Before | After | Change |
| Cache hit rate | 18% | 67% | +272% |
| LLM API costs | $47K/month | $12.7K/month | -73% |
| Average latency | 850ms | 300ms | -65% |
| False-positive rate | N/A | 0.8% | — |
| Customer complaints (wrong answers) | Baseline | +0.3% | Minimal increase |
The 0.8% false-positive rate was acceptable. These cases occurred primarily at the boundaries of our thresholds, where similarity was just above the cutoff but intent differed slightly.
Key Takeaways
- Semantic caching is a practical way to control LLM costs by capturing redundancy that exact-match caching misses.
- Threshold tuning is crucial; use query-type-specific thresholds based on precision/recall analysis.
- Cache invalidation is essential. Combine TTL, event-based, and staleness detection to maintain data accuracy.
At a 73% cost reduction, this was our highest-ROI optimization for production LLM systems. Implementation complexity is moderate, but careful threshold tuning is vital to avoid degrading response quality.
