LLM Costs: Cut Bills by 73% with Semantic Caching

by priyanka.patel tech editor

Large Language Model (LLM) costs were ballooning—increasing 30% month-over-month—even though traffic hadn’t surged at the same rate. The culprit? Users were asking the same questions in countless different ways. “What’s your return policy?”, “How do I return something?”, and “Can I get a refund?” each triggered separate LLM calls, generating nearly identical responses and racking up hefty API bills.

Semantic Caching Cuts LLM Costs by 73%

A smarter caching strategy can dramatically reduce expenses when working with large language models.

Traditional, exact-match caching only captured 18% of these redundant requests. Slightly rephrased questions bypassed the cache entirely. By implementing semantic caching—a system that focuses on the meaning of queries, not just the words—we boosted our cache hit rate to 67%, slashing LLM API costs by 73%. But achieving those savings required overcoming some unexpected hurdles.

Why Exact-Match Caching Falls Short

Conventional caching relies on the query text as the cache key. This works perfectly when queries are identical:

cache_key = hash(query_text)
if cache_key in cache:
    return cache[cache_key]

However, users rarely phrase questions the same way. An analysis of 100,000 production queries revealed:

  • Only 18% were exact duplicates.
  • 47% were semantically similar (same intent, different wording).
  • 35% were genuinely novel queries.

That 47% represented a significant opportunity for cost savings. Each semantically similar query triggered a full LLM call, producing a response nearly identical to one already computed and cached.

Semantic Caching Architecture

Semantic caching replaces text-based keys with embedding-based similarity lookup:

class SemanticCache:
    def __init__(self, embedding_model, similarity_threshold=0.92):
        self.embedding_model = embedding_model
        self.threshold = similarity_threshold
        self.vector_store = VectorStore()  # FAISS, Pinecone, etc.
        self.response_store = ResponseStore()  # Redis, DynamoDB, etc.

    def get(self, query: str) -> Optional[str]:
        """Return cached response if semantically similar query exists."""
        query_embedding = self.embedding_model.encode(query)
        # Find most similar cached query
        matches = self.vector_store.search(query_embedding, top_k=1)
        if matches and matches[0].similarity >= self.threshold:
            cache_id = matches[0].id
            return self.response_store.get(cache_id)
        return None

    def set(self, query: str, response: str):
        """Cache query-response pair."""
        query_embedding = self.embedding_model.encode(query)
        cache_id = generate_id()
        self.vector_store.add(cache_id, query_embedding)
        self.response_store.set(cache_id, {
            'query': query,
            'response': response,
            'timestamp': datetime.utcnow()
        })

The core idea is to embed queries into a vector space and find cached queries within a defined similarity threshold, rather than relying on exact text matches.

The Threshold Problem

The similarity threshold is a critical parameter. Set it too high, and you miss valid cache hits. Set it too low, and you risk returning incorrect responses. An initial threshold of 0.85 seemed reasonable—85% similar should equate to “the same question,” right?

Wrong. At 0.85, we encountered cache hits that were demonstrably incorrect.

We discovered that optimal thresholds vary depending on the query type:

Query type Optimal threshold Rationale
FAQ-style questions 0.94 High precision needed; wrong answers damage trust
Product searches 0.88 More tolerance for near-matches
Support queries 0.92 Balance between coverage and accuracy
Transactional queries 0.97 Very low tolerance for errors

We implemented query-type-specific thresholds:

class AdaptiveSemanticCache:
    def __init__(self):
        self.thresholds = {
            'faq': 0.94,
            'search': 0.88,
            'support': 0.92,
            'transactional': 0.97,
            'default': 0.92
        }
        self.query_classifier = QueryClassifier()

    def get_threshold(self, query: str) -> float:
        query_type = self.query_classifier.classify(query)
        return self.thresholds.get(query_type, self.thresholds['default'])

    def get(self, query: str) -> Optional[str]:
        threshold = self.get_threshold(query)
        query_embedding = self.embedding_model.encode(query)
        matches = self.vector_store.search(query_embedding, top_k=1)
        if matches and matches[0].similarity >= threshold:
            return self.response_store.get(matches[0].id)
        return None

Threshold Tuning Methodology

Tuning thresholds required ground truth data—identifying which query pairs truly had the same intent. Our methodology involved:

  1. Sampling query pairs: We sampled 5,000 query pairs at various similarity levels (0.80-0.99).
  2. Human labeling: Annotators labeled each pair as “same intent” or “different intent.” We used three annotators per pair and took a majority vote.
  3. Computing precision/recall curves: For each threshold, we calculated precision (of cache hits, what fraction had the same intent?) and recall (of same-intent pairs, what fraction did we cache-hit?).

We then selected thresholds based on the cost of errors. For FAQ queries, where incorrect answers erode trust, we prioritized precision (0.94 threshold yielded 98% precision). For search queries, where a missed cache hit simply costs money, we optimized for recall (0.88 threshold).

Latency Overhead

Semantic caching introduces latency—you must embed the query and search the vector store before determining whether to call the LLM.

Operation Latency (p50) Latency (p99)
Query embedding 12ms 28ms
Vector search 8ms 19ms
Total cache lookup 20ms 47ms
LLM API call 850ms 2400ms

The 20ms overhead is minimal compared to the 850ms LLM call avoided on cache hits. Even at the 99th percentile, the 47ms overhead is acceptable. Cache misses now take 20ms longer, but at a 67% hit rate, the net result is a 65% improvement in overall latency alongside the cost reduction.

Cache Invalidation

Cached responses become stale. Product information changes, policies are updated, and yesterday’s correct answer can become today’s misinformation. We implemented three invalidation strategies:

  1. Time-based TTL: Simple expiration based on content type.
  2. Event-based invalidation: Invalidating related cache entries when underlying data changes.
  3. Staleness detection: Periodically re-running queries against current data to identify and invalidate outdated responses.

Production Results

After three months in production:

Metric Before After Change
Cache hit rate 18% 67% +272%
LLM API costs $47K/month $12.7K/month -73%
Average latency 850ms 300ms -65%
False-positive rate N/A 0.8%
Customer complaints (wrong answers) Baseline +0.3% Minimal increase

The 0.8% false-positive rate was acceptable. These cases occurred primarily at the boundaries of our thresholds, where similarity was just above the cutoff but intent differed slightly.

Key Takeaways

  • Semantic caching is a practical way to control LLM costs by capturing redundancy that exact-match caching misses.
  • Threshold tuning is crucial; use query-type-specific thresholds based on precision/recall analysis.
  • Cache invalidation is essential. Combine TTL, event-based, and staleness detection to maintain data accuracy.

At a 73% cost reduction, this was our highest-ROI optimization for production LLM systems. Implementation complexity is moderate, but careful threshold tuning is vital to avoid degrading response quality.

You may also like

Leave a Comment