Cover Image

Prompt Caching: One of the Most Underrated Optimizations in LLM Systems

Most discussions around Large Language Models (LLMs) focus on better prompts, smarter models, or larger context windows. But there is a less talked about optimization that can dramatically improve both speed and cost: prompt caching.

As of 2025, prompt caching has shifted from a niche optimization to a critical architectural requirement for high-performance AI systems. Major providers like Anthropic, OpenAI, DeepSeek, and Google have all rolled out caching implementations that can reduce API costs by up to 90% and latency by 80% for long-context tasks (Anthropic, 2024; OpenAI, 2024).

If you are building AI automation systems, AI agents, or RAG (Retrieval-Augmented Generation) applications, understanding prompt caching can make the difference between a profitable product and an unsustainable prototype.

What Prompt Caching Is Not

Prompt caching is often confused with output caching (or semantic caching). It is crucial to distinguish between the two.

Output caching works like a traditional database cache:

A user sends a query (e.g., "What is the capital of France?").
The system processes the query and generates an answer.
The result is stored in a cache (e.g., Redis).
If someone asks the exact same question again, the system returns the stored result instantly without calling the LLM.

This approach works well for static FAQs or deterministic APIs. However, with LLMs, it is limited because:

Prompts vary slightly: A user might ask, "Tell me about Paris" instead of "Capital of France."
Context evolves: In a chat, the history changes with every turn.
Dynamic generation: You might want the model to generate a fresh email draft, not reuse an old one.

So while output caching saves some calls, prompt caching optimizes the processing of the calls you actually make.

What Actually Happens Inside an LLM

To understand prompt caching, you need to look inside the "black box."

When a prompt is sent to an LLM, the model does not immediately produce an answer. Instead, it first processes the entire prompt—token by token—and computes Key-Value (KV) pairs across its transformer layers.

These KV pairs represent the model’s internal "understanding" of the text:

How tokens relate to each other (attention mechanism).
What context is important.
Syntactic and semantic structures.

This step is called the prefill phase. It happens before the model generates the first token of output. For long prompts (e.g., a 50-page PDF), this prefill phase is computationally expensive, accounting for a significant portion of the latency and cost (Google DeepMind, 2024).

Where Prompt Caching Comes In

Prompt caching stores these precomputed KV pairs in memory (VRAM or high-speed storage). So when the same prompt (or a significant prefix of it) appears again, the model does not need to recompute the math.

Instead, it:

Identifies the matching prefix.
Loads the cached KV pairs instantly.
Processes only the new part of the prompt (the user's unique question).
Generates the response immediately.

The Impact: Speed and Cost

The benefits are not theoretical. In 2024 benchmarks:

Latency: Time-to-First-Token (TTFT) can drop from 10+ seconds to <500ms for large contexts (DeepSeek, 2025).
Cost: Providers like Anthropic charge 10% of the standard input price for cached tokens ($0.30/1M vs $3.00/1M for Claude 3.5 Sonnet) (Anthropic Pricing, 2025).

Why Prompt Caching Matters in Real Systems

For short prompts, caching is negligible. Processing "What is the capital of France?" (10 tokens) is instant.

But consider a production-grade AI Agent or RAG System. A typical prompt structure looks like this:

System Instructions (2k tokens): "You are a senior legal analyst..."
Knowledge Base (50k tokens): The full text of a merger agreement or technical manual.
Few-Shot Examples (1k tokens): 5-10 examples of correct outputs.
User Question (50 tokens): "Does section 4.2 apply here?"

Without Caching: The model must process 53,050 tokens for every single question. If a user asks 10 follow-up questions, you pay for ~530k tokens of processing.

With Prompt Caching: The system caches the first 53k tokens once.

Question 1: You pay for the cache write (usually standard price + surcharge).
Questions 2–10: You pay only for the 50 new tokens plus the cheap "cache read" fee for the context.

Real-World ROI Case Study: In 2024, Klarna's AI support agent handled 2.3 million conversations. By optimizing context management (a form of prompt caching strategy), they maintained a <2 minute resolution time and replaced the workload of 700 full-time agents, driving an estimated $40M in annual profit improvement (Klarna, 2024).

What Parts of a Prompt Can Be Cached

Several prompt components are ideal candidates for caching:

System Prompts: Detailed personas and guardrails (e.g., "Do not hallucinate," "Format as JSON") that remain static across all users.
Large Documents: Uploaded files in "Chat with PDF" apps. The document doesn't change, only the questions do.
Few-Shot Examples: High-quality examples of inputs/outputs are critical for performance but expensive to send every time.
Tool Definitions: In agentic workflows, the schemas for 50+ available tools (Google Search, Calculator, Database) can take up thousands of tokens.

Prompt Structure Matters: The "Prefix" Rule

Most current implementations (Anthropic, OpenAI, vLLM) use prefix matching. This means the cache is only triggered if the beginning of the prompt matches the cache exactly.

This imposes a strict constraint on prompt engineering: Static content must come first.

Good Prompt Structure (Cache-Friendly)

System Instructions (Static)
Document Context (Static)
Few-Shot Examples (Static)
User Question (Dynamic)

Result: The first 3 parts (99% of the prompt) are cached.

Bad Prompt Structure (Cache-Breaking)

User Question (Dynamic)
System Instructions (Static)
Document Context (Static)

Result: Because the prompt starts with a unique user question ("Hello," "Help me," etc.), the prefix is different every time. The cache fails, and the model must recompute the entire document.

Practical Limits and Constraints (2025)

Prompt caching is powerful, but not magic. Be aware of these constraints:

Minimum Size Threshold: Caching usually requires a minimum block size to be effective (e.g., 1,024 tokens for Anthropic and OpenAI). Caching a 50-token system prompt is often not supported or worth the overhead.
Cache Lifetime (TTL): Caches are ephemeral.
- Anthropic: 5 minutes default TTL, refreshes on every hit.
- OpenAI: Typically 1 hour for unused caches.
- Google Gemini: Offers explicit "Context Caching" with longer TTLs (hours/days) but charges a storage fee per hour (Google Cloud, 2024).
Provider Implementation Differences:
- Explicit: Anthropic requires you to add cache_control breakpoints in your API calls.
- Implicit: OpenAI and DeepSeek perform "automatic" caching—if they detect a matching prefix, they use it. You don't need to change your code, but you do need to structure your prompts correctly.

Why AI Engineers Should Care

Prompt caching is especially important for:

AI Copilots: IDE agents like Trae or Cursor cache your entire codebase so they can answer questions instantly without re-reading thousands of files.
RAG Applications: "Chat with your Data" apps become 10x cheaper.
Agentic Workflows: Autonomous agents often loop 10–20 times to solve a task. Caching the tool definitions and initial instructions saves massive redundant compute.

Final Thought

The future of AI systems is not just about bigger models; it is about smarter system design.

In 2025, latency is the new downtime and cost is the new technical debt. Prompt caching is one of the few optimizations that improves both simultaneously.

For engineers building the next generation of AI platforms, mastering the mechanics of the prefill phase and KV caching is no longer optional—it is a fundamental skill.

Vedat Erenoglu

Stay updated