Skip to Content

Prompt Caching

Prompt Caching allows you to cache frequently reused prompt prefixes, reducing token consumption and response latency.

How It Works

When your request contains a long, frequently reused system prompt or context:

  1. First request — All tokens are fully processed, and the prompt prefix is cached
  2. Subsequent requests — When the cache is hit, cached tokens are not charged again
  3. Cache expiration — Cache has a TTL (typically 5-10 minutes), after which it needs to be rebuilt

Cache Support

OfoxAI’s model resources are provided by official cloud providers including AWS Bedrock, Azure OpenAI, Google Cloud, Alibaba Cloud, and Volcengine. Models that support Prompt Caching on these cloud providers are also supported by OfoxAI.

Cloud ProviderRepresentative ModelsCaching Mechanism
AWS BedrockClaude seriesNative Prompt Caching
Azure OpenAIGPT-4o seriesAutomatic caching
Google CloudGemini seriesContext Caching
Alibaba CloudQwen seriesPlatform-side caching
VolcengineDoubao seriesPlatform-side caching

Specific model caching support is subject to each cloud provider’s official documentation. OfoxAI transparently passes through caching-related parameters with no additional configuration needed.

Usage

OpenAI Protocol

Prompt Caching for OpenAI models is automatic — it activates when repeated prompt prefixes are detected:

caching_openai.py
# Long system prompts are automatically cached SYSTEM_PROMPT = """You are OfoxAI's technical support assistant. Here is the product information you need to know: - OfoxAI is an LLM Gateway supporting 100+ models - Supports OpenAI / Anthropic / Gemini protocols - ... (more product knowledge omitted) """ # First request: caches the system prompt response1 = client.chat.completions.create( model="openai/gpt-4o", messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": "What models does OfoxAI support?"} ] ) # Second request: cache hit, faster and cheaper response2 = client.chat.completions.create( model="openai/gpt-4o", messages=[ {"role": "system", "content": SYSTEM_PROMPT}, # Cache hit {"role": "user", "content": "How do I set up Claude Code?"} ] )

Anthropic Protocol

Anthropic models support explicit cache control:

caching_anthropic.py
import anthropic client = anthropic.Anthropic( base_url="https://api.ofox.ai/anthropic", api_key="<your OFOXAI_API_KEY>" ) response = client.messages.create( model="anthropic/claude-sonnet-4.5", max_tokens=1024, system=[{ "type": "text", "text": "You are a professional assistant. Here is the product documentation...", "cache_control": {"type": "ephemeral"} # Explicitly enable caching }], messages=[{"role": "user", "content": "Summarize the product features"}] ) # Check cache hit status print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}") print(f"Cache hit tokens: {response.usage.cache_read_input_tokens}")

Cost Savings

When the cache is hit, cached tokens are billed at a lower rate. The savings vary by model:

  • Anthropic Claude series — Cache hits can save approximately 90% on input costs
  • OpenAI GPT series — Cache hits can save approximately 50% on input costs
  • Google Gemini series — Cache hits can save approximately 50-75% on input costs

Actual savings depend on cache hit rates and each cloud provider’s billing policies. Check the OfoxAI Console usage statistics for details.

Best Practices

  1. Put long text first — Place system prompts, knowledge base content, and other static parts at the beginning of messages
  2. Keep prefixes consistent — Only identical prefixes can hit the cache
  3. Design prompt structure wisely — Separate static and dynamic parts
# ✅ Good design: static content first, dynamic content last messages = [ {"role": "system", "content": LONG_STATIC_PROMPT}, # Cacheable {"role": "user", "content": dynamic_question} # Dynamic part ] # ❌ Bad design: dynamic content mixed with static content messages = [ {"role": "system", "content": f"Today is {date}. {LONG_PROMPT}"} # Changes daily, not cacheable ]

Cache hits can be monitored in the usage field of API responses, as well as in the OfoxAI Console usage statistics.

Last updated on