Prompt Caching
Prompt Caching allows you to cache frequently reused prompt prefixes, reducing token consumption and response latency.
How It Works
When your request contains a long, frequently reused system prompt or context:
- First request — All tokens are fully processed, and the prompt prefix is cached
- Subsequent requests — When the cache is hit, cached tokens are not charged again
- Cache expiration — Cache has a TTL (typically 5-10 minutes), after which it needs to be rebuilt
Cache Support
OfoxAI’s model resources are provided by official cloud providers including AWS Bedrock, Azure OpenAI, Google Cloud, Alibaba Cloud, and Volcengine. Models that support Prompt Caching on these cloud providers are also supported by OfoxAI.
| Cloud Provider | Representative Models | Caching Mechanism |
|---|---|---|
| AWS Bedrock | Claude series | Native Prompt Caching |
| Azure OpenAI | GPT-4o series | Automatic caching |
| Google Cloud | Gemini series | Context Caching |
| Alibaba Cloud | Qwen series | Platform-side caching |
| Volcengine | Doubao series | Platform-side caching |
Specific model caching support is subject to each cloud provider’s official documentation. OfoxAI transparently passes through caching-related parameters with no additional configuration needed.
Usage
OpenAI Protocol
Prompt Caching for OpenAI models is automatic — it activates when repeated prompt prefixes are detected:
# Long system prompts are automatically cached
SYSTEM_PROMPT = """You are OfoxAI's technical support assistant.
Here is the product information you need to know:
- OfoxAI is an LLM Gateway supporting 100+ models
- Supports OpenAI / Anthropic / Gemini protocols
- ...
(more product knowledge omitted)
"""
# First request: caches the system prompt
response1 = client.chat.completions.create(
model="openai/gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "What models does OfoxAI support?"}
]
)
# Second request: cache hit, faster and cheaper
response2 = client.chat.completions.create(
model="openai/gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT}, # Cache hit
{"role": "user", "content": "How do I set up Claude Code?"}
]
)Anthropic Protocol
Anthropic models support explicit cache control:
import anthropic
client = anthropic.Anthropic(
base_url="https://api.ofox.ai/anthropic",
api_key="<your OFOXAI_API_KEY>"
)
response = client.messages.create(
model="anthropic/claude-sonnet-4.5",
max_tokens=1024,
system=[{
"type": "text",
"text": "You are a professional assistant. Here is the product documentation...",
"cache_control": {"type": "ephemeral"} # Explicitly enable caching
}],
messages=[{"role": "user", "content": "Summarize the product features"}]
)
# Check cache hit status
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache hit tokens: {response.usage.cache_read_input_tokens}")Cost Savings
When the cache is hit, cached tokens are billed at a lower rate. The savings vary by model:
- Anthropic Claude series — Cache hits can save approximately 90% on input costs
- OpenAI GPT series — Cache hits can save approximately 50% on input costs
- Google Gemini series — Cache hits can save approximately 50-75% on input costs
Actual savings depend on cache hit rates and each cloud provider’s billing policies. Check the OfoxAI Console usage statistics for details.
Best Practices
- Put long text first — Place system prompts, knowledge base content, and other static parts at the beginning of messages
- Keep prefixes consistent — Only identical prefixes can hit the cache
- Design prompt structure wisely — Separate static and dynamic parts
# ✅ Good design: static content first, dynamic content last
messages = [
{"role": "system", "content": LONG_STATIC_PROMPT}, # Cacheable
{"role": "user", "content": dynamic_question} # Dynamic part
]
# ❌ Bad design: dynamic content mixed with static content
messages = [
{"role": "system", "content": f"Today is {date}. {LONG_PROMPT}"} # Changes daily, not cacheable
]Cache hits can be monitored in the usage field of API responses, as well as in the OfoxAI Console usage statistics.