Rate Limits
OfoxAI’s rate limits ensure platform stability. Understanding the limits helps you optimize your API usage.
Default Limits
OfoxAI uses pay-as-you-go pricing with a unified rate policy for all users:
| Limit | Quota |
|---|---|
| RPM (requests/minute) | 200 |
| TPM (tokens/minute) | Unlimited |
If you need a higher RPM quota, contact OfoxAI Support to request an adjustment.
Rate Limit Headers
Every API response includes rate limit information:
x-ratelimit-limit-requests: 200
x-ratelimit-remaining-requests: 195
x-ratelimit-reset-requests: 12s| Header | Description |
|---|---|
x-ratelimit-limit-requests | RPM limit value |
x-ratelimit-remaining-requests | Remaining request count |
x-ratelimit-reset-requests | Time until request limit resets |
Handling 429 Errors
When rate-limited, the API returns 429 Too Many Requests:
from openai import RateLimitError
import time
try:
response = client.chat.completions.create(...)
except RateLimitError as e:
retry_after = float(e.response.headers.get("retry-after", 1))
print(f"Rate limited, waiting {retry_after}s...")
time.sleep(retry_after)Optimization Strategies
1. Use Prompt Caching
For repeated system prompts, enabling caching reduces token consumption:
response = client.chat.completions.create(
model="openai/gpt-4o",
messages=[
# Long system prompts are automatically cached
{"role": "system", "content": "You are a professional... (long text omitted)"},
{"role": "user", "content": "User question"}
]
)See Prompt Caching for details.
2. Batch Processing
Consolidate multiple short requests into a single request:
# ❌ Not recommended: separate request for each question
for question in questions:
client.chat.completions.create(messages=[{"role": "user", "content": question}])
# ✅ Recommended: combine into one request
combined = "\n".join(f"{i+1}. {q}" for i, q in enumerate(questions))
client.chat.completions.create(
messages=[{"role": "user", "content": f"Please answer the following questions:\n{combined}"}]
)3. Choose the Right Model
| Scenario | Recommended Model | Reason |
|---|---|---|
| Simple chat | openai/gpt-4o-mini | Fast, saves tokens |
| Complex reasoning | openai/gpt-4o | High-quality output |
| Code generation | anthropic/claude-sonnet-4.5 | Strong coding ability |
| Long text processing | google/gemini-3-flash-preview | Large context, cost-effective |
4. Control max_tokens
Set a reasonable max_tokens limit to avoid unnecessary token consumption:
response = client.chat.completions.create(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Summarize in one sentence"}],
max_tokens=100 # Limit output length
)5. Use Model Fallback
Automatically switch to alternative models when the primary model hits its limit:
response = client.chat.completions.create(
model="openai/gpt-4o",
messages=[...],
extra_body={
"provider": {
"fallback": ["anthropic/claude-sonnet-4.5", "google/gemini-3-flash-preview"]
}
}
)See Fallback for details.
Last updated on