Skip to Content

Rate Limits

OfoxAI’s rate limits ensure platform stability. Understanding the limits helps you optimize your API usage.

Default Limits

OfoxAI uses pay-as-you-go pricing with a unified rate policy for all users:

LimitQuota
RPM (requests/minute)200
TPM (tokens/minute)Unlimited

If you need a higher RPM quota, contact OfoxAI Support  to request an adjustment.

Rate Limit Headers

Every API response includes rate limit information:

x-ratelimit-limit-requests: 200 x-ratelimit-remaining-requests: 195 x-ratelimit-reset-requests: 12s
HeaderDescription
x-ratelimit-limit-requestsRPM limit value
x-ratelimit-remaining-requestsRemaining request count
x-ratelimit-reset-requestsTime until request limit resets

Handling 429 Errors

When rate-limited, the API returns 429 Too Many Requests:

from openai import RateLimitError import time try: response = client.chat.completions.create(...) except RateLimitError as e: retry_after = float(e.response.headers.get("retry-after", 1)) print(f"Rate limited, waiting {retry_after}s...") time.sleep(retry_after)

Optimization Strategies

1. Use Prompt Caching

For repeated system prompts, enabling caching reduces token consumption:

response = client.chat.completions.create( model="openai/gpt-4o", messages=[ # Long system prompts are automatically cached {"role": "system", "content": "You are a professional... (long text omitted)"}, {"role": "user", "content": "User question"} ] )

See Prompt Caching for details.

2. Batch Processing

Consolidate multiple short requests into a single request:

# ❌ Not recommended: separate request for each question for question in questions: client.chat.completions.create(messages=[{"role": "user", "content": question}]) # ✅ Recommended: combine into one request combined = "\n".join(f"{i+1}. {q}" for i, q in enumerate(questions)) client.chat.completions.create( messages=[{"role": "user", "content": f"Please answer the following questions:\n{combined}"}] )

3. Choose the Right Model

ScenarioRecommended ModelReason
Simple chatopenai/gpt-4o-miniFast, saves tokens
Complex reasoningopenai/gpt-4oHigh-quality output
Code generationanthropic/claude-sonnet-4.5Strong coding ability
Long text processinggoogle/gemini-3-flash-previewLarge context, cost-effective

4. Control max_tokens

Set a reasonable max_tokens limit to avoid unnecessary token consumption:

response = client.chat.completions.create( model="openai/gpt-4o", messages=[{"role": "user", "content": "Summarize in one sentence"}], max_tokens=100 # Limit output length )

5. Use Model Fallback

Automatically switch to alternative models when the primary model hits its limit:

response = client.chat.completions.create( model="openai/gpt-4o", messages=[...], extra_body={ "provider": { "fallback": ["anthropic/claude-sonnet-4.5", "google/gemini-3-flash-preview"] } } )

See Fallback for details.

Last updated on