流式响应
流式响应(Streaming)允许你在模型生成过程中实时接收输出,提升用户体验和感知速度。
工作原理
OfoxAI 使用 Server-Sent Events (SSE) 协议实现流式响应:
- 客户端发送请求时设置
stream: true - 服务器逐步返回生成的内容片段(chunk)
- 每个 chunk 以
data:前缀通过 SSE 发送 - 生成结束时发送
data: [DONE]
OpenAI 协议流式
cURL
Terminal
curl https://api.ofox.ai/v1/chat/completions \
-H "Authorization: Bearer $OFOX_API_KEY" \
-H "Content-Type: application/json" \
-N \
-d '{
"model": "openai/gpt-4o",
"messages": [{"role": "user", "content": "写一首关于编程的诗"}],
"stream": true
}'Anthropic 协议流式
Python
stream_anthropic.py
import anthropic
client = anthropic.Anthropic(
base_url="https://api.ofox.ai/anthropic",
api_key="<你的 OFOXAI_API_KEY>"
)
with client.messages.stream(
model="anthropic/claude-sonnet-4.5",
max_tokens=1024,
messages=[{"role": "user", "content": "写一首关于编程的诗"}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)流式 + Function Calling
流式响应也支持函数调用场景。模型会先流式输出工具调用请求,你处理完成后继续对话:
stream_with_tools.py
stream = client.chat.completions.create(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "今天北京天气怎么样?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "获取指定城市的天气",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "城市名称"}
},
"required": ["city"]
}
}
}],
stream=True
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.tool_calls:
# 处理工具调用
print(f"调用工具: {delta.tool_calls[0].function}")
elif delta.content:
print(delta.content, end="", flush=True)错误处理和重连
流式连接可能因网络问题中断。建议实现重连逻辑。
stream_retry.py
import time
def stream_with_retry(client, max_retries=3, **kwargs):
for attempt in range(max_retries):
try:
stream = client.chat.completions.create(stream=True, **kwargs)
for chunk in stream:
yield chunk
return # 成功完成
except Exception as e:
if attempt < max_retries - 1:
wait = 2 ** attempt # 指数退避
print(f"\n连接中断,{wait}s 后重试...")
time.sleep(wait)
else:
raise e最佳实践
- 始终设置超时 — 避免无限等待
- 处理不完整的 chunk — 某些 chunk 可能没有 content
- 实现重连机制 — 使用指数退避策略
- 前端使用
flush— 确保内容即时显示