Vision

OfoxAI supports visual input for multimodal models, enabling analysis of images, screenshots, documents, and video content.

Supported Models

Model	Images	Video	Description
`openai/gpt-4o`	✅	—	High-quality image analysis
`openai/gpt-4o-mini`	✅	—	Fast image analysis
`anthropic/claude-sonnet-4.5`	✅	—	Strong document and code understanding
`google/gemini-3-flash-preview`	✅	✅	Multimodal all-rounder
`google/gemini-3.1-pro-preview`	✅	✅	Most capable multimodal reasoning

Image Analysis

Sending Images via URL

cURL

Terminal


curl https://api.ofox.ai/v1/chat/completions \
  -H "Authorization: Bearer $OFOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe what is in this image"},
        {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
      ]
    }]
  }'

Python

vision_url.py


from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.ofox.ai/v1",
    api_key="<your OFOXAI_API_KEY>"
)
 
response = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe what is in this image"},
            {
                "type": "image_url",
                "image_url": {"url": "https://example.com/photo.jpg"}
            }
        ]
    }]
)
 
print(response.choices[0].message.content)

TypeScript

vision_url.ts


const response = await client.chat.completions.create({
  model: 'openai/gpt-4o',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Describe what is in this image' },
      {
        type: 'image_url',
        image_url: { url: 'https://example.com/photo.jpg' }
      }
    ]
  }]
})

Sending Images via Base64

Suitable for local files or screenshot scenarios:

vision_base64.py


import base64
 
# Read local image
with open("screenshot.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")
 
response = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What does this screenshot show?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{image_data}"
                }
            }
        ]
    }]
)

Image Detail Level

Control analysis precision with the detail parameter:

Value	Description	Use Case
`auto`	Automatic selection (default)	General scenarios
`low`	Lower precision, faster	Simple classification, tag identification
`high`	Higher precision, more detailed	Document OCR, detailed analysis


{
    "type": "image_url",
    "image_url": {
        "url": "https://example.com/document.jpg",
        "detail": "high"  # High precision mode
    }
}

Multi-Image Comparison

You can send multiple images in a single request:


response = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Compare the differences between these two images"},
            {"type": "image_url", "image_url": {"url": "https://example.com/before.jpg"}},
            {"type": "image_url", "image_url": {"url": "https://example.com/after.jpg"}}
        ]
    }]
)

Vision Input with Anthropic Protocol


import anthropic
 
client = anthropic.Anthropic(
    base_url="https://api.ofox.ai/anthropic",
    api_key="<your OFOXAI_API_KEY>"
)
 
message = client.messages.create(
    model="anthropic/claude-sonnet-4.5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/jpeg",
                    "data": image_data
                }
            },
            {"type": "text", "text": "Describe this image"}
        ]
    }]
)

Common Use Cases

Document OCR — Extract text and tables from images
Code screenshot analysis — Analyze code in screenshots and provide suggestions
UI review — Analyze interface design and layout
Chart interpretation — Analyze data charts and visualizations
Object recognition — Identify objects and scenes in images