Multi-Modal Input Support

The following models support multi-modal processing, allowing you to use image and/or text as input and receive text-based outputs. These models are optimized to understand visual and textual context together, enabling use cases such as image analysis, visual interpretation, and context-aware reasoning. Use this list to identify the most suitable model for your specific multi-modal requirements.

Gemini 3 Flash
Gemini 2.5 Flash Lite
Gemini 2.0 Flash
Gemini 3.1 Pro
Gemini 2.5 Pro
Claude Opus 4.1
Claude Opus 4.5
Claude Opus 4.8
Claude Sonnet 4
Claude Sonnet 4.5
Claude Sonnet 4.6
Amazon Nova Pro
Amazon Nova Lite
ChatGPT 4.1
ChatGPT o3
ChatGPT 5
ChatGPT 5.2
ChatGPT 5.3
ChatGPT 5.4