Multimodal Models

openai
$1.75/1M

OpenAI: GPT-5.2-Codex

GPT-5.2-Codex is an upgraded version of GPT-5.1-Codex optimized for software engineering a...

πŸ“ 400,000 ctx Compare →
bytedance-seed
$0.08/1M

ByteDance Seed: Seed 1.6 Flash

Seed 1.6 Flash is an ultra-fast multimodal deep thinking model by ByteDance Seed, supporti...

πŸ“ 262,144 ctx Compare →
bytedance-seed
$0.25/1M

ByteDance Seed: Seed 1.6

Seed 1.6 is a general-purpose model released by the ByteDance Seed team. It incorporates m...

πŸ“ 262,144 ctx Compare →
google
$0.50/1M

Google: Gemini 3 Flash Preview

Gemini 3 Flash Preview is a high speed, high value thinking model designed for agentic wor...

πŸ“ 1,048,576 ctx Compare →
z-ai
$0.30/1M

Z.AI: GLM 4.6V

GLM-4.6V is a large multimodal model designed for high-fidelity visual understanding and l...

πŸ“ 131,072 ctx Compare →
anthropic
$5.00/1M

Anthropic: Claude Opus 4.5

Claude Opus 4.5 is Anthropic’s frontier reasoning model optimized for complex software e...

πŸ“ 200,000 ctx Compare →
google
$2.00/1M

Google: Nano Banana Pro (Gemini 3 Pro Image Preview)

Nano Banana Pro is Google’s most advanced image-generation and editing model, built on G...

πŸ“ 65,536 ctx Compare →
google
$2.00/1M

Google: Gemini 3 Pro Preview

Gemini 3 Pro is Google’s flagship frontier model for high-precision multimodal reasoning...

πŸ“ 1,048,576 ctx Compare →
openai
$1.25/1M

OpenAI: GPT-5.1-Codex

GPT-5.1-Codex is a specialized version of GPT-5.1 optimized for software engineering and c...

πŸ“ 400,000 ctx Compare →
amazon
$2.50/1M

Amazon: Nova Premier 1.0

Amazon Nova Premier is the most capable of Amazon’s multimodal models for complex reason...

πŸ“ 1,000,000 ctx Compare →
nvidia
Free/1M

NVIDIA: Nemotron Nano 12B 2 VL (free)

NVIDIA Nemotron Nano 2 VL is a 12-billion-parameter open multimodal reasoning model design...

πŸ“ 128,000 ctx Compare →
nvidia
$0.20/1M

NVIDIA: Nemotron Nano 12B 2 VL

NVIDIA Nemotron Nano 2 VL is a 12-billion-parameter open multimodal reasoning model design...

πŸ“ 131,072 ctx Compare →
qwen
$0.50/1M

Qwen: Qwen3 VL 32B Instruct

Qwen3-VL-32B-Instruct is a large-scale multimodal vision-language model designed for high-...

πŸ“ 262,144 ctx Compare →
openai
$2.50/1M

OpenAI: GPT-5 Image Mini

GPT-5 Image Mini combines OpenAI's advanced language capabilities, powered by [GPT-5 Mini]...

πŸ“ 400,000 ctx Compare →
qwen
$0.18/1M

Qwen: Qwen3 VL 8B Thinking

Qwen3-VL-8B-Thinking is the reasoning-optimized variant of the Qwen3-VL-8B multimodal mode...

πŸ“ 256,000 ctx Compare →
qwen
$0.08/1M

Qwen: Qwen3 VL 8B Instruct

Qwen3-VL-8B-Instruct is a multimodal vision-language model from the Qwen3-VL series, built...

πŸ“ 131,072 ctx Compare →
google
$0.30/1M

Google: Gemini 2.5 Flash Image (Nano Banana)

Gemini 2.5 Flash Image, a.k.a. "Nano Banana," is now generally available. It is a state of...

πŸ“ 32,768 ctx Compare →
qwen
$0.20/1M

Qwen: Qwen3 VL 30B A3B Thinking

Qwen3-VL-30B-A3B-Thinking is a multimodal model that unifies strong text generation with v...

πŸ“ 131,072 ctx Compare →
qwen
$0.15/1M

Qwen: Qwen3 VL 30B A3B Instruct

Qwen3-VL-30B-A3B-Instruct is a multimodal model that unifies strong text generation with v...

πŸ“ 262,144 ctx Compare →
qwen
$0.45/1M

Qwen: Qwen3 VL 235B A22B Thinking

Qwen3-VL-235B-A22B Thinking is a multimodal model that unifies strong text generation with...

πŸ“ 262,144 ctx Compare →
qwen
$0.20/1M

Qwen: Qwen3 VL 235B A22B Instruct

Qwen3-VL-235B-A22B Instruct is an open-weight multimodal model that unifies strong text ge...

πŸ“ 262,144 ctx Compare →
openai
$1.25/1M

OpenAI: GPT-5 Codex

GPT-5-Codex is a specialized version of GPT-5 optimized for software engineering and codin...

πŸ“ 400,000 ctx Compare →
x-ai
$0.20/1M

xAI: Grok 4 Fast

Grok 4 Fast is xAI's latest multimodal model with SOTA cost-efficiency and a 2M token cont...

πŸ“ 2,000,000 ctx Compare →
opengvlab
$0.10/1M

OpenGVLab: InternVL3 78B

The InternVL3 series is an advanced multimodal large language model (MLLM). Compared to In...

πŸ“ 32,768 ctx Compare →
stepfun-ai
$0.57/1M

StepFun: Step3

Step3 is a cutting-edge multimodal reasoning modelβ€”built on a Mixture-of-Experts archite...

πŸ“ 65,536 ctx Compare →
mistralai
$0.40/1M

Mistral: Mistral Medium 3.1

Mistral Medium 3.1 is an updated version of Mistral Medium 3, which is a high-performance ...

πŸ“ 131,072 ctx Compare →
baidu
$0.07/1M

Baidu: ERNIE 4.5 21B A3B

A sophisticated text-based Mixture-of-Experts (MoE) model featuring 21B total parameters w...

πŸ“ 120,000 ctx Compare →
baidu
$0.14/1M

Baidu: ERNIE 4.5 VL 28B A3B

A powerful multimodal Mixture-of-Experts chat model featuring 28B total parameters with 3B...

πŸ“ 30,000 ctx Compare →
z-ai
$0.60/1M

Z.AI: GLM 4.5V

GLM-4.5V is a vision-language foundation model for multimodal agent applications. Built on...

πŸ“ 65,536 ctx Compare →
openai
$1.25/1M

OpenAI: GPT-5 Chat

GPT-5 Chat is designed for advanced, natural, multimodal, and context-aware conversations ...

πŸ“ 128,000 ctx Compare →
bytedance
$0.10/1M

ByteDance: UI-TARS 7B

UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, in...

πŸ“ 128,000 ctx Compare →
google
Free/1M

Google: Gemma 3n 2B (free)

Gemma 3n E2B IT is a multimodal, instruction-tuned model developed by Google DeepMind, des...

πŸ“ 8,192 ctx Compare →
baidu
$0.42/1M

Baidu: ERNIE 4.5 VL 424B A47B

ERNIE-4.5-VL-424B-A47B is a multimodal Mixture-of-Experts (MoE) model from Baidu’s ERNIE...

πŸ“ 123,000 ctx Compare →
google
Free/1M

Google: Gemma 3n 4B (free)

Gemma 3n E4B-it is optimized for efficient execution on mobile and low-resource devices, s...

πŸ“ 8,192 ctx Compare →
google
$0.02/1M

Google: Gemma 3n 4B

Gemma 3n E4B-it is optimized for efficient execution on mobile and low-resource devices, s...

πŸ“ 32,768 ctx Compare →
mistralai
$0.40/1M

Mistral: Mistral Medium 3

Mistral Medium 3 is a high-performance enterprise-grade language model designed to deliver...

πŸ“ 131,072 ctx Compare →
arcee-ai
$0.18/1M

Arcee AI: Spotlight

Spotlight is a 7‑billion‑parameter vision‑language model derived from Qwenβ€―2.5‑V...

πŸ“ 131,072 ctx Compare →
meta-llama
$0.18/1M

Meta: Llama Guard 4 12B

Llama Guard 4 is a Llama 4 Scout-derived multimodal pretrained model, fine-tuned for conte...

πŸ“ 163,840 ctx Compare →
openai
$1.10/1M

OpenAI: o4 Mini High

OpenAI o4-mini-high is the same model as [o4-mini](/openai/o4-mini) with reasoning_effort ...

πŸ“ 200,000 ctx Compare →
openai
$1.10/1M

OpenAI: o4 Mini

OpenAI o4-mini is a compact reasoning model in the o-series, optimized for fast, cost-effi...

πŸ“ 200,000 ctx Compare →
openai
$2.00/1M

OpenAI: GPT-4.1

GPT-4.1 is a flagship large language model optimized for advanced instruction following, r...

πŸ“ 1,047,576 ctx Compare →
meta-llama
$0.15/1M

Meta: Llama 4 Maverick

Llama 4 Maverick 17B Instruct (128E) is a high-capacity multimodal language model from Met...

πŸ“ 1,048,576 ctx Compare →
meta-llama
$0.08/1M

Meta: Llama 4 Scout

Llama 4 Scout 17B Instruct (16E) is a mixture-of-experts (MoE) language model developed by...

πŸ“ 327,680 ctx Compare →
qwen
$0.05/1M

Qwen: Qwen2.5 VL 32B Instruct

Qwen2.5-VL-32B is a multimodal vision-language model fine-tuned through reinforcement lear...

πŸ“ 16,384 ctx Compare →
mistralai
Free/1M

Mistral: Mistral Small 3.1 24B (free)

Mistral Small 3.1 24B Instruct is an upgraded variant of Mistral Small 3 (2501), featuring...

πŸ“ 128,000 ctx Compare →
mistralai
$0.03/1M

Mistral: Mistral Small 3.1 24B

Mistral Small 3.1 24B Instruct is an upgraded variant of Mistral Small 3 (2501), featuring...

πŸ“ 131,072 ctx Compare →
google
Free/1M

Google: Gemma 3 4B (free)

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It ha...

πŸ“ 32,768 ctx Compare →
google
$0.02/1M

Google: Gemma 3 4B

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It ha...

πŸ“ 96,000 ctx Compare →
google
Free/1M

Google: Gemma 3 12B (free)

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It ha...

πŸ“ 32,768 ctx Compare →
google
$0.03/1M

Google: Gemma 3 12B

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It ha...

πŸ“ 131,072 ctx Compare →
google
Free/1M

Google: Gemma 3 27B (free)

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It ha...

πŸ“ 131,072 ctx Compare →
google
$0.04/1M

Google: Gemma 3 27B

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It ha...

πŸ“ 96,000 ctx Compare →
google
$0.10/1M

Google: Gemini 2.0 Flash

Gemini Flash 2.0 offers a significantly faster time to first token (TTFT) compared to [Gem...

πŸ“ 1,048,576 ctx Compare →
google
Free/1M

Google: Gemini 2.0 Flash Experimental (free)

Gemini Flash 2.0 offers a significantly faster time to first token (TTFT) compared to [Gem...

πŸ“ 1,048,576 ctx Compare →
amazon
$0.06/1M

Amazon: Nova Lite 1.0

Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast ...

πŸ“ 300,000 ctx Compare →
amazon
$0.80/1M

Amazon: Nova Pro 1.0

Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combi...

πŸ“ 300,000 ctx Compare →
mistralai
$2.00/1M

Mistral: Pixtral Large 2411

Pixtral Large is a 124B parameter, open-weight, multimodal model built on top of [Mistral ...

πŸ“ 131,072 ctx Compare →
anthropic
$6.00/1M

Anthropic: Claude 3.5 Sonnet

New Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, a...

πŸ“ 200,000 ctx Compare →
meta-llama
$0.05/1M

Meta: Llama 3.2 11B Vision Instruct

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle ...

πŸ“ 131,072 ctx Compare →
qwen
Free/1M

Qwen: Qwen2.5-VL 7B Instruct (free)

Qwen2.5 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements: ...

πŸ“ 32,768 ctx Compare →
qwen
$0.20/1M

Qwen: Qwen2.5-VL 7B Instruct

Qwen2.5 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements: ...

πŸ“ 32,768 ctx Compare →
openai
$0.15/1M

OpenAI: GPT-4o-mini (2024-07-18)

GPT-4o mini is OpenAI's newest model after [GPT-4 Omni](/models/openai/gpt-4o), supporting...

πŸ“ 128,000 ctx Compare →
openai
$0.15/1M

OpenAI: GPT-4o-mini

GPT-4o mini is OpenAI's newest model after [GPT-4 Omni](/models/openai/gpt-4o), supporting...

πŸ“ 128,000 ctx Compare →
openai
$5.00/1M

OpenAI: GPT-4o (2024-05-13)

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs...

πŸ“ 128,000 ctx Compare →
openai
$2.50/1M

OpenAI: GPT-4o

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs...

πŸ“ 128,000 ctx Compare →
openai
$6.00/1M

OpenAI: GPT-4o (extended)

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs...

πŸ“ 128,000 ctx Compare →
anthropic
$0.25/1M

Anthropic: Claude 3 Haiku

Claude 3 Haiku is Anthropic's fastest and most compact model for near-instant responsivene...

πŸ“ 200,000 ctx Compare →
openai
$30.00/1M

OpenAI: GPT-4

OpenAI's flagship model, GPT-4 is a large-scale multimodal language model capable of solvi...

πŸ“ 8,191 ctx Compare →