Best AI Models for Vision

google

$0.14/1M

Google: Gemma 4 31B

Gemma 4 31B Instruct is Google DeepMind's 30.7B dense multimodal model supporting text and...

📝 262,144 ctx Compare →

z-ai

$1.20/1M

Z.ai: GLM 5V Turbo

GLM-5V-Turbo is Z.ai’s first native multimodal agent foundation model, built for vision-...

📝 202,752 ctx Compare →

rekaai

$0.10/1M

Reka Edge

Reka Edge is an extremely efficient 7B multimodal vision-language model that accepts image...

📝 16,384 ctx Compare →

xiaomi

$0.40/1M

Xiaomi: MiMo-V2-Omni

MiMo-V2-Omni is a frontier omni-modal model that natively processes image, video, and audi...

📝 262,144 ctx Compare →

openai

$0.20/1M

OpenAI: GPT-5.4 Nano

GPT-5.4 nano is the most lightweight and cost-efficient variant of the GPT-5.4 family, opt...

📝 400,000 ctx Compare →

openai

$0.75/1M

OpenAI: GPT-5.4 Mini

GPT-5.4 mini brings the core capabilities of GPT-5.4 to a faster, more efficient model opt...

📝 400,000 ctx Compare →

qwen

$0.05/1M

Qwen: Qwen3.5-9B

Qwen3.5-9B is a multimodal foundation model from the Qwen3.5 family, designed to deliver s...

📝 256,000 ctx Compare →

google

$0.50/1M

Google: Nano Banana 2 (Gemini 3.1 Flash Image Preview)

Gemini 3.1 Flash Image Preview, a.k.a. "Nano Banana 2," is Google’s latest state of the ...

📝 65,536 ctx Compare →

qwen

$0.16/1M

Qwen: Qwen3.5-35B-A3B

The Qwen3.5 Series 35B-A3B is a native vision-language model designed with a hybrid archit...

📝 262,144 ctx Compare →

qwen

$0.20/1M

Qwen: Qwen3.5-27B

The Qwen3.5 27B native vision-language Dense model incorporates a linear attention mechani...

📝 262,144 ctx Compare →

qwen

$0.26/1M

Qwen: Qwen3.5-122B-A10B

The Qwen3.5 122B-A10B native vision-language model is built on a hybrid architecture that ...

📝 262,144 ctx Compare →

qwen

$0.07/1M

Qwen: Qwen3.5-Flash

The Qwen3.5 native vision-language Flash models are built on a hybrid architecture that in...

📝 1,000,000 ctx Compare →

qwen

$0.26/1M

Qwen: Qwen3.5 Plus 2026-02-15

The Qwen3.5 native vision-language series Plus models are built on a hybrid architecture t...

📝 1,000,000 ctx Compare →

qwen

$0.39/1M

Qwen: Qwen3.5 397B A17B

The Qwen3.5 series 397B-A17B native vision-language model is built on a hybrid architectur...

📝 262,144 ctx Compare →

z-ai

$0.30/1M

Z.ai: GLM 4.6V

GLM-4.6V is a large multimodal model designed for high-fidelity visual understanding and l...

📝 131,072 ctx Compare →

amazon

$0.30/1M

Amazon: Nova 2 Lite

Nova 2 Lite is a fast, cost-effective reasoning model for everyday workloads that can proc...

📝 1,000,000 ctx Compare →

mistralai

$0.15/1M

Mistral: Ministral 3 8B 2512

A balanced model in the Ministral 3 family, Ministral 3 8B is a powerful, efficient tiny l...

📝 262,144 ctx Compare →

mistralai

$0.10/1M

Mistral: Ministral 3 3B 2512

The smallest model in the Ministral 3 family, Ministral 3 3B is a powerful, efficient tiny...

📝 131,072 ctx Compare →

google

$2.00/1M

Google: Nano Banana Pro (Gemini 3 Pro Image Preview)

Nano Banana Pro is Google’s most advanced image-generation and editing model, built on G...

📝 65,536 ctx Compare →

qwen

$0.10/1M

Qwen: Qwen3 VL 32B Instruct

Qwen3-VL-32B-Instruct is a large-scale multimodal vision-language model designed for high-...

📝 131,072 ctx Compare →

openai

$2.50/1M

OpenAI: GPT-5 Image Mini

GPT-5 Image Mini combines OpenAI's advanced language capabilities, powered by [GPT-5 Mini]...

📝 400,000 ctx Compare →

qwen

$0.08/1M

Qwen: Qwen3 VL 8B Instruct

Qwen3-VL-8B-Instruct is a multimodal vision-language model from the Qwen3-VL series, built...

📝 131,072 ctx Compare →

openai

$10.00/1M

OpenAI: GPT-5 Image

[GPT-5](https://openrouter.ai/openai/gpt-5) Image combines OpenAI's GPT-5 model with state...

📝 400,000 ctx Compare →

google

$0.30/1M

Google: Nano Banana (Gemini 2.5 Flash Image)

Gemini 2.5 Flash Image, a.k.a. "Nano Banana," is now generally available. It is a state of...

📝 32,768 ctx Compare →

qwen

$0.13/1M

Qwen: Qwen3 VL 30B A3B Thinking

Qwen3-VL-30B-A3B-Thinking is a multimodal model that unifies strong text generation with v...

📝 131,072 ctx Compare →

qwen

$0.13/1M

Qwen: Qwen3 VL 30B A3B Instruct

Qwen3-VL-30B-A3B-Instruct is a multimodal model that unifies strong text generation with v...

📝 131,072 ctx Compare →

qwen

$0.26/1M

Qwen: Qwen3 VL 235B A22B Thinking

Qwen3-VL-235B-A22B Thinking is a multimodal model that unifies strong text generation with...

📝 131,072 ctx Compare →

qwen

$0.20/1M

Qwen: Qwen3 VL 235B A22B Instruct

Qwen3-VL-235B-A22B Instruct is an open-weight multimodal model that unifies strong text ge...

📝 262,144 ctx Compare →

baidu

$0.14/1M

Baidu: ERNIE 4.5 VL 28B A3B

A powerful multimodal Mixture-of-Experts chat model featuring 28B total parameters with 3B...

📝 30,000 ctx Compare →

z-ai

$0.60/1M

Z.ai: GLM 4.5V

GLM-4.5V is a vision-language foundation model for multimodal agent applications. Built on...

📝 65,536 ctx Compare →

bytedance

$0.10/1M

ByteDance: UI-TARS 7B

UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, in...

📝 128,000 ctx Compare →

x-ai

$3.00/1M

xAI: Grok 4

Grok 4 is xAI's latest reasoning model with a 256k context window. It supports parallel to...

📝 256,000 ctx Compare →

baidu

$0.42/1M

Baidu: ERNIE 4.5 VL 424B A47B

ERNIE-4.5-VL-424B-A47B is a multimodal Mixture-of-Experts (MoE) model from Baidu’s ERNIE...

📝 123,000 ctx Compare →

arcee-ai

$0.18/1M

Arcee AI: Spotlight

Spotlight is a 7‑billion‑parameter vision‑language model derived from Qwen 2.5‑VL ...

📝 131,072 ctx Compare →

qwen

$0.20/1M

Qwen: Qwen2.5 VL 32B Instruct

Qwen2.5-VL-32B is a multimodal vision-language model fine-tuned through reinforcement lear...

📝 128,000 ctx Compare →

google

Free/1M