AI & Automation 4 min read · June 2026

LLM Image Token Costs: How Many Tokens Does an Image Use?

There is no single answer to how many tokens an image uses in an LLM; it depends on the provider, the model, and the image's dimensions. On Anthropic's Claude, the rule is tokens = (width × height) / 750, so a 1-megapixel (1000×1000 px) image costs roughly 1,334 tokens, capped at 1,568 on standard models. OpenAI's GPT-4o and GPT-4.1 use a tile model (85 base tokens plus 170 per 512px tile), and Google's Gemini uses fixed 768×768 px tiles at 258 tokens each. The practical takeaway for anyone building agents is the same across all three: inline image bytes burn through a context window fast, so the established fix is to pass file paths or URIs into the model, not the raw bytes.

Published June 2026 by the Mochify Engineering Team. The per-image figures below are drawn from each provider's current vision documentation; tokenisation rules change with model releases, so dates are noted throughout.

Per-image token cost by provider

The numbers below come from each provider's current vision documentation. Tokenisation rules change with model releases, so treat these as 2025–2026 figures and re-check the linked doc before you rely on a number.

ProviderFormula1 MP image (1000×1000 px)Typical photo (1920×1080 px)Caps
Anthropic / Claude (standard models)(width × height) / 750~1,334 tokens~1,568 tokens (capped and downscaled)1,568 tokens max; long edge ≤1,568 px
Anthropic / Claude Opus 4.7 & 4.8Same formula, higher native resolution~1,334 tokens~2,765 tokens4,784 tokens max; long edge ≤2,576 px
OpenAI GPT-4o / GPT-4.1 (detail: high)85 base + 170 per 512px tile (after scaling to fit 2048px, shortest side 768px)~765 tokens~765 tokens500 images and 50MB payload per request
OpenAI GPT-4o (detail: low)Fixed85 tokens85 tokensFlat rate
OpenAI GPT-4.1-mini / o4-mini (patch model)ceil(w/32) × ceil(h/32) patches, capped at 1,536, × model multiplier (1.62 for 4.1-mini, 1.72 for o4-mini)~1,659 tokens (1,024 patches × 1.62)Scales to ≤1,536 patches, then multiplier1,536 patch cap before multiplier
Google GeminiBoth dims ≤384 px → 258 tokens flat; otherwise tile into 768×768 px tiles at 258 tokens each~258–1,032 tokens (tile-count dependent)~258–516 tokensNo per-image cap stated

Sources: Anthropic Claude vision docs, OpenAI vision docs, and Google Gemini token docs, all accessed June 2026. Worked Gemini figures for non-square photos are approximate; Google documents the per-tile cost but not a worked example for a 16:9 image.

Why this matters for local and agent workflows

The cost is small for one image and dangerous at scale. Claude allows up to 100 images per API request on its 200k-context models, and OpenAI allows 500 image inputs per request, but the context window fills long before those hard limits bite.

It is worse on local and open-weight models. Consumer hardware typically runs models at an 8k–32k token context window. At Claude's rate of about 1,334 tokens per 1 MP image, just 6 to 24 full-resolution images inline would exhaust the entire context before the model does any work. The bottleneck on memory-constrained hardware is rarely compute; it is context saturation.

The fix: pass file paths, not image bytes

The durable pattern is to keep binary out of the context window and hand the model a reference instead. The Model Context Protocol resources specification is built around exactly this: resources are identified by a URI (file:///…, https://…), so an agent receives a path or identifier rather than the encoded image. Anthropic's own guidance echoes the idea, noting that referencing uploaded images by file_id keeps request payloads small regardless of how many images accumulate in a conversation.

This is where Mochify's local MCP server fits a token-cost argument cleanly. Run as mochify serve, it returns file paths and metadata to the agent, not image bytes, so a compression step never injects a multi-thousand-token blob into the model's context. You drive it in plain English, for example: compress the PNGs in ./screenshots to WebP and give me the new paths. The encoding itself runs on Mochify's API (api.mochify.app), where files are streamed into memory and wiped immediately with zero retention; the image data travels to the API to be encoded, so it is not processed on your own machine, but it also never lands in the agent's context window. The hosted MCP server follows the same principle from the other direction, returning a short-lived download URL rather than inline binary. For a full local-workflow setup on constrained hardware, see the On-Device AI Agents guide.

Keep image bytes out of your context window

Mochify's local MCP server returns file paths, not binary, so a compression step costs a handful of tokens instead of thousands. Just describe the job — for example "compress the PNGs in ./screenshots to WebP and return the paths".

Try it free at mochify.app →

Related Guides