AI & Automation 15 min read · June 2, 2026 · Updated July 17, 2026

On-Device AI Agents: Image and PDF Optimization for Local Workflows

Q: What hardware do I need to run a local AI agent for image workflows?

For 7-13B models, a 16-32 GB AI PC with a modern NPU or dedicated GPU is sufficient, giving 30-100 tokens/second at Q4 quantization. For 70B models you need at least 48 GB of unified memory; 128 GB gives comfortable headroom. NVIDIA DGX Spark and Apple Mac Studio M5 Max (128 GB) both deliver roughly 25-45 tokens/second on 70B Q4 workloads.

Q: What is the difference between Mochify's hosted MCP server and its local MCP server?

The hosted MCP server (mcp.mochify.app) processes images server-side and returns a short-lived download URL on files.mochify.app valid for about five minutes. No install required. The local MCP server runs the same binary as mochify serve, returns file paths and metadata directly with no URL passback and no pickup store, keeping zero server-side retention end-to-end.

Q: Can Mochify's local MCP server handle PDF workflows too?

Yes. PDFs are a first-class format. The local MCP server can extract individual pages as PNG, JPEG, or WebP images, and can split multi-page PDFs into individual single-page files. Both operations work with Magic Flow natural language prompts. The privacy model is the same as images: encoding at api.mochify.app in RAM with no disk writes of the source.

Q: Does the local MCP server work with agent runtimes other than Claude?

Any MCP-compatible host works, including Cursor, Claude Code, Claude Desktop, ChatGPT, Gemini, and VS Code. If your agent runtime supports stdio MCP servers, Mochify's local server wires in the same way regardless of the underlying model or host.

Q: What's the practical difference between 7B and 70B models for image workflow automation?

A 7B model is fast (75-125 t/s) and handles most tool-orchestration tasks well. For straightforward batch processing, 7B is usually sufficient. A 70B model adds better reasoning on ambiguous inputs, complex prompts, edge cases in documents, or multi-step planning where quality matters more than speed.

Q: Will video optimization ever be available from the CLI or local MCP server?

Not currently. Mochify's video engine runs client-side in the browser so bytes never leave your device. That architecture is web-app-only. The CLI, local MCP server, and REST API handle images and PDFs. Route video compression through the web app at mochify.app.

Local AI agents running on hardware like NVIDIA DGX Spark and Apple Silicon are practical in 2026 - but running the model locally doesn't automatically mean your data stays local. This guide covers what the hardware shift actually enables, how MCP connects agents to genuinely local tools, and how to build an image and PDF optimization workflow that keeps server-side data retention at zero.

Published June 2026 by the Mochify Engineering Team. This guide covers the hardware driving the on-device AI shift, how MCP stdio servers connect agents to genuinely local tools, why a local agent runtime doesn't guarantee local data handling, and how to run zero-retention image and PDF optimization inside a local agent pipeline.

What's in This Guide

1. The On-Device AI Hardware Shift

The gap between cloud AI and desktop AI is closing faster than most people expected, and two hardware announcements in the first half of 2026 make the change concrete enough to build workflows around.

NVIDIA DGX Spark is a compact desktop unit built around the GB10 Grace Blackwell superchip. According to NVIDIA's DGX Spark hardware documentation, it delivers 1 PFLOP of FP4 AI performance, 128 GB of LPDDR5X unified memory at 273 GB/s bandwidth, and 4 TB of NVMe storage. That memory is coherently shared between the CPU and GPU - which is what makes it viable for large-model inference, not just smaller assistant models. Independent benchmark aggregations from May 2026 put 70B Q4 models at roughly 35-45 tokens per second on DGX Spark, versus 25-32 t/s on a 128 GB Mac Studio M5 Max. Both figures are usable for interactive agent workflows; neither approaches datacenter throughput.

NVIDIA RTX Spark was announced on 1 June 2026 in a joint release with Microsoft. It's a new superchip designed for Windows laptops and compact desktops, built on GB10-derived silicon, with a 20-core Grace CPU, 6,144 CUDA cores on a Blackwell GPU, and up to 128 GB of unified memory. The press materials describe it as enabling "personal agents on device" and position it as the hardware backbone for an "agentic AI OS." OEM systems including the Surface Laptop Ultra, Dell XPS 16 Creator Edition, and ASUS ProArt P16 are slated to ship in fall 2026.

Apple Silicon has been doing this quietly for longer. Apple's Foundation Models technical report from June 2025 describes an approximately 3B-parameter multimodal model running entirely on device, with tool-calling support, guided generation, and LoRA fine-tuning for developers at no per-call cloud cost. Larger requests route to Private Cloud Compute - Apple silicon in their own data centers, no user-data retention - but many assistant tasks complete entirely locally. It's a hybrid architecture, as are Copilot+ PCs with Qualcomm Snapdragon X Elite (40+ TOPS NPU) or AMD Ryzen AI 300 - but one where the local leg does real work.

The throughput ceiling matters for workflow planning. A 70B model at Q4 quantization needs roughly 42-48 GB of VRAM to load comfortably, and even on 128 GB hardware the output speed stays in the 25-45 tokens/second range. That's sufficient for coding agents, document processing, and automated batch workflows - the kind of background job that runs over a folder of images or a set of PDFs. It's not fast enough for real-time conversational interfaces at scale. Knowing where that ceiling sits is what separates a local deployment that works from one that frustrates.

2. What Local Agents Can (and Can't) Do Today

Local agents running on 128 GB hardware are genuinely capable for a well-defined class of tasks. The practical ceiling is the 7-70B model range.

For the 7-13B tier - which runs comfortably on 16-32 GB AI PCs at 30-100+ tokens per second - you get reliable performance on coding assistance, document Q&A, summarisation, and structured data extraction. These models handle tool-use frameworks well: they can plan a multi-step workflow, call tools via MCP, interpret results, and act on them without needing frontier-model reasoning quality. For straightforward image batch processing or file organization tasks, a well-prompted 7B model orchestrating the right tools is entirely workable.

At the 70B tier on 128 GB systems, reasoning quality improves substantially. Models like Llama 3.3 70B at Q4 quantization fit in around 42-48 GB of memory, leaving headroom for context and KV cache. The throughput is slower, but for batch workflows that run in the background - processing a folder of assets, converting documents, auditing a codebase - the 25-45 t/s range is plenty.

Where local agents still fall short: holding 1M-token context windows requires hundreds of GB of high-bandwidth memory, currently the domain of H200 and MI300X class hardware. Very large models (120B+) at useful throughput are not practical on 128 GB consumer systems. Complex multi-hop reasoning on frontier tasks still benefits from cloud model quality.

Jensen Huang made a useful point in May 2026: agentic AI demands roughly ten times more compute than earlier generative AI, because agents read large inputs, reason, call tools, and generate many output tokens in a single pass. That amplifier makes local compute more valuable - it gets used up faster - but it also means even powerful desktop hardware will continue to route some work to cloud backends. Every credible platform (Apple, Microsoft, NVIDIA, Qualcomm) acknowledges this: they all describe hybrid architectures, not purely local stacks.

The practical upshot for media workflows: local agents on 128 GB hardware are the right home for orchestration, decision-making, and file management. Encoding-heavy work - the actual compression, conversion, and transformation of images and PDFs - should sit in a dedicated tool process that keeps media bytes out of the LLM context entirely.

3. How MCP Connects Agents to Local Tools

MCP (Model Context Protocol) is an open standard introduced by Anthropic in late 2024 for connecting AI agent runtimes to external tools. It defines a standard JSON-RPC message format and two transport modes: stdio for local, same-machine servers, and streamable HTTP for networked servers.

The stdio transport is the architecturally important one for local agents. When a host application (Claude Desktop, Cursor, Claude Code, or any MCP-compatible runtime) uses a stdio server, it launches the server as a child process, communicates with it over stdin/stdout, and manages the process lifecycle itself. Per the official MCP transport specification: "the client launches the MCP server as a subprocess... sends messages over stdin and reads responses from stdout." No network ports, no sockets required. The only configuration is a command and its arguments in a JSON config file.

This matters for on-device workflows because a well-designed stdio tool server runs entirely within the user's OS permissions with no required network connectivity of its own. For a media optimization server, this means the agent can send a local file path, the tool processes the file, and returns a new path - without any of the image data passing through the LLM's context window.

MCP has been adopted quickly. By early 2026, Claude Desktop, Claude Code, Cursor, ChatGPT, and VS Code all support MCP servers. MCP marketplaces list hundreds of servers spanning image processing, database connectors, code search, and cloud provider SDKs. That breadth means the ecosystem of callable tools on local hardware is already substantial - and as new agent runtimes ship on DGX Spark and RTX Spark class hardware, they inherit the same tool ecosystem without re-integration work.

On platforms like Claude Code, adding a local MCP server is a one-line command (claude mcp add). On Claude Desktop, it's a short JSON snippet in claude_desktop_config.json. The agent then discovers the server's tools automatically and can call them from natural-language prompts without custom code. See our guide on how the Mochify MCP server works for a full breakdown of both MCP surfaces.

4. Local Agent Does Not Mean Local Data

This is the most important nuance in the whole on-device AI space, and it gets glossed over constantly in vendor marketing.

Running your agent model locally - whether that's a 70B model on DGX Spark or Apple's 3B on-device model - means inference happens on your hardware. It does not mean every tool the agent calls processes data locally.

Most MCP tool servers wrap remote APIs. TinyPNG's MCP integrations pass images to TinyPNG's cloud API for compression. The Avanquest PDF MCP uploads documents to Avanquest's servers for conversion. FAL's image generation MCP sends prompts to FAL's cloud infrastructure. These are all legitimate tools, but they're hybrid: the agent runtime is local, the processing is remote.

For anyone handling client documents, assets under NDA, or regulated data, this distinction matters directly. GDPR Article 5(1)(c), as the UK ICO explains in their data minimization guidance, requires that personal data be "adequate, relevant and limited to what is necessary." Using a cloud tool to process client images or PDFs creates a data transfer to a third party. That transfer needs to be accounted for under your data-processing obligations. Zero-retention server-side processing - where originals are never written to disk - simplifies that picture considerably.

The security angle compounds this. When a local agent runtime has access to API keys, configuration files, and private repositories, a single misconfigured tool call can silently exfiltrate sensitive content to a remote service - even if the model itself runs entirely offline. A local agent runtime is not a sandboxed environment by default. Giving an agent access to a tool that makes undisclosed network calls is a data governance risk regardless of where the model runs.

A practical three-question audit for any MCP tool you're adding to a privacy-sensitive workflow:

Does it use stdio (same-machine subprocess) or HTTP (networked) transport?
If HTTP, does it call a third-party remote API or only your own infrastructure?
What data enters the tool call, and where exactly does it go?

The important distinction is between the agent runtime being local and the tools being local. They are independent variables. A well-designed tool stack for regulated work audits both. For more on the privacy implications of different optimization paths, see our comprehensive guide to privacy and image optimization.

5. Why Image Tokens Are Expensive in Agent Context

Passing images directly into a language model's context - rather than through a tool that returns a file path - is expensive in two ways that compound on local hardware.

First, raw token cost. Claude's vision documentation gives the formula: approximately width * height / 750 tokens per image. A 1,000 x 1,000 pixel image (1 megapixel) costs roughly 1,334 tokens. A standard product photo at 3,000 x 2,000 runs to around 8,000 tokens. Pass ten images inline to start a batch job and you've consumed 80,000+ tokens before the agent has done any work.

Second, context ceiling. Claude's context window documentation states that a single request can include up to 600 images or PDF pages (100 for models using 200k-token context windows). Those ceilings are manageable for cloud models with large context windows. For local models - which typically run with 8k to 32k context windows on consumer hardware - passing even a handful of full-resolution images inline can exhaust the entire context before any processing pipeline begins. At that point the workflow simply doesn't work.

The established pattern, reflected in the MCP resources specification, is to pass file paths and metadata rather than binary data. The spec defines URI-based resource references (for example, file:///home/user/project/image.jpg) precisely for this: the tool server handles the file, the agent handles the reference. Anthropic's engineering write-up on MCP with code execution frames this directly: MCP lets agents "use fewer tokens" by moving heavy data and compute into dedicated tool processes, with the model only seeing references and summaries.

For on-device agents where context windows are tight and token throughput is measured in tens per second, this isn't an optimization. It's a prerequisite for running image workflows at any reasonable scale. For the per-model breakdown of how images are priced, see our guide to LLM image token costs.

6. Mochify Workflow: Optimizing Images and PDFs Inside a Local Agent

Mochify's local MCP server is built for exactly this pattern: the agent describes what it needs in plain language, the tool handles the file, and only a file path comes back into the agent's context. No image bytes enter the model. No context gets bloated. For the PDF-specific side of this, see working with PDFs in AI agent workflows, which digs into page extraction and splitting inside an agent pipeline in more detail.

1
Install the Mochify CLI
The same Rust binary serves as both CLI and local MCP server. Install via Homebrew:
```
brew install mochify
```
On Linux, use the curl installer from github.com/getmochify/mochify-cli, or cargo install from the repo directly.
2
Authenticate once
Run mochify auth login. A browser OAuth flow handles authentication, and credentials are saved to ~/.config/mochify/credentials.toml. Both the CLI and the local MCP server pick them up automatically - you won't need to touch them again.
3
Wire the local MCP server into your agent host
For Claude Desktop, add a short snippet to claude_desktop_config.json:
```
{
  "mcpServers": {
    "mochify": {
      "command": "mochify",
      "args": ["serve"]
    }
  }
}
```
For Claude Code: claude mcp add mochify mochify serve. For the full setup and workflows, see our guide to the Mochify CLI and MCP workflow in Claude Code.
4
Describe your task with Magic Flow
Mochify's natural language interface, Magic Flow, means you describe what you want rather than specifying format, quality, and resize settings by hand. The two-step pipeline - a language model parses the prompt, then Mochify's C++ compression engine executes - handles parameter resolution automatically. Real examples that work:
- "Convert all the PNGs in /project/assets to AVIF at web quality"
- "Compress these product photos, strip EXIF data, max 1600px wide"
- "Extract page 1 of the brief as a WebP thumbnail"
- "Split this 40-page contract into individual single-page PDFs"
Magic Flow is available in the web app, via the REST API at POST /v1/prompt, through the CLI with the -p flag, and on both MCP server surfaces.
5
Receive file paths, not bytes
The local MCP server returns file paths and metadata - the path to the optimized output, format, dimensions, and compression ratio. No image data enters the agent's context. This is what makes high-volume batch processing viable on local hardware: the context window stays clean regardless of how many files are in the job.

Privacy note

Images and PDFs travel from your machine to api.mochify.app over HTTPS for the encoding step. Processing happens in RAM, with no disk writes of the source and no logs containing file data. The local Rust binary is a client over that API; it does not encode locally. Because the local MCP path uses no pickup store, compressed bytes come straight back to the binary and are written to disk - zero server-side retention end-to-end.

The hosted MCP server at mcp.mochify.app works differently: it returns a short-lived download URL on files.mochify.app (valid for about five minutes) rather than a file path. That's the right surface for non-developers who want OAuth-based access without installing anything. For agent workflows where you want file paths and zero server-side retention end-to-end, the local server is the right choice. See why we adjusted our zero-retention policy for MCP for the full explanation.

On video: if your workflow involves video, that's handled separately. Mochify's video engine runs entirely client-side in the browser - the bytes never leave your device. That's a stronger local privacy guarantee than even the MCP path, but it means video is web-app-only. The CLI, local MCP server, and REST API handle images and PDFs. Route video compression through the web app.

MCP access is available on all Mochify tiers including Free (25 images/month). For heavier batch workloads, Seller ($7.99/month) gives you 300 images/month and 25-file batches. Details at mochify.app/pricing.

7. Cheat Sheet: On-Device Agent and Local MCP Tool Stack

Scenario	Local or remote?	Recommended approach	Data leaves device?
70B inference (DGX Spark, Mac Studio 128 GB)	Local model	Ollama, llama.cpp, MLX	No
7-13B inference (AI PC, 16-32 GB)	Local model	Ollama, llama.cpp	No
Tool calls via MCP stdio	Local subprocess	Any stdio MCP server	Only what the tool sends
Image/PDF optimization (Mochify local)	Local client, remote encoding	mochify serve + mochify auth login	Yes - HTTPS to api.mochify.app, zero retention
Image optimization via TinyPNG MCP	Remote API wrapper	TinyPNG MCP	Yes - to TinyPNG's servers
PDF handling via Avanquest MCP	Remote API wrapper	Avanquest PDF MCP	Yes - to Avanquest's servers
Video compression	Client-side browser only	mochify.app web app	No - browser only
Passing images inline to LLM	In-context	Avoid for local agents	Yes, in prompt
Passing file paths via MCP	Out-of-context	Recommended pattern	Paths and metadata only
Mochify hosted MCP	Remote connector	mcp.mochify.app (OAuth)	Yes - HTTPS + 5-min URL pickup

FAQ

What hardware do I need to run a local AI agent for image workflows?

For 7-13B models, which handle most tool-orchestration tasks well, a 16-32 GB AI PC with a modern NPU or dedicated GPU is sufficient. You'll get 30-100 tokens/second at Q4 quantization - fast enough for interactive workflows and background batch jobs. For 70B models, you need at least 48 GB of unified memory; 128 GB gives comfortable headroom for context and KV cache. NVIDIA's DGX Spark and Apple's Mac Studio M5 Max (128 GB) both sit in that tier, delivering roughly 25-45 tokens/second on 70B Q4 workloads.

Does running a local AI agent mean my data stays on my machine?

Not automatically. The inference runs on your hardware, but tool calls made through MCP or other integration layers may send data to remote APIs depending on which tools are configured. Every tool in the workflow needs to be evaluated on its own: check whether it uses stdio transport (same-machine subprocess) or HTTP transport (networked), and whether any HTTP server calls a third-party remote API. Never assume “local agent” implies “local tools.”

Why shouldn't I pass images directly into my local agent's context?

Image tokens are expensive. A 1-megapixel image costs roughly 1,334 tokens (approximately width * height / 750), and local models typically run with 8k-32k context windows. Passing a batch of images inline can exhaust the entire context before any processing happens. The right pattern is to use an MCP tool that handles the image and returns a file path - the agent works with the path, not the pixels.

What is the difference between Mochify's hosted MCP server and its local MCP server?

The hosted MCP server (mcp.mochify.app) is a remote connector: register it via OAuth, it processes your image server-side, and returns a short-lived download URL on files.mochify.app (valid for about five minutes). No install required. The local MCP server is the same Rust binary run in server mode via mochify serve. It returns file paths and metadata directly to the agent - no URL, no pickup store - and keeps zero server-side retention end-to-end. For developer and agent workflows, the local server is the default recommendation.

Can Mochify's local MCP server handle PDF workflows too?

Yes. PDFs are a first-class format alongside images. The local MCP server can extract individual pages as PNG, JPEG, or WebP images, and can split multi-page PDFs into individual single-page files. Both operations are Magic Flow-capable: describe what you want in plain language (“extract page 3 as WebP,” “split this into pages”) and the system handles the parameters. The privacy model is the same as images: encoding happens at api.mochify.app in RAM with no source disk writes.

Does the local MCP server work with agent runtimes other than Claude?

Any MCP-compatible host works. This includes Cursor, Claude Code, Claude Desktop, and others. MCP is supported by all major hosts including Claude, ChatGPT, Gemini, Cursor and VS Code. If your agent runtime supports stdio MCP servers, Mochify's local server wires in the same way regardless of the underlying model or host.

What's the practical difference between 7B and 70B models for image workflow automation?

A 7B model is fast (75-125 t/s on good hardware) and handles most tool-orchestration tasks competently: calling MCP tools, interpreting results, writing structured outputs. For straightforward batch processing - “compress everything in this folder to AVIF” - 7B is usually sufficient. A 70B model adds better reasoning on ambiguous inputs: understanding a vague prompt, handling edge cases in a document, or planning a complex multi-step workflow. If your agent needs to read a document and make decisions based on its content before taking action on assets, the quality difference is meaningful.

Will video optimization ever be available from the CLI or local MCP server?

Not currently, and there's no commitment to when or whether that changes. Mochify's video engine runs client-side in the browser - the bytes never leave your device, which is a stronger privacy guarantee than any server-side path. That architecture is inherently web-app-only. If your workflow involves video, use the web app; route image and PDF work through the CLI or local MCP server.

Free Tool

Connect your local agent to Mochify

Install the CLI, run mochify auth login once, and your agent can start optimizing images and PDFs with a natural language prompt - zero bytes in the model context.

Try it free at mochify.app