AI & Automation 12 min read · June 16, 2026

Working with PDFs in AI Agent Workflows: Extract, Split, and Convert Pages

Q: Why return file paths instead of the image bytes to my agent?

Images are billed as tokens (a 512x512 image is roughly 210 tokens in OpenAI's worked example), so piping every page through the model context is expensive at batch scale, and the MCP resources spec is built around returning URIs and metadata rather than blobs. Returning paths keeps cost down and lets the bytes enter context only when a step genuinely needs them. Mochify's local MCP server returns paths and metadata by default.

Q: Is PDF processing on Mochify private? Does the file stay on my machine?

The file does not stay on your machine. PDFs travel to api.mochify.app over HTTPS exactly like images, are processed in RAM, and are wiped immediately with no source disk writes and no logs containing file data. That is zero-retention, which is a stronger and more honest claim than "never leaves your device." Local CLI and local MCP paths are zero-retention end-to-end; the hosted MCP holds only the output behind a five-minute download URL.

If you build agents or automation, sooner or later a PDF lands in the pipeline and the agent needs a page as an image, the document broken into single pages, or a thumbnail for a preview card. This guide covers the three operations that recur when you extract images from PDF files, split multi-page documents, and convert pages to web formats inside an automated workflow - plus the correctness traps, the agent-tool-call pattern that keeps token costs sane, and the privacy model that matters when those PDFs hold regulated data.

Published June 16, 2026 by the Mochify Engineering Team. This guide is for developers and AI builders wiring PDF operations into pipelines: the mechanics of rendering versus extraction, splitting without breaking documents, converting pages to WebP and AVIF, the paths-not-bytes pattern, and a privacy-first workflow for regulated documents.

The three core PDF operations

Most PDF work in an agent pipeline reduces to three operations: turning a page into an image, splitting a multi-page file into single pages, and converting those page images into a web-friendly format. Each is conceptually simple and each has a sharp edge that bites automated workflows specifically.

The reason these three keep recurring is that a PDF is not an image and not a neat bag of text. It is a set of drawing instructions: text runs, vector paths, embedded bitmaps, fonts, and transparency, composited at render time. That single fact explains nearly every gotcha further down. "Get me page 3 as a PNG" is really "execute the page's drawing instructions at a chosen resolution and flatten the result to a raster," which is a different job from "pull the embedded photo out of page 3." Keep those two jobs distinct and most of the confusion disappears.

We'll take the operations in the order they usually appear in a pipeline: extract, split, convert.

How to extract images from a PDF

When you extract images from PDF pages there are two genuinely different operations hiding under one phrase, and picking the wrong one is the most common mistake we see. You either pull embedded image objects straight out of the file, or you render the whole page to a new raster image. They produce different results and fail in different ways.

Embedded image extraction reads the PDF's internal object table and exports the original image streams, preserving the source format and resolution. This is the right call when a document contains a photo or a figure you want to reuse at its native quality, and tools like Poppler's pdfimages or PyMuPDF's get_images/extract_image are built for exactly this. The catch: it only sees objects that exist as discrete images. A text-only page, a vector chart, or a page that is itself one giant scanned bitmap will not give you the clean "logo.png" you were hoping for.

Page rendering runs the page's full drawing model (text, vectors, images, fonts, transparency) and composites it into a bitmap at a resolution you choose. This is what you want for previews, thumbnails, and feeding a page to a vision model. The single most important parameter is DPI. Renderers commonly default to roughly 72 DPI, which is fine for a tiny thumbnail and visibly blurry for anything else. As a rule of thumb, set 150 DPI for legible on-screen page images, 300 DPI when the output may be printed, and only reach for 600 DPI for fine line art or archival work. Higher DPI is not free: file size and render time climb with it, and you cannot recover detail by upscaling a low-DPI render after the fact, so set the resolution at the initial rasterisation step, not in post.

A second rule of thumb worth internalising: rendering quality is not only about DPI. Anti-aliasing, font hinting, and color-space handling all affect whether small text and thin lines come out crisp or jagged, which is why a mature rendering engine matters more than a high number in the DPI field.

So the decision is: do you need the asset that was placed on the page (extract embedded) or what the page looks like (render)? Agent pipelines almost always want the second, because they are producing previews and model inputs rather than harvesting original artwork.

Splitting multi-page PDFs without breaking them

Splitting a multi-page PDF into single-page files is the simplest operation to describe and the easiest to get subtly wrong. Copying page ranges into new documents is trivial; preserving everything attached to those pages is not.

The two failure modes to plan around:

First, off-by-one page ranges. Command-line tools and human-facing UIs almost always use 1-based page numbers, while most programming libraries index pages from 0. An automation script that mixes a 1-based prompt ("split off pages 1 to 5") with a 0-based API call is the classic source of a split that is silently shifted by one page. Decide on one convention at the boundary of your pipeline and normalise to it.

Second, structural loss. Splitting often drops document features that live above the page level. The pdfcpu project is refreshingly explicit about this: its split documentation states that annotations, outlines (bookmarks), structure trees, and forms "are not carried over into the output files," because copying them piecemeal can leave them broken or invalid. That matters more than it sounds. If your pipeline relies on a tagged structure tree for accessibility, on bookmarks for navigation, or on form fields staying interactive, a naive split will quietly degrade all three. For most agent use cases (slice a report into per-page PDFs to route, summarise, or attach) this is acceptable. For anything touching forms or accessibility tags, treat split output as lossy and verify.

The practical guidance: split by page span for batch routing, split by bookmark when you genuinely need chapter-level sections and your tool supports the bookmark depth you need, and always test what survives before you trust it in production.

Converting pages to web formats: WebP, AVIF, and documents as web assets

Converting rendered page images into WebP or AVIF is what turns "a PDF" into "web assets." Once a page is a raster image, the same format economics that govern product photos apply: modern formats are smaller than JPEG at the same perceived quality, which is the whole point of generating them.

WebP is the safe default. It is broadly supported and, per MDN's image format guide, delivers meaningfully smaller files than JPEG at comparable quality; independent measurements commonly put lossy WebP around 25–34% smaller than JPEG. AVIF can compress smaller still and supports wider color and HDR, at the cost of slower encoding and patchier tooling. The conclusion many teams reach in practice: default to WebP, offer AVIF where the tooling is solid, and keep a JPEG fallback for legacy clients via a <picture> element.

The "documents as web assets" framing is the useful mental shift here. A first-page thumbnail for a document-listing UI, a WebP preview embedded in a knowledge base, a cropped figure pulled from a report and served responsively: these are all just images now, and they belong in the same optimization pipeline you already run for photographs. Treating a PDF page as a web asset rather than a special document type is exactly how you keep one pipeline instead of two.

The reverse direction comes up just as often in these pipelines: bundling a set of WebP images (screenshots, scanned pages, generated assets) back into a single document. Mochify's WebP to PDF tool handles that side of the workflow.

Why PDF operations belong in agent workflows

PDF extract/split/convert operations show up in agent pipelines because documents are rarely the final destination; they are an input to be triaged, summarised, routed, or indexed. The operations are the pre-processing that makes the rest of the pipeline possible.

Realistic examples we see:

Document-processing agents that render each page to an image so a vision model can read scanned or image-heavy pages that have no extractable text layer.
Preview and thumbnail generation, where an agent renders the first page to a small WebP for a listing card, then routes the document based on what it sees.
Batch contract and report handling, where a large file is split into per-page or per-section PDFs so individual pages can be sent to different downstream steps (signature, redaction, summarisation).
Ingestion for search and retrieval, where pages are sliced into chunks and stored alongside thumbnails or cropped figures next to their embeddings.

This is increasingly true on local hardware too. As on-device AI agents move onto desktop-class machines, the same discipline matters even more, because a local model's context window is tight and a folder of rendered pages will blow through it fast.

In all of these, the agent is orchestrating tools, not doing the pixel work itself. Which leads to the pattern that actually matters for cost and reliability.

The paths-not-bytes pattern (and why it saves tokens)

The established pattern for a PDF tool in an agent workflow is to do the heavy work, write the results to disk or object storage, and return file paths and lightweight metadata into the model context - not the raw image or document bytes. There are two solid reasons, one architectural and one financial.

Architecturally, this is exactly how the Model Context Protocol models data. The MCP resources specification exposes data via URIs with MIME types, leaving the client to decide when to actually read the bytes. A tool that hands back file:///out/contract-p3.webp plus its format, dimensions, and page index fits this model cleanly; a tool that dumps a base64 blob into every response does not.

Financially, images are expensive in context because they are billed as tokens. OpenAI's pricing documentation works through an example where a single 512×512 image costs on the order of 210 tokens, and Azure's vision pricing breaks a comparable input down as 170 + 85 image tokens on top of the text. Those numbers are small per image and brutal at batch scale: render 200 pages and pipe each one through the model context and you are paying token rent on every pixel, repeatedly, for data the model often does not need to "see" at all. Return a path instead, and the bytes enter the context only on the rare call where the model genuinely must look.

The takeaway: design your PDF tools to return URIs and metadata by default. Let the host application or a downstream tool fetch the actual file when, and only when, it is needed.

Mochify Workflow: extract and split a PDF inside an agent pipeline

Here is how this looks with Mochify's PDF utility, which does the two operations above (page-level image extraction and precision splitting) and is built to be driven by an agent. Mochify treats documents like web assets: the same engine that gets your images web-ready handles PDF pages, and you talk to it in plain English through Magic Flow. A worked example: an agent ingesting a folder of multi-page contracts, producing a WebP preview of each cover page and splitting the rest into single-page PDFs for routing.

1
Describe the task in natural language
Magic Flow is the interface, so the agent does not assemble flags. It sends something like "extract page 1 of each PDF as a WebP thumbnail, and split the remaining pages into single-page PDFs." Magic Flow runs a two-step pipeline: a language model parses the prompt into concrete parameters, then Mochify's C++ engine executes the extraction and the split.
2
The request hits the PDF endpoint
PDF work is served by a dedicated endpoint, POST /v1/pdf on api.mochify.app. The PDF utility is available wherever you reach Mochify as a developer: the web app, the CLI, both MCP servers, and the REST API.
3
Run it through the surface that suits the agent
For an autonomous pipeline, the local MCP server (mochify serve, the same Rust binary as the mochify CLI) is the one to reach for. It returns file paths and metadata, so the page images and split PDFs never enter the agent's context window, only their paths do. That is the paths-not-bytes pattern, enforced by the surface.
4
Hand the outputs downstream
The agent now has, say, cover-p1.webp plus per-page PDFs and their metadata. It routes, summarises, or attaches them, fetching actual bytes only if a step truly needs them.

Because MCP and API access are included on every tier, including Free, you can wire this into an agent without a paid plan to start. Point your agent at Mochify and you are running.

Privacy and compliance for document pipelines

Where PDF processing happens is a compliance question, not just an engineering one, because the documents agents handle (contracts, reports, anything under NDA or regulation) frequently contain personal data. The governing principle is data minimization.

Under GDPR, Article 5(1)(c) requires that personal data be "adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed." The UK ICO's guidance on data minimization puts it plainly: identify "the minimum amount of personal data you need to fulfill your purpose," hold that much "but no more," and review and delete what you no longer need. Holding more than necessary is likely to be unlawful.

Applied to a PDF pipeline, that principle pushes you toward two things: send the minimum, and retain nothing you do not have to. Every time a document goes to a third-party cloud service, you are transferring personal data to a processor, which carries contractual and security obligations and, often, retention you did not ask for. Where Mochify is that processor, the data-processing terms are set out in our Data Processing Agreement. A service that processes in memory and keeps nothing after the operation is materially lower-risk than one that stores your documents for caching, analytics, or re-use. This is why the zero-retention model matters for the kind of work agents do: the original is gone the instant encoding finishes, so there is no document sitting in someone's bucket to be breached, subpoenaed, or quietly mined later.

A clarification, because it is easy to overstate: zero-retention is not the same as on-device processing. For images and PDFs, the bytes do travel to the API for encoding; what makes it defensible is that they are not kept. Be precise about that distinction in your own compliance documentation. If you want the deeper treatment, our privacy and image optimization guide and our GDPR-focused TinyPNG alternative comparison both go further.

Quality and correctness gotchas

Most PDF pipeline bugs come from a short list of predictable mistakes. Here they are, so you can check for them before they ship.

Default low DPI. Rendering at the ~72 DPI default produces blurry previews and unreadable print output. Set 150–300 DPI deliberately, and remember upscaling afterwards cannot recover lost detail.
Confusing extract with render. Embedded image extraction will miss text-only pages and vector charts; page rendering is what you want for previews and vision-model inputs. Pick the operation that matches the goal.
Off-by-one splits. 1-based CLIs versus 0-based libraries cause silently shifted page ranges. Normalise the convention at your pipeline boundary.
Lost structure on split. Annotations, bookmarks, structure trees, and forms can be dropped when splitting. If accessibility tags or interactive forms matter, verify they survived.
Scanned PDFs are images, not text. A document that is really a stack of scanned pages has no text layer to extract. You render pages to images and OCR them; do not expect text extraction to work on a photocopy.
Color, transparency, and thin lines. High DPI alone does not guarantee quality. Anti-aliasing and color-space handling decide whether small text and fine lines render cleanly.

PDF operations cheat sheet

Operation	What it does	Sensible default	Watch out for
Extract embedded image	Pulls an original image object out of the file at native quality	Use when you want the placed photo/figure as-is	Misses text-only pages, vector charts, full-page scans
Render page to image	Rasterises the whole page (text, vectors, images)	150 DPI on-screen, 300 DPI for print	~72 DPI default is blurry; no recovery by upscaling
Split by page span	Breaks a multi-page PDF into single pages	1 page per file for routing	Off-by-one (1-based vs 0-based); structure loss
Split by bookmark	Cuts into logical sections (chapters)	When you need section-level files	Tool may only support shallow bookmark depth
Convert page to WebP/AVIF	Compresses a rendered page for web delivery	WebP default; AVIF where tooling is solid	Keep a JPEG fallback via `<picture>`
Return to an agent	Hand results back as paths + metadata	Paths-not-bytes; fetch on demand	Inline base64 bloats context and token cost

Mochify's PDF utility covers the extract and split operations here (PDF pages out as PNG, JPEG, or WebP; multi-page files split into single-page PDFs), driven by Magic Flow prompts like "extract page 3 as WebP" or "split this into pages." Try it at mochify.app.

Frequently asked questions

How do I extract images from a PDF in an automated pipeline?

Decide first whether you want the embedded image object (use an extractor like pdfimages or PyMuPDF's extract_image) or a render of the whole page (use a renderer at 150–300 DPI). Agent pipelines usually want the page render, because they are producing previews and model inputs rather than harvesting original artwork. With Mochify, you describe it in plain English ("extract page 3 as WebP") and the PDF utility handles it via POST /v1/pdf.

What DPI should I use when converting PDF pages to images?

Around 150 DPI for legible on-screen images, 300 DPI when output may be printed, and 600 DPI only for fine line art or archival work. The common ~72 DPI default is suitable for tiny thumbnails only. Set the resolution at render time, because upscaling a low-DPI image afterwards cannot recover detail.

Does splitting a PDF lose bookmarks or form fields?

Often, yes. Tools such as pdfcpu state outright that annotations, outlines, structure trees, and forms are not carried into split output. For routing and summarisation this is fine; if you depend on interactive forms or accessibility tags, treat the split as lossy and verify what survived.

Why return file paths instead of the image bytes to my agent?

Images are billed as tokens (a 512×512 image is roughly 210 tokens in OpenAI's worked example), so piping every page through the model context is expensive at batch scale, and the MCP resources spec is built around returning URIs and metadata rather than blobs. Returning paths keeps cost down and lets the bytes enter context only when a step genuinely needs them. Mochify's local MCP server returns paths and metadata by default.

Is Mochify's PDF utility available to AI agents, or only in the web app?

It is available across the web app, the CLI, both MCP servers, and the REST API, all served by POST /v1/pdf. (Video is the one capability that is web-app-only; the PDF utility is not.) For agent work, the local MCP server is the surface to lead with because it returns paths, not bytes.

Is PDF processing on Mochify private? Does the file stay on my machine?

The file does not stay on your machine. PDFs travel to api.mochify.app over HTTPS exactly like images, are processed in RAM, and are wiped immediately with no source disk writes and no logs containing file data. That is zero-retention, which is a stronger and more honest claim than "never leaves your device." Local CLI and local MCP paths are zero-retention end-to-end; the hosted MCP holds only the output behind a five-minute download URL.

Can I use the PDF utility on the free tier?

Yes. MCP and API access are included on every tier, including Free, which allows 25 operations per month (or 3 per session with no signup) and a 20MB file-size limit. Seller and Pro raise the file-size limit to 75MB and batch size to 25 files.

WebP or AVIF for PDF page previews?

Default to WebP: it is widely supported and typically 25–34% smaller than JPEG at similar quality. Use AVIF where your tooling and audience support it for even smaller files, and keep a JPEG fallback in a <picture> element for older clients.

Wiring PDFs into an agent?

Point it at Mochify's PDF utility, describe the job in plain English, and get back file paths and metadata - the page images and split PDFs never touch your model context.

Try it free at mochify.app →