PDF Content Extraction Tools Compared: From Plain Text to Structured Parsing

Nite included in Software Recommendations

2025-11-04 About 1100 words 6 minutes

Contents

PDF content extraction usually goes in two directions.

One direction is plain text only, as fast as possible. pdftotext is a good fit here, but it does not fully preserve charts, formulas, image descriptions, or similar information.

The other direction is output that humans can read, ideally with headings, tables, image descriptions, and reading order preserved. I used to look at Docling first. After trying MinerU recently, I feel it is more reliable on some of my PDFs. This is just my personal experience, not a benchmark result.

Tool overview

pdftotext

pdftotext is a command-line tool from the Poppler toolkit. It does one narrow thing: extract text from a PDF quickly, without caring much about layout.

Its strengths are straightforward:

Fast processing
Plain text output that is easy to post-process
Low installation and runtime overhead
Suitable for batch jobs

Docling

Docling is an open-source document parsing tool from IBM. It can convert PDF, DOCX, PPTX, XLSX, and other document formats into structured outputs such as Markdown and JSON.

It is much heavier than pdftotext, but the output is also more complete. Heading levels, tables, images, formulas, and reading order can be preserved as much as possible. If you are building document previews, knowledge base pages, or anything that needs downstream structured processing, Docling is more comfortable than a plain text extractor.

Docling’s resource usage depends on which features are enabled. Basic parsing is not too demanding. If formula enrichment or chart extraction is enabled, VRAM usage can increase a lot.

MinerU

MinerU is an open-source document parsing tool from OpenDataLab. It converts PDFs, images, Office documents, and web pages into Markdown / JSON, which can then be used in LLM / RAG-style workflows.

MinerU provides a free web version for trying documents online. It can also be self-hosted from the GitHub repository. The official docs also include Docker deployment. To avoid Python and pip package compatibility issues, I run it in a container and built an image based on the official Dockerfile:

podman pull docker.io/nite07/mineru:latest

Repo: https://git.nite07.com/nite/custom-containerfile

I have been leaning toward MinerU recently for a few reasons.

First, on my documents, the recognition quality feels better than Docling.

Second, it handles content that looks like tables but has no borders pretty well. Many PDF tables are really just whitespace and alignment. Some parsers turn them into scattered text. MinerU often turns this kind of content into readable Markdown tables, which is useful for both reading and later cleanup.

Third, it can recognize image content and add a summary to the output. For images, flowcharts, diagrams, and screenshots, it does more than just save image files. Some output can look close to this:

<details>
<summary>Flowchart</summary>

```mermaid
graph LR
    A["Original image X"] --> B["Encoder"]
    B --> C["Decoder"]
    C --> D["Decompressed (or reconstructed) image Ŷ"]
    style B fill:#ff0000,stroke:#333
    style C fill:#ff0000,stroke:#333
    note1["Compressed bitstream Y"] --> B
    note2["x̂ = x, x̂ ≠ x"] --> C
```

</details>

This kind of output works well in Markdown. Images and charts are not plain text in the first place, so having a readable summary first is already useful. Manual cleanup can come later.

I also wrote a separate post, MinerU on Low VRAM, about hybrid-auto-engine running out of memory on an 8GB GPU while vlm-auto-engine worked. If you plan to deploy MinerU locally, that post is worth checking first.

Usage comparison

Plain text extraction: pdftotext

If the task only needs body text, pdftotext is convenient.

Its advantage is not document understanding. It is speed. Plain text output also uses fewer tokens and is easy to clean, chunk, and ingest. The trade-off is clear: charts, formulas, image descriptions, and table structure will be lost or distorted.

# Quickly extract PDF text
pdftotext document.pdf - | head -n 50

# Batch process a document folder
for file in documents/*.pdf; do
    pdftotext "$file" "text_output/$(basename "$file" .pdf).txt"
done

It fits tasks like these:

Quickly previewing PDF text
Batch extracting body text
Building a rough full-text search index
Pre-checking documents before a richer parsing pipeline

For serious RAG, especially with technical documents, papers, or financial reports, plain text extraction is usually not enough. Charts, formulas, and tables are part of the source material. They should not be dropped by default.

Structured parsing: Docling and MinerU

If the output is meant for humans, or if you need to preserve structure, Docling and MinerU are closer to that requirement.

Docling’s Python API is easy to use:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()

result = converter.convert("report.pdf")
markdown_content = result.document.export_to_markdown()

with open("report.md", "w", encoding="utf-8") as f:
    f.write(markdown_content)

It can also run directly from the CLI. The default output is Markdown. You can use --to to choose JSON, HTML, and other formats, and --output to set the output directory:

# Convert to Markdown by default
docling report.pdf --output output

# Output JSON instead
docling report.pdf --to json --output output

The VLM pipeline can also be selected from the CLI:

docling report.pdf --pipeline vlm --output output

MinerU’s CLI is also direct:

mineru -p input.pdf -o output

If you use the container image, you can start the WebUI like this:

podman run --rm \
	--device nvidia.com/gpu=all \
	-p 7860:7860 \
	-it mineru:latest \
	mineru-gradio --server-name 0.0.0.0 --server-port 7860

If VRAM is tight, you can refer to the low-VRAM post mentioned above and switch the backend to vlm-auto-engine or pipeline. On my machine, MinerU’s pure vlm-auto-engine mode is easier to run than Docling with formula and chart recognition enabled.

mineru -p input.pdf -o output -b vlm-auto-engine

Comparison table

Dimension	pdftotext	Docling	MinerU
Processing speed	Fast	Slower	Slower
Resource usage	Low	Depends on enabled features; formula and chart recognition can increase VRAM usage	Depends on backend
Output format	Plain text	Markdown / JSON, etc.	Markdown / JSON, etc.
Structure preservation	Almost none	Good	Good
Table handling	Weak	Good	Good, with a nice experience on borderless table-like content
Image handling	Not supported	Supports extraction; extra capability depends on configuration	Supports extraction and content summaries
OCR	Not supported	Supported	Supported
Deployment	System package or command-line tool	pip / uv / source, etc.	pip / uv / source / Docker, etc.
Online version	None	Mainly local use	Official web version available

These tools are not direct replacements for each other. pdftotext is good for quickly getting plain text. Docling and MinerU are better suited for document structure. The final result still depends on your own PDFs. The same tool can behave very differently on papers, scanned files, reports, and slide decks.