PDF Content Extraction Tools Compared: From Plain Text to Structured Parsing
PDF content extraction usually goes in two directions.
One direction is plain text only, as fast as possible. pdftotext is a good fit here, but it does not fully preserve charts, formulas, image descriptions, or similar information.
The other direction is output that humans can read, ideally with headings, tables, image descriptions, and reading order preserved. I used to look at Docling first. After trying MinerU recently, I feel it is more reliable on some of my PDFs. This is just my personal experience, not a benchmark result.
Tool overview
pdftotext
pdftotext is a command-line tool from the Poppler toolkit. It does one narrow thing: extract text from a PDF quickly, without caring much about layout.
Its strengths are straightforward:
- Fast processing
- Plain text output that is easy to post-process
- Low installation and runtime overhead
- Suitable for batch jobs
Docling
Docling is an open-source document parsing tool from IBM. It can convert PDF, DOCX, PPTX, XLSX, and other document formats into structured outputs such as Markdown and JSON.
It is much heavier than pdftotext, but the output is also more complete. Heading levels, tables, images, formulas, and reading order can be preserved as much as possible. If you are building document previews, knowledge base pages, or anything that needs downstream structured processing, Docling is more comfortable than a plain text extractor.
Docling’s resource usage depends on which features are enabled. Basic parsing is not too demanding. If formula enrichment or chart extraction is enabled, VRAM usage can increase a lot.
MinerU
MinerU is an open-source document parsing tool from OpenDataLab. It converts PDFs, images, Office documents, and web pages into Markdown / JSON, which can then be used in LLM / RAG-style workflows.
MinerU provides a free web version for trying documents online. It can also be self-hosted from the GitHub repository. The official docs also include Docker deployment. To avoid Python and pip package compatibility issues, I run it in a container and built an image based on the official Dockerfile:
podman pull docker.io/nite07/mineru:latestRepo: https://git.nite07.com/nite/custom-containerfile
I have been leaning toward MinerU recently for a few reasons.
First, on my documents, the recognition quality feels better than Docling.
Second, it handles content that looks like tables but has no borders pretty well. Many PDF tables are really just whitespace and alignment. Some parsers turn them into scattered text. MinerU often turns this kind of content into readable Markdown tables, which is useful for both reading and later cleanup.
Third, it can recognize image content and add a summary to the output. For images, flowcharts, diagrams, and screenshots, it does more than just save image files. Some output can look close to this:
<details>
<summary>Flowchart</summary>
```mermaid
graph LR
A["Original image X"] --> B["Encoder"]
B --> C["Decoder"]
C --> D["Decompressed (or reconstructed) image Ŷ"]
style B fill:#ff0000,stroke:#333
style C fill:#ff0000,stroke:#333
note1["Compressed bitstream Y"] --> B
note2["x̂ = x, x̂ ≠ x"] --> C
```
</details>This kind of output works well in Markdown. Images and charts are not plain text in the first place, so having a readable summary first is already useful. Manual cleanup can come later.
I also wrote a separate post, MinerU on Low VRAM, about hybrid-auto-engine running out of memory on an 8GB GPU while vlm-auto-engine worked. If you plan to deploy MinerU locally, that post is worth checking first.
Usage comparison
Plain text extraction: pdftotext
If the task only needs body text, pdftotext is convenient.
Its advantage is not document understanding. It is speed. Plain text output also uses fewer tokens and is easy to clean, chunk, and ingest. The trade-off is clear: charts, formulas, image descriptions, and table structure will be lost or distorted.
# Quickly extract PDF text
pdftotext document.pdf - | head -n 50
# Batch process a document folder
for file in documents/*.pdf; do
pdftotext "$file" "text_output/$(basename "$file" .pdf).txt"
doneIt fits tasks like these:
- Quickly previewing PDF text
- Batch extracting body text
- Building a rough full-text search index
- Pre-checking documents before a richer parsing pipeline
For serious RAG, especially with technical documents, papers, or financial reports, plain text extraction is usually not enough. Charts, formulas, and tables are part of the source material. They should not be dropped by default.
Structured parsing: Docling and MinerU
If the output is meant for humans, or if you need to preserve structure, Docling and MinerU are closer to that requirement.
Docling’s Python API is easy to use:
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("report.pdf")
markdown_content = result.document.export_to_markdown()
with open("report.md", "w", encoding="utf-8") as f:
f.write(markdown_content)It can also run directly from the CLI. The default output is Markdown. You can use --to to choose JSON, HTML, and other formats, and --output to set the output directory:
# Convert to Markdown by default
docling report.pdf --output output
# Output JSON instead
docling report.pdf --to json --output outputThe VLM pipeline can also be selected from the CLI:
docling report.pdf --pipeline vlm --output outputMinerU’s CLI is also direct:
mineru -p input.pdf -o outputIf you use the container image, you can start the WebUI like this:
podman run --rm \
--device nvidia.com/gpu=all \
-p 7860:7860 \
-it mineru:latest \
mineru-gradio --server-name 0.0.0.0 --server-port 7860If VRAM is tight, you can refer to the low-VRAM post mentioned above and switch the backend to vlm-auto-engine or pipeline. On my machine, MinerU’s pure vlm-auto-engine mode is easier to run than Docling with formula and chart recognition enabled.
mineru -p input.pdf -o output -b vlm-auto-engineComparison table
| Dimension | pdftotext | Docling | MinerU |
|---|---|---|---|
| Processing speed | Fast | Slower | Slower |
| Resource usage | Low | Depends on enabled features; formula and chart recognition can increase VRAM usage | Depends on backend |
| Output format | Plain text | Markdown / JSON, etc. | Markdown / JSON, etc. |
| Structure preservation | Almost none | Good | Good |
| Table handling | Weak | Good | Good, with a nice experience on borderless table-like content |
| Image handling | Not supported | Supports extraction; extra capability depends on configuration | Supports extraction and content summaries |
| OCR | Not supported | Supported | Supported |
| Deployment | System package or command-line tool | pip / uv / source, etc. | pip / uv / source / Docker, etc. |
| Online version | None | Mainly local use | Official web version available |
These tools are not direct replacements for each other. pdftotext is good for quickly getting plain text. Docling and MinerU are better suited for document structure. The final result still depends on your own PDFs. The same tool can behave very differently on papers, scanned files, reports, and slide decks.