MinerU on Low VRAM: When Hybrid OOMs but Vlm Works Fine
I recently needed to process some PDFs locally with MinerU and ran into a series of issues on my 8GB GPU. The most counter-intuitive finding: the supposedly “lighter” hybrid mode ran out of memory, while the seemingly “heavier” vlm mode worked fine. After digging through the source code, I figured out why. Here’s the rundown for anyone running MinerU on low-VRAM hardware.
First hurdle: the default backend OOMs right away
MinerU 3.x defaults to the hybrid-auto-engine backend. After installing, I ran:
mineru -p input.pdf -o outputIt crashed almost immediately:
CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total
capacity of 8 GiB of which 65.56 MiB is free.With 8GB of VRAM, only 65MB was free - not even enough for a 20MB allocation.
Tweaking parameters didn’t help
The documentation lists several environment variables for controlling memory usage:
MINERU_HYBRID_BATCH_RATIO=1 \
MINERU_API_MAX_CONCURRENT_REQUESTS=1 \
MINERU_PROCESSING_WINDOW_SIZE=1 \
mineru -p input.pdf -o outputIn particular, MINERU_HYBRID_BATCH_RATIO has a reference table suggesting values of 8 or lower for GPUs under 6GB. My 8GB auto-detected value would be 2 (the get_batch_ratio function in the source sets it to 1 for <8GB, 2 for ≥8GB). I tried every combination anyway.
No luck - still OOM. The reason: batch_ratio only controls the batch size for small model inference, not the number of models loaded into VRAM.
The counter-intuitive discovery: vlm mode actually works
On a whim, I switched to vlm-auto-engine:
mineru -p input.pdf -o output -b vlm-auto-engineIt ran through. Slower than hybrid, but no crashes.
This seemed wrong - vlm mode runs every page through a large model for inference. Shouldn’t that use more VRAM?
Source code reveals the answer
Looking at MinerU’s model_init.py, the two modes load completely different sets of models:
hybrid-auto-engine (HybridModelSingleton):
| Model | Purpose | Always in VRAM |
|---|---|---|
| PPDocLayoutV2 | Layout detection | Yes |
| PytorchPaddleOCR | OCR text recognition | Yes |
| MFR (UniMerNet) | Math formula recognition | Yes (if enabled) |
| Table recognition models | Table structure parsing | Yes |
| VLM (~2B params) | Complex content | Yes |
4–5 models simultaneously resident in VRAM. 8GB simply can’t hold them all.
vlm-auto-engine:
| Model | Purpose | Always in VRAM |
|---|---|---|
| VLM (~2B params) | Everything in one pass | Yes |
Just 1 model loaded, roughly 4–5GB VRAM. Fits within 8GB.
The conclusion is clear: while each small model in hybrid mode is individually lightweight, their combined VRAM footprint exceeds that of a single VLM model. On low-VRAM hardware, vlm mode is actually the better choice.
Another gotcha: http-client with third-party LLMs is incompatible
Since local VRAM was tight, I tried offloading VLM inference to a remote API:
mineru -p input.pdf -o output -b hybrid-http-client -u http://some-api:30000/v1The logs spat out a WARNING:
WARNING | mineru_vl_utils.mineru_client:parse_layout_output:251 -
Layout output does not match expected format: ```json
[
{"bbox_2d": [145, 83, 853, 442], "label": "Table I"},
...
]The LLM wrapped its response in a ```json code block, which MinerU couldn’t parse.
Inspecting the request MinerU sends to the LLM, the prompt is remarkably barebones:
MinerU expects the remote model to be its fine-tuned VLM (based on Qwen2-VL), which has been trained to output special tokens when it sees “Layout Detection:”:
<|box_start|>145 83 853 442<|box_end|><|ref_start|>Table I<|ref_end|>A general-purpose LLM has no idea about these special tokens, so it naturally returns JSON wrapped in a markdown code block. The http-client mode is not a “plug in any OpenAI-compatible API” solution - you must use it with MinerU’s fine-tuned VLM model.
The working solution
For 8GB VRAM environments, just use vlm-auto-engine:
# Install
uv pip install -U "mineru[all]"
# Run
mineru -p input.pdf -o output -b vlm-auto-engineIf vlm also OOMs (edge cases), fall back to pure CPU mode:
mineru -p input.pdf -o output -b pipelineSlower, but rock-solid.