Why Fine-Tuning Object Detection for Documents Is Harder Than You Think

Jun 22, 2026

Authors

Unstructured
Unstructured

Object Detection (OD) has always been the quiet workhorse of document transformation.

Long before Vision-Language Models (VLMs), document pipelines depended on OD to segment pages into meaningful regions—paragraphs, tables, images, forms. That segmentation made everything else possible: OCR, structure extraction, and downstream reasoning.

Today, even with powerful VLMs that can summarize pages or extract complex tables end-to-end, OD still plays a critical role. Pure VLM approaches can struggle with precise reading order and fail to reliably preserve structure in complex tables, especially those with spanning cells or multi-column layouts.

Unstructured’s High Fidelity Transformation Workflow (HFTW = OD + VLM) addresses this by grounding the model in layout first.

We:

  • Detect regions explicitly (tables, headers, formula images, blocks).
  • Extract elements by type.
  • Route each region to specialized models with task-specific prompts.

This type-aware routing improves accuracy, reduces cross-section leakage, and keeps pipelines stable.

That said, a pure VLM partitioner can still outperform Unstructured HFTW in certain scenarios. Forms are a strong example: when the entire page is semantically cohesive and typically processed as a whole, a full-page VLM pass can better capture relationships across fields without being constrained by predefined regions. In simpler or more uniform layouts, end-to-end VLM reasoning can be more flexible and easier to maintain.

In this post, we cover:

  • Why OD still defines document transformation quality
  • Why fine-tuning your own OD model is far harder than it looks
  • Why most teams should avoid owning OD models altogether
  • How our High Fidelity Transformation Workflow benefits from continuous OD improvements

Why Object Detection Quality Still Matters Downstream

Modern document pipelines are not just “VLM in, JSON out.” High-quality transformation starts with proper scoping.

In our HFTW, OD is the foundation. When bounding boxes are tight and accurate, VLMs operate on clean inputs. When OD degrades:

  • VLMs receive oversized, or truncated inputs.
  • Prompts become brittle and over-engineered.
  • Pipelines grow complex to compensate for segmentation errors.

At that point, engineering effort shifts from improving quality to damage control.

Every gain in OD accuracy translates directly into measurable performance improvements.

Quick Results Snapshot

We’ve seen this progression clearly across both detection quality and downstream extraction accuracy:

YOLOX → IBM Heron +7 pts in element alignment and +3 pts in table detection F1, reducing misclassifications and missed boxes while driving measurable downstream accuracy gains.

IBM Heron → Fine-tuned Heron (Unstructured latest)+2.9 pts in detection recall, +2.3 pts in detection F1, +1.4 pts in reading order accuracy, +1.5 pts in corrected table TEDS, and +0.2 pts in content accuracy (CCT), delivering tighter boundaries, fewer merged elements, improved small-object recall, and further gains in structured extraction quality.

Taken together, OD remains more critical than many teams realize. These improvements didn’t come from incremental tuning. They required rethinking how we approach OD fine-tuning altogether. Below, we discuss why Heron was chosen as the foundation for the next generation of OD fine-tuning, and the challenges involved in fine-tuning OD models.

Starting From Heron, Not From Scratch

We didn’t start from scratch, NOT this time.

In the early days, when no open-source document object detection models met our needs, we built our own YOLO-based OD model from scratch. It performed reliably and supported a wide range of document types. For a long time, it was the backbone of our document transformation pipeline.

But as documents grew more complex—denser layouts, richer semantics, and higher precision requirements—its limitations became clear. We needed stronger layout understanding and better semantic awareness at the block level.

That’s where IBM’s Heron document object detection model came in.

Heron is a transformer-based model (RT-DETRv2) optimized for real-time performance and specifically trained for document understanding. Its transformer architecture enables stronger global context modeling, allowing it to reason about relationships between blocks, layout dependencies, and hierarchical structures across a page, while maintaining production-grade speed.This made Heron a strong choice for us. We built on top of its strengths, elevating our layout intelligence. At the same time, Heron raised the bar for precision.

For the HFTW, “good” still wasn’t good enough. This workflow depends on extremely precise and consistent bounding boxes to ensure clean cropping and reliable downstream processing. Even minor inconsistencies can cascade into measurable quality drops later on.

So instead of training a new model from scratch, we focused on fine-tuning on top of a strong foundation, pushing Heron further to meet our stricter precision and consistency requirements.

And that’s where the real challenges began.

The Reality Check: Fine-Tuning OD Isn’t Plug-and-Play

Fine-tuning object detection is often described as routine:Add your data, tweak a few hyperparameters, and train.

That description is dangerously misleading.

Our recent work exposed two fundamental challenges that reshaped how we think about owning OD models.

1. A Bug in the Training Framework

We fine-tuned using Hugging Face’s RT-DETRv2 training pipeline, and hit persistent instability.

We tried everything you’d expect:

  • Learning rate sweeps
  • Optimizer and scheduler changes
  • Freezing and unfreezing backbones and heads

Nothing worked.

The root cause was not our data or methodology, but a bug in the training framework itself.

The loss appeared to converge, yet prediction probabilities consistently remained below 0.2. We spent a week troubleshooting in every direction,: investing significant time and compute. Finally we switched to RT-DETR’s official implementation and the issue was resolved immediately.

Fine-tuning OD models requires not only ML expertise, but also a deep understanding of training frameworks and library behavior. Even mature, widely adopted tooling can hide costly failure modes that only surface at scale.

2. Inconsistent Labels Across Datasets

The deeper challenge was data alignment.

Heron and our internal datasets were built with different annotation philosophies.

  • Heron favors larger bounding boxes, often covering full paragraphs or sections.
  • Our downstream pipelines require tighter, more granular blocks aligned with structured extraction tasks.

Both approaches work independently. We continue to use Heron successfully for OD inference. However, combining these datasets for fine-tuning introduced training instability.

2.1 Bounding Box Semantics & Overfitting

The two datasets differ in both label definitions and box granularity. When finetuned without adjustment, the model struggled to learn a consistent definition of what constitutes a “correct” bounding box.

As training epochs increased, this inconsistency led to overfitting behavior. The model oscillated between coarse and fine-grained spatial priors, degrading generalization performance.

To address this, we designed an agentic-based label harmonization approach:

  • Normalize bounding box granularity across datasets
  • Align label semantics between Heron and internal annotationsEnforce consistent spatial definitions prior to training

This preprocessing step was essential to stabilize learning.

2.2 Catastrophic Forgetting & Data Imbalance

A second issue emerged during fine-tuning: catastrophic forgetting.

Our internal dataset is distributionally different and more imbalanced than Heron’s original training data. When fine-tuned aggressively, the model began to lose some of its previously general detection capabilities.

To mitigate this, we blended a portion of Heron’s original training data into the fine-tuning mix. This helped preserve prior capabilities while adapting the model to our tighter bounding box definitions.

Fine-tuning OD models is therefore not only about optimization. It requires careful control over annotation consistency, dataset balance, and knowledge retention across training phases.

What These Challenges Taught Us

After navigating these challenges, one conclusion became unavoidable:

Fine-tuning object detection is not plug-and-play.

It demands:

  • Deep understanding of OD architecturesStrong debugging skills across training frameworks
  • Extremely consistent, high-quality annotations
  • Ongoing maintenance as document layouts evolve

For most teams, the operational cost far outweighs the benefits of owning a custom OD model.

This is why many document pipelines quietly struggle—not because VLMs are weak, but because OD is treated as a solved problem when it isn’t.

Lessons learned and path forward

High-quality document transformation isn’t about chasing the newest model. It’s about strengthening the foundation.

What we learned is clear: OD is not plug-and-play. It requires rigorous debugging, consistent annotations, and careful control of data distribution. Small inconsistencies in bounding boxes cascade into measurable downstream errors.

It’s also critical because OD quality directly determines the reliability of everything downstream: OCR accuracy, table structure recovery, reading order, and structured extraction.That’s why we invest heavily in it, so you don’t have to. We let our users focus on building applications and delivering value while we focus on making sure the layout foundation is precise, stable, and continuously improving.

Join our newsletter to receive updates about our features.