Welcome to the third article in the series on Retrieval Augmented Generation (RAG) with Unstructured Data for LLMs. This series covers how to preprocess tables and images using vi. In the first two articles, we discussed normalizing unstructured documents and metadata extraction.
Often, you’ll need special models to preprocess images and PDFs and some document types. In this article, you’ll learn about document image analysis techniques. These include Document Layout Detection (DLD) and vision transformers (ViT).
You’ll also learn how to use these techniques to preprocess images and PDFs. Let’s get started.
You have learned so far how to preprocess documents that require rules-based parsers, like HTML, Word documents, and PowerPoints. However, some documents don’t have structured information available within the document itself, such as PDFs and images.
For these documents, you need to use visual information to understand the document’s structure. In this article, you will learn about document image analysis. and this will allow us to extract formatting information and text from the raw image of a document.
Document Layout Detection
Document Layout Detection uses an object detection model to draw and label bounding boxes on a document. Once it has drawn those bounding boxes, the boxes get labeled, and then the text gets extracted from within the bounding box.
By contrast, Vision Transformers take a document image as input and produce text as output. These models can be trained to produce a structured output like JSON as the output text.
As a note, vision transformers can optionally include text input, just like an LLM transformer can.
Now let’s learn about Document Layout Detection. Document Layout Detection requires two steps:
- Identifying and categorizing bounding boxes around elements within the document, such as narrative text, titles, or bulleted lists.
- Extracting text from within those bounding boxes. Depending on the document type, there may be two methods for doing this:
- In some cases, the text is not available from within the document itself. In those cases, you’ll need to apply techniques such as Optical Character Recognition (OCR) to extract the text.
- In other cases, such as in some PDFs, the text is available within the document itself. You can use the bounding box information to trace the bounding box back to the original document and extract the text content that falls within the bounding box.
You can take a look below at what the architecture looks like for the YOLO-X model, which is one of the models frequently used for document layout detection. If you’re interested, check out the arxiv p.

Vision Transformers
Now, let’s learn about another technique for document image analysis – Vision Transformers. In contrast to document layout detection models, which draw a bounding box and then apply an OCR model when necessary, vision transformers accept images as input and then produce text directly as output.
In this case, OCR is not required to extract the text from the image.
One common architecture for vision transformers is the DONUT architecture, or Document Understanding Transformer. When applying a model such as the DONUT model, you can train the model to produce a valid JSON string as output, and that JSON string can contain the structured document output that we’re interested in.
You can see below what the architecture looks like for the DONUT model. Again, if you’re interested in learning more, check out the arxiv paper.

Now let’s take a look at what vision transformers look like in practice:
- You’ll take an image representation of the document
- Pass that to the Vision Transformer
- The Vision Transformer will output a string, which will be a valid JSON
- Each element in the JSON will contain the text of the unit and the category of the element
- Once we have this string, we can convert that to the normalized document elements that we expect from all of our various document types.
So, when should I use a vision transformer, and when should I use a document layout detection model?
Pros and Cons of Both Models
Each model type has its advantages and disadvantages. For document layout detection models, some of the advantages are:
- The model is trained on a fixed set of element types, and so it can become very good at recognizing these.
- You get bounding box information, which allows you to trace results back to the original document and, in some cases, allows you to extract text without running OCR.
The disadvantages include:
- In some cases, Document Layout Detection Models require two model calls – first for the object detection model, and second for the OCR model.
- These models are less flexible. You work from a fixed set of element types, and so if you need to extract something else, they may not be able to do that without further retraining.
For vision transformers, some of the advantages are:
- They are flexible for non-standard document types like forms and can extract information like key-value pairs relatively easily.
- They are more adaptable to new ontologies. Whereas adding a new element was difficult for document layout detection models, you can add a new element type to a vision transformer potentially through prompting.
Some of the disadvantages are:
- The model is generative, and so it is potentially prone to hallucination or repetition, just like a generative model in natural language use cases.
- These models tend to be much more computationally expensive than document layout detection models, and so they either require a lot of computing power or they run more slowly.
Practice DLD and ViT
Now that we’ve learned techniques to preprocess images and PDFs, let’s put those into practice. In this exercise, we will preprocess the same document first in an HTML representation, and then in a PDF representation.
You’ll see how you can extract a similar set of information from different document types, whether you’re processing the document using a rules-based technique, or you’re extracting that information based on visual cues.
To get started, you’ll need to import some of the same dependencies that you imported in the previous article, especially the unstructured open-source library.
We won’t spend much time talking about preprocessing PDFs, as our model-based work will be done through the API to handle the required model setup. The necessary installations and directions are already given in the Colab notebook.
Working with HTML Files
First, let’s take a look at preprocessing the same document represented as HTML. The document in this case is a news article about the El Nino weather pattern that was published on CNN.
To process the HTML version, first use the file name, and then you can use the partition_html() function from unstructured to get the HTML elements.
By passing the file name into that function, we can get the narrative text, titles, and other information identified using the HTML tags and natural language processing:
filename = data_path + "example_files/el_nino.html"
html_elements = partition_html(filename=filename)
for element in html_elements[:10]:
print(f"{element.category.upper()}: {element.text}")which should return:
TITLE: CNN
UNCATEGORIZEDTEXT: 1/30/2024
TITLE: A potent pair of atmospheric rivers will drench California as El Niño makes its first mark on winter
TITLE: By Mary Gilbert, CNN Meteorologist
UNCATEGORIZEDTEXT: Updated:
3:49 PM EST, Tue January 30, 2024
TITLE: Source: CNN
NARRATIVETEXT: A potent pair of atmospheric river-fueled storms ...
NARRATIVETEXT: The soaking storms will raise the flood threat across ...
...Working with PDFs
Now, let’s look at preprocessing the PDF representation of the same document. We will start with installing pikepdf library. You can first preprocess it using the fast strategy from unstructured.
The fast strategy extracts text directly from simple PDFs like this news article. If you preprocess the PDF using this strategy, you’ll see the extracted elements are very similar to the HTML case, with narrative text and titles identified:
!pip install pikepdf
filename = data_path + "example_files/el_nino.pdf"
pdf_elements = partition_pdf(filename=filename, strategy="fast")
for element in pdf_elements[:10]:
print(f"{element.category.upper()}: {element.text}")You can also preprocess the PDF using a document layout detection model via the unstructured API. In this case, we’ll use the YSOLO-X model which will draw bounding boxes around elements and extract the text within them.
This is a model-based workload, so please be patient as the API call completes. Once done, you’ll again see the model has successfully identified titles, narrative text, etc. similar to the HTML outputs:
with open(filename, "rb") as f:
files=shared.Files(
content=f.read(),
file_name=filename,
)
req = shared.PartitionParameters(
files=files,
strategy="hi_res",
hi_res_model_name="yolox",
)
try:
resp = s.general.partition(req)
dld_elements = dict_to_elements(resp.elements)
except SDKError as e:
print(e)
for element in dld_elements[:10]:
print(f"{element.category.upper()}: {element.text}")To compare, the HTML outputs had 32 elements, including:
- 23 narrative text elements
- 6 title elements
The document layout detection outputs had 39 elements:
- 28 narrative text elements
- 10 title elements (mix of headers and titles)
You can verify those numbers as follows:
import collections
print(len(html_elements))
html_categories = [el.category for el in html_elements]
collections.Counter(html_categories).most_common()
print(len(dld_elements))
dld_categories = [el.category for el in dld_elements]
collections.Counter(dld_categories).most_common()So while not exactly identical, the outputs are pretty close regardless of whether the document is represented as a PDF or HTML. This means you can treat documents in either format similarly within your application.
Extracting Tables from Documents
You will now learn how to extract tables from documents and infer their structure. While most use cases focus on text content within documents, in some industries like finance and insurance, it is common to see structured information within unstructured documents.
This often occurs in the form of tables containing financial or numerical data embedded within documents. Table extraction enables applications to extract information contained within these tables in unstructured documents.
Some document types like HTML and Word contain table structure information within the document itself, such as the <table> tag in HTML. For these, you can use rules-based parsers to extract table information.
However, for other formats like PDFs and images, we need to use visual cues to first identify the table within the document and then process it to extract the tabular information. There are a few key techniques you will learn to accomplish this:
Table Transformers
A table transformer is a model that identifies bounding boxes for table cells and converts the output to HTML. This involves two steps:
- First, identify tables using a document layout detection model
- Then, route the identified tables to the table transformer model
The advantages include the ability to trace cells back to the original bounding boxes. However, the disadvantages are that it requires multiple expensive model calls for layout detection and OCR.
Vision Transformers
You can also extract table content from PDFs and images using vision transformer models like the ones learned previously. But unlike outputting JSON, here the target is HTML output.
- Advantages: Allows for flexible prompting, single model call
- Disadvantages: Generative so can hallucinate, no bounding box info
OCR Post-Processing
A final technique is OCR post-processing of the tables:
- Identify tables using document layout detection
- OCR the table region
- Process the OCR output using rules-based parsing to construct HTML
The advantages include being fast, accurate for well-behaved tables while its disadvantage is that Rules-based parsing is brittle for complex tables.
Now let’s put what we have learned into practice. First, install langchain and langchain-openai libraries. In addition, you are going to need the OpenAI API key to be able to run the following code. Let’s start with installations:
!pip install langchain
!pip install langchain-openaiYou’ll preprocess a more complex document containing tables and images. Specifically, you’ll extract the content from Table 1 from the below file:
Pass the document to the unstructured API with infer_table_structure=True to extract tables:
filename = data_path + "example_files/embedded-images-tables.pdf"
with open(filename, "rb") as f:
files=shared.Files(
content=f.read(),
file_name=filename,
)
req = shared.PartitionParameters(
files=files,
strategy="hi_res",
hi_res_model_name="yolox",
skip_infer_table_types=[],
pdf_infer_table_structure=True,
)
try:
resp = s.general.partition(req)
elements = dict_to_elements(resp.elements)
except SDKError as e:
print(e)Filter for elements with category=’table’ to get the table element:
tables = [el for el in elements if el.category == "Table"]Access the text using element.text:
tables[0].textGet the HTML representation from element.metadata['text_as_html']
table_html = tables[0].metadata.text_as_htmlThis gives you the table content in text and HTML formats.
To make the tables searchable in a retrieval-augmented generator system, you can summarize them using utilities from Langchain:
- Import summarization utilities
- Instantiate the summary chain
- Summarize the table HTML content
Note that, you need OpenAI API KEY at this step. Make sure you either set it as an environment variable or directly fed into the ChatOpenAI function:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-1106")
chain = load_summarize_chain(llm, chain_type="stuff")
chain.invoke([Document(page_content=table_html)])Final Thoughts
Document image analysis techniques like Document Layout Detection (DLD) and Vision Transformers (ViT) process PDFs and images. DLD draws bounding boxes and applies OCR. ViT takes images and produces structured text output.
DLD pros:
- fixed element types, bounding boxes trace originals.
DLD cons:
- multiple calls, less flexible.
ViT pros:
- single call, flexible.
ViT cons:
- generative, expensive computationally.
For tables, we briefly covered table transformers (layout detection + transformer), ViT outputting HTML, and OCR post-processing technique (layout detection + OCR + rules).
Now you know how to preprocess images and extract tables! Try it on your own files, and experiment with changing parameters like the vision model used. In the next article, we’ll put all these skills together to build a full retrieval-augmented generator application.
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.