Welcome to the fourth and final article on the Retrieval Augmented Generation (RAG) with Unstructured Data Applications series.
In the first three articles, we respectively covered
- How to normalize unstructured documents
- Deal with metadata extraction
- and, Preprocess images, PDFs, and tables.
In this article, you will put all these techniques together. You will build a RAG ROBOT model using a corpus containing PDFs, PowerPoint, and Markdown documents. The model will be fine-tuned for information retrieval and generative AI tasks.
The Goal
The goal is to take a corpus of documents with different document types that discuss the donut model.
You’ll be implementing RAG techniques and improving the quality of responses using large language models (LLMs).
You’ll pre-process all these document types and load them into a vector database. Then, you’ll query the vector database and insert the query results into a prompt.
Eventually, you’ll pass the prompt to large language models (LLMs) so that you can chat with the set of documents.
Once you’ve asked the LLM the question, you’ll get a relevant response. You can start by importing a few helper functions, many of which you’ve already seen throughout the course.
For this application, you’ll also need a new function, partition_markdown, because the corpus will include Markdown files. This function is essential for the natural language processing of Markdown documents.
If you want to include other document types, such as Word documents, you can include another function like partition_docx.
You can also set up your unstructured instance. That’s because the corpus will contain PDFs and other documents that require rules-based parsing.
This is where implementing a retrieval system with an external knowledge base becomes crucial.
Start Building Your Own RAG Robot

Now that you’re ready to build your RAG model, start by reviewing the donut model documentation. Take a look at a few of the documents to get started.
The first document you’ll include is a PDF of the donut paper from ArXiv. This PDF includes complex tables.
You’ll also include a PowerPoint deck. It will include information about the donut model. Also, it will along with the README from the GitHub repo containing the donut model, which is in Markdown format.
You’ll also include a PowerPoint deck. This deck will contain information about the donut model. Additionally, it will feature the README from the GitHub rep containing the donut model, which is in Markdown format.
You can get started by preprocessing the PDF.
You can apply what you learned earlier in the course. Use the unstructured API to call the YOLOX model. Pre-process this document using a document layout detection model.
As you learned earlier in the course, this is an expensive model-based workload. So, if it takes a few minutes to process, don’t worry:
filename = data_path + "example_files/donut_paper.pdf"
with open(filename, "rb") as f:
files=shared.Files(
content=f.read(),
file_name=filename,
)
req = shared.PartitionParameters(
files=files,
strategy="hi_res",
hi_res_model_name="yolox",
pdf_infer_table_structure=True,
skip_infer_table_types=[],
)
try:
resp = s.general.partition(req)
pdf_elements = dict_to_elements(resp.elements)
except SDKError as e:
print(e)After preprocessing the PDF, you can examine the results:
pdf_elements[0].to_dict() Which returns:
{'type': 'Title',
'element_id': '59a9f0edd370eaa8c5c59cd9256e63bd',
'text': 'OCR-free Document Understanding Transformer',
'metadata': {'filetype': 'application/pdf',
'languages': ['eng'],
'page_number': 1,
'filename': 'donut_paper.pdf'}}You may also want to take a look at some of the tables you’ll be able to query over. This can be done once you have assembled your RAG application.
tables = [el for el in pdf_elements if el.category == "Table"]
table_html = tables[0].metadata.text_as_html
from io import StringIO
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
file_obj = StringIO(table_html)
tree = etree.parse(file_obj, parser)
print(etree.tostring(tree, pretty_print=True).decode())That should give the following HTML table:
<table>
<tr>
<td>NAVER CLOVA 4Upstage</td>
<td>2NAVER Search STmax 6Google</td>
<td>3SNAVER AI Lal 7LBox</td>
</tr>
</table>You may want to filter out unwanted content, like the references section. Use the parent_id metadata field to identify and remove elements nested under the references element.
reference_title = [
el for el in pdf_elements
if el.text == "References"
and el.category == "Title"
][0]
reference_title.to_dict()
for element in pdf_elements:
if element.metadata.parent_id == references_id:
print(element)
break
pdf_elements = [el for el in pdf_elements if el.metadata.parent_id != references_id]You can also filter out header information that breaks up the narrative structure by removing elements with the category metadata field set to “header”:
headers = [el for el in pdf_elements if el.category == "Header"]
headers[1].to_dict()and get the following Header element:
{'type': 'Header',
'element_id': 'a4b916e36299d9c6f3f676a6480f550c',
'text': 'OCR-free Document Understanding Transformer',
'metadata': {'filetype': 'application/pdf',
'languages': ['eng'],
'page_number': 3,
'filename': 'donut_paper.pdf'}}Next, you’ll pre-process your PowerPoint slide using the partition_pptx function from the unstructured open-source library. This helps in embedding model data effectively.
Similarly, you can use the partition_md function to partition your Markdown file:
# preprocess the PowerPoint Slide
filename = data_path + "example_files/donut_slide.pptx"
pptx_elements = partition_pptx(filename=filename)
# preprocess readme file
from unstructured.partition.md import partition_md
filename = data_path + "example_files/donut_readme.md"
md_elements = partition_md(filename=filename)To get the PDF content, get rid of the Header content:
# preprocess PDF
pdf_elements = [el for el in pdf_elements if el.category != "Header"]Loading Content to a Vector Database
After pre-processing all documents, you can combine them into a single corpus and chunk them using the chunk_by_title function from the metadata and chunking article. This will improve the information retrieval process.
After chunking, you can load the documents into a vector database using utilities from LangChain. To search for content by file type, you can include the source as a metadata field when loading documents:
elements = chunk_by_title(pdf_elements + pptx_elements + md_elements)
documents = []
for element in elements:
metadata = element.metadata.to_dict()
del metadata["languages"]
metadata["source"] = metadata["filename"]
documents.append(Document(page_content=element.text, metadata=metadata))The next step is to embed the documents using OpenAI embeddings. Then, you load them into a Chroma vector database using LangChain’s from_documents method.
This shall create a powerful search result mechanism.
You can then set up a retriever to search for similar documents and retrieve six results before building your prompt for the LLM.
After setting up the vector database, you can create a prompt template using LangChain. This template will instruct the LLM to say “I don’t know” if it doesn’t know the answer.
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 6}
)With the prompt template set up, you’re ready to query your LLM using LangChain’s ConversationalRetrievalChain. This leverages search engines’ capabilities for real-time information retrieval.
You can ask a question like “How does donut compare to other document understanding models?”, and the model will provide a response citing relevant sources from the corpus.
You can apply a hybrid search by filtering metadata fields like the file name. This will retrieve information from specific sources within the corpus:
template = """You are an AI assistant for answering questions about the Donut document understanding model.
You are given the following extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "Hmm, I'm not sure." Don't try to make up an answer.
If the question is not about Donut, politely inform them that you are tuned to only answer questions about Donut.
Question: {question}
=========
{context}
=========
Answer in Markdown:"""
prompt = PromptTemplate(template=template, input_variables=["question", "context"])
llm = OpenAI(temperature=0)
doc_chain = load_qa_with_sources_chain(llm, chain_type="map_reduce")
question_generator_chain = LLMChain(llm=llm, prompt=prompt)
qa_chain = ConversationalRetrievalChain(
retriever=retriever,
question_generator=question_generator_chain,
combine_docs_chain=doc_chain,
)
qa_chain.invoke({
"question": "How does Donut compare to other document understanding models?",
"chat_history": []
})["answer"]Which should return:
Donut is a state-of-the-art document understanding model that does not require OCR and can be trained in an end-to-end fashion. It uses a simple architecture consisting of a visual encoder and textual decoder.
SOURCES: donut_readme.md, donut_paper.pdf, donut_slide.pptxAfter completing your RAG model, you can improve it by including your files. You can ask different questions or add additional metadata fields in your hybrid search.
This process will continuously improve the quality of the information retrieval.
You are now prepared to create a RAG model based on information important to you or your organization. You can do this using a diverse set of documents.
Final Thoughts
In the “Preprocessing Unstructured Data for LLM Applications” series, you learned about ingesting and normalizing content from various data sources.
You also explored enriching RAG applications with metadata extracted during preprocessing.
Finally, the series covered using advanced modeling techniques, including fine-tuning, to unlock content in PDFs and images, transforming the outputs into a functioning RAG model.
You’re now ready to build an RAG model that is aware of information about your project or organization. You can also use generative models to enhance the understanding and processing of complex documents.
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.