Metadata Extraction and Chunking (FREE AI Course

Welcome to the second article in the series on Retrieval Augmented Generation (RAG) with Unstructured Data for LLMs. This series covers extracting information from unstructured documents to improve language model performance.

In the first article, we discussed normalizing unstructured documents.

This second one focuses on how to fine-tune and enrich extracted content with metadata. This helps improve downstream RAG (Retrieval Augmented Generation) results by supporting hybrid search.

This allows you to have chunked, more meaningful content for semantic search.

What is Metadata?

Metadata, in our context, is additional information that we extract while we’re pre-processing the document.

Metadata can be at the document level or the element level. It can be something we extract from the document information itself, like the last modified date or the file name.

Alternatively, it can be something we infer while pre-processing the document. For instance:

The type of element
Hierarchical relationships between different element types

This metadata fuel will be crucial when building your RAG application, particularly for applications like hybrid search.

In practice, metadata looks like this:

You have your text, which is the actual raw content extracted from the document.
Then, you’ll have all the other information, such as the page number, language, file name, and type of the element. All this information will be useful when you build hybrid search systems for your RAG application.

Before learning about hybrid search, it’s vital to first understand some basics about semantic search for LLMs (Large Language Models).

The first step in a Retrieval Augmented Generation (RAG) system, often utilized in machine learning applications, is typically retrieving relevant documents from a vector database. The most basic way to do this is through semantic search, looking for content similar to your query.

After loading your unstructured documents into a vector database, you run a query. The system then searches for the most similar documents based on a measure of distance in the vector space, leveraging external knowledge for more accurate results.

Documents close in the vector space are considered similar, creating a useful knowledge base for the LLM.

For example, a tomato document would be close to vectors for “tomato” and “vegetable.” Animal documents, on the other hand, would cluster together but be further from the “tomato” vector.

The idea is that if you’re searching over a corpus of documents, you can ask the vector database to return information relevant to your query for the LLM. And then, you’ll get back similar documents that you can insert into a prompt and feed into your LLM.

That’s typically what you’re trying to do in the semantic search part of an RAG application.

However, there are some cases where similarity search isn’t ideal for returning information for your RAG system:

There could be too many matches in some cases, which occurs when the documents are about the same topic.
You may want to bias results based on other information, like more recent information.
Important information contained within the document, such as section information, might be missed when you’re only searching on semantics.

That’s where hybrid search comes into play. You can use metadata as a filtering option.

You can filter on that metadata field if you want to limit your search to a particular section of your computer files.

If you want to limit your results to more recent information, you can structure your query accordingly. Set the query to return only documents after a specific date.

Metadata Extraction

So now you’ve learned about metadata and hybrid search. Let’s see what this looks like in practice. Just like in the previous article, you can start by importing some helper functions from the Unstructured, open source library.

You have one additional import that’s important to highlight: ChromaDB.

ChromaDB is an in-memory vector database. You’ll use it in this article to conduct your similarity search, which can be installed and imported as follows:

!pip install chromadb
import chromadb

Other necessary installations are mentioned in the previous article and they are in the Installations section in the notebook. Similarly, we are going to clone the following repository to fetch data to use in this tutorial:

!git clone https://github.com/Nuri-Tas/LLM-Tutorials.git

In this example, you’ll work with an e-publication about winter sports in Switzerland. The goal is to identify the chapter for each section. And, then, conduct a hybrid search where you ask a question about a particular chapter.

First, take a look at the document’s cover and table of contents. In this application, you will look for the titles in the table of contents. When pre-processing this document, we’ll get a metadata field called “parentID.” It attaches each element to a title (the section title).

Understanding the file format is crucial in this step.

You’ll use that metadata to construct a hybrid search for content within a particular section or chapter. Here is how the table of the contents of our data looks like:

from IPython.display import Image
data_path = "/content/LLM-Tutorials/Data/LLM_Preprocessing/"
Image(filename=data_path+"images/winter-sports-toc.png", height=400, width=400)

Table of contents image:

The first step is to run the document through the Unstructured API, as e-publications get converted to HTML before pre-processing. Once processed, you can use the JSONDisplay function to explore the results. These include the title classified as a “title” element and metadata from the data source.

You can process the data as follows:

filename = data_path + "example_files/winter-sports.epub"

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )
req = shared.PartitionParameters(files=files)

try:
    resp = s.general.partition(req)
except SDKError as e:
    print(e)

The data is now ready to be displayed in the JSON format. Let’s print the first 3 elements of the processed data:

print(json.dumps(resp.elements[0:3], indent=4))

which should return:

[
    {
        "type": "Title",
        "element_id": "6c6310b703135bfe4f64a9174a7af8eb",
        "text": "The Project Gutenberg eBook of Winter Sports in\nSwitzerland, by E. F. Benson",
        "metadata": {
            "languages": [
                "eng"
            ],
            "filename": "winter-sports.epub",
            "filetype": "application/epub"
        }
    },
...
]

Find Elements Associated with Chapters

Next, you’ll find elements associated with chapters by filtering for elements that are title elements containing keywords like “hockey.” You can then get the element ID for the “Ice Hockey” title element.

This element ID will show up as the parentID for elements within that chapter. This allows you to identify elements in a given chapter to enable searching only within that chapter. We can find the corresponding element IDs as follows:

[x for x in resp.elements if x['type'] == 'Title' and 'hockey' in x['text'].lower()]

Which returns:

[{'type': 'Title',
  'element_id': '6cf4a015e8c188360ea9f02a9802269b',
  'text': 'ICE-HOCKEY',
  'metadata': {'languages': ['eng'],
   'filename': 'winter-sports.epub',
   'filetype': 'application/epub'}},
 {'type': 'Title',
  'element_id': '4ef38ec61b1326072f24495180c565a8',
  'text': 'ICE HOCKEY',
  'metadata': {'languages': ['eng'],
   'filename': 'winter-sports.epub',
   'filetype': 'application/epub'}}]

This was an example of a filtering operation. You can filter down to various other elements by, for instance, looking for NarrativeText instead of Title types.

Now that we know how to filter, we can find the corresponding IDs for each chapter:

chapters = [
    "THE SUN-SEEKER",
    "RINKS AND SKATERS",
    "TEES AND CRAMPITS",
    "ICE-HOCKEY",
    "SKI-ING",
    "NOTES ON WINTER RESORTS",
    "FOR PARENTS AND GUARDIANS",
]

chapter_ids = {}
for element in resp.elements:
    for chapter in chapters:
        if element["text"] == chapter and element["type"] == "Title":
            chapter_ids[element["element_id"]] = chapter
            break

print(chapter_ids)

Doing this, you end up with a mapping of current element IDs to chapter names:

{'a37f63b4dd470e0bc0d4d92e66758183': 'THE SUN-SEEKER',
 '1766cdc7e0052527c77938d0a51d0495': 'RINKS AND SKATERS',
 '99ff7a9efd2518b35b27b9bee204fc1d': 'TEES AND CRAMPITS',
 '6cf4a015e8c188360ea9f02a9802269b': 'ICE-HOCKEY',
 'f342e4727343974b173134e6931b9158': 'SKI-ING',
 'a784c0efc6886ee75314b5fbb7b60ba0': 'NOTES ON WINTER RESORTS',
 'ecc5c7f65b2e27481f7a70508aa9a4c4': 'FOR PARENTS AND GUARDIANS'}

Once you have this mapping, you can view the text within a parent ID:

chapter_to_id = {v: k for k, v in chapter_ids.items()}
[x for x in resp.elements if x["metadata"].get("parent_id") == chapter_to_id["ICE-HOCKEY"]][0]

Which returns:

{'type': 'NarrativeText',
 'element_id': 'c7c8e2f178cb0dc273ba7e811372640b',
 'text': 'Many of the Swiss winter-resorts can put\ninto the field a very strong ice-hockey team, and fine teams from other\ncountries often make winter tours there; but the ice-hockey which the\nordinary winter visitor will be apt to join in will probably be of the\nmost elementary and unscientific kind indulged in, when the skating day\nis drawing to a close, by picked-up sides. As will be readily\nunderstood, the ice over which a hockey match has been played is\nperfectly useless for skaters any more that day until it has been swept,\nscraped, and sprinkled or flooded; and in consequence, at all Swiss\nresorts, with the exception of St. Moritz, where there is a rink that\nhas been made for the hockey-player, or when an important match is being\nplayed, this sport is supplementary to such others as I have spoken of.\nNobody, that is, plays hockey and nothing else, since he cannot play\nhockey at all till the greedy skaters have finished with the ice.',
 'metadata': {'languages': ['eng'],
  'parent_id': '6cf4a015e8c188360ea9f02a9802269b',
  'filename': 'winter-sports.epub',
  'filetype': 'application/epub'}}

Hybrid Search with Vector Databases

Load Data to a Vector DB

The next step is to load our data to a vector database to perform a hybrid search. You set up your ChromaDB collection, passing parameters like using cosine similarity in this vector space. Then you use your chapter mapping to insert documents into ChromaDB with chapter metadata:

# set up chromadb client 
client = chromadb.PersistentClient(path="chroma_tmp", settings=chromadb.Settings(allow_reset=True))
client.reset()

Create a chromadb collection with a cosine similarity search:

collection = client.create_collection(
    name="winter_sports",
    metadata={"hnsw:space": "cosine"}
)

Then load text to the collection with their IDs:

for element in resp.elements:
    parent_id = element["metadata"].get("parent_id")
    chapter = chapter_ids.get(parent_id, "")
    collection.add(
        documents=[element["text"]],
        ids=[element["element_id"]],
        metadatas=[{"chapter": chapter}]
    )

Hybrid Search

Once the elements are uploaded with metadata, you can perform a hybrid search by setting a query text like “how many players are on the team” and conditioning the search on the “Ice Hockey” chapter. This will limit results only to that chapter. You can then see the relevant information returned.

result = collection.query(
    query_texts=["How many players are on a team?"],
    n_results=2,
    where={"chapter": "ICE-HOCKEY"},
)
print(json.dumps(result, indent=2))

Which returns:

{
  "ids": [
    [
      "241221156e35865aa1715aa298bcc78d",
      "7a2340e355dc6059a061245db57f925b"
    ]
  ],
  "distances": [
    [
      0.5229758024215698,
      0.7836340665817261
    ]
  ],
  "metadatas": [
    [
      {
        "chapter": "ICE-HOCKEY"
      },
      {
        "chapter": "ICE-HOCKEY"
      }
    ]
  ],
  "embeddings": null,
  "documents": [
    [
      "It is a wonderful and delightful sight to watch the speed and\naccuracy of a ...  ",,
      "And in most places hockey is not taken very seriously ...   "]
  ],
  "uris": null,
  "data": null
}

Chunking

You’ve now learned about extracting metadata while preprocessing and using it for hybrid search. However, metadata is also useful for other operations, like chunking.

Chunking takes a large piece of text and breaks it into smaller pieces. This allows you to pass those into the vector database and include snippets in prompts for the LLM. There are two main reasons to do this:

Some LLMs have a limited context window, so you can’t pass the full document.
LLMs often cost more for larger context windows, so chunking into smaller pieces saves on inference costs.

If different content is chunked differently, you may get different chunks back. And better chunks, generally, result in better similarity search outputs and LLM performance.

The simplest way to chunk is by fixed size (characters or tokens), splitting into a new chunk when hitting a threshold. However, by using the extracted metadata, you can chunk more intelligently using information about the document structure.

The process is:

Load documents
Preprocess using a tool like Unstructured
Chunk the preprocessed elements
Embed and load chunks into the vector database
Extract relevant chunks for LLM prompts

When chunking by elements, you first split the document into atomic elements (titles, text, lists, etc). Then, you combine those into chunks based on rules like creating a new chunk whenever hitting a title element. This groups content from the same section together into coherent chunks:

elements = dict_to_elements(resp.elements)

Applying chunking strategies to the standardized element list allows rapid experimentation with different techniques to see what works best for your Large Language Model (LLM) application.

For example, traditional character-splitting chunking might split sections across chunks, so querying for “open domain question answering” would return irrelevant content about abstractive QA from the next section. But chunking by document elements and breaking on titles keeps all content for a given section together in the same chunk.

So when querying the vector database, you get back the relevant results in one coherent chunk, resulting in a better prompt for the LLM and better answer quality.

You can see this in practice by chunking the serialized JSON content from the previous publication example using the `chunk_by_titles` function from Unstructured.

chunks = chunk_by_title(
    elements,
    combine_text_under_n_chars=100,
    max_characters=3000,
)

This chunks the elements into 255 chunks from the original 752 elements by combining elements and breaking on titles.

print(len(elements), len(chunks))

Output:

752 255

Similar to the previous article, you can upload your local documents to the notebook. Please refer to the previous tutorial to initialize the required Class.

Final Thoughts

In this article, we dived into how to extract metadata from documents and use it for hybrid search and intelligent chunking when building RAG applications.

Metadata like document structure and element types allows filtering search results and grouping related content into coherent chunks for better prompting of large language models.

Techniques like chunking by document sections using title elements keep relevant information together. This improves the quality of retrieved information passed to generative AI or language models.

In the next article, we will cover more complex pre-processing techniques for PDFs and images.

Discover more from AI For Developers

Subscribe to get the latest posts sent to your email.

Metadata Extraction and Chunking (AI Course – Part 2)

What is Metadata?

Metadata Extraction

Find Elements Associated with Chapters

Hybrid Search with Vector Databases

Load Data to a Vector DB

Hybrid Search

Chunking

Final Thoughts

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

Why Most AI Doom Scenarios for Devs Are Wrong

AI For Developers

Top Categories

Subscribe to Our Newsletter

Follow us

What is Metadata?

Metadata Extraction

Find Elements Associated with Chapters

Hybrid Search with Vector Databases

Load Data to a Vector DB

Hybrid Search

Chunking

Final Thoughts

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Prompt Engineering for Vision Models

GPGPUs Fundamentals for Aspiring AI Software Engineers

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

AWS re:Invent 2024 Keynote Deep Dive (Continued): Infrastructure at Scale

Why Most AI Doom Scenarios for Devs Are Wrong

Discover more from AI For Developers