Multimodal Search and RAG: Techniques & Applications

This article is the first of our series on LLM Applications with Multimodal Search and RAG. This series aims to provide software developers with an arsenal of skills to integrate various data types, including text, images, and videos.

Overview of Multimodality & Multimodal Search

Traditional search techniques are insufficient as data takes on many forms, like images, audio, and video. This series will help you learn how to search through various data types by examining the ideas of multimodality and advanced search techniques.

You will learn about the basics and importance of multimodal data, how multimodal search works, and techniques for building and training models that handle different data types.

Definition of Multimodal Data

Multimodal data refers to information represented in many forms, such as text, images, audio, and video. Data is multimodal, making traditional, single-modal search methods inadequate for comprehensive information retrieval.

Modern data analysis and search techniques need to be able to handle and interpret a wide variety of data types to provide accurate and comprehensive results. The ability to process and analyze multimodal data has become important in fields such as artificial intelligence, data science, and information retrieval.

Functioning of Multimodality & Multimodal Search

Multimodal search combines various data types to provide more accurate and comprehensive search results. It uses models capable of understanding and processing different data types, such as images and text, to deliver a unified search experience.

Multimodal search enables the search engine to consider many forms of data, including visual and textual information, to enhance the relevance and richness of search results. It also offers a more nuanced and holistic understanding of the user’s query, leading to improved search accuracy and user satisfaction.

Models for Multimodal Embedding

Multimodal embedding models are a powerful artificial intelligence tool that converts data from various modalities, such as images, text, and audio, into a common vector space. This enables the system to understand and compare different types of data, facilitating effective multimodal searches.

By bringing together information from different modalities, these models can enhance the accuracy and richness of data analysis, leading to better insights and decision-making in various applications such as image recognition, natural language processing, and multimedia retrieval.

Constructive Learning for Unifying Multimodal Models

Constructive learning techniques integrate data from various modalities, such as text, images, audio, and video, ensuring that the embeddings capture the complex relationships between different types of data.

This unification process is critical for creating an effective multimodal search system that can retrieve relevant information across diverse data sources. By combining information from many modalities, the system can provide a more comprehensive understanding of the underlying data and improve the search experience for users.

Model Training

When training multimodal models, it’s essential to work with extensive datasets that contain many data types, such as images, text, and audio. These models are designed to learn and generate embeddings that capture the nuances of the content and context across different modalities.

By doing so, they can represent and understand the information present in the data, regardless of its modality. This capability is crucial for various applications, including image captioning, video analysis, and natural language processing, where understanding and integrating information from diverse sources is essential for accurate and comprehensive analysis.

Contrastive Loss Function

Remember that a contrastive loss function is often used in training multimodal models. It helps the model learn to differentiate between similar and dissimilar data points by minimizing the distance between embeddings of similar data and maximizing the distance between embeddings of dissimilar data.

This is particularly useful in tasks such as image and text matching, where the model needs to understand the relationships between different modalities. By optimizing the embeddings in this way, the model can better capture the similarities and differences between various types of data, leading to more accurate and robust multimodal representations.

Building a Multimodal Search Model

Step 1: Setting Up the Environment

First, we need to set up our environment by installing the necessary libraries and configuring API keys. Here’s the setup code:

# Ignore warnings for a cleaner output
import warnings
warnings.filterwarnings('ignore')

# Import necessary libraries
import os
from dotenv import load_dotenv, find_dotenv
import weaviate

# Load environment variables from a .env file
load_dotenv(find_dotenv())  # Locate and read local .env file
EMBEDDING_API_KEY = os.getenv("EMBEDDING_API_KEY")  # Get the API key from environment variables

# Connect to an embedded Weaviate instance
client = weaviate.connect_to_embedded(
    version="1.24.4",  # Specify the Weaviate version
    environment_variables={
        "ENABLE_MODULES": "backup-filesystem,multi2vec-palm",  # Enable required modules
        "BACKUP_FILESYSTEM_PATH": "/home/jovyan/work/backups",  # Set backup path
    },
    headers={
        "X-PALM-Api-Key": EMBEDDING_API_KEY,  # Provide the API key for authentication
    }
)

# Check if the Weaviate client is ready
client.is_ready()

Step 2: Connect to Weaviate and Create a Collection

Weaviate is a vector search engine that integrates and searches various data modalities, including text, images, and videos. Using Weaviate, you can create a robust multimodal search system that stores and retrieves data based on its content and context. In this step, you will connect to Weaviate and create a collection to store our multimodal data:

from weaviate.classes.config import Configure

# Delete existing collection if it exists
if client.collections.exists("Animals"):
    client.collections.delete("Animals")  # Remove the existing collection to avoid conflicts

# Create a new collection named "Animals" with specific configurations
client.collections.create(
    name="Animals",  # Name of the new collection
    vectorizer_config=Configure.Vectorizer.multi2vec_palm(
        image_fields=["image"],  # Specify which fields will contain images
        video_fields=["video"],  # Specify which fields will contain videos
        project_id="semi-random-dev",  # Set the project ID
        location="us-central1",  # Specify the location
        model_id="multimodalembedding@001",  # Model ID for multimodal embedding
        dimensions=1408,  # Dimensions of the embedding vector
    )
)

Step 3:Upload Images to Weaviate

We then upload image files to the Weaviate collection

import base64

# Convert a file to base64 representation
def to_base64(path):
    with open(path, 'rb') as file:
        return base64.b64encode(file.read()).decode('utf-8')

# Get the "Animals" collection from Weaviate
animals = client.collections.get("Animals")

# Get the list of image files from the specified directory
image_source = os.listdir("./source/animal_image/")

# Iterate over each image file and add it to the Weaviate collection
with animals.batch.rate_limit(requests_per_minute=100) as batch:
    for name in image_source:
        print(f"Adding {name}")
        path = f"./source/image/{name}"
        
        # Add the object to the batch for insertion into Weaviate
        batch.add_object({
            "name": name,  # Name of the object
            "path": path,  # Path to the image file
            "image": to_base64(path),  # Base64 representation of the image
            "mediaType": "image",  # Specify the media type as "image"
        })

# Check for failed objects during insertion
if animals.batch.failed_objects:
    print(f"Failed to import {len(animals.batch.failed_objects)} objects")
    for failed in animals.batch.failed_objects:
        print(f"e.g. Failed to import object with error: {failed.message}")
else:
    print("No errors")  # Print success message if no errors occurred)

Step 4: Upload Video Files to Weaviate

Similarly, we upload video files to Weaviate:

video_source = os.listdir("./source/video/")  # Get the list of video files from the specified directory

# Iterate over each video file and add it to the Weaviate collection
for name in video_source:
    print(f"Adding {name}")
    path = f"./source/video/{name}"
    
    # Add the video object to the Weaviate collection
    animals.data.insert({
        "name": name,  # Name of the object
        "path": path,  # Path to the video file
        "video": to_base64(path),  # Base64 representation of the video
        "mediaType": "video"  # Specify the media type as "video"
    })

# Check for failed objects during insertion
if animals.batch.failed_objects:
    print(f"Failed to import {len(animals.batch.failed_objects)} objects")
    for failed in animals.batch.failed_objects:
        print(f"e.g. Failed to import object with error: {failed.message}")
else:
    print("No errors")  # Print success message if no errors occurred

Step 5: Text-to-Media Search

Now we build a text-to-media search:

# Perform a multimodal search based on text query
response = animals.query.near_text(
    query="dog playing with stick",  # Text query to search for
    return_properties=['name', 'path', 'mediaType'],  # Properties to return in the search results
    limit=3  # Limit the number of search results to 3
)

# Iterate over each object in the search response
for obj in response.objects:
    # Print the properties of each object in a readable format
    print(json.dumps(obj.properties, indent=2))
    
    # Display the media associated with each object (e.g., image or video)
    display_media(obj.properties)

Step 6: Image-to-Media Search

For image-to-media search, we use the following code:

from IPython.display import Image

# Display the image
Image("./test/test-cat.jpg", width=300)

# Perform a multimodal search based on an image
response = animals.query.near_image(
    near_image=file_to_base64("./test/test-cat.jpg"),  # Convert the image to base64 representation
    return_properties=['name', 'path', 'mediaType'],  # Properties to return in the search results
    limit=3  # Limit the number of search results to 3
)

# Iterate over each object in the search response
for obj in response.objects:
    # Print the properties of each object in a readable format
    print(json.dumps(obj.properties, indent=2))
    
    # Display the media associated with each object (e.g., image or video)
    display_media(obj.properties)

Step 7: Search for Pictures with Web URL

We can also search using an image URL:

# Display the image from the URL
Image("https://raw.githubusercontent.com/weaviate-tutorials/multimodal-workshop/main/2-multimodal/test/test-meerkat.jpg", width=300)

# Perform a multimodal search based on an image URL
response = animals.query.near_image(
    near_image=url_to_base64("https://raw.githubusercontent.com/weaviate-tutorials/multimodal-workshop/main/2-multimodal/test/test-meerkat.jpg"),  # Convert the image URL to base64 representation
    return_properties=['name', 'path', 'mediaType'],  # Properties to return in the search results
    limit=3  # Limit the number of search results to 3
)

# Iterate over each object in the search response
for obj in response.objects:
    # Print the properties of each object in a readable format
    print(json.dumps(obj.properties, indent=2))
    
    # Display the media associated with each object (e.g., image or video)
    display_media(obj.properties)

Step 8: Video to Media Search

For video-to-media search, use this code:

from IPython.display import Video

# Display the video
Video("./test/test-meerkat.mp4", width=400)

from weaviate.classes.query import NearMediaType

# Perform a multimodal search based on a video
response = animals.query.near_media(
    media=file_to_base64("./test/test-meerkat.mp4"),  # Convert the video file to base64 representation
    media_type=NearMediaType.VIDEO,  # Specify the media type as video
    return_properties=['name', 'path', 'mediaType'],  # Properties to return in the search results
    limit=3  # Limit the number of search results to 3
)

# Iterate over each object in the search response
for obj in response.objects:
    # Display the media associated with each object (e.g., image or video)
    display_media(obj.properties)

Step 9: Visualizing a Multimodal Vector Space

Load Vector Embeddings and MediaType from Weaviate

To visualize the vector space, we first load the embeddings

# Restore backup with specified parameters
client.backup.restore(
    backup_id="resources-img-and-vid",  # ID of the backup to restore
    include_collections="Resources",  # Include only the "Resources" collection in the restoration process
    backend="filesystem"  # Specify the backend as filesystem
)

# Wait for the "Resources" collection to be ready (may take a few seconds)
import time
time.sleep(5)

# Get the "Resources" collection
collection = client.collections.get("Resources")

# Initialize lists to store embeddings and labels
embs = []
labs = []

# Iterate over each item in the collection and retrieve vector embeddings
for item in collection.iterator(include_vector=True):
    labs.append(item.properties['mediaType'])  # Append the media type label
    embs.append(item.vector)  # Append the vector embedding

# Extract the default embedding from each item
embs2 = [emb['default'] for emb in embs]

# Create a DataFrame to store embeddings
emb_df = pd.DataFrame(embs2)

# Create a Series to store labels
labels = pd.Series(labs)

Step 10: Embedding Plotting

We then plot the embeddings using UMAP

import umap
import umap.plot
import matplotlib.pyplot as plt

# Replace labels with numerical values for plotting
labels.replace({'image': 0, 'video': 1}, inplace=True)

# Plot the embeddings using UMAP
mapper2 = umap.UMAP().fit(emb_df)  # Fit UMAP to the embedding data

plt.figure(figsize=(10, 8))  # Set the figure size

# Plot the UMAP points with specified labels and theme
umap.plot.points(mapper2, labels=labels, theme='fire')

# Set the title and labels for the plot
plt.title('UMAP Visualization of Embedding Space')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')

plt.show()  # Display the plot

Step 11: Plot of Vectors

For an interactive plot of vectors

# Enable output to Jupyter Notebook for UMAP plots
umap.plot.output_notebook()

# Create an interactive UMAP plot
p = umap.plot.interactive(mapper2, labels=labels, theme='fire')

# Show the interactive plot
umap.plot.show(p)

# Close the connection to Weaviate
client.close()

Final Thoughts

Multimodal search is a new way of finding information by combining different types of data. This article gives an overview of multimodal data, how multimodal search works, how to build a multimodal search model, and its real-world uses. Using multimodal search can help us create better and more intuitive search experiences that bring together different kinds of information. This innovative approach to search leverages many modes of input, such as text, images, voice, and other forms of data, to enhance the search process.

By integrating these various data types, multimodal search aims to provide more comprehensive and accurate results, improve the user experience, and enable more effective information retrieval. Additionally, multimodal search can revolutionize various industries, such as e-commerce, healthcare, education, and entertainment, by enabling more advanced and efficient search capabilities.

In the next article, we will delve deeper into the technology behind these advancements. We’ll start by exploring the foundations of large language models and how they integrate with multimodal capabilities to form Large Multimodality Models (LMMs). Stay tuned to understand the intricate workings of these powerful models and their practical applications.

Colab link to the entire code https://colab.research.google.com/drive/11suo_OqVYyH_xAiFF6eKC9_rkd2loFWN?usp=sharing

Discover more from AI For Developers

Subscribe to get the latest posts sent to your email.

Exploring Multimodal Search and RAG – Part 1

Overview of Multimodality & Multimodal Search

Definition of Multimodal Data

Functioning of Multimodality & Multimodal Search

Models for Multimodal Embedding

Constructive Learning for Unifying Multimodal Models

Model Training

Contrastive Loss Function

Building a Multimodal Search Model

Step 1: Setting Up the Environment

Step 2: Connect to Weaviate and Create a Collection

Step 3:Upload Images to Weaviate

Step 4: Upload Video Files to Weaviate

Step 5: Text-to-Media Search

Step 6: Image-to-Media Search

Step 7: Search for Pictures with Web URL

Step 8: Video to Media Search

Step 9: Visualizing a Multimodal Vector Space

Step 10: Embedding Plotting

Step 11: Plot of Vectors

Final Thoughts

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

Why Most AI Doom Scenarios for Devs Are Wrong

AI For Developers

Top Categories

Subscribe to Our Newsletter

Follow us

Overview of Multimodality & Multimodal Search

Definition of Multimodal Data

Functioning of Multimodality & Multimodal Search

Models for Multimodal Embedding

Constructive Learning for Unifying Multimodal Models

Model Training

Contrastive Loss Function

Building a Multimodal Search Model

Step 1: Setting Up the Environment

Step 2: Connect to Weaviate and Create a Collection

Step 3:Upload Images to Weaviate

Step 4: Upload Video Files to Weaviate

Step 5: Text-to-Media Search

Step 6: Image-to-Media Search

Step 7: Search for Pictures with Web URL

Step 8: Video to Media Search

Step 9: Visualizing a Multimodal Vector Space

Step 10: Embedding Plotting

Step 11: Plot of Vectors

Final Thoughts

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

GraphRAG: A Game-Changer for Complex Data Discovery is now on GitHub

Exploring Multimodal Search and RAG – Part 2

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

AWS re:Invent 2024 Keynote Deep Dive (Continued): Infrastructure at Scale

Why Most AI Doom Scenarios for Devs Are Wrong

Discover more from AI For Developers