This article is the first of our series on LLM Applications with Multimodal Search and RAG. This series aims to provide software developers with an arsenal of skills to integrate various data types, including text, images, and videos.
Overview of Multimodality & Multimodal Search
Traditional search techniques are insufficient as data takes on many forms, like images, audio, and video. This series will help you learn how to search through various data types by examining the ideas of multimodality and advanced search techniques.
You will learn about the basics and importance of multimodal data, how multimodal search works, and techniques for building and training models that handle different data types.
Definition of Multimodal Data
Multimodal data refers to information represented in many forms, such as text, images, audio, and video. Data is multimodal, making traditional, single-modal search methods inadequate for comprehensive information retrieval.
Modern data analysis and search techniques need to be able to handle and interpret a wide variety of data types to provide accurate and comprehensive results. The ability to process and analyze multimodal data has become important in fields such as artificial intelligence, data science, and information retrieval.
Functioning of Multimodality & Multimodal Search
Multimodal search combines various data types to provide more accurate and comprehensive search results. It uses models capable of understanding and processing different data types, such as images and text, to deliver a unified search experience.
Multimodal search enables the search engine to consider many forms of data, including visual and textual information, to enhance the relevance and richness of search results. It also offers a more nuanced and holistic understanding of the user’s query, leading to improved search accuracy and user satisfaction.
Models for Multimodal Embedding
Multimodal embedding models are a powerful artificial intelligence tool that converts data from various modalities, such as images, text, and audio, into a common vector space. This enables the system to understand and compare different types of data, facilitating effective multimodal searches.
By bringing together information from different modalities, these models can enhance the accuracy and richness of data analysis, leading to better insights and decision-making in various applications such as image recognition, natural language processing, and multimedia retrieval.
Constructive Learning for Unifying Multimodal Models
Constructive learning techniques integrate data from various modalities, such as text, images, audio, and video, ensuring that the embeddings capture the complex relationships between different types of data.
This unification process is critical for creating an effective multimodal search system that can retrieve relevant information across diverse data sources. By combining information from many modalities, the system can provide a more comprehensive understanding of the underlying data and improve the search experience for users.
Model Training
When training multimodal models, it’s essential to work with extensive datasets that contain many data types, such as images, text, and audio. These models are designed to learn and generate embeddings that capture the nuances of the content and context across different modalities.
By doing so, they can represent and understand the information present in the data, regardless of its modality. This capability is crucial for various applications, including image captioning, video analysis, and natural language processing, where understanding and integrating information from diverse sources is essential for accurate and comprehensive analysis.
Contrastive Loss Function
Remember that a contrastive loss function is often used in training multimodal models. It helps the model learn to differentiate between similar and dissimilar data points by minimizing the distance between embeddings of similar data and maximizing the distance between embeddings of dissimilar data.
This is particularly useful in tasks such as image and text matching, where the model needs to understand the relationships between different modalities. By optimizing the embeddings in this way, the model can better capture the similarities and differences between various types of data, leading to more accurate and robust multimodal representations.
Building a Multimodal Search Model
Step 1: Setting Up the Environment
First, we need to set up our environment by installing the necessary libraries and configuring API keys. Here’s the setup code:
# Ignore warnings for a cleaner output
import warnings
warnings.filterwarnings('ignore')
# Import necessary libraries
import os
from dotenv import load_dotenv, find_dotenv
import weaviate
# Load environment variables from a .env file
load_dotenv(find_dotenv()) # Locate and read local .env file
EMBEDDING_API_KEY = os.getenv("EMBEDDING_API_KEY") # Get the API key from environment variables
# Connect to an embedded Weaviate instance
client = weaviate.connect_to_embedded(
version="1.24.4", # Specify the Weaviate version
environment_variables={
"ENABLE_MODULES": "backup-filesystem,multi2vec-palm", # Enable required modules
"BACKUP_FILESYSTEM_PATH": "/home/jovyan/work/backups", # Set backup path
},
headers={
"X-PALM-Api-Key": EMBEDDING_API_KEY, # Provide the API key for authentication
}
)
# Check if the Weaviate client is ready
client.is_ready()Step 2: Connect to Weaviate and Create a Collection
Weaviate is a vector search engine that integrates and searches various data modalities, including text, images, and videos. Using Weaviate, you can create a robust multimodal search system that stores and retrieves data based on its content and context. In this step, you will connect to Weaviate and create a collection to store our multimodal data:
from weaviate.classes.config import Configure
# Delete existing collection if it exists
if client.collections.exists("Animals"):
client.collections.delete("Animals") # Remove the existing collection to avoid conflicts
# Create a new collection named "Animals" with specific configurations
client.collections.create(
name="Animals", # Name of the new collection
vectorizer_config=Configure.Vectorizer.multi2vec_palm(
image_fields=["image"], # Specify which fields will contain images
video_fields=["video"], # Specify which fields will contain videos
project_id="semi-random-dev", # Set the project ID
location="us-central1", # Specify the location
model_id="multimodalembedding@001", # Model ID for multimodal embedding
dimensions=1408, # Dimensions of the embedding vector
)
)Step 3:Upload Images to Weaviate
We then upload image files to the Weaviate collection
import base64
# Convert a file to base64 representation
def to_base64(path):
with open(path, 'rb') as file:
return base64.b64encode(file.read()).decode('utf-8')
# Get the "Animals" collection from Weaviate
animals = client.collections.get("Animals")
# Get the list of image files from the specified directory
image_source = os.listdir("./source/animal_image/")
# Iterate over each image file and add it to the Weaviate collection
with animals.batch.rate_limit(requests_per_minute=100) as batch:
for name in image_source:
print(f"Adding {name}")
path = f"./source/image/{name}"
# Add the object to the batch for insertion into Weaviate
batch.add_object({
"name": name, # Name of the object
"path": path, # Path to the image file
"image": to_base64(path), # Base64 representation of the image
"mediaType": "image", # Specify the media type as "image"
})
# Check for failed objects during insertion
if animals.batch.failed_objects:
print(f"Failed to import {len(animals.batch.failed_objects)} objects")
for failed in animals.batch.failed_objects:
print(f"e.g. Failed to import object with error: {failed.message}")
else:
print("No errors") # Print success message if no errors occurred)Step 4: Upload Video Files to Weaviate
Similarly, we upload video files to Weaviate:
video_source = os.listdir("./source/video/") # Get the list of video files from the specified directory
# Iterate over each video file and add it to the Weaviate collection
for name in video_source:
print(f"Adding {name}")
path = f"./source/video/{name}"
# Add the video object to the Weaviate collection
animals.data.insert({
"name": name, # Name of the object
"path": path, # Path to the video file
"video": to_base64(path), # Base64 representation of the video
"mediaType": "video" # Specify the media type as "video"
})
# Check for failed objects during insertion
if animals.batch.failed_objects:
print(f"Failed to import {len(animals.batch.failed_objects)} objects")
for failed in animals.batch.failed_objects:
print(f"e.g. Failed to import object with error: {failed.message}")
else:
print("No errors") # Print success message if no errors occurredStep 5: Text-to-Media Search
Now we build a text-to-media search:
# Perform a multimodal search based on text query
response = animals.query.near_text(
query="dog playing with stick", # Text query to search for
return_properties=['name', 'path', 'mediaType'], # Properties to return in the search results
limit=3 # Limit the number of search results to 3
)
# Iterate over each object in the search response
for obj in response.objects:
# Print the properties of each object in a readable format
print(json.dumps(obj.properties, indent=2))
# Display the media associated with each object (e.g., image or video)
display_media(obj.properties)Step 6: Image-to-Media Search
For image-to-media search, we use the following code:
from IPython.display import Image
# Display the image
Image("./test/test-cat.jpg", width=300)
# Perform a multimodal search based on an image
response = animals.query.near_image(
near_image=file_to_base64("./test/test-cat.jpg"), # Convert the image to base64 representation
return_properties=['name', 'path', 'mediaType'], # Properties to return in the search results
limit=3 # Limit the number of search results to 3
)
# Iterate over each object in the search response
for obj in response.objects:
# Print the properties of each object in a readable format
print(json.dumps(obj.properties, indent=2))
# Display the media associated with each object (e.g., image or video)
display_media(obj.properties)Step 7: Search for Pictures with Web URL
We can also search using an image URL:
# Display the image from the URL
Image("https://raw.githubusercontent.com/weaviate-tutorials/multimodal-workshop/main/2-multimodal/test/test-meerkat.jpg", width=300)
# Perform a multimodal search based on an image URL
response = animals.query.near_image(
near_image=url_to_base64("https://raw.githubusercontent.com/weaviate-tutorials/multimodal-workshop/main/2-multimodal/test/test-meerkat.jpg"), # Convert the image URL to base64 representation
return_properties=['name', 'path', 'mediaType'], # Properties to return in the search results
limit=3 # Limit the number of search results to 3
)
# Iterate over each object in the search response
for obj in response.objects:
# Print the properties of each object in a readable format
print(json.dumps(obj.properties, indent=2))
# Display the media associated with each object (e.g., image or video)
display_media(obj.properties)Step 8: Video to Media Search
For video-to-media search, use this code:
from IPython.display import Video
# Display the video
Video("./test/test-meerkat.mp4", width=400)
from weaviate.classes.query import NearMediaType
# Perform a multimodal search based on a video
response = animals.query.near_media(
media=file_to_base64("./test/test-meerkat.mp4"), # Convert the video file to base64 representation
media_type=NearMediaType.VIDEO, # Specify the media type as video
return_properties=['name', 'path', 'mediaType'], # Properties to return in the search results
limit=3 # Limit the number of search results to 3
)
# Iterate over each object in the search response
for obj in response.objects:
# Display the media associated with each object (e.g., image or video)
display_media(obj.properties)Step 9: Visualizing a Multimodal Vector Space
Load Vector Embeddings and MediaType from Weaviate
To visualize the vector space, we first load the embeddings
# Restore backup with specified parameters
client.backup.restore(
backup_id="resources-img-and-vid", # ID of the backup to restore
include_collections="Resources", # Include only the "Resources" collection in the restoration process
backend="filesystem" # Specify the backend as filesystem
)
# Wait for the "Resources" collection to be ready (may take a few seconds)
import time
time.sleep(5)
# Get the "Resources" collection
collection = client.collections.get("Resources")
# Initialize lists to store embeddings and labels
embs = []
labs = []
# Iterate over each item in the collection and retrieve vector embeddings
for item in collection.iterator(include_vector=True):
labs.append(item.properties['mediaType']) # Append the media type label
embs.append(item.vector) # Append the vector embedding
# Extract the default embedding from each item
embs2 = [emb['default'] for emb in embs]
# Create a DataFrame to store embeddings
emb_df = pd.DataFrame(embs2)
# Create a Series to store labels
labels = pd.Series(labs)Step 10: Embedding Plotting
We then plot the embeddings using UMAP
import umap
import umap.plot
import matplotlib.pyplot as plt
# Replace labels with numerical values for plotting
labels.replace({'image': 0, 'video': 1}, inplace=True)
# Plot the embeddings using UMAP
mapper2 = umap.UMAP().fit(emb_df) # Fit UMAP to the embedding data
plt.figure(figsize=(10, 8)) # Set the figure size
# Plot the UMAP points with specified labels and theme
umap.plot.points(mapper2, labels=labels, theme='fire')
# Set the title and labels for the plot
plt.title('UMAP Visualization of Embedding Space')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.show() # Display the plotStep 11: Plot of Vectors
For an interactive plot of vectors
# Enable output to Jupyter Notebook for UMAP plots
umap.plot.output_notebook()
# Create an interactive UMAP plot
p = umap.plot.interactive(mapper2, labels=labels, theme='fire')
# Show the interactive plot
umap.plot.show(p)
# Close the connection to Weaviate
client.close()Final Thoughts
Multimodal search is a new way of finding information by combining different types of data. This article gives an overview of multimodal data, how multimodal search works, how to build a multimodal search model, and its real-world uses. Using multimodal search can help us create better and more intuitive search experiences that bring together different kinds of information. This innovative approach to search leverages many modes of input, such as text, images, voice, and other forms of data, to enhance the search process.
By integrating these various data types, multimodal search aims to provide more comprehensive and accurate results, improve the user experience, and enable more effective information retrieval. Additionally, multimodal search can revolutionize various industries, such as e-commerce, healthcare, education, and entertainment, by enabling more advanced and efficient search capabilities.
In the next article, we will delve deeper into the technology behind these advancements. We’ll start by exploring the foundations of large language models and how they integrate with multimodal capabilities to form Large Multimodality Models (LMMs). Stay tuned to understand the intricate workings of these powerful models and their practical applications.
Colab link to the entire code https://colab.research.google.com/drive/11suo_OqVYyH_xAiFF6eKC9_rkd2loFWN?usp=sharing
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.