Exploring Multimodal Search and RAG

In this series, we previously discussed the advanced capabilities of Multimodal Retrieval-Augmented Generation (RAG) and its potential to create sophisticated AI systems by integrating various data types. Now, we turn our attention to the roles of search and recommendation, which serve distinct yet complementary purposes.

Search is objective, retrieving items that match a query without considering user preferences, while recommendation is subjective, suggesting items based on user preferences and past interactions.

By combining these approaches through a multimodal system that uses various data types like text and images, we can enhance personalization. Creating vector representations for each modality allows the system to handle both search and recommendation seamlessly.

For instance, a query like “cute pet movies” can return relevant results based on descriptions or images. Integrating user preferences into search algorithms provides personalized recommendations, blending precise information retrieval with individualized suggestions for a more robust user experience.

Definition of a Multimodal Recommender System

A multimodal recommender system is a type of recommendation system that is designed to recommend items by leveraging various forms of data representations, such as text, images, audio, video, and other forms of multimedia content.

By utilizing multiple modalities to analyze user preferences and item features, this approach provides a more comprehensive and personalized recommendation experience.

This means that the system can take into account not only what users say they like or dislike, but also their interactions with different types of content, such as watching videos, listening to audio, reading text, or viewing images, to make recommendations that align with their preferences and habits.

In today’s highly interconnected digital world, where users consume content in various formats and across multiple devices, a multimodal recommender system can provide more accurate and relevant recommendations by considering the diverse ways in which users interact with and consume digital media.

Building a Multimodal Recommender System

Setup and Connection to Weaviate

First, we set up our environment and connect to Weaviate, a vector database that supports various embeddings and multimodal data.

import warnings
warnings.filterwarnings("ignore")

# Load environment variables and API keys
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

MM_EMBEDDING_API_KEY = os.getenv("EMBEDDING_API_KEY")
TEXT_EMBEDDING_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_BASEURL = os.getenv("OPENAI_BASE_URL")

# Connect to Weaviate
import weaviate

client = weaviate.connect_to_embedded(
    version="1.24.4",
    environment_variables={
        "ENABLE_MODULES": "multi2vec-palm,text2vec-openai"
    },
    headers={
        "X-PALM-Api-Key": MM_EMBEDDING_API_KEY,
        "X-OpenAI-Api-Key": TEXT_EMBEDDING_API_KEY,
        "X-OpenAI-BaseURL": OPENAI_BASEURL
    }
)

# Check if the client is ready
client.is_ready()

Creating a Multivector Collection

We create a collection named “Movies” with various properties such as title, overview, vote_average, release_year, tmdb_id, poster, and poster_path. Additionally, we configure the vector spaces for text and image-based semantic search.

from weaviate.classes.config import Configure, DataType, Property

# Create the Movies collection
client.collections.create(
    name="Movies",
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="overview", data_type=DataType.TEXT),
        Property(name="vote_average", data_type=DataType.NUMBER),
        Property(name="release_year", data_type=DataType.INT),
        Property(name="tmdb_id", data_type=DataType.INT),
        Property(name="poster", data_type=DataType.BLOB),
        Property(name="poster_path", data_type=DataType.TEXT),
    ],

    # Define & configure the vector spaces
    vectorizer_config=[
        # Vectorize the movie title and overview - for text-based semantic search
        Configure.NamedVectors.text2vec_openai(
            name="txt_vector",                       # the name of the txt vector space
            source_properties=["title", "overview"], # text properties to be used for vectorization
        ),
        
        # Vectorize the movie poster - for image-based semantic search
        Configure.NamedVectors.multi2vec_palm(
            name="poster_vector",                    # the name of the image vector space
            image_fields=["poster"],                 # use poster property for multivec vectorization
            
            project_id="semi-random-dev",
            location="us-central1",
            model_id="multimodalembedding@001",
            dimensions=1408,
        ),
    ]
)

Data Upload

We load the movie data from a JSON file and prepare it for uploading to the Weaviate database.

import pandas as pd
df = pd.read_json("movies_data.json")
df.head()

Helper Function

We define a helper function to convert image files to base64 representation, which is required for storing image data in Weaviate.

import base64

# Helper function to convert a file to base64 representation
def toBase64(path):
    with open(path, 'rb') as file:
        return base64.b64encode(file.read()).decode('utf-8')

Introduction of Text and Image Data

We iterate over the movie data and add each movie, along with its poster image, to the Weaviate collection.

from weaviate.util import generate_uuid5

movies = client.collections.get("Movies")

with movies.batch.rate_limit(20) as batch:
    for index, movie in df.iterrows():
        
        # Skip movies that are already in the database
        if movies.data.exists(generate_uuid5(movie.id)):
            print(f'{index}: Skipping insert. The movie "{movie.title}" is already in the database.')
            continue
        
        print(f'{index}: Adding "{movie.title}"')
        
        # Construct the path to the poster image file
        poster_path = f"./posters/{movie.id}_poster.jpg"
        # Generate base64 representation of the poster
        posterb64 = toBase64(poster_path)
        
        # Build the object payload
        movie_obj = {
            "title": movie.title,
            "overview": movie.overview,
            "vote_average": movie.vote_average,
            "tmdb_id": movie.id,
            "poster_path": poster_path,
            "poster": posterb64
        }
        
        # Add object to batch queue
        batch.add_object(
            properties=movie_obj,
            uuid=generate_uuid5(movie.id),
        )

# Check for failed objects
if len(movies.batch.failed_objects) > 0:
    print(f"Failed to import {len(movies.batch.failed_objects)} objects")
    for failed in movies.batch.failed_objects:
        print(f"e.g. Failed to import object with error: {failed.message}")
else:
    print("Import complete with no errors")

Text Search Using the Text Vector

We perform text-based searches using the text vector space.

from IPython.display import Image

# Perform a text search
response = movies.query.near_text(
    query="Movie about lovable cute pets",
    target_vector="txt_vector",  # Search in the txt_vector space
    limit=3,
)

# Inspect the response
for item in response.objects:
    print(item.properties["title"])
    print(item.properties["overview"])
    display(Image(item.properties["poster_path"], width=200))

# Perform another text search
response = movies.query.near_text(
    query="Epic super hero",
    target_vector="txt_vector",  # Search in the txt_vector space
    limit=3,
)

# Inspect the response
for item in response.objects:
    print(item.properties["title"])
    print(item.properties["overview"])
    display(Image(item.properties["poster_path"], width=200))

Image Searches Within the Poster Vector Space

We perform image-based searches using the poster vector space.

# Perform a text search in the poster vector space
response = movies.query.near_text(
    query="Movie about lovable cute pets",
    target_vector="poster_vector",  # Search in the poster_vector space
    limit=3,
)

# Inspect the response
for item in response.objects:
    print(item.properties["title"])
    print(item.properties["overview"])
    display(Image(item.properties["poster_path"], width=200))

# Perform another text search in the poster vector space
response = movies.query.near_text(
    query="Epic super hero",
    target_vector="poster_vector",  # Search in the poster_vector space
    limit=3,
)

# Inspect the response
for item in response.objects:
    print(item.properties["title"])
    print(item.properties["overview"])
    display(Image(item.properties["poster_path"], width=200))

Image-Search Through the Posters Vector Space

We perform image-based searches using a sample image to find similar movie posters.

# Load a test image
Image("test/spooky.jpg", width=300)

# Perform an image search
response = movies.query.near_image(
    near_image=toBase64("test/spooky.jpg"),
    target_vector="poster_vector",  # Search in the poster_vector space
    limit=3,
)

# Inspect the response
for item in response.objects:
    print(item.properties["title"])
    display(Image(item.properties["poster_path"], width=200))

# Load another test image
Image("test/superheroes.png", width=300)

# Perform another image search
response = movies.query.near_image(
    near_image=toBase64("test/superheroes.png"),
    target_vector="poster_vector",  # Search in the poster_vector space
    limit=3,
)

# Inspect the response
for item in response.objects:
    print(item.properties["title"])
    display(Image(item.properties["poster_path"], width=200))

Final Thoughts

In this example, we demonstrated how to build a multimodal recommender system using Weaviate. This system leverages text and image data to provide comprehensive and personalized recommendations.

By capturing different modalities and embedding them into vector spaces, we can perform both text and image-based searches, enhancing the recommendation process. This approach allows for a more holistic understanding of user preferences and interests, resulting in more accurate and relevant recommendations.

Furthermore, the multimodal nature of the system enables it to accommodate diverse types of data, expanding its applicability across various domains such as e-commerce, entertainment, and content recommendation platforms.

By integrating text and image modalities, the system can better capture the nuances and context of user preferences, leading to a more immersive and tailored recommendation experience.

Discover more from AI For Developers

Subscribe to get the latest posts sent to your email.

Multimodal Recommender System

Definition of a Multimodal Recommender System

Building a Multimodal Recommender System

Setup and Connection to Weaviate

Creating a Multivector Collection

Data Upload

Helper Function

Introduction of Text and Image Data

Text Search Using the Text Vector

Image Searches Within the Poster Vector Space

Image-Search Through the Posters Vector Space

Final Thoughts

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

Why Most AI Doom Scenarios for Devs Are Wrong

AI For Developers

Top Categories

Subscribe to Our Newsletter

Follow us

Definition of a Multimodal Recommender System

Building a Multimodal Recommender System

Setup and Connection to Weaviate

Creating a Multivector Collection

Data Upload

Helper Function

Introduction of Text and Image Data

Text Search Using the Text Vector

Image Searches Within the Poster Vector Space

Image-Search Through the Posters Vector Space

Final Thoughts

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Understanding Multimodal RAG

Normalizing Unstructured Documents (AI COURSE – PART 1)

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

AWS re:Invent 2024 Keynote Deep Dive (Continued): Infrastructure at Scale

Why Most AI Doom Scenarios for Devs Are Wrong

Discover more from AI For Developers