Multimodal RAG: Your Go-To Comprehensive Guide

Previously, we explored the foundations and functionalities of Large Multimodality Models, emphasizing their integration of text and image data. Now, let’s move forward to Multimodal Retrieval-Augmented Generation (RAG), which combines various data modalities, such as text and images, to enhance the capability of large language models.

By integrating language and vision models, multimodal RAG processes can generate more accurate and context-aware responses. This lesson will walk you through implementing a simple multimodal RAG workflow using Weaviate and Google Gemini Pro Vision.

Functioning of Multimodal RAG

Multimodal RAG, or Multimodal Retrieval Augmented Generation, is an innovative approach that seeks to enhance a language model by incorporating pertinent information retrieved from a database.

By doing so, the model is better equipped to formulate responses that are firmly rooted in the context provided, thereby reducing the risk of producing inaccurate or unrelated outputs.

To illustrate, instead of solely presenting a prompt to the language model, this approach involves supplying both the prompt and relevant information extracted from a vector database, which can store a variety of data types such as images and text.

This integrated approach empowers the language model to leverage diverse sources of information, resulting in more robust and contextually relevant output.

Implementing a Multimodal RAG Process

Setup and Connection to Weaviate

First, ensure the necessary libraries are installed. If running this code locally, execute the following commands:

!pip install -U weaviate-client
!pip install google-generativeai

Next, load the environment variables and API keys:

import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())  # Read local .env file

EMBEDDING_API_KEY = os.getenv("EMBEDDING_API_KEY")
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

Connect to the Weaviate instance:

import weaviate

client = weaviate.Client(
    url="http://localhost:8080",  # Replace with your Weaviate instance URL
    additional_headers={
        "X-PALM-Api-Key": EMBEDDING_API_KEY,
    }
)

assert client.is_ready()  # Ensure the client is ready

Restoring Pre-vectorized Resources

Restore the resources into Weaviate:

client.backup.restore(
    backup_id="resources-img-and-vid",
    include_classes="Resources",
    backend="filesystem"
)

import time
time.sleep(5)  # Wait for the resources to be ready

Data Preview

Preview the available data in the Weaviate instance:

from weaviate.client import Aggregate

aggregate_query = {
    "group_by": ["mediaType"]
}

result = client.aggregate.get("Resources", aggregate_query)

for group in result["data"]["groups"]:
    print(f"{group['groupedBy']} count: {group['totalCount']}")

Retrieving Data from the Database Using a Query

Define a function to retrieve an image based on a query:

from weaviate.client.query import QueryBuilder

def retrieve_image(query_text):
    query = QueryBuilder().with_near_text({"concepts": [query_text]}).build()
    result = client.query.get("Resources", ["path"]).with_filter(query).do()

    if result["data"]["Get"]["Resources"]:
        return result["data"]["Get"]["Resources"][0]["path"]
    else:
        return None

Run the image retrieval:

img_path = retrieve_image("fishing with my buddies")
print(f"Image path: {img_path}")

Generating Image Description

Use Google Gemini Pro Vision to generate a description of the retrieved image:

import google.generativeai as genai
from google.api_core.client_options import ClientOptions

genai.configure(
    api_key=GOOGLE_API_KEY,
    transport="rest",
    client_options=ClientOptions(
        api_endpoint=os.getenv("GOOGLE_API_BASE"),
    ),
)

import PIL.Image
from IPython.display import Markdown, Image

def generate_image_description(image_path, prompt):
    img = PIL.Image.open(image_path)
    
    model = genai.GenerativeModel("gemini-pro-vision")
    response = model.generate_content([prompt, img], stream=True)
    response.resolve()

    return response.text

Generate the description:

description = generate_image_description(img_path, "Please describe this image in detail.")
print(description)

Executing the Vision Request

Combine the retrieval and description generation steps into a single function:

def mm_rag(query):
    # Step 1: Retrieve an image using Weaviate
    img_path = retrieve_image(query)
    if img_path:
        display(Image(img_path))
    else:
        print("No image found for the query.")
        return
    
    # Step 2: Generate a description using Google Gemini Pro Vision
    description = generate_image_description(img_path, "Please describe this image in detail.")
    print(description)

# Example usage
mm_rag("paragliding through the mountains")

Final Thoughts

Multimodal RAG is an advanced technology that significantly improves the capabilities of large language models. It achieves this by seamlessly integrating various data types, resulting in more contextually rich and accurate responses.

By harnessing the combined power of Weaviate’s vector database and Google’s Gemini Pro Vision, a sophisticated multimodal AI system can be created. This system is exceptionally adept at comprehending and generating content from both textual inputs and images, offering a comprehensive and powerful solution for a wide range of applications.

In our next article, we will explore how multimodal systems can be applied to enhance search and recommendation functionalities. We’ll examine the distinct roles of search and recommendation and how combining them through multimodal technology can provide a more personalized and effective user experience.

Discover more from AI For Developers

Subscribe to get the latest posts sent to your email.

Understanding Multimodal RAG

Functioning of Multimodal RAG

Implementing a Multimodal RAG Process

Setup and Connection to Weaviate

Restoring Pre-vectorized Resources

Data Preview

Retrieving Data from the Database Using a Query

Generating Image Description

Executing the Vision Request

Final Thoughts

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

Why Most AI Doom Scenarios for Devs Are Wrong

AI For Developers

Top Categories

Subscribe to Our Newsletter

Follow us

Functioning of Multimodal RAG

Implementing a Multimodal RAG Process

Setup and Connection to Weaviate

Restoring Pre-vectorized Resources

Data Preview

Retrieving Data from the Database Using a Query

Generating Image Description

Executing the Vision Request

Final Thoughts

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Introduction to LLMOps: Mastering the Fundamentals

Multimodal Recommender System

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

AWS re:Invent 2024 Keynote Deep Dive (Continued): Infrastructure at Scale

Why Most AI Doom Scenarios for Devs Are Wrong

Discover more from AI For Developers