Previously, we explored the foundations and functionalities of Large Multimodality Models, emphasizing their integration of text and image data. Now, let’s move forward to Multimodal Retrieval-Augmented Generation (RAG), which combines various data modalities, such as text and images, to enhance the capability of large language models.
By integrating language and vision models, multimodal RAG processes can generate more accurate and context-aware responses. This lesson will walk you through implementing a simple multimodal RAG workflow using Weaviate and Google Gemini Pro Vision.

Functioning of Multimodal RAG
Multimodal RAG, or Multimodal Retrieval Augmented Generation, is an innovative approach that seeks to enhance a language model by incorporating pertinent information retrieved from a database.
By doing so, the model is better equipped to formulate responses that are firmly rooted in the context provided, thereby reducing the risk of producing inaccurate or unrelated outputs.
To illustrate, instead of solely presenting a prompt to the language model, this approach involves supplying both the prompt and relevant information extracted from a vector database, which can store a variety of data types such as images and text.
This integrated approach empowers the language model to leverage diverse sources of information, resulting in more robust and contextually relevant output.
Implementing a Multimodal RAG Process
Setup and Connection to Weaviate
First, ensure the necessary libraries are installed. If running this code locally, execute the following commands:
!pip install -U weaviate-client
!pip install google-generativeaiNext, load the environment variables and API keys:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # Read local .env file
EMBEDDING_API_KEY = os.getenv("EMBEDDING_API_KEY")
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")Connect to the Weaviate instance:
import weaviate
client = weaviate.Client(
url="http://localhost:8080", # Replace with your Weaviate instance URL
additional_headers={
"X-PALM-Api-Key": EMBEDDING_API_KEY,
}
)
assert client.is_ready() # Ensure the client is readyRestoring Pre-vectorized Resources
Restore the resources into Weaviate:
client.backup.restore(
backup_id="resources-img-and-vid",
include_classes="Resources",
backend="filesystem"
)
import time
time.sleep(5) # Wait for the resources to be readyData Preview
Preview the available data in the Weaviate instance:
from weaviate.client import Aggregate
aggregate_query = {
"group_by": ["mediaType"]
}
result = client.aggregate.get("Resources", aggregate_query)
for group in result["data"]["groups"]:
print(f"{group['groupedBy']} count: {group['totalCount']}")Retrieving Data from the Database Using a Query
Define a function to retrieve an image based on a query:
from weaviate.client.query import QueryBuilder
def retrieve_image(query_text):
query = QueryBuilder().with_near_text({"concepts": [query_text]}).build()
result = client.query.get("Resources", ["path"]).with_filter(query).do()
if result["data"]["Get"]["Resources"]:
return result["data"]["Get"]["Resources"][0]["path"]
else:
return NoneRun the image retrieval:
img_path = retrieve_image("fishing with my buddies")
print(f"Image path: {img_path}")Generating Image Description
Use Google Gemini Pro Vision to generate a description of the retrieved image:
import google.generativeai as genai
from google.api_core.client_options import ClientOptions
genai.configure(
api_key=GOOGLE_API_KEY,
transport="rest",
client_options=ClientOptions(
api_endpoint=os.getenv("GOOGLE_API_BASE"),
),
)
import PIL.Image
from IPython.display import Markdown, Image
def generate_image_description(image_path, prompt):
img = PIL.Image.open(image_path)
model = genai.GenerativeModel("gemini-pro-vision")
response = model.generate_content([prompt, img], stream=True)
response.resolve()
return response.textGenerate the description:
description = generate_image_description(img_path, "Please describe this image in detail.")
print(description)Executing the Vision Request
Combine the retrieval and description generation steps into a single function:
def mm_rag(query):
# Step 1: Retrieve an image using Weaviate
img_path = retrieve_image(query)
if img_path:
display(Image(img_path))
else:
print("No image found for the query.")
return
# Step 2: Generate a description using Google Gemini Pro Vision
description = generate_image_description(img_path, "Please describe this image in detail.")
print(description)
# Example usage
mm_rag("paragliding through the mountains")Final Thoughts
Multimodal RAG is an advanced technology that significantly improves the capabilities of large language models. It achieves this by seamlessly integrating various data types, resulting in more contextually rich and accurate responses.
By harnessing the combined power of Weaviate’s vector database and Google’s Gemini Pro Vision, a sophisticated multimodal AI system can be created. This system is exceptionally adept at comprehending and generating content from both textual inputs and images, offering a comprehensive and powerful solution for a wide range of applications.
In our next article, we will explore how multimodal systems can be applied to enhance search and recommendation functionalities. We’ll examine the distinct roles of search and recommendation and how combining them through multimodal technology can provide a more personalized and effective user experience.
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.