The Power of RAG: Unveiling Large Multimodal Models

Introduction to Large Multimodality Models

Welcome to the second part of our series on Large Multimodality Models (LMMs) and Retrieval augmented generation (RAG). Previously, we discussed the fundamentals of multimodal search and its potential to enhance various industries by integrating diverse data types.

In this article, we will provide an overview of how large language models work and introduce the concept of multimodal models. We will then explore the operation of large language models, discuss language vision models, and outline the process of building a large multimodal model.

Finally, we will cover practical steps for analyzing images using LMMs and deciphering hidden messages within images.

How Large Language Models Work

Large Language Models (LLMs) are generative models trained on extensive text data capable of generating human-like text based on a given input.

These models utilize two main types of architectures: autoregressive models and bidirectional models. Autoregressive models generate text by predicting the next word based on the previous words in the sequence.

In contrast, bidirectional models consider the context from both the left and right sides of the input sequence when generating text. These LLMs have gained significant attention and have shown remarkable capabilities in various natural language processing tasks such as language translation, summarization, and text generation.

Autoregressive Models

Autoregressive models are language models that generate text one token at a time, with each token’s generation depending on the previously generated tokens. These models are trained by predicting the next word in a sequence and optimizing the probability distribution over all possible next tokens.

For example, if the prompt is “The rock,” the model will assign higher probabilities to likely continuations like “rolls” or “is” rather than unrelated words like “apple.” Autoregressive models have been widely used in natural language processing tasks such as machine translation, text summarization, and speech recognition.

They have shown promise in capturing complex patterns and dependencies in sequential data, making them a valuable tool in various applications.

Bidirectional Models

Bidirectional models like BERT leverage information from both the left and right contexts of a target token to make predictions about the missing token in a sequence. By doing so, these models can gain a deeper and more holistic understanding of the context, resulting in more precise and accurate predictions.

Introduction to Large Multimodality Models

Large Multimodality Models (LMMs) extend the capabilities of LLMs by incorporating multiple types of data, such as text and images, enabling the models to understand and generate multimodal content. This integration allows LMMs to perform complex tasks that involve both visual and textual information, such as image captioning and visual question answering.

Operation of Large Language Models

LLMs begin with tokenizing the input text into manageable units. These tokens are then embedded into vectors, which the model processes to generate the next word or token by paying attention to the context provided by the previous words or tokens.

The transformer architecture, which underlies many LLMs, uses self-attention mechanisms to weigh the importance of different words in the input sequence, enabling the model to generate coherent and contextually appropriate text.

Language Vision Models

Language vision models combine visual and textual data to enhance the model’s understanding and generation capabilities.

For example, given an image and a textual instruction, the model can generate a description or answer questions about the image. This process involves embedding both the image patches and the text tokens into vectors and training the model to pay attention to both modalities to generate the correct output.

Building a Large Multimodal Model

Setup

To work with large multimodal models, you need to set up the environment and API keys. For instance, using Google’s Generative AI models requires setting up the GOOGLE_API_KEY.

import warnings
warnings.filterwarnings('ignore')

# Load environment variables and API keys
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())  # Read local .env file
GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')

# Set up the genai library
import google.generativeai as genai
from google.api_core.client_options import ClientOptions

genai.configure(
    api_key=GOOGLE_API_KEY,
    transport="rest",
    client_options=ClientOptions(
        api_endpoint=os.getenv("GOOGLE_API_BASE"),
    ),
)

Helper Functions

These helper functions format text as Markdown and handle image processing.

import textwrap
import PIL.Image
from IPython.display import Markdown, Image

def to_markdown(text):
    text = text.replace('•', '  *')
    return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

def call_LMM(image_path: str, prompt: str) -> str:
    # Load the image
    img = PIL.Image.open(image_path)

    # Call generative model
    model = genai.GenerativeModel('gemini-pro-vision')
    response = model.generate_content([prompt, img], stream=True)
    response.resolve()

    return to_markdown(response.text)

Analysis of Images using LLM

This function analyzes an image and generates text based on the provided prompt.

# Pass in an image and see if the LMM can answer questions about it
Image(url="SP-500-Index-Historical-Chart.jpg")
# Use the LMM function
call_LMM("SP-500-Index-Historical-Chart.jpg", "Explain what you see in this image.")

Deciphering the Message

Using LMMs to decode messages hidden within images involves analyzing the image and generating text that describes its content.

import imageio.v2 as imageio
import numpy as np
import matplotlib.pyplot as plt

# Load the image and convert to NumPy array
image = imageio.imread("blankimage3.png")
image_array = np.array(image)

# Display the processed image
plt.imshow(np.where(image_array[:, :, 0] > 120, 0, 1), cmap='gray')
plt.show()

Model Output

The output generated by the model varies with each run due to the probabilistic nature of the model. Different tokens can be sampled, leading to diverse responses.

# Example of generating hidden message image
def create_image_with_text(text, font_size=20, font_family='sans-serif', text_color='#73D955', background_color='#7ED957'):
    fig, ax = plt.subplots(figsize=(5, 5))
    fig.patch.set_facecolor(background_color)
    ax.text(0.5, 0.5, text, fontsize=font_size, ha='center', va='center', color=text_color, fontfamily=font_family)
    ax.axis('off')
    plt.tight_layout()
    return fig

# Modify the text here to create a new hidden message image!
fig = create_image_with_text("Hello, world!")
plt.show()
fig.savefig("extra_output_image.png")

# Call the LMM function with the generated image
call_LMM("extra_output_image.png", "Read what you see on this image.")

Final Thoughts

In this article, we introduced the concept of Large Multimodality Models and provided an overview of how they work. We discussed the operation of large language models, the integration of text and image data in language vision models, and the practical steps involved in building and using LMMs.

Next, we’ll dive deeper into the detailed workings of large language models, examining their architectures, training processes, and practical applications. This exploration will lay the groundwork for understanding how these models can be enhanced through multimodal retrieval-augmented generation.

Colab Link

https://colab.research.google.com/drive/1zx4dxZ6XJsGrkztXRwC7CyHO7wceeLKw?usp=sharing

Discover more from AI For Developers

Subscribe to get the latest posts sent to your email.

Exploring Multimodal Search and RAG – Part 2

Introduction to Large Multimodality Models

How Large Language Models Work

Autoregressive Models

Bidirectional Models

Introduction to Large Multimodality Models

Operation of Large Language Models

Language Vision Models

Building a Large Multimodal Model

Setup

Helper Functions

Analysis of Images using LLM

Deciphering the Message

Model Output

Final Thoughts

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

Why Most AI Doom Scenarios for Devs Are Wrong

AI For Developers

Top Categories

Subscribe to Our Newsletter

Follow us

Introduction to Large Multimodality Models

How Large Language Models Work

Autoregressive Models

Bidirectional Models

Introduction to Large Multimodality Models

Operation of Large Language Models

Language Vision Models

Building a Large Multimodal Model

Setup

Helper Functions

Analysis of Images using LLM

Deciphering the Message

Model Output

Final Thoughts

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Exploring Multimodal Search and RAG – Part 1

Introduction to LLMOps: Mastering the Fundamentals

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

AWS re:Invent 2024 Keynote Deep Dive (Continued): Infrastructure at Scale

Why Most AI Doom Scenarios for Devs Are Wrong

Discover more from AI For Developers