Introduction to Large Multimodality Models
Welcome to the second part of our series on Large Multimodality Models (LMMs) and Retrieval augmented generation (RAG). Previously, we discussed the fundamentals of multimodal search and its potential to enhance various industries by integrating diverse data types.
In this article, we will provide an overview of how large language models work and introduce the concept of multimodal models. We will then explore the operation of large language models, discuss language vision models, and outline the process of building a large multimodal model.
Finally, we will cover practical steps for analyzing images using LMMs and deciphering hidden messages within images.
How Large Language Models Work
Large Language Models (LLMs) are generative models trained on extensive text data capable of generating human-like text based on a given input.
These models utilize two main types of architectures: autoregressive models and bidirectional models. Autoregressive models generate text by predicting the next word based on the previous words in the sequence.
In contrast, bidirectional models consider the context from both the left and right sides of the input sequence when generating text. These LLMs have gained significant attention and have shown remarkable capabilities in various natural language processing tasks such as language translation, summarization, and text generation.
Autoregressive Models
Autoregressive models are language models that generate text one token at a time, with each token’s generation depending on the previously generated tokens. These models are trained by predicting the next word in a sequence and optimizing the probability distribution over all possible next tokens.
For example, if the prompt is “The rock,” the model will assign higher probabilities to likely continuations like “rolls” or “is” rather than unrelated words like “apple.” Autoregressive models have been widely used in natural language processing tasks such as machine translation, text summarization, and speech recognition.
They have shown promise in capturing complex patterns and dependencies in sequential data, making them a valuable tool in various applications.
Bidirectional Models
Bidirectional models like BERT leverage information from both the left and right contexts of a target token to make predictions about the missing token in a sequence. By doing so, these models can gain a deeper and more holistic understanding of the context, resulting in more precise and accurate predictions.
Introduction to Large Multimodality Models
Large Multimodality Models (LMMs) extend the capabilities of LLMs by incorporating multiple types of data, such as text and images, enabling the models to understand and generate multimodal content. This integration allows LMMs to perform complex tasks that involve both visual and textual information, such as image captioning and visual question answering.
Operation of Large Language Models
LLMs begin with tokenizing the input text into manageable units. These tokens are then embedded into vectors, which the model processes to generate the next word or token by paying attention to the context provided by the previous words or tokens.
The transformer architecture, which underlies many LLMs, uses self-attention mechanisms to weigh the importance of different words in the input sequence, enabling the model to generate coherent and contextually appropriate text.
Language Vision Models
Language vision models combine visual and textual data to enhance the model’s understanding and generation capabilities.
For example, given an image and a textual instruction, the model can generate a description or answer questions about the image. This process involves embedding both the image patches and the text tokens into vectors and training the model to pay attention to both modalities to generate the correct output.
Building a Large Multimodal Model
Setup
To work with large multimodal models, you need to set up the environment and API keys. For instance, using Google’s Generative AI models requires setting up the GOOGLE_API_KEY.
import warnings
warnings.filterwarnings('ignore')
# Load environment variables and API keys
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # Read local .env file
GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')
# Set up the genai library
import google.generativeai as genai
from google.api_core.client_options import ClientOptions
genai.configure(
api_key=GOOGLE_API_KEY,
transport="rest",
client_options=ClientOptions(
api_endpoint=os.getenv("GOOGLE_API_BASE"),
),
)Helper Functions
These helper functions format text as Markdown and handle image processing.
import textwrap
import PIL.Image
from IPython.display import Markdown, Image
def to_markdown(text):
text = text.replace('•', ' *')
return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))
def call_LMM(image_path: str, prompt: str) -> str:
# Load the image
img = PIL.Image.open(image_path)
# Call generative model
model = genai.GenerativeModel('gemini-pro-vision')
response = model.generate_content([prompt, img], stream=True)
response.resolve()
return to_markdown(response.text)Analysis of Images using LLM
This function analyzes an image and generates text based on the provided prompt.
# Pass in an image and see if the LMM can answer questions about it
Image(url="SP-500-Index-Historical-Chart.jpg")
# Use the LMM function
call_LMM("SP-500-Index-Historical-Chart.jpg", "Explain what you see in this image.")Deciphering the Message
Using LMMs to decode messages hidden within images involves analyzing the image and generating text that describes its content.
import imageio.v2 as imageio
import numpy as np
import matplotlib.pyplot as plt
# Load the image and convert to NumPy array
image = imageio.imread("blankimage3.png")
image_array = np.array(image)
# Display the processed image
plt.imshow(np.where(image_array[:, :, 0] > 120, 0, 1), cmap='gray')
plt.show()Model Output
The output generated by the model varies with each run due to the probabilistic nature of the model. Different tokens can be sampled, leading to diverse responses.
# Example of generating hidden message image
def create_image_with_text(text, font_size=20, font_family='sans-serif', text_color='#73D955', background_color='#7ED957'):
fig, ax = plt.subplots(figsize=(5, 5))
fig.patch.set_facecolor(background_color)
ax.text(0.5, 0.5, text, fontsize=font_size, ha='center', va='center', color=text_color, fontfamily=font_family)
ax.axis('off')
plt.tight_layout()
return fig
# Modify the text here to create a new hidden message image!
fig = create_image_with_text("Hello, world!")
plt.show()
fig.savefig("extra_output_image.png")
# Call the LMM function with the generated image
call_LMM("extra_output_image.png", "Read what you see on this image.")Final Thoughts
In this article, we introduced the concept of Large Multimodality Models and provided an overview of how they work. We discussed the operation of large language models, the integration of text and image data in language vision models, and the practical steps involved in building and using LMMs.
Next, we’ll dive deeper into the detailed workings of large language models, examining their architectures, training processes, and practical applications. This exploration will lay the groundwork for understanding how these models can be enhanced through multimodal retrieval-augmented generation.
Colab Link
https://colab.research.google.com/drive/1zx4dxZ6XJsGrkztXRwC7CyHO7wceeLKw?usp=sharing
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.