This course will explore different models and techniques for prompt engineering for vision models. We will introduce key concepts and models like Meta’s Segment Anything Model (SAM), OWL-ViT, Stable Diffusion 2.0, and the fine-tuning technique DreamBooth.
- Introduction to Prompt Engineering for Vision Models: This article covers the basics of prompt engineering and its applications in vision models.
- Advanced Techniques in Prompt Engineering for Vision Models: This article explores how models operate with advanced techniques. These include object detection & in-painting, and the fine-tuning technique DreamBooth.
Let’s kick off this series by exploring the fundamentals of prompt engineering in vision models and their diverse applications.
To start, we will explore image generation with Stable Diffusion 2.0.

Image Generation with Stable Diffusion 2.0
Stable Diffusion 2.0 is a powerful, open-source, text-to-image model. It uses natural language processing and neural networks for generating images from text prompts. This model uses a diffusion process and shot prompting. This helps it transform a noisy input into a coherent picture guided by the provided text.
It excels in creating highly detailed and photorealistic images, making it an essential tool for various creative and professional applications.

Prompt Engineering with Text:
Generating images with text prompts involves several key steps. Here’s a guide to get you started with Stable Diffusion 2.0:
Install Necessary Libraries
To start creating images, install essential libraries such as Torch, Transformers, and Diffusers. This step ensures you have a solid foundation for your projects.
!pip install torch transformers diffusersLoad the Model
Learn how to load the Stable Diffusion model using HuggingFace Transformers. This involves importing components, configuring the model, and preparing it for image generation.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load the tokenizer and model from HuggingFace
model_name = "CompVis/stable-diffusion-v-1-4"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)Generate an Image
Generate an image from your generated prompt by defining the prompt, tokenizing it, and using the Stable Diffusion model. Convert the output into an image and display it, bringing your ideas to life.
from PIL import Image
import numpy as np
# Define your prompt
prompt = "A serene landscape with mountains and a lake at sunrise"
# Tokenize and generate the image
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs)
# Convert the output to an image
image_data = output[0].numpy()
image = Image.fromarray((image_data * 255).astype(np.uint8), 'RGB')
image.show()This snippet outlines the intermediate steps required to generate an image using Stable Diffusion 2.0. Experimenting with different prompts can yield various results, showcasing the model’s versatility.
Adjusting Hyperparameters
To fine-tune the results, you can adjust Hyperparameters such as strength, guidance scale, and the number of inference steps. Here’s how to tweak these settings:
# Modify the Guidance Scale
guidance_scale = 8.0 # Higher values make the generated image more aligned with the prompt
# Set the Number of Inference Steps
inference_steps = 60 # More steps typically result in higher quality images
# Adjust the Strength Parameter
strength = 0.9 # Controls the level of detail in the generated imageYou can optimize the generated images by fine-tuning these parameters to match your desired output better.
Next, let’s explore image segmentation with Meta’s Segment Anything Model (SAM).
Image Segmentation with Meta’s Segment Anything Model (SAM)
Leveraging advanced machine learning, Meta’s Segment Anything Model (SAM) is a versatile and powerful image segmentation tool. It’s been designed to handle a wide range of tasks.
SAM can segment images based on prompts, such as pixel coordinates and bounding boxes, to create detailed segmentation masks.
This capability makes it a valuable asset for image editing, object detection, and more applications.

Prompting with Coordinates
SAM allows for both positive and negative coordinate prompts to refine segmentation. Here’s a step-by-step guide on how to use SAM for image segmentation with coordinates:
Import Libraries
from PIL import Image
import torch
from ultralytics import YOLO
# Load the image
raw_image = Image.open("cats.jpg")
raw_image.show()Resize the Image
<p>from utils import resize_image<br><br># Resize the image to the model's expected input size<br>resized_image = resize_image(raw_image, input_size=1024)<br>resized_image.show()</p>Prepare the Model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = YOLO('FastSAM.pt') # Load the modelDefine and Visualize Prompt Points
from utils import show_points_on_image
# Define the coordinates for the points
input_points = [[300, 400], [600, 400]] # Get two points on the image
input_labels = [1, 1] # Both are positive points
# Visualize the points on the image
show_points_on_image(resized_image, input_points)Run the Model and Generate Masks
# Run the model on the image
results = model(resized_image, device=device, retina_masks=True)
# Filter the masks based on the points
from utils import format_results, point_prompt
results = format_results(results[0], 0)
masks, _ = point_prompt(results, input_points, input_labels)
# Visualize the generated masks
from utils import show_masks_on_image
show_masks_on_image(resized_image, [masks])Bounding Box Coordinates
SAM can also segment images based on bounding box coordinates. Here’s how to use bounding boxes for precise segmentation:
Define and Visualize Bounding Boxes
from utils import show_boxes_on_image
# Define bounding box coordinates
input_boxes = [[530, 180, 780, 600]]
# Visualize the bounding box on the image
show_boxes_on_image(resized_image, input_boxes)Run the Model and Generate Masks:
# Run the model
results = model(resized_image, device=device, retina_masks=True)
# Generate masks
masks = results[0].masks.data > 0 # Convert to boolean mask
from utils import box_prompt
masks, _ = box_prompt(masks, input_boxes)
# Visualize the masks
show_masks_on_image(resized_image, [masks])These steps demonstrate how to use SAM for image segmentation using both coordinates and bounding box prompts, providing precise control over the segmentation process.
Final Thoughts
AI prompt engineering in vision models, as demonstrated with Stable Diffusion 2.0 and SAM, opens up new possibilities for Artificial Intelligence applications. By experimenting with different prompts and adjusting settings, you can optimize your generative AI models for a variety of tasks.
Utilizing text prompts in Stable Diffusion 2.0 allows for the creation of detailed and photorealistic images. On the other hand, Hyperparameter adjustments, such as strength, guidance scale, and inference steps, fine-tune the output.
Using SAM, positive and negative coordinate prompts isolate specific parts of an image. Bounding box coordinates enable precise segmentation. This enhances control over the process.
Overall, these techniques show the immense potential of prompt engineering in expanding the capabilities of vision models. By continuing to explore and refine these methods, we can unlock even more sophisticated and powerful AI applications.
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.