In the first part of this series, we covered the basics of prompt engineering for vision models. We’ve put focus on image generation and segmentation.
Now, we’ll delve into advanced techniques such as object detection and in-painting, along with DreamBooth‘s fine-tuning technique.
These methods enhance vision models’ capability to identify and manipulate objects within images, opening up new possibilities for AI applications.
To start with, let’s explore the advanced technique of object detection using OWL-ViT.

Object Detection with OWL-ViT
OWL-ViT is a state-of-the-art zero-shot object detection model. It can identify and locate objects in images based on natural language prompts.
This model does not require prior training on the specific objects it needs to detect. This makes it highly versatile and effective for a wide range of applications.
Prompting with Natural Language
Using natural language prompts to detect objects involves a few key steps. Here’s a guide to get you started with OWL-ViT:
Install the necessary libraries.
!pip install -q comet_ml transformers ultralytics torchSet Up Comet for Experiment Tracking
import comet_ml
comet_ml.init(anonymous=True, project_name="3-OWL-ViT-SAM")
exp = comet_ml.Experiment()Load the Image
from PIL import Image
# Load and display the image
raw_image = Image.open("dogs.jpg")
raw_image.show()Load the OWL-ViT Model
from transformers import pipeline
# Define the model checkpoint
OWL_checkpoint = "google/owlvit-base-patch32"
# Build the pipeline for zero-shot object detection
detector = pipeline(model=OWL_checkpoint, task="zero-shot-object-detection")Detect Objects Using Natural Language Prompts
# Define the text prompt
text_prompt = "dog"
# Use the model to detect objects
output = detector(raw_image, candidate_labels=[text_prompt])
# Print the output to identify the bounding boxes detected
print(output)Visualize the Detected Bounding Boxes
from utils import preprocess_outputs, show_boxes_and_labels_on_image
# Process the outputs
input_scores, input_labels, input_boxes = preprocess_outputs(output)
# Display the image with bounding boxes
show_boxes_and_labels_on_image(raw_image, input_boxes[0], input_labels, input_scores)This process allows you to detect objects in images using simple text prompts. And this makes OWL-ViT a powerful tool for various applications.
Using OWL-ViT for zero-shot object detection showcases the power of combining natural language processing with vision models. This technique simplifies the process of identifying and locating objects within images, enhancing the functionality of AI systems.
Next, let’s explore the technique of in-painting. This one involves replacing or filling in parts of an image with generated content.
In-painting with Combined Techniques
In-painting is a technique for replacing or filling in parts of an image with generated content. This method can be employed:
- Remove unwanted objects
- Restore damaged images
- Create entirely new elements within a scene
By combining segmentation and generation techniques, in-painting allows for precise and creative modifications.

Combining Segmentation and Generation
To effectively perform in-painting, we can use a combination of the Segment Anything Model (SAM) for segmentation and a diffusion model like Stable Diffusion 2.0 for generation.
Here’s a step-by-step guide:
Install Necessary Libraries
!pip install torch diffusers transformers ultralyticsLoad and Display the Image
from PIL import Image
# Load the image to be edited
image_path = "boy-with-kitten.jpg"
image = Image.open(image_path).resize((256, 256))
image.show()Load the Mask Image
# Load the mask image
mask_path = "cat_binary_mask.png"
image_mask = Image.open(mask_path).resize((256, 256))
image_mask.show()Initialize the Stable Diffusion Inpainting Pipeline
from diffusers import StableDiffusionInpaintPipeline
import torch
# Set up the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Initialize the inpainting pipeline
sd_pipe = StableDiffusionInpaintPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-inpainting",
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.bfloat16,
low_cpu_mem_usage=True
).to(device)Define the Prompt and Generate the In-painted Image
import numpy as np
# Set the prompt and seed
prompt = "a realistic phoenix"
seed = 123 generator = torch.manual_seed(seed)
# Perform in-painting
output = sd_pipe(
image=image,
mask_image=image_mask,
prompt=prompt,
generator=generator,
num_inference_steps=40,
guidance_scale=6.5
)
# Display the generated image
generated_image = output.images[0]
generated_image.show()This approach demonstrates how to replace parts of an image with generated content. And this creates a seamless and visually appealing result.
In-painting with combined techniques offers a powerful way to manipulate and enhance images.
By using segmentation and generation models, you can achieve precise and visually appealing results. Whether you’re removing unwanted objects, restoring images, or creating new elements within a scene, these techniques offer impressive capabilities.
Following this, we will explore the personalization of image generation through fine-tuning techniques like DreamBooth.
Personalization with Fine-tuning (DreamBooth)
DreamBooth is a fine-tuning technique that enables the generation of personalized images based on specific text labels.
By fine-tuning a diffusion model, DreamBooth can associate unique features or objects with a custom text prompt. This allows for highly personalized and context-specific image generation.

Generating Custom Images
To create personalized images using DreamBooth, we need to follow a systematic approach involving data preparation, model initialization, and training. Here’s a step-by-step guide:
Set Up Comet for Experiment Tracking
import comet_ml
comet_ml.init(anonymous=True, project_name="4-diffusion-prompting")
exp = comet_ml.Experiment()Initialize the Stable Diffusion Inpainting Pipeline
from diffusers import StableDiffusionInpaintPipeline
import torch
# Set up the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Initialize the inpainting pipeline
sd_pipe = StableDiffusionInpaintPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-inpainting",
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.bfloat16,
low_cpu_mem_usage=True
).to(device)Define Hyperparameters and Initialize the Model
# Define hyperparameters
hyperparameters = {
"instance_prompt": "a photo of a [V] man",
"class_prompt": "a photo of a man",
"seed": 4329,
"pretrained_model_name_or_path": "stabilityai/stable-diffusion-xl-base-1.0",
"resolution": 1024 if torch.cuda.is_available() else 512,
"num_inference_steps": 50,
"guidance_scale": 5.0,
"num_class_images": 200,
"prior_loss_weight": 1.0
}
# Initialize the DreamBooth trainer
from utils import DreamBoothTrainer
trainer = DreamBoothTrainer(hyperparameters)Initialize and Train the Model
# Initialize the model components
tokenizer, text_encoder, vae, unet = trainer.initialize_models()
# Add noise to generate images in Stable Diffusion
from diffusers import DDPMScheduler
noise_scheduler = DDPMScheduler.from_pretrained(
trainer.hyperparameters.pretrained_model_name_or_path,
subfolder="scheduler"
)
unet = trainer.initialize_lora(unet)
optimizer, params_to_optimize = trainer.initialize_optimizer(unet)
# Prepare dataset and dataloader
train_dataset, train_dataloader = trainer.prepare_dataset(tokenizer, text_encoder)
lr_scheduler = trainer.initialize_scheduler(train_dataloader, optimizer)Train the Model
from tqdm import tqdm
global_step = 0
progress_bar = tqdm(range(trainer.hyperparameters.max_train_steps), desc="Steps")
for epoch in range(trainer.hyperparameters.num_train_epochs):
unet.train()
for step, batch in enumerate(train_dataloader):
with trainer.accelerator.accumulate(unet):
pixel_values = batch["pixel_values"].to(dtype=vae.dtype)
model_input = vae.encode(pixel_values).latent_dist.sample()
model_input = model_input * vae.config.scaling_factor
noise = torch.randn_like(model_input)
timesteps = torch.randint(
0, noise_scheduler.config.num_train_timesteps,
(batch["pixel_values"].shape[0],),
device=model_input.device
).long()
noisy_model_input = noise_scheduler.add_noise(model_input, noise, timesteps)
encoder_hidden_states = batch["input_ids"]
model_pred = unet(noisy_model_input, timesteps, encoder_hidden_states)[0]
target = noise
model_pred, model_pred_prior = torch.chunk(model_pred, 2, dim=0)
target, target_prior = torch.chunk(target, 2, dim=0)
instance_loss = torch.nn.functional.mse_loss(model_pred.float(), target.float(), reduction="mean")
prior_loss = torch.nn.functional.mse_loss(model_pred_prior.float(), target_prior.float(), reduction="mean")
loss = instance_loss + trainer.hyperparameters.prior_loss_weight * prior_loss
trainer.accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
global_step += 1
loss_metrics = {"loss": loss.detach().item(), "prior_loss": prior_loss.detach().item(), "lr": lr_scheduler.get_last_lr()[0]}
exp.log_metrics(loss_metrics, step=global_step)
progress_bar.set_postfix(**loss_metrics)
progress_bar.update(1)
if global_step >= trainer.hyperparameters.max_train_steps:
break
trainer.save_lora_weights(unet)
exp.add_tag("dreambooth-training")
exp.log_parameters(trainer.hyperparameters)
trainer.accelerator.end_training()Generate Personalized Images
# Define prompts to generate personalized images
prompts = [
"a photo of a [V] man playing basketball",
"a photo of a [V] man riding a horse",
"a photo of a [V] man at the summit of a mountain",
"a photo of a [V] man driving a convertible",
"a photo of a [V] man riding a skateboard on a huge halfpipe",
"a mural of a [V] man, painted by graffiti artists"
]This method demonstrates how to fine-tune a diffusion model using the DreamBooth technique. It generates highly personalized images based on specific prompts.
Personalization with DreamBooth enables the creation of unique and context-specific images by associating text labels with visual features.
By fine-tuning diffusion models enables you to achieve high levels of customization in image generation. This technique, accordingly, is quite valuable for various applications.
Final Thoughts
The journey through prompt engineering for vision models unveils a landscape of innovations.
From generating stunning images with Stable Diffusion 2.0 to the precision of object detection with OWL-ViT, these advanced techniques empower developers. Additionally, the personalized artistry of DreamBooth allows developers to push the boundaries of AI.
By embracing these tools and experimenting with new prompts and settings, you can unlock the full potential of vision models. Eventually, this will transform creative and practical applications alike.
The future of AI-driven image manipulation and generation is bright, and your exploration is just beginning.
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.