Welcome back to our on-device AI series!
The previous article explored deploying segmentation models on devices. It discussed the necessary steps for taking a trained segmentation model, optimizing it for inference, and running it on a device.
In this article, we will focus further on the crucial process of quantizing models. We’ll aim to achieve significant performance improvements.
Quantization can enhance performance by almost 4 times while also reducing the model size by 4 times!
We will delve into the steps required to quantize a model, exploring the benefits and technical considerations involved.
Let’s dive into the world of model quantization and discover how this process can transform your on-device AI applications into powerful, efficient, and reliable tools.
Benefits of Quantizing Models

The three main benefits of quantizing models are:
- Reduced Model Size: Enables better storage on devices with limited capacity.
- Faster Processing: Models run faster due to fewer computations.
- Lower Power Consumption: this is crucial for battery-operated devices.
Types of Quantization Techniques
There are different types of quantizing models’ techniques. These include weight quantization and activation quantization.
The most aggressive form is weight and activation (W8) quantization. In this type, both weights and activations are converted to 8-bit.
Steps for Quantizing Models
Step 1: Preparing the Dataset
The first step is to prepare the dataset that will be used for calibration and testing. This involves loading and splitting the dataset into calibration and test sets.
# Import necessary libraries
from datasets import load_dataset
# Define the input shape of the network
input_shape = (1, 3, 1024, 2048)
# Load 100 RGB images of urban scenes
dataset = load_dataset("UrbanSyn/UrbanSyn",
split="train",
data_files="rgb/*_00*.png")
dataset = dataset.train_test_split(1)
# Split the dataset for calibration and testing
calibration_dataset = dataset["train"]
test_dataset = dataset["test"]
# Display a sample image from the calibration dataset
calibration_dataset["image"][0]Step 2: Setting Up the Calibration and Inference Pipeline
Next, we set up the pipeline for calibration and inference, including preprocessing and post-processing steps.
# Import necessary libraries for preprocessing
import torch
from torchvision import transforms
# Define a transform to convert a PIL image to a Torch tensor
preprocess = transforms.ToTensor()
# Get a sample image from the test dataset and preprocess it
test_sample_pil = test_dataset[0]["image"]
test_sample = preprocess(test_sample_pil).unsqueeze(0)
print(test_sample)
# Define a function to postprocess the model output
import torch.nn.functional as F
import numpy as np
from PIL import Image
def postprocess(output_tensor, input_image_pil):
# Upsample the output to the original size
output_tensor_upsampled = F.interpolate(
output_tensor, input_shape[2:], mode="bilinear",
)
# Get top predicted class and convert to numpy
output_predictions = output_tensor_upsampled[0].argmax(0).byte().detach().numpy().astype(np.uint8)
# Overlay predictions on the original image
color_mask = Image.fromarray(output_predictions).convert("P")
# Create a palette for visualization
palette = [
128, 64, 128, 244, 35, 232, 70, 70, 70, 102, 102, 156,
190, 153, 153, 153, 153, 153, 250, 170, 30, 220, 220, 0,
107, 142, 35, 152, 251, 152, 70, 130, 180, 220, 20, 60,
255, 0, 0, 0, 0, 142, 0, 0, 70, 0, 60, 100, 0, 80, 100,
0, 0, 230, 119, 11, 32
]
palette += (256 * 3 - len(palette)) * [0]
color_mask.putpalette(palette)
# Blend the original image with the color mask
out = Image.blend(input_image_pil, color_mask.convert("RGB"), 0.5)
return outStep 3: Configuring the Model in Floating Point
We will configure the model in floating point (FP32) and run a sample inference to verify its performance.
# Import the model
from qai_hub_models.models.ffnet_40s.model import FFNet40S
# Load the pretrained model and set it to evaluation mode
model = FFNet40S.from_pretrained().model.eval()
# Run the model with the test sample
test_output_fp32 = model(test_sample)
test_output_fp32 # Display the output
# Postprocess the output for visualization
postprocess(test_output_fp32, test_sample_pil)Step 4: Preparing the Quantized Model
Prepare the model for quantization by folding batch normalization layers and setting up the quantization simulator.
# Import necessary libraries for quantization
from qai_hub_models.models._shared.ffnet_quantized.model import FFNET_AIMET_CONFIG
from aimet_torch.batch_norm_fold import fold_all_batch_norms
from aimet_torch.model_preparer import prepare_model
from aimet_torch.quantsim import QuantizationSimModel
# Fold batch normalization layers to improve quantization
fold_all_batch_norms(model, [input_shape])
model = prepare_model(model)
# Setup the quantization simulator
quant_sim = QuantizationSimModel(
model,
quant_scheme="tf_enhanced", # Use TensorFlow Enhanced quantization scheme
default_param_bw=8, # Set parameter bitwidth to 8-bit
default_output_bw=8, # Set output bitwidth to 8-bit
config_file=FFNET_AIMET_CONFIG,
dummy_input=torch.rand(input_shape),
)Step 5: Performing Post-Training Quantization
Perform post-training quantization (PTQ) by calibrating the model with a subset of the dataset.
# Define the size of the calibration dataset
size = 5 # Must be < 100
# Function to pass calibration data through the quantization simulator
def pass_calibration_data(sim_model: torch.nn.Module, args):
(dataset,) = args
with torch.no_grad():
for sample in dataset.select(range(size)):
pil_image = sample["image"]
input_batch = preprocess(pil_image).unsqueeze(0)
# Feed sample through for calibration
sim_model(input_batch)
# Run Post-Training Quantization (PTQ)
quant_sim.compute_encodings(pass_calibration_data, [calibration_dataset])
# Run the quantized model with the test sample
test_output_int8 = quant_sim.model(test_sample)
# Postprocess the output for visualization
postprocess(test_output_int8, test_sample_pil)Step 6: Running the Quantized Model On-Device
Deploy the quantized model on a selected device to test its performance in a real-world scenario.
# Import necessary libraries for deploying the model on-device
import qai_hub
from utils import get_ai_hub_api_token # Utility function to get AI Hub API token
# Configure QAI Hub with the API token
ai_hub_api_token = get_ai_hub_api_token()
!qai-hub configure --api_token $ai_hub_api_token
# List available devices for deployment
devices = [
"Samsung Galaxy S22 Ultra 5G",
"Samsung Galaxy S22 5G",
"Samsung Galaxy S22+ 5G",
"Samsung Galaxy Tab S8",
"Xiaomi 12",
"Xiaomi 12 Pro",
"Samsung Galaxy S23",
"Samsung Galaxy S23+",
"Samsung Galaxy S23 Ultra",
"Samsung Galaxy S24",
"Samsung Galaxy S24 Ultra",
"Samsung Galaxy S24+",
]
# Randomly select a device for deployment
import random
selected_device = random.choice(devices)
print(selected_device) # Print the selected device
# Run the quantized model on the selected device
%run -m qai_hub_models.models.ffnet_40s_quantized.export -- --device "$selected_device"Final Thoughts
Quantizing models is a powerful technique to improve performance and reduce the size of your models. The technique makes the models more suitable for deployment on resource-constrained devices.
By following the steps outlined above, you can effectively quantize your models and validate their performance on various devices.
In the next part, we will explore the process of integrating AI models into smartphone applications. This will involve steps to ensure
- smooth integration
- Optimal performance
- and Effective utilization of smartphone hardware capabilities.
Stay tuned!
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.