Quantizing Models for Enhanced On-Device AI Performance

Welcome back to our on-device AI series!

The previous article explored deploying segmentation models on devices. It discussed the necessary steps for taking a trained segmentation model, optimizing it for inference, and running it on a device.

In this article, we will focus further on the crucial process of quantizing models. We’ll aim to achieve significant performance improvements.

Quantization can enhance performance by almost 4 times while also reducing the model size by 4 times!

We will delve into the steps required to quantize a model, exploring the benefits and technical considerations involved.

Let’s dive into the world of model quantization and discover how this process can transform your on-device AI applications into powerful, efficient, and reliable tools.

Benefits of Quantizing Models

The three main benefits of quantizing models are:

Reduced Model Size: Enables better storage on devices with limited capacity.
Faster Processing: Models run faster due to fewer computations.
Lower Power Consumption: this is crucial for battery-operated devices.

Types of Quantization Techniques

There are different types of quantizing models’ techniques. These include weight quantization and activation quantization.

The most aggressive form is weight and activation (W8) quantization. In this type, both weights and activations are converted to 8-bit.

Steps for Quantizing Models

Step 1: Preparing the Dataset

The first step is to prepare the dataset that will be used for calibration and testing. This involves loading and splitting the dataset into calibration and test sets.

# Import necessary libraries
from datasets import load_dataset

# Define the input shape of the network
input_shape = (1, 3, 1024, 2048)

# Load 100 RGB images of urban scenes
dataset = load_dataset("UrbanSyn/UrbanSyn", 
                       split="train", 
                       data_files="rgb/*_00*.png")
dataset = dataset.train_test_split(1)

# Split the dataset for calibration and testing
calibration_dataset = dataset["train"]
test_dataset = dataset["test"]

# Display a sample image from the calibration dataset
calibration_dataset["image"][0]

Step 2: Setting Up the Calibration and Inference Pipeline

Next, we set up the pipeline for calibration and inference, including preprocessing and post-processing steps.

# Import necessary libraries for preprocessing
import torch
from torchvision import transforms

# Define a transform to convert a PIL image to a Torch tensor
preprocess = transforms.ToTensor()

# Get a sample image from the test dataset and preprocess it
test_sample_pil = test_dataset[0]["image"]
test_sample = preprocess(test_sample_pil).unsqueeze(0) 
print(test_sample)

# Define a function to postprocess the model output
import torch.nn.functional as F
import numpy as np
from PIL import Image

def postprocess(output_tensor, input_image_pil):
    # Upsample the output to the original size
    output_tensor_upsampled = F.interpolate(
        output_tensor, input_shape[2:], mode="bilinear",
    )

    # Get top predicted class and convert to numpy
    output_predictions = output_tensor_upsampled[0].argmax(0).byte().detach().numpy().astype(np.uint8)

    # Overlay predictions on the original image
    color_mask = Image.fromarray(output_predictions).convert("P")

    # Create a palette for visualization
    palette = [
        128, 64, 128, 244, 35, 232, 70, 70, 70, 102, 102, 156,
        190, 153, 153, 153, 153, 153, 250, 170, 30, 220, 220, 0,
        107, 142, 35, 152, 251, 152, 70, 130, 180, 220, 20, 60,
        255, 0, 0, 0, 0, 142, 0, 0, 70, 0, 60, 100, 0, 80, 100,
        0, 0, 230, 119, 11, 32
    ]
    palette += (256 * 3 - len(palette)) * [0]
    color_mask.putpalette(palette)

    # Blend the original image with the color mask
    out = Image.blend(input_image_pil, color_mask.convert("RGB"), 0.5)
    return out

Step 3: Configuring the Model in Floating Point

We will configure the model in floating point (FP32) and run a sample inference to verify its performance.

# Import the model
from qai_hub_models.models.ffnet_40s.model import FFNet40S

# Load the pretrained model and set it to evaluation mode
model = FFNet40S.from_pretrained().model.eval()

# Run the model with the test sample
test_output_fp32 = model(test_sample)
test_output_fp32  # Display the output

# Postprocess the output for visualization
postprocess(test_output_fp32, test_sample_pil)

Step 4: Preparing the Quantized Model

Prepare the model for quantization by folding batch normalization layers and setting up the quantization simulator.

# Import necessary libraries for quantization
from qai_hub_models.models._shared.ffnet_quantized.model import FFNET_AIMET_CONFIG
from aimet_torch.batch_norm_fold import fold_all_batch_norms
from aimet_torch.model_preparer import prepare_model
from aimet_torch.quantsim import QuantizationSimModel

# Fold batch normalization layers to improve quantization
fold_all_batch_norms(model, [input_shape])
model = prepare_model(model)

# Setup the quantization simulator
quant_sim = QuantizationSimModel(
    model,
    quant_scheme="tf_enhanced",  # Use TensorFlow Enhanced quantization scheme
    default_param_bw=8,          # Set parameter bitwidth to 8-bit
    default_output_bw=8,         # Set output bitwidth to 8-bit
    config_file=FFNET_AIMET_CONFIG,
    dummy_input=torch.rand(input_shape),
)

Step 5: Performing Post-Training Quantization

Perform post-training quantization (PTQ) by calibrating the model with a subset of the dataset.

# Define the size of the calibration dataset
size = 5  # Must be < 100

# Function to pass calibration data through the quantization simulator
def pass_calibration_data(sim_model: torch.nn.Module, args):
    (dataset,) = args
    with torch.no_grad():
        for sample in dataset.select(range(size)):
            pil_image = sample["image"]
            input_batch = preprocess(pil_image).unsqueeze(0)

            # Feed sample through for calibration
            sim_model(input_batch)

# Run Post-Training Quantization (PTQ)
quant_sim.compute_encodings(pass_calibration_data, [calibration_dataset])

# Run the quantized model with the test sample
test_output_int8 = quant_sim.model(test_sample)

# Postprocess the output for visualization
postprocess(test_output_int8, test_sample_pil)

Step 6: Running the Quantized Model On-Device

Deploy the quantized model on a selected device to test its performance in a real-world scenario.

# Import necessary libraries for deploying the model on-device
import qai_hub
from utils import get_ai_hub_api_token  # Utility function to get AI Hub API token

# Configure QAI Hub with the API token
ai_hub_api_token = get_ai_hub_api_token()
!qai-hub configure --api_token $ai_hub_api_token

# List available devices for deployment
devices = [
    "Samsung Galaxy S22 Ultra 5G",
    "Samsung Galaxy S22 5G",
    "Samsung Galaxy S22+ 5G",
    "Samsung Galaxy Tab S8",
    "Xiaomi 12",
    "Xiaomi 12 Pro",
    "Samsung Galaxy S23",
    "Samsung Galaxy S23+",
    "Samsung Galaxy S23 Ultra",
    "Samsung Galaxy S24",
    "Samsung Galaxy S24 Ultra",
    "Samsung Galaxy S24+",
]

# Randomly select a device for deployment
import random
selected_device = random.choice(devices)
print(selected_device)  # Print the selected device

# Run the quantized model on the selected device
%run -m qai_hub_models.models.ffnet_40s_quantized.export -- --device "$selected_device"

Final Thoughts

Quantizing models is a powerful technique to improve performance and reduce the size of your models. The technique makes the models more suitable for deployment on resource-constrained devices.

By following the steps outlined above, you can effectively quantize your models and validate their performance on various devices.

In the next part, we will explore the process of integrating AI models into smartphone applications. This will involve steps to ensure

smooth integration
Optimal performance
and Effective utilization of smartphone hardware capabilities.

Stay tuned!

Discover more from AI For Developers

Subscribe to get the latest posts sent to your email.

Quantizing Models for On-Device AI (AI Course – Part 3)

Benefits of Quantizing Models

Types of Quantization Techniques

Steps for Quantizing Models

Step 1: Preparing the Dataset

Step 2: Setting Up the Calibration and Inference Pipeline

Step 3: Configuring the Model in Floating Point

Step 4: Preparing the Quantized Model

Step 5: Performing Post-Training Quantization

Step 6: Running the Quantized Model On-Device

Final Thoughts

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

Why Most AI Doom Scenarios for Devs Are Wrong

AI For Developers

Top Categories

Subscribe to Our Newsletter

Follow us

Benefits of Quantizing Models

Types of Quantization Techniques

Steps for Quantizing Models

Step 1: Preparing the Dataset

Step 2: Setting Up the Calibration and Inference Pipeline

Step 3: Configuring the Model in Floating Point

Step 4: Preparing the Quantized Model

Step 5: Performing Post-Training Quantization

Step 6: Running the Quantized Model On-Device

Final Thoughts

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Preparing for On-Device Deployment – Free AI Course (Part 2)

Device Integration: Integrating AI Models into Smartphone Applications for Real-Time Performance (AI Course – Part 4)

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

AWS re:Invent 2024 Keynote Deep Dive (Continued): Infrastructure at Scale

Why Most AI Doom Scenarios for Devs Are Wrong

Discover more from AI For Developers