Welcome back to our on-device AI series!
In the first part, we introduced the fundamental concepts of on-device AI. We prioritized on-device AI to supercharge performance, protect privacy, and conquer offline challenges.
Now, this article will delve into the detailed steps required to prepare AI models for on-device deployment. This involves ensuring the models are optimized, validated, and ready for real-world applications.
We will explore the intricacies of on-device AI and how it differs from traditional cloud-based AI solutions. We will focus on critical technical considerations such as model size, architecture, and computational requirements.
By understanding these aspects, you can achieve efficient and effective performance on various devices.

Capturing the Neural Network Graph
Capturing the neural network graph involves converting the computational graph of a neural network into a portable format. The latter needs to support device deployment. This is essential for model portability and optimization.
Model Compilation for On-Device Deployment
Model compilation for on-device deployment is a process. It involves transforming the captured model into a format that is optimized for the target device’s hardware.
This step ensures that the model runs efficiently on the specific device.
Accelerating Inference with Hardware
Leveraging hardware acceleration, such as GPUs, NPUs, or specialized AI processors, can significantly speed up inference, allowing for real-time performance on devices.
Importance of On-Device Validation
Validating the model on the device ensures that the model’s predictions are consistent with those made during the cloud-based training. This step is crucial to ensuring the model’s reliability and accuracy in real-world scenarios.
Steps for Preparing Deployment
Step 1: Capturing the Trained Model
Capturing the trained model involves tracing its computation graph to create a portable representation that can be compiled for various devices.
# Import the necessary libraries for capturing the trained model
from qai_hub_models.models.ffnet_40s import Model as FFNet_40s
import torch
# Load the pre-trained FFNet 40s model
ffnet_40s = FFNet_40s.from_pretrained()
# Define the input shape for the model
input_shape = (1, 3, 1024, 2048)
# Create example inputs for tracing the model
example_inputs = torch.rand(input_shape)
# Trace the model to capture its computation graph
traced_model = torch.jit.trace(ffnet_40s, example_inputs)
traced_model # Display the traced modelStep 2: Compiling the Model for the Device
Compiling the model involves transforming the traced representation into an optimized format for the target device’s hardware.
# Import QAI Hub library for model compilation
import qai_hub
from utils import get_ai_hub_api_token # Utility function to get AI Hub API token
# Configure QAI Hub with the API token
ai_hub_api_token = get_ai_hub_api_token()
!qai-hub configure --api_token $ai_hub_api_token
# List available devices for deployment
for device in qai_hub.get_devices():
print(device.name)
# Randomly select a device for compilation
devices = [
"Samsung Galaxy S22 Ultra 5G",
"Samsung Galaxy S22 5G",
"Samsung Galaxy S22+ 5G",
"Samsung Galaxy Tab S8",
"Xiaomi 12",
"Xiaomi 12 Pro",
"Samsung Galaxy S23",
"Samsung Galaxy S23+",
"Samsung Galaxy S23 Ultra",
"Samsung Galaxy S24",
"Samsung Galaxy S24 Ultra",
"Samsung Galaxy S24+",
]
import random
selected_device = random.choice(devices) # Select a random device
print(selected_device) # Print the selected device
# Initialize the selected device
device = qai_hub.Device(selected_device)
# Submit a compile job for the selected device
compile_job = qai_hub.submit_compile_job(
model=traced_model, # Traced PyTorch model
input_specs={"image": input_shape}, # Input specifications
device=device, # Target device
)
# Download and save the compiled model for on-device use
target_model = compile_job.get_target_model()Step 3: Experimenting with Different Runtimes
Experimenting with different runtimes can help identify the best configuration for the target device. This helps optimize the model further.
# Experiment with different runtimes for model compilation
# Compile using TensorFlow Lite runtime
compile_options = "--target_runtime tflite"
compile_job_expt = qai_hub.submit_compile_job(
model=traced_model, # Traced PyTorch model
input_specs={"image": input_shape}, # Input specifications
device=device, # Target device
options=compile_options, # Compilation options
)
# Compile using ONNX runtime
compile_options = "--target_runtime onnx"
compile_job_expt = qai_hub.submit_compile_job(
model=traced_model, # Traced PyTorch model
input_specs={"image": input_shape}, # Input specifications
device=device, # Target device
options=compile_options, # Compilation options
)
# Compile using Qualcomm AI Engine runtime
compile_options = "--target_runtime qnn_lib_aarch64_android"
compile_job_expt = qai_hub.submit_compile_job(
model=traced_model, # Traced PyTorch model
input_specs={"image": input_shape}, # Input specifications
device=device, # Target device
options=compile_options, # Compilation options
)Step 4: Exploring Different Compute Units
Exploring different compute units (CPU, GPU, NPU) helps determine the optimal configuration for running the model efficiently on the target device.
# Import necessary utilities for performance profiling
from qai_hub_models.utils.printing import print_profile_metrics_from_job
# Initialize the selected device for profiling
device = qai_hub.Device(selected_device)
# Submit a performance profiling job on the device
profile_job = qai_hub.submit_profile_job(
model=target_model, # Compiled model
device=device, # Target device
)
# Download and print profiling data
profile_data = profile_job.download_profile()
print_profile_metrics_from_job(profile_job, profile_data)
# Experiment with different compute units
# Profile using CPU
profile_options = "--compute_unit cpu"
profile_job_expt = qai_hub.submit_profile_job(
model=target_model, # Compiled model
device=device, # Target device
options=profile_options, # Profiling options
)
# Profile using GPU
profile_options = "--compute_unit gpu"
profile_job_expt = qai_hub.submit_profile_job(
model=target_model, # Compiled model
device=device, # Target device
options=profile_options, # Profiling options
)
# Profile using NPU
profile_options = "--compute_unit npu"
profile_job_expt = qai_hub.submit_profile_job(
model=target_model, # Compiled model
device=device, # Target device
options=profile_options, # Profiling options
)Step 5: Performing On-Device Inference
Finally, performing on-device inference validates the model’s performance. And, it ensures its outputs are accurate and consistent with the original model.
# Sample inputs for on-device inference
sample_inputs = ffnet_40s.sample_inputs()
# Convert sample inputs to Torch tensor
torch_inputs = torch.Tensor(sample_inputs['image'][0])
# Perform inference using the original model
torch_outputs = ffnet_40s(torch_inputs)
torch_outputs # Display the outputs
# Submit an inference job on the device
inference_job = qai_hub.submit_inference_job(
model=target_model, # Compiled model
inputs=sample_inputs, # Sample input
device=device, # Target device
)
# Download and display the on-device inference outputs
ondevice_outputs = inference_job.download_output_data()
ondevice_outputs['output_0'] # Display the outputs
# Print inference metrics
from qai_hub_models.utils.printing import print_inference_metrics
print_inference_metrics(inference_job, ondevice_outputs, torch_outputs)Deployment Readiness
After capturing, compiling, and validating the model on the target device, and ensuring its performance meets the required criteria, the model is ready for deployment.
Final Thoughts on On-Device Deployment
Deploying AI models on devices requires careful preparation. This includes capturing the computation graph, compiling the model for the target hardware, and validating its performance.
Following these steps ensures that the models are both efficient and effective when deployed on devices.
In the next article, we will explore the process of quantizing models. It’s a technique that further optimizes models for on-device deployment. It relies on reducing their size and improving inference speed without significantly sacrificing accuracy.
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.