Welcome to the final part of our Free AI course on Quantization in depth.
The previous article explored the fundamentals of quantization, including symmetric and asymmetric approaches. This article moves beyond the basics to focus on the practical aspects of quantizing weights and activations.
It’ll also explore how to optimize models for real-world applications by improving speed and reducing memory usage.
Additionally, we’ll introduce advanced strategies, such as building custom 8-bit quantizers. These methods offer even greater efficiency in specialized scenarios.

Practical Quantization: Weights and Activations
Quantizing weights and activations is a key step in making deep learning models more efficient. This process reduces the model’s memory usage and speeds up inference.
This makes it ideal for deployment on devices with limited resources.
Understanding Weight Quantization
Weights are the parameters learned by the model during training. Converting these weights from 32-bit floating-point numbers to 8-bit integers reduces the model’s size. This also speeds up computation during inference.
However, it’s important to do this carefully to avoid significant drops in accuracy.
Here’s an example of how to quantize weights to 8-bit integers while keeping activations at 32-bit:
import torch
from helper import linear_q_symmetric
# Function for quantized linear layer without bias
def quantized_linear_W8A32_without_bias(input, q_w, s_w, z_w):
assert input.dtype == torch.float32
assert q_w.dtype == torch.int8
dequantized_weight = q_w.to(torch.float32) * s_w + z_w
output = torch.nn.functional.linear(input, dequantized_weight)
return output
# Example usage
input = torch.tensor([1, 2, 3], dtype=torch.float32)
weight = torch.tensor([[-2, -1.13, 0.42], [-1.51, 0.25, 1.62], [0.23, 1.35, 2.15]])
q_w, s_w = linear_q_symmetric(weight)
output = quantized_linear_W8A32_without_bias(input, q_w, s_w, 0)
print(f"This is the W8A32 output: {output}")
fp32_output = torch.nn.functional.linear(input, weight)
print(f"This is the output if we don't quantize: {fp32_output}")In this example, the model’s weights are quantized to 8-bit integers. This reduces memory usage and speeds up the model, with minimal impact on accuracy.
Activation Quantization Explained
Activations are the outputs of each layer in the network. Quantizing them can further enhance efficiency.
While weights are fixed after training, activations vary with each input. This makes their quantization crucial for efficient real-time inference.
Two main methods are used to quantize activations:
- Static Quantization: Here, the quantization parameters are determined during training and applied during inference. It’s straightforward but may not adapt well to varying inputs.
- Dynamic Quantization: This method calculates quantization parameters during inference, allowing the model to adjust to different inputs dynamically. It offers better accuracy but requires more computation.
Quantizing weights and activations allows for a significant reduction in model size and inference time. However, it’s crucial to test the quantized model thoroughly to ensure it meets performance requirements without sacrificing too much accuracy.

Performance Analysis
Quantizing both weights and activations allows for a substantial reduction in model size and inference time. It makes it easier to deploy models on devices with limited resources. However, quantization can lead to a loss in accuracy.
This loss’s extent depends on the model architecture, the nature of the data, and the specific quantization method used.
It’s essential to test and validate the quantized model to ensure it meets the required performance criteria.
Advanced Techniques: Building a Custom 8-Bit Quantizer
Standard quantization methods are effective. There are situations where more control is needed, however.
Building a custom 8-bit quantizer allows for greater precision and adaptability, especially when working with specialized models or hardware.
This section will guide you through creating a custom 8-bit quantizer in PyTorch.
Why Go Custom?
Standard quantization methods use fixed rules and scales, which may not be ideal for every model. A custom 8-bit quantizer gives you the flexibility to tailor the quantization process to your specific needs.
This can be especially useful for models deployed on custom hardware or when dealing with unique data distributions.
Building the Custom Quantizer
Let’s create a custom 8-bit quantizer that works for both weights and activations. This quantizer will allow you to control the precision of your model while minimizing the loss of accuracy.
import torch
import torch.nn as nn
import torch.nn.functional as F
# Function for custom forward pass in W8A16 quantization
def w8_a16_forward(weight, input, scales, bias=None):
# Cast the weights to the same dtype as the input
casted_weights = weight.to(input.dtype)
# Perform linear operation and scale the result
output = F.linear(input, casted_weights) * scales
# Add bias if provided
if bias is not None:
output = output + bias
return output
# Example usage
random_int8 = torch.randint(-128, 127, (32, 16)).to(torch.int8)
random_hs = torch.randn((1, 16), dtype=torch.bfloat16)
scales = torch.randn((1, 32), dtype=torch.bfloat16)
bias = torch.randn((1, 32), dtype=torch.bfloat16)
print("With bias:\n", w8_a16_forward(random_int8, random_hs, scales, bias))
print("\nWithout bias:\n", w8_a16_forward(random_int8, random_hs, scales))This function handles the quantization process by converting weights to match the input’s data type. It also applies scaling and optional bias, ensuring the output remains accurate.
Creating a PyTorch Module for Custom Quantization
To make this custom quantizer reusable, we can wrap it in a PyTorch module:
class W8A16LinearLayer(nn.Module):
def __init__(self, in_features, out_features, bias=True, dtype=torch.float32):
super().__init__()
# Initialize int8 weights and scales
self.register_buffer("int8_weights", torch.randint(-128, 127, (out_features, in_features), dtype=torch.int8))
self.register_buffer("scales", torch.randn((out_features), dtype=dtype))
# Initialize bias if needed
if bias:
self.register_buffer("bias", torch.randn((1, out_features), dtype=dtype))
else:
self.bias = None
def quantize(self, weights):
# Convert weights to 32-bit float and compute scales
w_fp32 = weights.clone().to(torch.float32)
scales = w_fp32.abs().max(dim=-1).values / 127
scales = scales.to(weights.dtype)
# Quantize the weights to int8
int8_weights = torch.round(weights / scales.unsqueeze(1)).to(torch.int8)
# Update internal buffers
self.int8_weights = int8_weights
self.scales = scales
def forward(self, input):
# Use the custom forward function defined earlier
return w8_a16_forward(self.int8_weights, input, self.scales, self.bias)
# Example usage
module = W8A16LinearLayer(16, 32)
dummy_hidden_states = torch.randn(1, 6, 16)
output = module(dummy_hidden_states)
print(output.shape)
print(output.dtype)This PyTorch module encapsulates the custom quantization process. It is designed to be easy to apply to various models. The quantize method dynamically adjusts the weights. This provides precise control over the entire quantization process.
Beyond Linear Quantization: Exploring Non-Linear Quantization
While linear quantization is widely used, non-linear quantization can offer better performance in certain scenarios. Non-linear techniques adapt to the data’s distribution. It provides more flexibility in optimizing models for specific tasks.
These methods help achieve greater efficiency and precision, especially on specialized hardware.
Integrating and Optimizing: Combining Techniques
Maximizing model efficiency often requires combining quantization techniques for weights and activations. This integrated approach ensures your model runs smoothly, especially on devices with limited resources.
This section explores how to combine these techniques effectively, using practical examples.
The Power of Joint Quantization
Quantizing weights alone can significantly reduce model size. However, when you also quantize activations, the benefits are amplified.
This dual approach lightens the computational load on the device. It enables faster inference without major accuracy loss.
The challenge is to maintain the right balance between efficiency and precision.
Implementing Joint Quantization in PyTorch
Let’s create a PyTorch module that quantizes both weights and activations. This example builds on the custom quantizer we developed earlier. We’ll adapt it for joint quantization.
import torch
import torch.nn as nn
import torch.nn.functional as F
class W8A8QuantizedLayer(nn.Module):
def __init__(self, in_features, out_features, bias=True, dtype=torch.float32):
super().__init__()
# Initialize int8 weights and scales for both weights and activations
self.register_buffer("int8_weights", torch.randint(-128, 127, (out_features, in_features), dtype=torch.int8))
self.register_buffer("scales_w", torch.randn((out_features), dtype=dtype))
if bias:
self.register_buffer("bias", torch.randn((1, out_features), dtype=dtype))
else:
self.bias = None
self.register_buffer("scales_a", torch.tensor(1.0, dtype=dtype)) # Activation scale
def quantize(self, weights, activations):
# Quantize weights
w_fp32 = weights.clone().to(torch.float32)
scales_w = w_fp32.abs().max(dim=-1).values / 127
scales_w = scales_w.to(weights.dtype)
int8_weights = torch.round(weights / scales_w.unsqueeze(1)).to(torch.int8)
# Quantize activations
a_fp32 = activations.clone().to(torch.float32)
scales_a = a_fp32.abs().max() / 127
scales_a = torch.tensor(scales_a, dtype=activations.dtype)
int8_activations = torch.round(activations / scales_a).to(torch.int8)
# Update internal buffers
self.int8_weights = int8_weights
self.scales_w = scales_w
self.scales_a = scales_a
def forward(self, input):
# Quantize input (activations) before feeding it into the linear layer
input = input / self.scales_a
casted_weights = self.int8_weights.to(input.dtype)
# Linear transformation and scaling back
output = F.linear(input, casted_weights) * self.scales_w
if self.bias is not None:
output = output + self.bias
return output * self.scales_a # Scale back the output to the original range
# Example usage
module = W8A8QuantizedLayer(16, 32)
dummy_weights = torch.randn(32, 16, dtype=torch.bfloat16)
dummy_activations = torch.randn(1, 16, dtype=torch.bfloat16)
# Quantize both weights and activations
module.quantize(dummy_weights, dummy_activations)
# Forward pass with quantized weights and activations
output = module(dummy_activations)
print(f"Output Shape: {output.shape}, Output Dtype: {output.dtype}")In this implementation:
- Both weights and activations are quantized to 8-bit integers.
- Scales are calculated dynamically, allowing the model to adapt to varying inputs.
- The forward function manages quantized inputs and outputs, ensuring that the final output is accurate and efficient.
Real-World Applications
Joint quantization is particularly useful in models deployed on mobile devices or embedded systems. By integrating these techniques, you can achieve a balance between speed and accuracy. This makes your models more practical for real-world use.
Case Studies: MobileNet and BERT
- MobileNet: Known for its lightweight architecture, MobileNet benefits greatly from joint quantization. This enables high performance on mobile devices without significant loss in accuracy.
- BERT: For larger models like BERT, quantizing both weights and activations helps manage the computational load during inference. This makes it feasible to run BERT on more accessible hardware.
Final Thoughts
In this series, we’ve covered the essentials of quantization. We started with basic techniques and moved on to advanced 8-bit quantizers. You’ve learned how quantizing weights and activations can reduce model size and boost performance.
This makes AI models more efficient on devices with limited resources.
By combining these methods, you can optimize deep learning models for real-world use.
While challenges like accuracy loss exist, careful implementation can help you overcome them.
This series has equipped you with the tools to deploy quantized models effectively. Keep experimenting, stay curious, and push the limits of what AI can do.
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.