Welcome back to the last article of our free AI course on Quantization with Hugging Face.
In the previous article, we discussed loading models by data size. We focused on practical techniques for managing model sizes and quantization methods.
We explored how understanding data size impacts model loading and deployment. And, we emphasized strategies for optimizing model performance based on size constraints.
In this new article, we will explore linear quantization theory. It’s a popular and effective method for reducing the size and enhancing the performance of AI models. We will use the Quanto Python quantization toolkit from Hugging Face to apply this technique to real models.
Linear quantization is a crucial technique in model optimization, especially for deploying models on resource-constrained devices.
By the end of this lesson, you will gain practical knowledge on implementing linear quantization and understanding its effects compared to models without quantization.

Quantization Theory Concept
Quantization plays a pivotal role in the machine learning and data compression domains. It strategically maps a broad spectrum of continuous values to a more compact, discrete range.
This fundamental technique facilitates a significant reduction in the size of a model and markedly improves computational efficiency.
This makes it indispensable for both theoretical research and practical applications.
At its core, quantization involves the conversion of high-precision values into lower-precision formats.
This process is designed to retain as much of the original information as possible while reducing the overall data footprint. By truncating or approximating values to a smaller set of discrete levels, quantization helps compress the model’s parameters. This, then, can lead to faster processing times and decreased memory usage.
It is particularly beneficial in scenarios where computational resources are limited or where real-time performance is critical.
The essence of quantization lies in its ability to maintain the integrity of the model’s predictive accuracy despite the reduction in numerical precision.
This balance between precision and efficiency is crucial for deploying models in resource-constrained environments. Think of environments like mobile devices and embedded systems, where the trade-offs between speed, accuracy, and storage need to be carefully managed.
8-bit Linear Quantization
Linear quantization is a technique used to map floating-point values, such as float32, to a lower-precision integer format like int8.
This process involves scaling the values to fit within the 8-bit range, which is from -128 to 127. The main goal of linear quantization is to reduce the memory footprint of the model and to accelerate computation without significantly degrading performance.
This technique has gained popularity as a way to optimize and deploy deep learning models on resource-constrained devices.
By reducing the precision of the model’s parameters, linear quantization allows for significant savings in terms of memory and computational resources, while often only having a modest impact on the model’s accuracy.
Example of 8-bit Linear Quantization:
Here’s a simple example to illustrate how to perform 8-bit linear quantization:
import numpy as np
# Example float32 tensor
float32_tensor = np.array([[0.1, 0.4, 0.7], [1.0, 1.5, 2.0]], dtype=np.float32)
# Find min and max values
min_val = float32_tensor.min()
max_val = float32_tensor.max()
# Define quantization parameters
scale = 255 / (max_val - min_val) # Scaling factor
zero_point = -min_val * scale # Zero point adjustment
# Apply linear mapping
quantized_tensor = np.round(float32_tensor * scale + zero_point).astype(np.int8)
print("Quantized Tensor:\n", quantized_tensor)In this example, we map float32 values to the 8-bit integer range using linear mapping. This helps compress the tensor while retaining its essential characteristics.
Performing Linear Quantization
To apply linear quantization to a model using the quanto toolkit, follow these steps:
1. Without Quantization
First, let’s see how to load and run inference with a model without quantization:
# Import libraries
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
# Load pre-trained model
model_name = "google/flan-t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
# Define input and run inference
input_text = "Hello, my name is"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print("Output without quantization:", tokenizer.decode(outputs[0]))
# Compute and print model size
from helper import compute_module_sizes
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")2. Quantize the Model (8-bit Precision)
Next, apply 8-bit linear quantization to the model:
from quanto import quantize, freeze
import torch
# Quantize the model
quantize(model, weights=torch.int8, activations=None)
print("Model after quantization:", model)
# Freeze the model to apply quantization
freeze(model)
# Compute and print the new model size
module_sizes = compute_module_sizes(model)
print(f"The model size after quantization is {module_sizes[''] * 1e-9} GB")3. Running Inference on the Quantized Model
Finally, run inference on the quantized model to observe its performance:
# Define input and run inference
outputs = model.generate(input_ids)
print("Output with quantization:", tokenizer.decode(outputs[0]))Comparing Linear Quantization to Downcasting
Linear Quantization vs. Downcasting:
Linear quantization is a technique that involves converting model parameters to a smaller integer data type, such as int8, while maintaining performance by converting back to a floating point during inference.
This method retains more information and offers a closer approximation to the original model.
By reducing the precision of the model’s weights and activations, linear quantization allows for more efficient storage and computation. This makes it particularly useful for deploying deep learning models on devices with limited computational resources.
Additionally, linear quantization can help reduce memory bandwidth and energy consumption, making it an attractive optimization technique for various machine-learning applications.
Downcasting refers to the process of converting parameters to a more compact floating-point type, such as bfloat16. This approach may lead to performance degradation, as it operates in a smaller data type.
Additionally, it’s important to be aware that downcasting is not always suitable for integer conversions, like int8.
Therefore, careful consideration should be given to the potential impact of downcasting on performance and data integrity when implementing this process in your projects.
Example Comparison
To illustrate the difference between linear quantization and downcasting:
Downcasting Example:
from transformers import AutoModelForCausalLM
# Load model with downcasting
model_name = "EleutherAI/pythia-410m"
model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True)
# Print model size and weights in fp32
module_sizes = compute_module_sizes(model)
print(f"Model size without quantization: {module_sizes[''] * 1e-9} GB")With downcasting, the model’s parameters are converted to a more compact data type, which might affect its performance and accuracy.
Final Thoughts
In this insightful article, we delved deep into the fundamentals of 8-bit linear quantization and compared it to downcasting. We meticulously demonstrated the practical implementation of utilizing the quanto toolkit from Hugging Face for quantizing a model and subsequently running inference.
The process of quantizing the model enabled us to observe significant changes in model size and performance, both of which are pivotal factors when it comes to deploying models on resource-limited devices.
This thorough exploration not only provided a clearer understanding of the quantization process but also shed light on the potential impact of such techniques in real-world applications.
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.