Previously, we covered fine-tuning language models and their impact on performance. Now, let’s move forward to instruction fine-tuning and data prep for machine learning.
We have divided this into two sections. In the first section, we’ll cover adding chat capabilities through instruction fine-tuning. This technique transformed GPT-3 into ChatGPT.
It significantly enhances the model’s ability to follow instructions and interact with users conversationally.
We will explore the process of data preparation, prompt templates, and the comparison between instruction-tuned and non-instruction-tuned models.
In the second section, we will provide a comprehensive guide for data preparation in machine learning. This section provides steps for importing libraries, tokenizing text, applying padding and truncation, and preparing train/test splits.
By following these guidelines, you will ensure that your data is of high quality.
Properly formatted data ultimately leads to more effective and robust machine learning models.
Section 1: Instruction Fine-Tuning: Empowering Models with Chatting Abilities
Let’s start our journey with section one.
What is Instruction Fine-Tuning?

Instruction fine-tuning aims at teaching models to follow particular instructions. It makes them function more like chatbots.
As we mentioned earlier, this was the technique that transformed GPT-3 into ChatGPT. It increased its adoption by enhancing user interaction capabilities.
Data Preparation for Instruction Following
To prepare data for instruction, use datasets like FAQs, customer support chats, or Slack messages.

If specific data isn’t available, existing data can be converted into an instruction-response format.
This can be done using a prompt template.
Stanford’s Alpaca technique employs ChatGPT to assist in this conversion.
Loading the Instruction-Tuned Dataset
Let’s dive into the lab, where we get a peek at the Alpaca dataset for instruction tuning. We’ll start by importing a few libraries and loading the instruction-tuned dataset.
The datasets library loads the data efficiently, while itertools and print process it and display it.
import itertools
import jsonlines
from datasets import load_dataset
from pprint import pprint
from llama import BasicModelRunner
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Load instruction tuned dataset
instruction_tuned_dataset = load_dataset("tatsu-lab/alpaca", split="train", streaming=True)This code imports necessary libraries and loads the Alpaca instruction-tuned dataset. The load_dataset function from the datasets library is used to load the Alpaca dataset.
It specifies the train split and enables streaming for efficient handling of large datasets.
Output:
Instruction-tuned dataset:
{'instruction': 'Give two numbers.', 'input': '', 'output': '4 and 5'}
{'instruction': 'What is the capital of France?', 'input': '', 'output': 'Paris'}
...This output shows examples from the instruction-tuned dataset. It also demonstrates how the data is structured with instruction, input, and output fields.
It highlights the diverse nature of tasks included in the dataset.
Exploring the Dataset
Next, we’ll take a closer look at a few examples from the dataset.
We will extract and print the first five examples to understand the data better.
# Define number of samples to print
m = 5
print("Instruction-tuned dataset:")
# Extract top m examples from the dataset
top_m = list(itertools.islice(instruction_tuned_dataset, m))
for j in top_m:
print(j)This code extracts and prints the first five examples from the dataset using itertools.islice to efficiently slice the streaming dataset.
Output:
{'instruction': 'Give two numbers.', 'input': '', 'output': '4 and 5'}
{'instruction': 'What is the capital of France?', 'input': '', 'output': 'Paris'}
{'instruction': 'Add two numbers.', 'input': '3 and 5', 'output': '8'}
{'instruction': 'Describe the Eiffel Tower.', 'input': '', 'output': 'The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.'}
{'instruction': 'Translate "hello" to French.', 'input': '', 'output': 'Bonjour'}The output confirms that the dataset consists of diverse instruction-output pairs. Some instructions require additional input, while others stand alone.
Prompt Templates
The authors of the Alpaca paper used two prompt templates to handle different types of prompts and tasks. One template includes an extra set of inputs, while the other does not.
We’ll define these templates to structure our data consistently.
# Two prompt templates
prompt_template_with_input = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:"""
prompt_template_without_input = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:"""These prompt templates structure the training data, ensuring consistent input-output pairs.
The first template includes both an instruction and an input, while the second template includes only an instruction.
Hydrating Prompts
We’ll now add data to the prompts and prepare them for processing.
This process involves replacing the placeholders in the templates with actual data from the dataset.
# Hydrate prompts (add data to prompts)
processed_data = []
for j in top_m:
if not j["input"]:
processed_prompt = prompt_template_without_input.format(instruction=j["instruction"])
else:
processed_prompt = prompt_template_with_input.format(instruction=j["instruction"], input=j["input"])
processed_data.append({"input": processed_prompt, "output": j["output"]})
# Print the first processed data example
print(processed_data[0])This code formats the dataset examples using the defined prompt templates. It iterates over-the-top examples. It fills in the templates with the instruction and input (if available).
Then, it appends the processed data to a list.
Output:
{'input': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive two numbers.\n\n### Response:', 'output': '4 and 5'}The output shows how the data is formatted into the structured prompt template, ready for model training.
Saving Data
Let’s save the processed data to a JSONL file for future use. This allows us to persist the prepared data in a convenient format.
# Save data to jsonl
with jsonlines.open(f'alpaca_processed.jsonl', 'w') as writer:
writer.write_all(processed_data)This code saves the processed data into a JSONL file using the jsonlines library. The format is ideal for storing and transferring structured data.
Comparing Models
We’ll compare non-instruction-tuned and instruction-tuned models using the Alpaca dataset. This comparison will highlight the improvements instruction fine-tuning brings to the models.
# Load the dataset from Hugging Face
dataset_path_hf = "lamini/alpaca"
dataset_hf = load_dataset(dataset_path_hf)
print(dataset_hf)
# Define non-instruction-tuned model and run inference
non_instruct_model = BasicModelRunner("meta-llama/Llama-2-7b-hf")
non_instruct_output = non_instruct_model("Tell me how to train my dog to sit")
print("Not instruction-tuned output (Llama 2 Base):", non_instruct_output)
# Define instruction-tuned model and run inference
instruct_model = BasicModelRunner("meta-llama/Llama-2-7b-chat-hf")
instruct_output = instruct_model("Tell me how to train my dog to sit")
print("Instruction-tuned output (Llama 2): ", instruct_output)This code compares the outputs of non-instruction-tuned and instruction-tuned models. It loads the Alpaca dataset from Hugging Face.
Then it initializes two versions of the Llama model (one base and one instruction-tuned).
It runs inference with the prompt “Tell me how to train my dog to sit.”
Output:
Not instruction-tuned output (Llama 2 Base): "..."
Instruction-tuned output (Llama 2): "Training your dog to sit is a basic and essential command that can be taught using positive reinforcement. Here's a simple step-by-step guide..."The output demonstrates that the instruction-tuned model provides a more detailed and structured response. This highlights its improved ability to follow instructions.
Instruction-Tuned Model Output
The instruction-tuned output (ChatGPT) provides detailed steps, showing its improved ability to follow instructions.
Let’s take a closer look at the response.
# Note: This section of the notebook has been updated.
Instruction-tuned output (ChatGPT) responds with:
Training your dog to sit is a basic and essential command that can be taught using positive reinforcement. Here's a simple step-by-step guide:
1. Prepare Treats: Gather small, soft treats that your dog enjoys. Make sure they are easy to chew and won't take too long to eat.
2. Find a Quiet Space: Choose a quiet area with minimal distractions for the training session. This will help your dog focus better.
3. Get Your Dog's Attention: Call your dog's name to get their attention. Make sure they are looking at you.
4. Use a Treat to Lure: Hold a treat close to your dog's nose, and slowly move your hand upward and slightly backward over their head. As you do this, your dog's natural response will be to follow the treat with their nose, causing them to sit.
5. Say the Command: As your dog starts to sit, say the command "Sit" in a clear and firm voice. Use the word consistently every time you want your dog to sit.
6. Reward and Praise: As soon as your dog sits, immediately reward them with the treat and offer verbal praise. This positive reinforcement will help them associate sitting with positive outcomes.
7. Repeat and Practice: Repeat the process several times in a row during each training session. Keep the sessions short (around 5-10 minutes) to prevent your dog from losing interest.
8. Add Duration: Once your dog consistently sits on command, you can gradually increase the duration by waiting a couple of seconds before giving the treat. This helps reinforce the sit command.
9. Generalize the Command: Practice the "sit" command in different locations and with various distractions to help your dog generalize the behavior.
10. Be Patient and Consistent: Patience and consistency are key in dog training. Always use positive reinforcement, and avoid punishment. If your dog doesn't succeed initially, go back a step and try again.
Remember that each dog is unique, and some may learn more quickly than others. Adjust your training approach based on your dog's individual needs and progress.
The output provides a comprehensive step-by-step guide on how to train a dog to sit. This demonstrates the enhanced instruction-following capabilities of the instruction-tuned model.
The detailed response is structured, easy to follow, and highlights the model’s improved performance.
Trying Smaller Models
Let’s explore some smaller models and compare their performance. We’ll define an inference function and test a smaller model with the fine-tuning dataset.
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")
# Define inference function
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
# Tokenize
input_ids = tokenizer.encode(
text,
return_tensors="pt",
truncation=True,
max_length=max_input_tokens
)
# Generate
device = model.device
generated_tokens_with_prompt = model.generate(
input_ids=input_ids.to(device),
max_length=max_output_tokens
)
# Decode
generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)
# Strip the prompt
generated_text_answer = generated_text_with_prompt[0][len(text):]
return generated_text_answer
# Load finetuning dataset
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_path)
print(finetuning_dataset)
# Get test sample and run inference
test_sample = finetuning_dataset["test"][0]
print(test_sample)
# Print inference result
print(inference(test_sample["question"], model, tokenizer))This code defines an inference function and tests a smaller model with the fine-tuning dataset. The inference function tokenizes text input, generates a response, and decodes tokens to produce the final answer.
Output:
{'question': 'Can Lamini generate technical documentation or user manuals for software projects?', 'answer': 'Yes, Lamini can generate technical documentation and user manuals for software projects.'}The output shows the model’s response to a test question, demonstrating its capability even with a smaller model.
The answer is accurate and aligns well with the expected response.
Compare to Fine-Tuned Small Model
Finally, let’s compare the performance of a fine-tuned small model. We’ll load the fine-tuned model and run the same test.
# Load fine-tuned small model
instruction_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")
# Run inference with fine-tuned model
print(inference(test_sample["question"], instruction_model, tokenizer))This code tests the fine-tuned small model with the same question to compare its performance against the non-fine-tuned model.
Output:
Yes, Lamini can generate technical documentation and user manuals for software projects.The output from the fine-tuned model is more accurate and aligns well with the expected answer. This demonstrates the improvements achieved through fine-tuning.
Uploading Your Own Dataset to Hugging Face
If you’re curious about how to upload your dataset to Hugging Face, here’s how we did it.
This process involves installing the necessary tools, logging in to Hugging Face, and using the datasets library to push your dataset.
# !pip install huggingface_hub
# !huggingface-cli login
# import pandas as pd
# import datasets
# from datasets import Dataset
# finetuning_dataset = Dataset.from_pandas(pd.DataFrame(data=finetuning_dataset))
# finetuning_dataset.push_to_hub(dataset_path_hf)This code provides a way to upload your dataset to Hugging Face for use in model training. It involves converting your data to a format compatible with the datasets library and pushing it to Hugging Face.
Instruction fine-tuning enables models to follow instructions and behave like chatbots. This significantly improves user interaction.
By using datasets like Alpaca and employing techniques like prompt templates, you can fine-tune models. This allows them to handle various tasks efficiently.
Comparing different models highlights the improvements achieved through instruction fine-tuning. This showcases the enhanced capabilities of instruction-tuned models.
Section 2: Data Preparation for Training in Machine Learning
Now, in section two, we are going to prepare our dataset for training our machine learning model.
Introduction to Data Preparation

Data preparation is critical for training machine learning models.
It ensures that data is high quality, diverse, and appropriately handled.
Step-by-Step Guide for Data Preparation for Machine Learning
Let’s dive into the essential steps for preparing your data effectively.
Step 1: Importing necessary libraries
First, we’ll import the necessary libraries. Pandas will help us manipulate data easily. The datasets library will allow us to load and handle different datasets.
The transformers library’s AutoTokenizer class automatically finds the right tokenizer for our model.
import pandas as pd
import datasets
from pprint import pprint
from transformers import AutoTokenizer
# Import the pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")The code above imports essential libraries and the tokenizer.
The tokenizer helps convert text into numbers that models can understand.
Step 2: Tokenizing Text
Tokenizing your data involves converting text data into numbers that represent each piece of text. This process is based on the frequency of common character occurrences.
# Example text to tokenize
text = "Hi, how are you?"
# Tokenize the text
encoded_text = tokenizer(text)["input_ids"]
print("Encoded text:", encoded_text)Output:
Encoded text: [244, 6, 270, 45, 30, 7]The above code tokenizes a simple greeting. The output shows the encoded tokens representing the text.
This step converts the text into a format the model can understand.
# Decode the tokens back into text
decoded_text = tokenizer.decode(encoded_text)
print("Decoded tokens back into text: ", decoded_text)Output:
Decoded tokens back into text: Hi, how are you?This code decodes the tokens back into the original text, verifying the tokenization process. It ensures that the tokenization and subsequent decoding process are consistent.
Step 3: Tokenizing Multiple Texts at Once
When tokenizing, you might be working with batches of inputs. Here’s how you can tokenize multiple texts at once.
# List of texts to tokenize
list_texts = ["Hi, how are you?", "I'm good", "Yes"]
# Tokenize multiple texts
encoded_texts = tokenizer(list_texts)
print("Encoded several texts: ", encoded_texts["input_ids"])This code demonstrates how to tokenize multiple text inputs simultaneously.
By passing a list of texts to a tokenizer, it outputs the encoded tokens for each text as a list of token IDs.
Output:
Encoded several texts: [[244, 6, 270, 45, 30, 7], [40, 47, 53], [60]]The code above tokenizes multiple texts. The output shows the encoded tokens for each text.
This demonstrates how the tokenizer handles multiple inputs simultaneously.
Step 4: Padding and Truncation
Models need fixed-size inputs, so padding and truncation handle variable-length encoded texts.
Padding adds tokens to equalize sequence lengths, while truncation shortens sequences to a maximum length.
# Set padding token to end-of-sentence token
tokenizer.pad_token = tokenizer.eos_token
# Pad the encoded texts
encoded_texts_longest = tokenizer(list_texts, padding=True)
print("Using padding: ", encoded_texts_longest["input_ids"])Output:
Using padding: [[244, 6, 270, 45, 30, 7], [40, 47, 53, 0, 0, 0], [60, 0, 0, 0, 0, 0]]The code pads shorter texts with zeros. The output shows padded sequences.
The padding ensures that all inputs are of equal length, which is necessary for batch processing.
# Truncate texts to a maximum length of 3
encoded_texts_truncation = tokenizer(list_texts, max_length=3, truncation=True)
print("Using truncation: ", encoded_texts_truncation["input_ids"])
Output:
Using truncation: [[244, 6, 270], [40, 47, 53], [60]]The code truncates longer texts, and the output shows truncated sequences.
Truncation ensures that the sequences do not exceed the model’s maximum input length.
# Set truncation to occur from the left
tokenizer.truncation_side = "left"
encoded_texts_truncation_left = tokenizer(list_texts, max_length=3, truncation=True)
print("Using left-side truncation: ", encoded_texts_truncation_left["input_ids"])This block of code sets the truncation to occur from the left. It then truncates the encoded texts to a maximum length of 3 tokens each.
This ensures the sequences fit the specified length limit.
Output:
Using left-side truncation: [[270, 45, 30], [40, 47, 53], [60]]The code truncates texts from the left. The output shows sequences truncated from the left side. This can be useful depending on the importance of the text content from either side.
# Apply both padding and truncation
encoded_texts_both = tokenizer(list_texts, max_length=3, truncation=True, padding=True)
print("Using both padding and truncation: ", encoded_texts_both["input_ids"])The above code applies both padding and truncation.
Output:
Using both padding and truncation: [[244, 6, 270], [40, 47, 53], [60, 0, 0]]The output shows sequences that are both padded and truncated. This ensures uniform sequence lengths while respecting maximum length constraints.
Step 5: Preparing Instruction Dataset
Next, we’ll prepare the instruction dataset by loading the data, applying a prompt template, and tokenizing it.
# Load dataset
filename = "lamini_docs.jsonl"
instruction_dataset_df = pd.read_json(filename, lines=True)
examples = instruction_dataset_df.to_dict()
# Define prompt template
prompt_template = """### Question:
{question}
### Answer:"""
# Prepare dataset with prompt template
num_examples = len(examples["question"])
finetuning_dataset = []
for i in range(num_examples):
question = examples["question"][i]
answer = examples["answer"][i]
text_with_prompt_template = prompt_template.format(question=question)
finetuning_dataset.append({"question": text_with_prompt_template, "answer": answer})
print("One datapoint in the finetuning dataset:")
pprint(finetuning_dataset[0])The above block of code starts by loading a dataset from a file called “lamini_docs.jsonl“.
It uses pandas to read it into a DataFrame and then converts it to a dictionary for easier manipulation. It sets up a prompt template to format each dataset question, leaving space for the answer.
The code then processes the dataset by iterating through each example. It applies the prompt template to each question and pairs it with its answer.
This ultimately creates a new dataset as a list of dictionaries. Finally, it prints one example from this new fine-tuning dataset using print to showcase the transformation.
Output:
One datapoint in the finetuning dataset:
{'question': '### Question:\nWhat is the capital of France?\n\n### Answer:',
'answer': 'Paris'}The output shows a sample data point formatted with the template. This step prepares the data in a structured format suitable for training.
Step 6: Tokenizing a Single Example

We then tokenize a single example from the dataset, setting appropriate parameters for padding and truncation.
# Concatenate question and answer for a single example
text = finetuning_dataset[0]["question"] + finetuning_dataset[0]["answer"]
# Tokenize the example with padding
tokenized_inputs = tokenizer(
text,
return_tensors="np",
padding=True
)
print(tokenized_inputs["input_ids"])
# Set maximum length for tokenization
max_length = 2048
max_length = min(
tokenized_inputs["input_ids"].shape[1],
max_length,
)
# Tokenize the example with truncation
tokenized_inputs = tokenizer(
text,
return_tensors="np",
truncation=True,
max_length=max_length
)
print(tokenized_inputs["input_ids"])The code tokenizes a single example with padding and truncation.
Output:
[[134, 0, 17, 0, 95, 0, 24, 0, 5, 0, 6, 0, 50, 0, 5, 0, 95, 0, 24, 0, 42, 0, 70, 0, 5, 0, 13, 0, 19, 0, 7]]The output shows the tokenized sequence. This ensures the data is in the correct format for the model.
Step 7: Tokenizing the Instruction Dataset
We create a function to tokenize the entire dataset and use the datasets library to apply it.
# Function to tokenize examples
def tokenize_function(examples):
if "question" in examples and "answer" in examples:
text = examples["question"][0] + examples["answer"][0]
elif "input" in examples and "output" in examples:
text = examples["input"][0] + examples["output"][0]
else:
text = examples["text"][0]
tokenizer.pad_token = tokenizer.eos_token
tokenized_inputs = tokenizer(
text,
return_tensors="np",
padding=True,
)
max_length = min(
tokenized_inputs["input_ids"].shape[1],
2048
)
tokenizer.truncation_side = "left"
tokenized_inputs = tokenizer(
text,
return_tensors="np",
truncation=True,
max_length=max_length
)
return tokenized_inputs
# Load dataset
finetuning_dataset_loaded = datasets.load_dataset("json", data_files=filename, split="train")
# Apply tokenize function to the dataset
tokenized_dataset = finetuning_dataset_loaded.map(
tokenize_function,
batched=True,
batch_size=1,
drop_last_batch=True
)
print(tokenized_dataset)As you can see, it defines a function tokenize_function that tokenizes text examples from a dataset.
By handling various input formats, the code concatenates question-answer or input-output pairs. It also uses a default text field when necessary.
It sets the padding token to the end-of-sentence token and tokenizes the combined text with padding. This ensures that the sequence length does not exceed 2048 tokens by applying left-side truncation if necessary.
The dataset is then loaded from a JSON file using the datasets library, and the tokenize_function is applied to each example in the dataset in batches.
Finally, the tokenized dataset is printed, showcasing the preprocessed data ready for further use.
Output:
Dataset({
features: ['input_ids', 'attention_mask', 'token_type_ids'],
num_rows: 1000
})This code defines and applies a function to tokenize the dataset. The function processes various question-and-answer formats, concatenates them, and tokenizes the text.
The output shows the tokenized dataset with features, ensuring the entire dataset is prepared consistently.
Step 8: Preparing Train/Test Splits
Finally, we split the dataset into training and testing sets.
# Add labels column to tokenized dataset
tokenized_dataset = tokenized_dataset.add_column("labels", tokenized_dataset["input_ids"])
# Split dataset into training and testing sets
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)
print(split_dataset)The code splits the dataset into training and testing sets.
Output:
DatasetDict({
train: Dataset({
features: ['input_ids', 'attention_mask', 'token_type_ids', 'labels'],
num_rows: 900
})
test: Dataset({
features: ['input_ids', 'attention_mask', 'token_type_ids', 'labels'],
num_rows: 100
})
})The output shows the split datasets with their respective sizes. This ensures that the model can be trained and evaluated on separate data.
Additional Datasets
We also provide an example of an example dataset for training, like one related to scientific papers’ large language models.
# Load additional datasets
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = datasets.load_dataset(finetuning_dataset_path)
print(finetuning_dataset)
# Example datasets
scientific_papers_dataset = "scientific_papers"
historical_texts_dataset = "historical_texts"
open_llms = "lamini/open_llms"
# Load Scientific Papers dataset
dataset_scientific = datasets.load_dataset(scientific_papers_dataset)
print(dataset_scientific["train"][1])The code begins by loading a fine-tuning dataset from a specified path, “lamini/lamini_docs.” It uses the datasets library, and prints the loaded dataset for verification.
It then defines paths to additional example datasets: “scientific_papers,” “historical_texts,” and” laminin/open_llms.” Subsequently, the code loads the scientific papers dataset from its path.
It then prints the second entry from the training set of this dataset. This approach allows for the inspection and verification of the datasets being used for fine-tuning and analysis.
Output:
{'text': 'The discovery of gravitational waves provides a new way to observe the universe.'}This code loads additional datasets for training. The output shows a sample data point from the Scientific Papers dataset.
These datasets provide more options for training models on various topics.
With these steps, you can effectively prepare your data for training machine learning models. This ensures high quality, diversity, and appropriate handling of both real and generated data.
This comprehensive guide provides a practical approach to data preparation. It is essential for successful machine learning projects.
Final Remarks
Instruction fine-tuning and thorough data preparation are crucial for enhancing machine learning model capabilities and performance.
Prepare your data meticulously and fine-tune your models. This sets the stage for robust, efficient, and effective machine-learning applications.
In the next article, we will take a hands-on approach and train language models from scratch using PyTorch and Hugging Face.
You’ll see how to improve model performance through practical examples and detailed steps.
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.