Welcome to the second article of our series on mastering Reinforcement Learning from Human Feedback.
The first article provided an introduction Reinforcement Learning from Human Feedback. to for large language models (LLMs), we will guide you through the process of tuning and evaluating your LLM using Reinforcement Learning from Human Feedback techniques.
In the previous article, we covered the fundamentals of Reinforcement Learning from Human Feedback, including how it aligns LLM outputs with human values by using preference datasets and reward models. We also walked through setting up the necessary environment, loading datasets, and configuring Google Cloud.
Now, we will build upon that foundation by compiling and running the Reinforcement Learning from Human Feedback pipeline on Google Cloud’s Vertex AI. This includes defining the pipeline job, configuring the training parameters, and executing the tuning process.
By the end of this article, you will have a comprehensive understanding of optimizing your LLM to generate human-preferred responses and evaluate its performance.
Compiling the Pipeline
To begin with, we need to compile the Reinforcement Learning from Human Feedback pipeline. This involves importing the necessary components and defining the path to the YAML file that describes the pipeline.
# Import RLHF components (RLHF is currently in preview)
from google_cloud_pipeline_components.preview.llm import rlhf_pipeline
# Import the compiler from Kubeflow pipelines
from kfp import compiler
# Define the path to the YAML file
RLHF_PIPELINE_PKG_PATH = "rlhf_pipeline.yaml"
# Execute the compile function to compile the pipeline
compiler.Compiler().compile(
pipeline_func=rlhf_pipeline,
package_path=RLHF_PIPELINE_PKG_PATH
)
# Print the first lines of the YAML file to verify the pipeline
!head rlhf_pipeline.yamlDefining the Vertex AI Pipeline Job
We will define the Vertex AI pipeline job by specifying the location of the training and evaluation data, choosing the foundational model, and calculating the number of training steps for both the reward model and reinforcement learning.
Define the Location of the Training and Evaluation Data
First, we specify the paths to the preference, prompt, and evaluation datasets stored in Google Cloud Storage.
parameter_values = {
"preference_dataset": "gs://vertex-ai/generative-ai/rlhf/text_small/summarize_from_feedback_tfds/comparisons/train/*.jsonl",
"prompt_dataset": "gs://vertex-ai/generative-ai/rlhf/text_small/reddit_tfds/train/*.jsonl",
"eval_dataset": "gs://vertex-ai/generative-ai/rlhf/text_small/reddit_tfds/val/*.jsonl"
}We define a dictionary parameter_values that holds the paths to the preference, prompt, and evaluation datasets. These datasets should be stored in the same Google Cloud Storage bucket for consistency.
Choose the Foundation Model to be Tuned
Next, we specify the foundational model to be tuned. In this case, we are tuning the Llama-2 model.
parameter_values["large_model_reference"] = "llama-2-7b"Calculate the Number of Reward Model Training Steps
To determine the number of training steps for the reward model, we need to consider the size of the preference dataset and the batch size.
import math
# Preference dataset size
PREF_DATASET_SIZE = 3000
# Batch size
BATCH_SIZE = 64
# Calculate steps per epoch
REWARD_STEPS_PER_EPOCH = math.ceil(PREF_DATASET_SIZE / BATCH_SIZE)
# Number of epochs
REWARD_NUM_EPOCHS = 30
# Calculate total number of training steps
reward_model_train_steps = REWARD_STEPS_PER_EPOCH * REWARD_NUM_EPOCHS
parameter_values["reward_model_train_steps"] = reward_model_train_stepsCalculate the Number of Reinforcement Learning Training Steps
Similarly, we calculate the number of training steps for reinforcement learning based on the size of the prompt dataset.
# Prompt dataset size
PROMPT_DATASET_SIZE = 2000
# Calculate steps per epoch
RL_STEPS_PER_EPOCH = math.ceil(PROMPT_DATASET_SIZE / BATCH_SIZE)
# Number of epochs
RL_NUM_EPOCHS = 10
# Calculate total training steps
reinforcement_learning_train_steps = RL_STEPS_PER_EPOCH * RL_NUM_EPOCHS
parameter_values["reinforcement_learning_train_steps"] = reinforcement_learning_train_stepsSetting Up Google Cloud for the Pipeline
To run the pipeline on Vertex AI, you need to set up Google Cloud properly. This includes installing the necessary packages, authenticating, and initializing Vertex AI.

Ensure The Necessary Packages are Installed
pip install google-cloud-aiplatformAuthenticate and Initialize Vertex AI
Next, we authenticate with Google Cloud and initialize Vertex AI with the project ID, region, and credentials.
# Import the authentication utility
from google.auth import credentials
# Replace with your authentication function
from utils import authenticate
# Authenticate and retrieve credentials, project ID, and staging bucket
credentials, PROJECT_ID, STAGING_BUCKET = authenticate()
# RLHF pipeline is available in this region
REGION = "europe-west4"
# Import the aiplatform module
import google.cloud.aiplatform as aiplatform
# Initialize Vertex AI with your project and credentials
aiplatform.init(project=PROJECT_ID, location=REGION, credentials=credentials)Run the Pipeline Job on Vertex AI
With the setup complete, we can create and run the pipeline job on Vertex AI.
# Look at the path for the YAML file
RLHF_PIPELINE_PKG_PATH
# Create the pipeline job
job = aiplatform.PipelineJob(
display_name="tutorial-rlhf-tuning",
pipeline_root=STAGING_BUCKET,
template_path=RLHF_PIPELINE_PKG_PATH,
parameter_values=parameter_values
)
# Run the pipeline job
job.run()Fine-Tuning Code
The next step is to fine-tune the LLM using the compiled Reinforcement Learning from Human Feedback pipeline and predefined parameters. This involves setting up the necessary configuration for the training steps and ensuring the pipeline is executed correctly.
Compile the Pipeline

To begin with, compile the Reinforcement Learning from Human Feedback pipeline and set up the necessary configurations.
# Import RLHF components
from google_cloud_pipeline_components.preview.llm import rlhf_pipeline
# Import the compiler from Kubeflow pipelines
from kfp import compiler
# Define the path to the YAML file
RLHF_PIPELINE_PKG_PATH = "rlhf_pipeline.yaml"
# Compile the pipeline
compiler.Compiler().compile(
pipeline_func=rlhf_pipeline,
package_path=RLHF_PIPELINE_PKG_PATH
)Set Up Parameter Values
Define the parameter values for the pipeline, including paths to datasets, the foundational model, training steps, and instruction.
import math
# Define dataset paths
parameter_values = {
"preference_dataset": "gs://vertex-ai/generative-ai/rlhf/text_small/summarize_from_feedback_tfds/comparisons/train/*.jsonl",
"prompt_dataset": "gs://vertex-ai/generative-ai/rlhf/text_small/reddit_tfds/train/*.jsonl",
"eval_dataset": "gs://vertex-ai/generative-ai/rlhf/text_small/reddit_tfds/val/*.jsonl",
"large_model_reference": "llama-2-7b",
}
# Calculate the number of reward model training steps
PREF_DATASET_SIZE = 3000
BATCH_SIZE = 64
REWARD_STEPS_PER_EPOCH = math.ceil(PREF_DATASET_SIZE / BATCH_SIZE)
REWARD_NUM_EPOCHS = 30
parameter_values["reward_model_train_steps"] = REWARD_STEPS_PER_EPOCH * REWARD_NUM_EPOCHS
# Calculate the number of reinforcement learning training steps
PROMPT_DATASET_SIZE = 2000
RL_STEPS_PER_EPOCH = math.ceil(PROMPT_DATASET_SIZE / BATCH_SIZE)
RL_NUM_EPOCHS = 10
parameter_values["reinforcement_learning_train_steps"] = RL_STEPS_PER_EPOCH * RL_NUM_EPOCHS
# Additional parameters
parameter_values["reward_model_learning_rate_multiplier"] = 1.0
parameter_values["reinforcement_learning_rate_multiplier"] = 1.0
parameter_values["kl_coeff"] = 0.1
parameter_values["instruction"] = "Summarize the following in less than 75 words"Run the Pipeline Job
With the configuration set, we create and run the pipeline job on Vertex AI.
import google.cloud.aiplatform as aiplatform
# Initialize Vertex AI with your project and credentials
aiplatform.init(project=PROJECT_ID, location=REGION, credentials=credentials)
# Create the pipeline job
job = aiplatform.PipelineJob(
display_name="tutorial-rlhf-tuning",
pipeline_root=STAGING_BUCKET,
template_path=RLHF_PIPELINE_PKG_PATH,
parameter_values=parameter_values
)
# Run the pipeline job
job.run()Evaluating the Tuned Model
Now we will evaluate the performance of the tuned model using TensorBoard and compare the results with the untuned model. This involves setting up TensorBoard, exploring the results, and putting the data into a dataframe for analysis.
Install TensorBoard
If you are running this locally, you need to install TensorBoard.
pip install tensorboardExplore Results with TensorBoard
To explore the results, we load TensorBoard and point it to the logs generated during training.
# Load TensorBoard extension
%load_ext tensorboard
# Set the port for TensorBoard
port = %env PORT1
# Launch TensorBoard to view reward model training logs
%tensorboard --logdir reward-logs --port $port --bind_all
# List the contents of the reward-logs directory
%ls reward-logs
# Launch TensorBoard to view reinforcement learning logs
port = %env PORT2
%tensorboard --logdir reinforcer-logs --port $port --bind_all
# Launch TensorBoard to view full dataset reinforcement learning logs
port = %env PORT3
%tensorboard --logdir reinforcer-fulldata-logs --port $port --bind_allLoad and Print Evaluation Results
We load the evaluation results produced by both the tuned and untuned models and print the results for comparison.
import json
# Load evaluation results for the tuned model
eval_tuned_path = 'eval_results_tuned.jsonl'
eval_data_tuned = []
with open(eval_tuned_path) as f:
for line in f:
eval_data_tuned.append(json.loads(line))
# Function to print dictionary in a readable format
def print_d(d):
for key, value in d.items():
print(f"Key: {key}\nValue: {value}\n")
# Print results from the tuned model
print_d(eval_data_tuned[0])
# Load evaluation results for the untuned model
eval_untuned_path = 'eval_results_untuned.jsonl'
eval_data_untuned = []
with open(eval_untuned_path) as f:
for line in f:
eval_data_untuned.append(json.loads(line))
# Print results from the untuned model
print_d(eval_data_untuned[0])Compare Results in a DataFrame
Finally, we compare the results side-by-side in a dataframe for better visualization.
import pandas as pd
# Extract all the prompts
prompts = [sample['inputs']['inputs_pretokenized'] for sample in eval_data_tuned]
# Extract completions from the untuned model
untuned_completions = [sample['prediction'] for sample in eval_data_untuned]
# Extract completions from the tuned model
tuned_completions = [sample['prediction'] for sample in eval_data_tuned]
# Create a dataframe to compare results
results = pd.DataFrame(
data={
'prompt': prompts,
'base_model': untuned_completions,
'tuned_model': tuned_completions
}
)
# Set display option to show full column width
pd.set_option('display.max_colwidth', None)
# Print the results
print(results)Final Thoughts
In this article, we walked through the process of tuning and evaluating an LLM using Reinforcement Learning from Human Feedback techniques.
We covered compiling the pipeline, defining the Vertex AI pipeline job, setting up Google Cloud, running the pipeline job, fine-tuning the model, and evaluating the tuned model.
By following these steps, you can optimize your LLM to generate human-preferred responses and effectively evaluate its performance. This process is essential for ensuring that models align with human values and preferences, making them more useful and reliable.
This concludes our guide on tuning and evaluating LLMs with Reinforcement Learning from Human Feedback.
Stay tuned for more articles in this series, where we will delve deeper into advanced techniques and practical applications of Reinforcement Learning from Human Feedback.
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.