This is the second article in our series on LLMOps. Today, we’re diving into data preparation for your text-based LLMOps workflow. Get ready to wrangle some text data!
The last article explored setting up the foundation for your LLMOps journey. In this article, we’ll focus on how to tackle large datasets stored in a data warehouse, specifically using BigQuery. Buckle up for some SQL magic!
Why BigQuery for LLMOps?
BigQuery is a fantastic choice for data warehousing because it lets you handle massive datasets efficiently. It’s particularly useful for LLMOps since we often deal with mountains of text data. Plus, BigQuery uses familiar SQL queries, making data preparation a breeze for those already comfortable with SQL.
Hold on, Isn’t Pandas Better?
While Pandas is excellent for working with in-memory data on your local machine, BigQuery shines regarding data warehouses. BigQuery lets you process and filter your data directly in the warehouse, saving you from the hassle of downloading the entire dataset, which can be a real-time (and memory-saver) for those enormous LLMOps datasets.

Exploring Your Data Warehouse with SQL
Alright, let’s get our hands dirty! We’ll use Stack Overflow’s public dataset, a goldmine of questions and answers, to demonstrate how to explore and prepare your data.
The first step is to identify the data you’ll be using. Here’s a code snippet to query BigQuery and list all the available tables in the Stack Overflow dataset:
# Replace with your project ID and credentials
project_id = "your-project-id"
credentials = # your authentication credentials
# Import libraries
from google.cloud import bigquery
# Initialize BigQuery client
client = bigquery.Client(project=project_id, credentials=credentials)
# Write your SQL query
query = """
SELECT table_name
FROM `public_dataset.stackoverflow.INFORMATION_SCHEMA.TABLES`
"""
# Run the query and print the results
job = client.query(query)
for row in job.result():
print(row.table_name)
This code snippet will print out all the tables available in the dataset. Now you can see which tables have the data you need!
posts_answers
users
posts_orphaned_tag_wiki
posts_tag_wiki
stackoverflow_posts
posts_questions
comments
posts_tag_wiki_excerpt
posts_wiki_placeholder
posts_privilege_wiki
post_history
badges
post_links
tags
votes
posts_moderator_nomination
Fetching and Visualizing Data
Let’s say you want to grab some data from the posts table and see what it looks like. Here’s how you can use a SQL query to retrieve and visualize the data in a Pandas DataFrame:
# Write your SQL query to select data
query = """
SELECT *
FROM `public_dataset.stackoverflow.posts`
LIMIT 10
"""
# Load the results into a Pandas DataFrame
df = pd.DataFrame(client.query(query).result())
print(df.head())
This code snippet will fetch the first 10 rows from the posts table and display them in a Pandas DataFrame using df.head(). Remember to replace LIMIT 10 with a higher number if you want to see more rows. However, remember you might encounter memory limitations with a local notebook if you go overboard!
Big Data? No Problem!
When you’re working with truly massive datasets, BigQuery is your best friend. Downloading the entire dataset to your local machine can quickly lead to memory woes. Thankfully, BigQuery allows you to process and filter your data directly in the warehouse using SQL.
For instance, if you want to get all the questions and answers for fine-tuning your model, you can use a WHERE clause to filter the data and retrieve only relevant information. This can significantly improve performance, especially when dealing with massive datasets.
Optimizing Your Queries
Optimizing your queries is crucial to save time and resources when working with large datasets. In the next section of this article, we’ll show you how to combine tables using JOIN clauses to get the specific question-and-answer pairs you need for fine-tuning. We’ll also discuss filtering your data based on criteria like timestamps and keywords.
Let’s explore crafting powerful SQL queries to wrangle your text data from BigQuery for your LLMOps projects!
Wrangling Text Data for LLMs: Extracting Questions and Answers
Now that we’ve explored the basics of using BigQuery, let’s focus on how to extract the questions and answers you’ll need to train your large language model (LLM).
Imagine you want to train your LLM on both questions and answers from Stack Overflow. BigQuery’s JOIN clauses let you combine data from multiple tables. Here’s an example combining the posts and answers tables:
SELECT p.title || ' ' || p.body AS text, a.body AS answer
FROM `public_dataset.stackoverflow.posts` p
JOIN `public_dataset.stackoverflow.answers` a
ON p.accepted_answer_id = a.id
WHERE p.tags LIKE '%python%'
LIMIT 10000;
This query achieves several things:
- Selects Columns: We select specific columns:
- p.title || ‘ ‘ || p.body combines the question title and body into a single text column.
- a.body selects the answer text.
- Joins Tables: We use JOIN to connect the posts (p) and answers (a) tables based on the accepted_answer_id column. This ensures we get the answer that goes with each question.
- Filters Data: We use WHERE to filter for questions tagged with “python” to focus our training data.
- Limits Results: We use LIMIT 10000 to retrieve only 10,000 examples to keep things efficient.
Remember to adjust the query based on your specific needs!
Adding Instructions for Your LLM
LLMs can benefit from clear instructions. We can create a new column to provide context for the question-and-answer pairs. Here’s how we can add an instruction template:
from sklearn.model_selection import train_test_split
# Split the data into training and evaluation sets (80%/20% split by default)
train_data, eval_data = train_test_split(df, test_size=0.2, random_state=42)
This code splits the data into an 80% training set (train_data) and a 20% evaluation set (eval_data). The random_state parameter ensures reproducibility of your split.
Keep your training and evaluation set splits consistent across your experiments for a fair comparison of your LLM’s performance.
Saving Your Wrangled Data
Once your data is prepared, it’s time to save it in a format suitable for LLM training. Here are common options:
- JSON Lines: A simple human-readable format where each question-answer pair is on a separate line. Ideal for small to medium datasets.
- TFRecord: A binary format optimized for machine learning frameworks like TensorFlow. More efficient for reading during training.
Saving Your Wrangled Treasure: Exporting Data for LLM Consumption
We need to save this data in a format that our LLM can understand and devour during training. Here, we’ll explore two popular options: JSON Lines and TFRecords.
JSON Lines: Simple and Readable
JSON Lines (JSONL) is a human-readable format that stores each question-answer pair as a separate JSON object on a new line. Here’s an example:
{
"text": "How do I iterate through a list in Python?",
"answer": "You can use a for loop to iterate through each element in a list."
}
{
"text": "What is the difference between a list and a tuple in Python?",
"answer": "Lists are mutable, while tuples are immutable."
}
Pros:
- Easy to read and understand for humans.
- No complex tools required to create or read JSONL files.
Cons:
- Can be less efficient for large datasets due to its text-based nature.
- May not be ideal for all LLM frameworks.
Here’s how you can save your wrangled data as JSONL using Python:
# Import library
import json
# Open a file for writing
with open('my_llm_data.jsonl', 'w') as f:
# Iterate through your data and write each question-answer pair as JSON
for index, row in df.iterrows():
data = {'text': row['text'], 'answer': row['answer']}
json.dump(data, f)
This code opens a file named my_llm_data.jsonl and iterates through your Pandas DataFrame (df), writing each question-answer pair as a JSON object to the file.
TFRecords: Built for Speed (and TensorFlow)
TensorFlow Records (TFRecords) is a binary format specifically designed to efficiently store and retrieve machine learning data. It groups data into records and compresses them for faster training, especially when used with TensorFlow.
Pros:
- Highly efficient for storing and reading large datasets.
- Optimized for TensorFlow framework.
Cons:
- Binary format, not human-readable.
- Requires additional libraries to work with.
Here’s a glimpse of how you can save your data as TFRecords using TensorFlow (code not provided due to its complexity):
- Import TensorFlow libraries.
- Create a TFRecord writer object.
- Iterate through your data and serialize each question-answer pair into a TFRecord.
- Write the TFRecord to disk.
Note: Working with TFRecords involves more complex code, but the efficiency gains can be significant for massive datasets.
Choosing the Right Format
The best format for you depends on your specific needs. Here’s a quick guide:
- For small to medium datasets and ease of use, JSON Lines is a great choice.
- For massive datasets and TensorFlow-based LLMs, TFRecords offer superior performance.
In the next part of this series, we’ll dive into the world of training your LLM, using the data we’ve wrangled and saved!
Versioning for Reproducibility: Keeping Your Data Tracked
Throughout this article, we’ve emphasized the importance of data preparation for LLMs. But data isn’t static – it can evolve. This is why versioning your data is crucial, especially when working with large datasets in data warehouses like BigQuery.
Versioning allows you to track which specific version of the data you used to train your LLM. This is essential for reproducibility – if you need to retrain your model or compare results across different experiments, you can ensure you’re using the exact same data each time.
Here are some strategies for data versioning:
- Timestamps in Filenames: Include a timestamp in your filenames when saving your wrangled data (JSON Lines or TFRecords). This provides a clear indication of when the data was extracted from BigQuery.
- Version Control Systems: Consider using a version control system (VCS) like Git to track changes to your data preparation scripts. This allows you to revert to previous versions if necessary.
- BigQuery Export Options: BigQuery offers data versioning features through its export options. You can specify versioning options when exporting your data to cloud storage.
Cloud Storage for Scalable Training
As we saw earlier, JSON Lines and TFRecords are excellent data formats for LLM training. However, storing genuinely massive datasets on your local machine can quickly become impractical. Cloud storage services like Google Cloud Storage (GCS) come in here.
By storing your data in GCS, you benefit from several advantages:
- Scalability: GCS scales elastically, meaning you can store vast amounts of data without worrying about storage limitations.
- Accessibility: Data stored in GCS is readily accessible from anywhere with an internet connection, making it ideal for distributed training environments where multiple machines collaborate on training the LLM.
- Integration with Vertex AI: Vertex AI, Google’s AI platform, integrates seamlessly with GCS, allowing you to directly train your LLM on data stored in your cloud storage bucket.
Here’s a high-level overview of the workflow:
- Prepare your data in BigQuery.
- Export your wrangled data to a JSON Lines or TFRecords file and store it in your GCS bucket.
- Train your LLM using Vertex AI, referencing the data in your GCS bucket.
BigQuery Best Practices for LLMOps
BigQuery is a powerful tool for data preparation in LLMOps workflows. Here are some best practices to keep in mind:
- Filter Data in BigQuery: Take advantage of BigQuery’s SQL capabilities to filter your data directly in the warehouse. This reduces the data you need to transfer for training, saving time and resources.
- Partitioning for Performance: Consider using partitioned tables in BigQuery. Partitioning allows BigQuery to efficiently query specific subsets of your data, especially for large datasets. Imagine you have a table with data for different years. Partitioning by year allows you to query data for a specific year faster than scanning the entire table.
By following these best practices, you can leverage BigQuery’s strengths to streamline your data preparation process for LLMOps.
Final Thoughts
we’ve explored the exciting world of data preparation for Large Language Model (LLM) workflows. We started by diving into BigQuery, the serverless data warehouse from Google, highlighting its efficiency in handling massive text datasets. We contrasted BigQuery with Pandas, emphasizing BigQuery’s advantage of in-warehouse data processing for LLMOps.
Next, we tackled the practicalities of data wrangling, using Stack Overflow’s public dataset as an example. We saw how to leverage SQL queries to explore, filter, and transform your data to meet your LLM’s needs.
To ensure training success, we discussed the importance of crafting clear instructions for your LLM and splitting your data into training and evaluation sets. Finally, we ventured into the world of data saving, exploring two popular options: JSON Lines, ideal for readability, and TFRecords, the champion for efficiency, especially with TensorFlow-based LLMs. We also explored considerations for data versioning and cloud storage using Google Cloud Storage (GCS) for scalability and accessibility during LLM training.
Stay tuned for my next article in this series. We will discuss LLMOps Automation and Orchestration with Pipelines.
Discover more from AI For Developers
Subscribe to get the latest posts sent to your email.