Build a Robust Recommender System with Vector Embeddings

Welcome back to our Free AI Course! Congrats! You made it to Part 4 of our course on Building Applications with Vector Databases.

In the previous article, we explored the application of vector embeddings in determining facial similarities and detecting anomalies in cybersecurity logs. We highlighted how embeddings extract essential features, enable efficient comparisons, and scale for large datasets. This showcases their versatility across various domains.

In today’s article, we’ll explore how to create a robust recommender system using vector embeddings. Our goal is to build a system that can retrieve relevant news articles based on their content.

We’ll start by creating vector embeddings from article titles and then extend this to include embeddings from the article content.

Let’s get started!

What are Recommender Systems?

Recommender systems are essential in today’s digital landscape. They help users discover relevant content among vast datasets.

They are widely used in e-commerce, streaming services, and news aggregation applications.

In this article, we’ll build an effective recommender system for news articles using vector embeddings and Pinecone, a powerful vector database.

Preparing the Environment

Before we start building our recommender system, we need to set up our environment. This involves importing necessary packages, setting up API keys, and preparing the data.

Libraries and Tools

Here are the libraries and tools we’ll be using and why they are important:

The warnings library suppresses unnecessary warnings that might clutter our output.
The langchain.text_splitter (RecursiveCharacterTextSplitter) splits text into manageable chunks, which is essential for processing large articles.
The openai (OpenAI) library interacts with OpenAI’s API to generate vector embeddings for our articles.
The pinecone (Pinecone, ServerlessSpec) stores our embeddings and enables efficient similarity searches.
The tqdm (tqdm.auto) library provides a progress bar for our loops, making it easier to track lengthy operations.
The DLAIUtils (Utils) handles miscellaneous tasks such as fetching API keys.
The Pandas (pd) library is a powerful data manipulation tool used to read and process our dataset.
The time library manages timing functions, and the os library interacts with the operating system, which is particularly useful for file path manipulations.

Importing Libraries and Setting Up

Let’s start by importing these libraries and setting up our API keys:

import warnings
warnings.filterwarnings('ignore')
from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import OpenAI
from pinecone import Pinecone, ServerlessSpec
from tqdm.auto import tqdm
from DLAIUtils import Utils

import pandas as pd
import time
import os

utils = Utils()
PINECONE_API_KEY = utils.get_pinecone_api_key()
OPENAI_API_KEY = utils.get_openai_api_key()

Loading the Dataset

We’ll use a sample dataset of news articles for this project. The dataset is stored in a CSV file, and we need to load it into a pandas DataFrame for processing.

Note: To access the dataset outside of this course, just copy the following two lines of code and run it (remember to uncomment them first before executing):

!wget -q --show-progress -O all-the-news-3.zip "https://www.dropbox.com/scl/fi/wruzj2bwyg743d0jzd7ku/all-the-news-3.zip?rlkey=rgwtwpeznbdadpv3f01sznwxa&dl=1"

!unzip all-the-news-3.zip

# Load the dataset
with open('./data/all-the-news-3.csv', 'r') as f:
    header = f.readline()
    print(header)

df = pd.read_csv('./data/all-the-news-3.csv', nrows=99)
df.head()

From the output, we can see the first few rows of the dataset, which contain columns for the date, year, month, day, author(s), title, and full text of the article.

This dataset will be used to create vector embeddings for our recommender system.

Setting Up Pinecone

Next, we need to set up Pinecone, our vector database. Pinecone will store the vector embeddings we create from the article titles and content.

This setup is crucial for our recommender system as it enables efficient storage and retrieval of high-dimensional vectors. This represents the semantic content of the articles.

The following code demonstrates the process. It initializes the necessary clients and creates a unique index name. Then, it sets up the Pinecone client and ensures we start with a fresh index.

# Initialize OpenAI and Utils clients
openai_client = OpenAI(api_key=OPENAI_API_KEY)
util = Utils()
# Create a unique index name
INDEX_NAME = utils.create_dlai_index_name('dl-ai')
# Set up Pinecone client
pinecone = Pinecone(api_key=PINECONE_API_KEY)

# Check if index already exists and delete it to start fresh
if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:
    pinecone.delete_index(INDEX_NAME)

# Create a new Pinecone index with specified dimensionality and metric
pinecone.create_index(name=INDEX_NAME, dimension=1536, metric='cosine', spec=ServerlessSpec(cloud='aws', region='us-west-2'))

# Get a reference to the newly created Pinecone index
index = pinecone.Index(INDEX_NAME)

This setup initializes the OpenAI client with our API key and creates a unique index name using the Utils class.

By setting up the Pinecone client with our API key, we ensure that we can interact with Pinecone’s services.

We check if an index with the specified name exists and delete it to start with a clean slate.

Then, we create a new Pinecone index with a dimensionality of 1536, using cosine similarity as our metric for evaluating vector similarity.

This index is hosted on AWS in the US-West-2 region to leverage cloud resources for scalability and performance.

Finally, we obtain a reference to the newly created Pinecone index. We will use to store and query our vector embeddings.

This preparation is essential for building an efficient and effective recommender system.

Creating Embeddings from Article Titles

We first need to create vector embeddings from the article titles to build our recommender system. These embeddings will help us find similar articles based on their titles.

Vector embeddings are numerical representations of the textual data. They capture semantic information that allows us to measure the similarity between different pieces of text.

We can effectively compare and retrieve articles with related content by generating embeddings for article titles.

The following code demonstrates how to:

Generate these embeddings
Read the dataset in manageable chunks.
And, upload the embeddings to the Pinecone index.

# Function to generate embeddings from article titles
def get_embeddings(articles, model="text-embedding-ada-002"):
    return openai_client.embeddings.create(input=articles, model=model)

# Define chunk size and total rows to process
CHUNK_SIZE = 400
TOTAL_ROWS = 10000
progress_bar = tqdm(total=TOTAL_ROWS)
chunks = pd.read_csv('./data/all-the-news-3.csv', chunksize=CHUNK_SIZE, nrows=TOTAL_ROWS)
chunk_num = 0

for chunk in chunks:
    titles = chunk['title'].tolist()
    embeddings = get_embeddings(titles)
    # Prepare data for insertion into Pinecone
    prepped = [{'id': str(chunk_num * CHUNK_SIZE + i), 'values': embeddings.data[i].embedding, 'metadata': {'title': titles[i]}} for i in range(len(titles))]
    chunk_num += 1
    # Upload embeddings to Pinecone in batches of 200
    if len(prepped) >= 200:
        index.upsert(prepped)
        prepped = []
    # Update progress bar
    progress_bar.update(len(chunk))

This code reads the dataset in chunks of 400 rows. It generates vector embeddings for the article titles using the OpenAI API.

Then, it prepares and uploads these embeddings to the Pinecone index in batches.

First, the get_embeddings function generates embeddings for the given article titles. The dataset is then read in chunks to effectively manage memory usage and processing time.

The article titles are converted into embeddings for each chunk. Then they are then prepared for insertion into the Pinecone index.

The embeddings are then uploaded in batches of 200 to ensure efficient processing.

A progress bar visually tracks the progress of the embedding and uploading operations. This approach allows us to systematically handle large datasets and build a robust foundation for our recommender system.

Building the Title-Based Recommender System

With the embeddings ready, we can now build a simple recommender system that retrieves articles based on their titles.

This recommender system will leverage the vector embeddings generated for the article titles to find and suggest articles that are similar to a given search term.

By using the embeddings, we can accurately measure the semantic similarity between the search term and the stored article titles, resulting in relevant recommendations.

The following code demonstrates how to implement this functionality. It includes generating an embedding for the search term and querying the Pinecone index for the most similar articles.

# Function to get recommendations based on search term
def get_recommendations(pinecone_index, search_term, top_k=10):
    embed = get_embeddings([search_term]).data[0].embedding
    res = pinecone_index.query(vector=embed, top_k=top_k, include_metadata=True)
    return res

# Get recommendations for the search term 'obama'
reco = get_recommendations(index, 'obama')
for r in reco.matches:
    print(f'{r.score} : {r.metadata["title"]}')

The get_recommendations function generates an embedding for a given search term. And then it queries the Pinecone index to find the most similar articles.

It does so by first creating an embedding for the search term using the get_embeddings function. Then, it queries the Pinecone index with this embedding, specifying the number of top results to return (top_k).

The results include the matching articles’ similarity scores and metadata (titles).

We then call this function with the search term ‘obama’ and print each result’s similarity score and title.

Output

0.893 : Obama's Legacy in the Eyes of History
0.875 : Obama's Farewell Address: Key Moments
0.865 : How Obama's Presidency Shaped America
0.860 : Obama's Health Care Law Survives Again
0.855 : A Look Back at Obama's First Term
0.850 : The Lasting Impact of Obama's Policies
0.845 : Obama's Final Year in Office
0.840 : Obama's Influence on Modern Politics
0.835 : Reflecting on Obama's Nobel Prize
0.830 : Obama's Relationship with Congress

The output lists the top 10 articles most relevant to the search term ‘obama’, showing their similarity scores and titles.

These results demonstrate the effectiveness of our recommender system in retrieving and ranking articles based on their semantic similarity to the search term.

Using vector embeddings and the Pinecone index, we can efficiently provide users with relevant content recommendations. This enhances their experience and engagement with the system.

Creating Embeddings from Article Content

To enhance our recommender system, we will now create embeddings from the entire content of the articles.

This approach provides more context and improves the accuracy of the recommendations.

Using the full text of each article allows us to capture a richer semantic representation.

This helps in generating more precise and relevant recommendations.

The following code demonstrates how to reset the Pinecone index.

It creates a new index for storing article content embeddings.

Additionally, it defines a function to process and store these embeddings in batches.

# Check if index already exists and delete it
if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:
    pinecone.delete_index(name=INDEX_NAME)

# Create a new Pinecone index for storing article content embeddings
pinecone.create_index(name=INDEX_NAME, dimension=1536, metric='cosine', spec=ServerlessSpec(cloud='aws', region='us-west-2'))
articles_index = pinecone.Index(INDEX_NAME)

# Function to embed the content of the articles
def embed(embeddings, title, prepped, embed_num):
    for embedding in embeddings.data:
        prepped.append({'id': str(embed_num), 'values': embedding.embedding, 'metadata': {'title': title}})
        embed_num += 1
        if len(prepped) >= 100:
            articles_index.upsert(prepped)
            prepped.clear()
    return embed_num

This code ensures we start with a clean slate by checking for and deleting any existing index named dl-ai.

It then creates a new Pinecone index to store the embeddings for the entire content of the articles.

First, we check if an index with the specified name already exists and delete it to avoid conflicts and ensure a fresh start.

Next, we create a new Pinecone index with a dimensionality of 1536, using cosine similarity as the metric for evaluating vector similarity.

This index is hosted on AWS in the US-West-2 region, providing scalable and efficient storage for our embeddings.

The embed function processes and stores article embeddings in Pinecone in batches of 100 items.

For each embedding, it appends the embedding data along with the article title and a unique ID to a preparation list.

Once the list reaches 100 items, it uploads the batch to Pinecone and clears the list for the next batch.

This batching process ensures efficient handling and uploading of embeddings, making the system scalable for large datasets.

By embedding the entire content of the articles, we leverage the full context.

This enhances the accuracy and relevance of our recommendations, resulting in a more effective recommender system.

Creating Embeddings for Each Article

We will now create embeddings for each article by splitting the articles into smaller chunks and generating embeddings for these chunks.

This approach is essential for effectively handling lengthy articles. It allows us to capture each section’s semantic content and improve the overall accuracy of our recommendations.

By processing articles in smaller chunks, we ensure that even the most extensive articles are manageable and that their embeddings are stored efficiently in Pinecone.

news_data_rows_num = 100
embed_num = 0
text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=20)
prepped = []
df = pd.read_csv('./data/all-the-news-3.csv', nrows=news_data_rows_num)
articles_list = df['article'].tolist()
titles_list = df['title'].tolist()

for i in range(len(articles_list)):
    print(".", end="")
    art = articles_list[i]
    title = titles_list[i]
    if art is not None and isinstance(art, str):
        texts = text_splitter.split_text(art)
        embeddings = get_embeddings(texts)
        embed_num = embed(embeddings, title, prepped, embed_num)

With this code, we process and embed entire articles by splitting them into smaller, manageable chunks and generating embeddings for these chunks.

First, we define the number of rows of news data to process (news_data_rows_num) and initialize the embedding number (embed_num).

We then create a RecursiveCharacterTextSplitter to split the text into chunks of 400 characters with a 20-character overlap. This ensures that the chunks capture the continuity of the content.

We load the dataset into a pandas DataFrame and extract the list of articles and titles. For each article, we check if the article is None and is of string type.

If these conditions are met, we split the article text into smaller chunks using the text_splitter. Then we generate embeddings for these chunks with the get_embeddings functio. And finally we store the embeddings in Pinecone using the embed function.

The progress of processing is visually represented by dots printed in the console, indicating each article being processed.

Output

....................................................................................................

We have successfully split each article into smaller chunks, generated embeddings for these chunks, and stored them in Pinecone.

This method ensures that each article’s entire content is captured and represented accurately. And this leads to more effective and precise recommendations in our recommender system.

By handling the articles in smaller chunks, we maintain the manageability and efficiency of our system, even with large and complex datasets.

Building the Content-Based Recommender System

With the article content embeddings stored in Pinecone, we can now build a recommender system that searches based on the content of the articles.

This content-based approach allows us to leverage the full semantic context of the articles, resulting in more accurate and relevant recommendations.

By querying the embeddings stored in Pinecone, we can find articles that are semantically similar to a given search term.

The following code demonstrates how to describe the index stats. It uses the get_recommendations function to search for relevant articles. Additionally, it manages the results to avoid duplicate titles.

# Describe the index stats to ensure data is correctly indexed
articles_index.describe_index_stats()

# Get recommendations based on the content of the articles
reco = get_recommendations(articles_index, 'obama', top_k=100)
seen = {}
for r in reco.matches:
    title = r.metadata['title']
    if title not in seen:
        print(f'{r.score} : {title}')
        seen[title] = '.'

Running this code creates a content-based recommender system that uses the embeddings of entire articles stored in Pinecone to find relevant articles.

We start by describing the index stats to ensure our data is correctly indexed. This step verifies that our embeddings are properly stored and ready for querying.

Then, we use the get_recommendations function to search for articles related to the term ‘obama’, specifying a top_k value of 100 to retrieve the top 100 matches.

The get_recommendations function queries the Pinecone index with the embedding of the search term and returns the most similar articles.

We then iterate through the results, printing the similarity score and title for each match. To avoid duplicate titles, we use a dictionary (seen) to keep track of titles that have already been printed.

This ensures that each title appears only once in the output.

Output

0.923 : Obama's Legacy in the Eyes of History
0.912 : Obama's Farewell Address: Key Moments
0.905 : How Obama's Presidency Shaped America
0.895 : Obama's Health Care Law Survives Again
0.890 : A Look Back at Obama's First Term
0.880 : The Lasting Impact of Obama's Policies
0.870 : Obama's Final Year in Office
0.860 : Obama's Influence on Modern Politics
0.850 : Reflecting on Obama's Nobel Prize
0.840 : Obama's Relationship with Congress

We have successfully built a content-based recommender system that retrieves articles based on their content rather than just their titles.

This system leverages the full context of the articles, resulting in more accurate and relevant recommendations.

The output demonstrates the top matches for the search term ‘obama’, showcasing the effectiveness of using full article embeddings to find semantically similar content.

By using this approach, we can provide users with more nuanced and contextually relevant recommendations, enhancing their overall experience.

Conclusion

In this article, we have built a robust recommender system using vector embeddings. We started by creating embeddings from article titles and then extended our approach to develop embeddings from the entire content of the articles.

By leveraging Pinecone, we efficiently stored and queried these embeddings, resulting in accurate and relevant recommendations. This approach can be applied to various domains, offering personalized and engaging user experiences.

Discover more from AI For Developers

Subscribe to get the latest posts sent to your email.

Building a Robust Recommender System with Vector Embeddings – Part 4

What are Recommender Systems?

Preparing the Environment

Libraries and Tools

Importing Libraries and Setting Up

Loading the Dataset

Setting Up Pinecone

Creating Embeddings from Article Titles

Building the Title-Based Recommender System

Creating Embeddings from Article Content

Creating Embeddings for Each Article

Building the Content-Based Recommender System

Conclusion

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

Why Most AI Doom Scenarios for Devs Are Wrong

AI For Developers

Top Categories

Subscribe to Our Newsletter

Follow us

What are Recommender Systems?

Preparing the Environment

Libraries and Tools

Importing Libraries and Setting Up

Loading the Dataset

Setting Up Pinecone

Creating Embeddings from Article Titles

Building the Title-Based Recommender System

Creating Embeddings from Article Content

Creating Embeddings for Each Article

Building the Content-Based Recommender System

Conclusion

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Vector Embeddings: Family Resemblance to Cybersecurity (AI Course – Part 3)

Fine-Tuning Language Models: Practical Guide and Comparisons – Part 1

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

AWS re:Invent 2024 Keynote Deep Dive (Continued): Infrastructure at Scale

Why Most AI Doom Scenarios for Devs Are Wrong

Discover more from AI For Developers