AI Agent Interacting with CSV data & SQL Database (Building Database Agents

Welcome back to the Building Database Agents AI course! The previous article guided you through connecting to Azure OpenAI and setting up a LangChain agent.

This article focuses on creating an SQL LangChain AI agent that interacts with CSV data. And, it explores integrating it with Azure OpenAI for seamless database interactions.

You’ll learn how to load tabular data from a CSV file and use natural language queries via Azure database SQL to extract information quickly.

You’ll be able to reuse this agent to analyze your own CSV files.

AI for Database Agents

When building a data agent on top of the GPT-4 model, there are several considerations and options:

Fine-tuning: You can customize the model using the fine-tuning method to handle SQL tasks.
RAG: You can use the RAG model, which is a simpler option than fine-tuning. It doesn’t require retraining and uses the database or dataset as its source.

In the first part of the article, you’ll focus on using a CSV file as the data source and a LangChain SQL database toolkit to connect to it.

Alternative options for AI Database Agents are summarized in the figure. However, the primary focus here is on using RAG to interact with SQL databases (or CSV dumps) via the SQL Agent LangChain.

Interacting with a CSV Data

Now, you will create a simple Microsoft AI agent using Python and Jupyter Notebook to interact with CSV data. This tutorial is designed for beginners and will guide you step-by-step through the code and concepts.

Step 1: Setting Up the Environment and Connecting to OpenAI Endpoint

First, you need to create an Azure account to get your own API endpoint and API key to build a Microsoft AI Agent. Also, you need to install the LangChain-OpenAI integration package.

Detailed instructions for these steps and installing the necessary requirements are provided in the previous article.

You can start by importing the OpenAI, LangChain SQL Agent, and Pandas libraries.

import os
import pandas as pd  from IPython.display import Markdown, HTML, display
from langchain.schema import HumanMessage
from langchain_openai import AzureChatOpenAI 
model = AzureChatOpenAI(
    openai_api_version="2023-05-15",
    azure_deployment="gpt-4-1106",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
)

Step 2: Loading the Dataset

In this part, your AI agent will interact with a CSV dataset.

You can download a CSV file from the open data portal and place it in your project folder. Or, you can remotely access the data we are using by following the commands below.

```
os.makedirs("data",exist_ok=True)
!wget https://covidtracking.com/data/download/all-states-history.csv -P ./data/
file_url = "./data/all-states-history.csv"
df = pd.read_csv(file_url).fillna(value = 0)
```

Now you can read the CSV data.

df = pd.read_csv("./data/all-states-history.csv").fillna(value = 0)

Step 3: Preparing the Langchain Dataframe Agent

After loading a CSV file, you can begin with exploratory data analysis (EDA). This involves asking questions to understand the statistical properties of the data. These properties include minimum and maximum values, averages, the number of rows and columns, and rows with empty values.

To perform these actions, you will use the invoke function to prompt questions like

# "How many columns are there?"

The expected output should be in the format:

 # "There are x columns in the DataFrame."

from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent

agent = create_pandas_dataframe_agent(llm=model,df=df,verbose=True)

agent.invoke("How many columns are there?")

Step 4: Designing Your Prompt and Ask Your Question

In this section, you will define a prefix and suffix to guide the model in handling a data analysis task using Pandas.

The prefix sets the Pandas display options to show all the columns, ensuring that the column names are visible.

The suffix provides instructions on how the model should interact with the data and validate its results.

The key advantage here is that you can customize the prefix and suffix to suit how you want the model to interact.

CSV_PROMPT_PREFIX = """
First, set the pandas display options to show all the columns and get the column names.
"""

CSV_PROMPT_SUFFIX = """
- **ALWAYS** before giving the final answer, try another method.
Reflect on the answers of the methods you used and ensure they correctly address the original question.
- If the methods yield different results, reflect and try again until you obtain consistent results.
- If you still cannot arrive at a consistent result, indicate that you are unsure of the answer.
- Once you are sure of the correct answer, create a comprehensive response using Markdown.
- **DO NOT FABRICATE AN ANSWER OR USE PRIOR KNOWLEDGE. USE ONLY THE RESULTS OF YOUR CALCULATIONS**.
- **ALWAYS** include an explanation of how you arrived at the answer in a section that starts with "\n\nExplanation:\n".
In the explanation, mention the specific column names used to derive the final answer.
"""

QUESTION = """
How many patients were hospitalized during July 2020 in Texas and nationwide (total of all states)?
Use the 'hospitalizedIncrease' column.
"""

agent.invoke(CSV_PROMPT_PREFIX + QUESTION + CSV_PROMPT_SUFFIX)

Congratulations! You interacted with CSV data with your AI agent!

Connecting to an SQL Database

You can also implement LangChain database agents to connect to a provided SQL database instead of a CSV file.

Now you will leverage the SQL database and perform the RAG pattern against it.

The result will be a database agent in which generative Al will help you translate natural language to SQL code.

First, take a look at the architecture for connecting to an SQL database shown in the figure below. It should help clarify things for you!

When building AI agents, you will use the baseline GPT-4 model from SQL on Azure OpenAI and leverage the RAG system.

LangChain will orchestrate the knowledge by helping to find information and explaining each step of the process. It will also replace the CSV file with an SQL database.

You can combine different data sources, such as images, data lakes, and databases. The RAG mechanism will efficiently locate information from the appropriate source.

Now, you will create a simple Microsoft AI agent using Python and Jupyter Notebook to connect to an SQL database.

Step 1: Setting Up the Environment and Connecting to OpenAI Endpoint

You will set up the environment once more, this time to integrate an SQL database.

import os
from IPython.display import Markdown, HTML, display
from langchain.chat_models import AzureChatOpenAI
from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from langchain_openai import AzureChatOpenAI

Step 2: Loading the Dataset

You will create the SQL Alchemy engine to facilitate communication with the SQL database.

Pandas will be used to handle the data frames.

Let’s start connecting to your SQL database and loading the data.

from sqlalchemy import create_engine
import pandas as pd 
df = pd.read_csv("./data/all-states-history.csv").fillna(value = 0)

Step 3: Moving the Data to the SQL Database

Here, you will set up an SQLite database stored locally in a folder named ‘database’.

The file ‘test.db’ in this folder will act as your SQLite database template.

Next, you will create an engine to connect to this SQLite database using the specified path, ‘database_file_path’.

Next, you will load data from a CSV file located in your data folder into a Pandas data frame named ‘df’. The data frame will hold the data extracted from the CSV file.

The key action is using the ‘to_sql’ function. This function transfers the contents of ‘df’ into the SQLite database.

This process seamlessly integrates the data from the CSV file into your SQLite database. It enables efficient interaction and querying capabilities with the structured dataset.

# Path to your SQLite database file
database_file_path = "./database/test.db"
# Create an engine to connect to the SQLite database
# SQLite only requires the path to the database file
engine = create_engine(f'sqlite:///{database_file_path}')
file_url = "./data/all-states-history.csv"
df = pd.read_csv(file_url).fillna(value = 0)
df.to_sql(
    'all_states_history',
    con=engine,
    if_exists='replace',
    index=False
)

Step 4: Preparing the SQL Prompt

Now, you will write the prefix, which will outline how the engine should respond to queries. This includes defining the input, providing context, and delivering the final answer along with an accompanying explanation.

This example will serve to clarify your expectations as a user. It will enable seamless communication with the SQL database using natural language.

MSSQL_AGENT_PREFIX = """
You are an agent designed to interact with a SQL database.
## Instructions:
- Given an input question, generate a syntactically correct {dialect} query to execute. Retrieve and analyze up to {top_k} relevant results based on the query.
- Order results by a relevant column to prioritize the most pertinent examples from the database.
- Avoid querying all columns from a table; instead, request only necessary columns based on the question.
- Utilize available tools for seamless interaction with the database.
- Double-check each query before execution. If errors occur, refine the query and retry.
- Refrain from executing DML statements (INSERT, UPDATE, DELETE, DROP, etc.) on the database.
- Ensure responses are in Markdown format. Exclude Markdown backticks when running SQL queries in "Action Input"; these are solely for response formatting, not command execution.
- Always include an explanation of how the answer was derived under a section labeled "Explanation:". Include the SQL query as part of this explanation.
- If a question seems unrelated to the database, respond with "I don't know."
- Limit query construction to the provided tools and information returned by these tools.
- Use existing table names; do not invent new ones.

## Tools:
"""

The following instructions will guide you through formulating and executing SQL queries.

You will start by defining your query and outlining your strategy.

Next, you will choose the appropriate action, provide the necessary input, analyze the results, and deliver a clear, final answer.

This approach will help you handle data inquiries methodically. It will also enable you to interact efficiently with SQL Azure databases to obtain meaningful insights.

MSSQL_AGENT_FORMAT_INSTRUCTIONS = """
## Use the following format:
Question: [Insert the input question you must answer here.]
Thought: [Explain your thought process and strategy.]
Action: [Specify the action to take, choose one from: {tool_names}.]
Action Input:
[Provide the input necessary for the action.]
Observation:
[Describe the result obtained from the action. Repeat this section as needed.]
Thought: [Reflect on the observations and indicate readiness to provide the final answer.]
Final Answer: [State the final answer to the original input question.]
Example of Final Answer:
<=== Beginning of example
Action: query_sql_db
Action Input:
SELECT TOP (10) [death]
FROM covidtracking 
WHERE state = 'TX' AND date LIKE '2020%'
Observation:
[(27437.0,), (27088.0,), (26762.0,), (26521.0,), (26472.0,), (26421.0,), (26408.0,)]
Thought: I now know the final answer.
Final Answer: There were 27,437 people who died of COVID in Texas in 2020.
Explanation:
I queried the `covidtracking` table for the `death` column where the state
is 'TX' and the date starts with '2020'. The query returned a list of tuples
with the number of deaths for each day in 2020. To answer the question,
I took the sum of all the deaths in the list, which is 27,437.
I used the following SQL query:
```sql
SELECT [death] FROM covidtracking WHERE state = 'TX' AND date LIKE '2020%'
```
===> End of Example
"""

Step 5: Calling the Azure Chat Model and Create the SQL Agent

You will configure the Language Model (LLM) chat instance. This involves connecting it to the database toolkit via its endpoint for seamless integration.

This setup will enable you to handle diverse questions. For example, you could address queries such as:

# "How many patients were specialized during October 2020 in New York and nationwide across all states?"

Here, you will set up the SQL executor agent. This involves utilizing LangChain with a specific format defined by your provided prefix and format instructions.

Azure OpenAI will serve as your LLM. It will ensure compatibility with the designated endpoint and SQL database toolkit to fetch customizable information.

Once executed, you will await prompts to interact further with the system.

llm = AzureChatOpenAI(
    openai_api_version="2023-05-15",
    azure_deployment="gpt-4-1106",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    temperature=0, 
    max_tokens=500
)

db = SQLDatabase.from_uri(f'sqlite:///{database_file_path}')
toolkit = SQLDatabaseToolkit(db=db, llm=llm)

QUESTION = """How may patients were hospitalized during October 2020
in New York, and nationwide as the total of all states?
Use the hospitalizedIncrease column
"""

agent_executor_SQL = create_sql_agent(
    prefix=MSSQL_AGENT_PREFIX,
    format_instructions = MSSQL_AGENT_FORMAT_INSTRUCTIONS,
    llm=llm,
    toolkit=toolkit,
    top_k=30,
    verbose=True
)

Step 6: Invoking the SQL Model

Next, you will invoke the process and send your question. Run it and see what happens!

agent_executor_SQL.invoke(QUESTION)

Congratulations! You connected to an SQL database to build your AI agent!

Final Thoughts

In this article, you learned how an AI agent can interact with both CSV data and an SQL database Azure.

By following the steps outlined, you can now load datasets, perform exploratory data analysis, and execute queries seamlessly. This is made possible through the combined power of OpenAI and LangChain.

This capability allows for more efficient data handling and the extraction of valuable insights from your data sources.

The ability to customize the AI agent’s interaction through tailored prompts enhances its versatility.

Whether working with CSV files or SQL databases, these agents can adapt to different scenarios. This makes them a valuable tool for a variety of applications.

The flexibility ensures that you can harness the full potential of your data, regardless of its format or complexity.

In the next article, we will focus on the Azure OpenAI Function Calling Feature.

Stay tuned as we continue to explore the exciting possibilities of AI in data management and beyond.

Discover more from AI For Developers

Subscribe to get the latest posts sent to your email.

Building Database Agents: AI Agent Interacting with CSV data and SQL Database (AI Course – Part 2)

AI for Database Agents

Interacting with a CSV Data

Step 1: Setting Up the Environment and Connecting to OpenAI Endpoint

Step 2: Loading the Dataset

Step 3: Preparing the Langchain Dataframe Agent

Step 4: Designing Your Prompt and Ask Your Question

Connecting to an SQL Database

Step 1: Setting Up the Environment and Connecting to OpenAI Endpoint

Step 2: Loading the Dataset

Step 3: Moving the Data to the SQL Database

Step 4: Preparing the SQL Prompt

Step 5: Calling the Azure Chat Model and Create the SQL Agent

Step 6: Invoking the SQL Model

Final Thoughts

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

Why Most AI Doom Scenarios for Devs Are Wrong

AI For Developers

Top Categories

Subscribe to Our Newsletter

Follow us

AI for Database Agents

Interacting with a CSV Data

Step 1: Setting Up the Environment and Connecting to OpenAI Endpoint

Step 2: Loading the Dataset

Step 3: Preparing the Langchain Dataframe Agent

Step 4: Designing Your Prompt and Ask Your Question

Connecting to an SQL Database

Step 1: Setting Up the Environment and Connecting to OpenAI Endpoint

Step 2: Loading the Dataset

Step 3: Moving the Data to the SQL Database

Step 4: Preparing the SQL Prompt

Step 5: Calling the Azure Chat Model and Create the SQL Agent

Step 6: Invoking the SQL Model

Final Thoughts

Discover more from AI For Developers

Read Articles by Topic

Mohamed Ahmed

Step-by-Step Guide to Building Your Own Database Agent (Free AI Course – Part 1)

Building Database Agents: Azure OpenAI Function Calling Feature (AI Course – Part 3)

Leave a ReplyCancel reply

AWS re:Invent 2024: The Infrastructure Race Gets More Interesting

AI Development in 2024: A Year of Transformation

Introducing Multimodal Llama 3.2 – Part 1

AWS re:Invent 2024 Keynote Deep Dive (Continued): Infrastructure at Scale

Why Most AI Doom Scenarios for Devs Are Wrong

Discover more from AI For Developers