This article focuses on creating an SQL LangChain AI agent that interacts with CSV data. And, it explores integrating it with Azure OpenAI for seamless database interactions.
You’ll learn how to load tabular data from a CSV file and use natural language queries via Azure database SQL to extract information quickly.
You’ll be able to reuse this agent to analyze your own CSV files.
AI for Database Agents
When building a data agent on top of the GPT-4 model, there are several considerations and options:
Fine-tuning: You can customize the model using the fine-tuning method to handle SQL tasks.
RAG: You can use the RAG model, which is a simpler option than fine-tuning. It doesn’t require retraining and uses the database or dataset as its source.
In the first part of the article, you’ll focus on using a CSV file as the data source and a LangChain SQL database toolkit to connect to it.
Alternative options for AI Database Agents are summarized in the figure. However, the primary focus here is on using RAG to interact with SQL databases (or CSV dumps) via the SQL Agent LangChain.
Interacting with a CSV Data
Now, you will create a simple Microsoft AI agent using Python and Jupyter Notebook to interact with CSV data. This tutorial is designed for beginners and will guide you step-by-step through the code and concepts.
Step 1: Setting Up the Environment and Connecting to OpenAI Endpoint
Detailed instructions for these steps and installing the necessary requirements are provided in the previous article.
You can start by importing the OpenAI, LangChain SQL Agent, and Pandas libraries.
import osimport pandas as pd from IPython.display import Markdown, HTML, displayfrom langchain.schema import HumanMessagefrom langchain_openai import AzureChatOpenAI model = AzureChatOpenAI(openai_api_version="2023-05-15",azure_deployment="gpt-4-1106",azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),)
Step 2: Loading the Dataset
In this part, your AI agent will interact with a CSV dataset.
You can download a CSV file from the open data portal and place it in your project folder. Or, you can remotely access the data we are using by following the commands below.
After loading a CSV file, you can begin with exploratory data analysis (EDA). This involves asking questions to understand the statistical properties of the data. These properties include minimum and maximum values, averages, the number of rows and columns, and rows with empty values.
To perform these actions, you will use the invoke function to prompt questions like
# "How many columns are there?"
The expected output should be in the format:
# "There are x columns in the DataFrame."
from langchain.agents.agent_types import AgentTypefrom langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agentagent = create_pandas_dataframe_agent(llm=model,df=df,verbose=True)agent.invoke("How many columns are there?")
Step 4: Designing Your Prompt and Ask Your Question
In this section, you will define a prefix and suffix to guide the model in handling a data analysis task using Pandas.
The prefix sets the Pandas display options to show all the columns, ensuring that the column names are visible.
The suffix provides instructions on how the model should interact with the data and validate its results.
The key advantage here is that you can customize the prefix and suffix to suit how you want the model to interact.
CSV_PROMPT_PREFIX="""First, set the pandas display options to show all the columns and get the column names."""CSV_PROMPT_SUFFIX="""- **ALWAYS** before giving the final answer, try another method.Reflect on the answers of the methods you used and ensure they correctly address the original question.- If the methods yield different results, reflect and try again until you obtain consistent results.- If you still cannot arrive at a consistent result, indicate that you are unsure of the answer.- Once you are sure of the correct answer, create a comprehensive response using Markdown.- **DO NOT FABRICATE AN ANSWER OR USE PRIOR KNOWLEDGE. USE ONLY THE RESULTS OF YOUR CALCULATIONS**.- **ALWAYS** include an explanation of how you arrived at the answer in a section that starts with "\n\nExplanation:\n".In the explanation, mention the specific column names used to derive the final answer."""QUESTION="""How many patients were hospitalized during July 2020 in Texas and nationwide (total of all states)?Use the 'hospitalizedIncrease' column."""agent.invoke(CSV_PROMPT_PREFIX+QUESTION+CSV_PROMPT_SUFFIX)
Congratulations! You interacted with CSV data with your AI agent!
Connecting to an SQL Database
You can also implement LangChain database agents to connect to a provided SQL database instead of a CSV file.
Now you will leverage the SQL database and perform the RAG pattern against it.
The result will be a database agent in which generative Al will help you translate natural language to SQL code.
First, take a look at the architecture for connecting to an SQL database shown in the figure below. It should help clarify things for you!
When building AI agents, you will use the baseline GPT-4 model from SQL on Azure OpenAI and leverage the RAG system.
LangChain will orchestrate the knowledge by helping to find information and explaining each step of the process. It will also replace the CSV file with an SQL database.
You can combine different data sources, such as images, data lakes, and databases. The RAG mechanism will efficiently locate information from the appropriate source.
Now, you will create a simple Microsoft AI agent using Python and Jupyter Notebook to connect to an SQL database.
Step 1: Setting Up the Environment and Connecting to OpenAI Endpoint
You will set up the environment once more, this time to integrate an SQL database.
You will create the SQL Alchemy engine to facilitate communication with the SQL database.
Pandas will be used to handle the data frames.
Let’s start connecting to your SQL database and loading the data.
from sqlalchemy import create_engineimport pandas as pd df = pd.read_csv("./data/all-states-history.csv").fillna(value=0)
Step 3: Moving the Data to the SQL Database
Here, you will set up an SQLite database stored locally in a folder named ‘database’.
The file ‘test.db’ in this folder will act as your SQLite database template.
Next, you will create an engine to connect to this SQLite database using the specified path, ‘database_file_path’.
Next, you will load data from a CSV file located in your data folder into a Pandas data frame named ‘df’. The data frame will hold the data extracted from the CSV file.
The key action is using the ‘to_sql’ function. This function transfers the contents of ‘df’ into the SQLite database.
This process seamlessly integrates the data from the CSV file into your SQLite database. It enables efficient interaction and querying capabilities with the structured dataset.
# Path to your SQLite database filedatabase_file_path ="./database/test.db"# Create an engine to connect to the SQLite database# SQLite only requires the path to the database fileengine = create_engine(f'sqlite:///{database_file_path}')file_url ="./data/all-states-history.csv"df = pd.read_csv(file_url).fillna(value=0)df.to_sql('all_states_history',con=engine,if_exists='replace',index=False)
Step 4: Preparing the SQL Prompt
Now, you will write the prefix, which will outline how the engine should respond to queries. This includes defining the input, providing context, and delivering the final answer along with an accompanying explanation.
This example will serve to clarify your expectations as a user. It will enable seamless communication with the SQL database using natural language.
MSSQL_AGENT_PREFIX="""You are an agent designed to interact with a SQL database.## Instructions:- Given an input question, generate a syntactically correct {dialect} query to execute. Retrieve and analyze up to {top_k} relevant results based on the query.- Order results by a relevant column to prioritize the most pertinent examples from the database.- Avoid querying all columns from a table; instead, request only necessary columns based on the question.- Utilize available tools for seamless interaction with the database.- Double-check each query before execution. If errors occur, refine the query and retry.- Refrain from executing DML statements (INSERT, UPDATE, DELETE, DROP, etc.) on the database.- Ensure responses are in Markdown format. Exclude Markdown backticks when running SQL queries in "Action Input"; these are solely for response formatting, not command execution.- Always include an explanation of how the answer was derived under a section labeled "Explanation:". Include the SQL query as part of this explanation.- If a question seems unrelated to the database, respond with "I don't know."- Limit query construction to the provided tools and information returned by these tools.- Use existing table names; do not invent new ones.## Tools:"""
The following instructions will guide you through formulating and executing SQL queries.
You will start by defining your query and outlining your strategy.
Next, you will choose the appropriate action, provide the necessary input, analyze the results, and deliver a clear, final answer.
This approach will help you handle data inquiries methodically. It will also enable you to interact efficiently with SQL Azure databases to obtain meaningful insights.
MSSQL_AGENT_FORMAT_INSTRUCTIONS="""## Use the following format:Question: [Insert the input question you must answer here.]Thought: [Explain your thought process and strategy.]Action: [Specify the action to take, choose one from: {tool_names}.]Action Input:[Provide the input necessary for the action.]Observation:[Describe the result obtained from the action. Repeat this section as needed.]Thought: [Reflect on the observations and indicate readiness to provide the final answer.]Final Answer: [State the final answer to the original input question.]Example of Final Answer:<=== Beginning of exampleAction: query_sql_dbAction Input:SELECT TOP (10) [death]FROM covidtracking WHERE state = 'TX' AND date LIKE '2020%'Observation:[(27437.0,), (27088.0,), (26762.0,), (26521.0,), (26472.0,), (26421.0,), (26408.0,)]Thought: I now know the final answer.Final Answer: There were 27,437 people who died of COVID in Texas in 2020.Explanation:I queried the `covidtracking` table for the `death` column where the stateis 'TX' and the date starts with '2020'. The query returned a list of tupleswith the number of deaths for each day in 2020. To answer the question,I took the sum of all the deaths in the list, which is 27,437.I used the following SQL query:```sqlSELECT [death] FROM covidtracking WHERE state = 'TX' AND date LIKE '2020%'```===> End of Example"""
Step 5: Calling the Azure Chat Model and Create the SQL Agent
You will configure the Language Model (LLM) chat instance. This involves connecting it to the database toolkit via its endpoint for seamless integration.
This setup will enable you to handle diverse questions. For example, you could address queries such as:
# "How many patients were specialized during October 2020 in New York and nationwide across all states?"
Here, you will set up the SQL executor agent. This involves utilizing LangChain with a specific format defined by your provided prefix and format instructions.
Azure OpenAI will serve as your LLM. It will ensure compatibility with the designated endpoint and SQL database toolkit to fetch customizable information.
Once executed, you will await prompts to interact further with the system.
llm = AzureChatOpenAI(openai_api_version="2023-05-15",azure_deployment="gpt-4-1106",azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),temperature=0, max_tokens=500)db = SQLDatabase.from_uri(f'sqlite:///{database_file_path}')toolkit = SQLDatabaseToolkit(db=db, llm=llm)QUESTION="""How may patients were hospitalized during October 2020in New York, and nationwide as the total of all states?Use the hospitalizedIncrease column"""agent_executor_SQL = create_sql_agent(prefix=MSSQL_AGENT_PREFIX,format_instructions=MSSQL_AGENT_FORMAT_INSTRUCTIONS,llm=llm,toolkit=toolkit,top_k=30,verbose=True)
Step 6: Invoking the SQL Model
Next, you will invoke the process and send your question. Run it and see what happens!
agent_executor_SQL.invoke(QUESTION)
Congratulations! You connected to an SQL database to build your AI agent!
Final Thoughts
In this article, you learned how an AI agent can interact with both CSV data and an SQL database Azure.
By following the steps outlined, you can now load datasets, perform exploratory data analysis, and execute queries seamlessly. This is made possible through the combined power of OpenAI and LangChain.
This capability allows for more efficient data handling and the extraction of valuable insights from your data sources.
The ability to customize the AI agent’s interaction through tailored prompts enhances its versatility.
Whether working with CSV files or SQL databases, these agents can adapt to different scenarios. This makes them a valuable tool for a variety of applications.
The flexibility ensures that you can harness the full potential of your data, regardless of its format or complexity.
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.