SQL

Author: Jinu Cho
Design: LeeYuChul
Peer Review: JeongHo Shin, Erika Park
Proofread : Juni Lee
This is a part of LangChain Open Tutorial

Overview

This tutorial covers how to use create_sql_query_chain to generate SQL queries, execute them, and derive answers.

Additionally, we'll explore the differences between this method and the SQL Agent.

References

Environment Setup

Setting up your environment is the first step. See the Environment Setup guide for more details.

[Note]

The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials. Check out the langchain-opentutorial for more details.

%%capture --no-stderr
%pip install langchain-opentutorial

# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_openai",
        "langchain_community",
    ],
    verbose=False,
    upgrade=False,
)

You can alternatively set OPENAI_API_KEY in .env file and load it.

[Note] This is not necessary if you've already set OPENAI_API_KEY in previous steps.

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "02-SQL",
    }
)

Environment variables have been set successfully.

# Load environment variables
# Reload any variables that need to be overwritten from the previous cell

from dotenv import load_dotenv

load_dotenv(override=True)

True

Loading SQL Databases

Usage methods for various databases and required library list:

Database

Required Library

Code Example

PostgreSQL

psycopg2-binary

db = SQLDatabase.from_uri("postgresql://:@:/")

MySQL

pymysql

db = SQLDatabase.from_uri("mysql+pymysql://:@:/")

SQLite

Included in standard lib

db = SQLDatabase.from_uri("sqlite:///path/to/your_database.db")

Oracle

cx_Oracle

db = SQLDatabase.from_uri("oracle+cx_oracle://:@:/")

Example for PostgreSQL :

    db = SQLDatabase.from_uri("postgresql://postgre_user_name:password@ip_address:port/db_name")

Load and verify the sample database data.

from langchain_openai import ChatOpenAI
from langchain.chains import create_sql_query_chain
from langchain_community.utilities import SQLDatabase

# Connect to the SQLite database.
db = SQLDatabase.from_uri("sqlite:///data/finance.db")

# Output the database dialect.
print(db.dialect)

# Output the available table names.
print(db.get_usable_table_names())

sqlite
    ['accounts', 'customers', 'transactions']

SQL Query Chain

create_sql_query_chain generates a chain for creating SQL queries based on natural language input.

It leverages LLMs to translate natural language into SQL statements.

[RECOMMEDED] Create an LLM object and generate a chain by providing the LLM and DB as parameters.

Since changing the model may cause unexpected behavior, this tutorial will proceed with gpt-3.5-turbo .

# Create an OpenAI LLM object
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Generate a chain by providing the LLM and DB as parameters.
chain = create_sql_query_chain(
    llm=llm, db=db, k=10
)  # k(for query Limit)'s default value is 5

chain.invoke({"question": "List the all customer names."})

'SELECT "name" FROM customers;'

What if the latest model is used?

Using the latest version of OpenAI's LLM may cause issues with the output.

# Create an OpenAI LLM object
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Generate a chain by providing the LLM and DB as parameters.
bad_case_chain = create_sql_query_chain(
    llm=llm, db=db, k=10
)  # k(for query Limit)'s default values is 5

Unnecessary information, such as 'SQLQuery: ' , is included in the output along with the query.

bad_case_chain.invoke({"question": "List the all customer names."})

'SQLQuery: SELECT "name" FROM customers LIMIT 10;'

(Optional) You can specify the prompt directly using the method below.

When writing it yourself, you can include table_info along with descriptive column descriptions for better explanation.

from langchain_core.prompts import PromptTemplate

prompt = PromptTemplate.from_template(
    """Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer. 
    Unless the user specifies in his question a specific number of examples he wishes to obtain, always limit your query to at most {top_k} results. 
    You can order the results by a relevant column to return the most interesting examples in the database.

Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use the following tables:
{table_info}

Here is the description of the columns in the tables:
`cust`: customer name
`prod`: product name
`trans`: transaction date

Question: {input}
"""
).partial(dialect=db.dialect)

# Create an OpenAI LLM object
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Generate a chain by providing the LLM and DB as parameters.
chain = create_sql_query_chain(llm, db, prompt)

Executing the chain generates queries based on the database.

# Execute the chain and display the results.
generated_sql_query = chain.invoke({"question": "List the all customer names."})

# Print the generated query.
print(generated_sql_query.__repr__())

'SELECT name\nFROM customers'

How to use the `get_prompts` method

The chain.get_prompt method allows you to retrieve the current prompt template used in a LangChain chain.

This prompt contains the instructions given to the LLM, including the input structure, expected variables, and contextual guidelines.

Key Features

Prompt Retrieval:

Fetches the active prompt template to inspect or debug the chain's behavior.

Dynamic Variable Substitution:

Displays how variables are dynamically substituted within the template.

Customizability:

Enables users to modify parts of the prompt dynamically.

check the get_prompts 's contents

There are various elements:

input_variables
input_types
partial_variables
template

# Check the prompt template configuration
print(f"input_variables : {chain.get_prompts()[0].input_variables}", "\n")
print(f"input_types : {chain.get_prompts()[0].input_types}", "\n")
print(f"partial_variables : {chain.get_prompts()[0].partial_variables}", "\n")
print(f"template : {chain.get_prompts()[0].template}", "\n")

input_variables : ['input', 'table_info'] 
    
    input_types : {} 
    
    partial_variables : {'dialect': 'sqlite', 'top_k': '5'} 
    
    template : Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer. 
        Unless the user specifies in his question a specific number of examples he wishes to obtain, always limit your query to at most {top_k} results. 
        You can order the results by a relevant column to return the most interesting examples in the database.
    
    Use the following format:
    
    Question: Question here
    SQLQuery: SQL Query to run
    SQLResult: Result of the SQLQuery
    Answer: Final answer here
    
    Only use the following tables:
    {table_info}
    
    Here is the description of the columns in the tables:
    `cust`: customer name
    `prod`: product name
    `trans`: transaction date
    
    Question: {input}

Modify the variable values and check the results.

# Modify the dialect to MySQL
chain.get_prompts()[0].partial_variables["dialect"] = "my_sql"

# check the modified prompt
print(f"input_variables : {chain.get_prompts()[0].input_variables}", "\n")
print(f"input_types : {chain.get_prompts()[0].input_types}", "\n")
print(f"partial_variables : {chain.get_prompts()[0].partial_variables}", "\n")
print(f"template : {chain.get_prompts()[0].template}", "\n")

input_variables : ['input', 'table_info'] 
    
    input_types : {} 
    
    partial_variables : {'dialect': 'my_sql', 'top_k': '5'} 
    
    template : Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer. 
        Unless the user specifies in his question a specific number of examples he wishes to obtain, always limit your query to at most {top_k} results. 
        You can order the results by a relevant column to return the most interesting examples in the database.
    
    Use the following format:
    
    Question: Question here
    SQLQuery: SQL Query to run
    SQLResult: Result of the SQLQuery
    Answer: Final answer here
    
    Only use the following tables:
    {table_info}
    
    Here is the description of the columns in the tables:
    `cust`: customer name
    `prod`: product name
    `trans`: transaction date
    
    Question: {input}

You can specify variables, including dialect, when invoking create_sql_query_chain.

chain.invoke({"question": "List all customer names.", "dialect": "mysql"})

'SELECT name\nFROM customers;'

`QuerySQLDatabaseTool`

Executing SQL Queries:

Executes the provided SQL query on the connected database and retrieves the results.
Encapsulates the functionality for interacting with the database, promoting code reusability.

Integration with LangChain Agents:

LangChain agents are used to convert natural language into SQL queries.
This tool performs the execution step by running the agent-generated SQL query and returning the results.

Database Abstraction:

Supports various types of databases, such as MySQL, PostgreSQL, SQLite, and more.
Handles direct database operations internally, reducing the dependency of user code on database-specific details.

Let's verify if the generated query executes correctly.

from langchain_community.tools import QuerySQLDatabaseTool

# Create a tool to execute the generated query.
execute_query = QuerySQLDatabaseTool(db=db)

execute_query.invoke({"query": generated_sql_query})

"[('Altman',), ('Huang',), ('Zuckerberg',), ('Musk',), ('Hassabis',), ('Chase',)]"

from langchain_community.tools.sql_database.tool import QuerySQLDatabaseTool

# Tool
execute_query = QuerySQLDatabaseTool(db=db)

# SQL query chain
write_query = create_sql_query_chain(llm, db, prompt)

# Create a chain to execute the generated query.
chain = write_query | execute_query

# Check the execution result
chain.invoke({"question": "Retrieve Altman's email address."})

"[('Sam@example.com',)]"

Enhance and generate answers using the LLM

Using the chain created in the previous step results in short, concise answers. This can be adjusted using an LCEL-style chain to provide more natural and detailed responses.

from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

# Define the prompt for generating answers
answer_prompt = PromptTemplate.from_template(
    """Given the following user question, corresponding SQL query, and SQL result, answer the user question.

Question: {question}
SQL Query: {query}
SQL Result: {result}
Answer: """
)

# Create a pipeline for generating natural answers
answer = answer_prompt | llm | StrOutputParser()

# Create a chain to execute the generated query and produce an answer
chain = (
    RunnablePassthrough.assign(query=write_query).assign(
        result=itemgetter("query") | execute_query
    )
    | answer
)

# Check the execution result
chain.invoke({"question": "Calculate the total of Altman's transactions."})

"The total of Altman's transactions is -965.7."

Using SQL Query Chain with an Agent

What is `agent_toolkits` ?

agent_toolkits in LangChain is a collection of tools designed to simplify the creation and use of Agents optimized for specific domains or use cases.

Each toolkit encapsulates the functionality and workflows needed for specific tasks (e.g., SQL query processing, file system operations, API calls), enabling developers to perform complex tasks with ease.

`create_sql_agent`

create_sql_agent is a specialized function within the agent_toolkits library that simplifies the process of interacting with SQL databases.

It is designed to streamline SQL query generation and execution by leveraging LangChain’s agent capabilities. Developers can integrate this tool to enable agents to:

Connect to SQL databases seamlessly.
Automatically generate SQL queries based on natural language input.
Retrieve and format results for easy consumption.

This functionality is particularly useful for scenarios requiring dynamic database interactions, such as reporting, analytics, or user-facing applications that need query-based responses.

Using an agent, you can generate SQL queries and output the results as answers.

Agents work well with models like gpt-4o and gpt-4o-mini , in contrast to the issues encountered with chains when changing the model.

from langchain_openai import ChatOpenAI
from langchain_community.utilities import SQLDatabase
from langchain_community.agent_toolkits import create_sql_agent

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Connect to the SQLite database
db = SQLDatabase.from_uri("sqlite:///data/finance.db")

# Create the agent
agent_executor = create_sql_agent(llm, db=db, agent_type="openai-tools", verbose=True)

# Check the execution result
agent_executor.invoke(
    {"input": "Calculate and compare the total transactions of Altman and Zuckerberg."}
)

    
    > Entering new SQL Agent Executor chain...
    
    Invoking: `sql_db_list_tables` with `{}`
    
    
    accounts, customers, transactions
    Invoking: `sql_db_schema` with `{'table_names': 'accounts'}`
    
    
    
    CREATE TABLE accounts (
    	account_id INTEGER, 
    	customer_id INTEGER, 
    	balance REAL, 
    	PRIMARY KEY (account_id), 
    	FOREIGN KEY(customer_id) REFERENCES customers (customer_id)
    )
    
    /*
    3 rows from accounts table:
    account_id	customer_id	balance
    1	1	1000.5
    2	2	2500.75
    3	3	1500.0
    */
    Invoking: `sql_db_schema` with `{'table_names': 'customers'}`
    
    
    
    CREATE TABLE customers (
    	customer_id INTEGER, 
    	name TEXT, 
    	age INTEGER, 
    	email TEXT
    )
    
    /*
    3 rows from customers table:
    customer_id	name	age	email
    1	Altman	40	Sam@example.com
    2	Huang	62	Jensen@example.com
    3	Zuckerberg	41	Mark@example.com
    */
    Invoking: `sql_db_schema` with `{'table_names': 'transactions'}`
    
    
    
    CREATE TABLE transactions (
    	transaction_id INTEGER, 
    	account_id INTEGER, 
    	amount REAL, 
    	transaction_date TEXT, 
    	PRIMARY KEY (transaction_id), 
    	FOREIGN KEY(account_id) REFERENCES accounts (account_id)
    )
    
    /*
    3 rows from transactions table:
    transaction_id	account_id	amount	transaction_date
    1	1	74.79	2024-07-13
    2	1	-224.1	2024-05-13
    3	1	-128.9	2024-01-25
    */
    Invoking: `sql_db_query_checker` with `{'query': "SELECT customer_id, name FROM customers WHERE name IN ('Altman', 'Zuckerberg')"}`
    responded: The relevant tables and their structures are as follows:
    
    1. **accounts**: 
       - `account_id`: INTEGER
       - `customer_id`: INTEGER
       - `balance`: REAL
    
    2. **customers**: 
       - `customer_id`: INTEGER
       - `name`: TEXT
       - `age`: INTEGER
       - `email`: TEXT
    
    3. **transactions**: 
       - `transaction_id`: INTEGER
       - `account_id`: INTEGER
       - `amount`: REAL
       - `transaction_date`: TEXT
    
    To calculate and compare the total transactions of Altman and Zuckerberg, I will need to:
    1. Find the `customer_id` for both Altman and Zuckerberg from the `customers` table.
    2. Use their `customer_id` to find their corresponding `account_id` from the `accounts` table.
    3. Sum the `amount` from the `transactions` table for each `account_id`.
    
    Let's construct the SQL queries for these steps. 
    
    First, I will get the `customer_id` for Altman and Zuckerberg. Then, I will get their `account_id` and finally sum their transactions. 
    
    I will start by querying the `customer_id` for both customers.
    
    ```sql
    SELECT customer_id, name FROM customers WHERE name IN ('Altman', 'Zuckerberg')
    ```
    Invoking: `sql_db_query` with `{'query': "SELECT customer_id, name FROM customers WHERE name IN ('Altman', 'Zuckerberg')"}`
    
    
    [(1, 'Altman'), (3, 'Zuckerberg')]
    Invoking: `sql_db_query_checker` with `{'query': 'SELECT account_id FROM accounts WHERE customer_id IN (1, 3)'}`
    
    
    ```sql
    SELECT account_id FROM accounts WHERE customer_id IN (1, 3)
    ```
    Invoking: `sql_db_query` with `{'query': 'SELECT account_id FROM accounts WHERE customer_id IN (1, 3)'}`
    
    
    [(1,), (3,)]
    Invoking: `sql_db_query_checker` with `{'query': 'SELECT SUM(amount) FROM transactions WHERE account_id IN (1, 3) GROUP BY account_id'}`
    
    
    ```sql
    SELECT SUM(amount) FROM transactions WHERE account_id IN (1, 3) GROUP BY account_id
    ```
    Invoking: `sql_db_query` with `{'query': 'SELECT SUM(amount) FROM transactions WHERE account_id IN (1, 3) GROUP BY account_id'}`
    
    
    [(-965.7,), (656.6400000000002,)]The total transactions for each customer are as follows:
    
    - **Altman** (account_id 1): Total transactions amount to **-965.7**.
    - **Zuckerberg** (account_id 3): Total transactions amount to **656.64**.
    
    In summary, Zuckerberg has a positive total transaction amount, while Altman has a negative total transaction amount.
    
    > Finished chain.

{'input': 'Calculate and compare the total transactions of Altman and Zuckerberg.',
 'output': 'The total transactions for each customer are as follows:\n\n- **Altman** (account_id 1): Total transactions amount to **-965.7**.\n- **Zuckerberg** (account_id 3): Total transactions amount to **656.64**.\n\nIn summary, Zuckerberg has a positive total transaction amount, while Altman has a negative total transaction amount.'}

Differences Between `create_sql_query_chain` and SQL Agent

create_sql_query_chain :
- Translates user input into a single SQL query and executes it directly.
- Best for simple, direct query execution.
SQL Agent :
- Handles more complex workflows, involving multiple queries and reasoning steps.
- Ideal for dynamic or multi-step tasks.
Conclusion: It is recommended to use create_sql_query_chain for simple queries, while SQL Agent is suggested for complex or iterative processes.

Appendix : Chain with `gpt-4o` and a Post-Processing Function

As observed earlier, gpt-4o output can be inconsistent.

To improve this, a chain can be constructed with a post-processing function applied.

This part consists of the following procedure:

Natural language question input → SQL query generation by LLM → Post-processing of the generated query → Query execution on the database → Natural language answer output by LLM

Load and verify the sample database data.

from langchain_openai import ChatOpenAI
from langchain.chains import create_sql_query_chain
from langchain_community.utilities import SQLDatabase

# Connect to the SQLite database.
db = SQLDatabase.from_uri("sqlite:///data/finance.db")

# Output the database dialect.
print(db.dialect)

# Output the available table names.
print(db.get_usable_table_names())

sqlite
    ['accounts', 'customers', 'transactions']

Create an LLM object and use a custom template to generate a chain by providing the LLM and DB as parameters.

from langchain_core.prompts import PromptTemplate

prompt = PromptTemplate.from_template(
    """Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer. 
    Unless the user specifies in his question a specific number of examples he wishes to obtain, always limit your query to at most {top_k} results. 
    You can order the results by a relevant column to return the most interesting examples in the database.

[Important] You must respond strictly in the format 'select column_name from table_name [Options:LIMIT,ORDER BY,others]'.

Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use the following tables:
{table_info}

Here is the description of the columns in the tables:
`cust`: customer name
`prod`: product name
`trans`: transaction date

Question: {input}
"""
).partial(dialect=db.dialect)

# Create an OpenAI LLM object
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Generate a chain by providing the LLM and DB as parameters.
write_query = create_sql_query_chain(llm, db, prompt, k=10)

Check the execution result

# The result contains unnecessary text, such as 'SQLQuery: '
write_query.invoke({"question": "List the all customer names."})

'SQLQuery: select name from customers ORDER BY name LIMIT 10'

Define a function for post-processing the SQL query.

# It is necessary to remove unnecessary text.
import re


# Define the regex parsing function
def parse_sqlquery(query):
    match = re.search(r"SQLQuery:\s*(.*)", query)
    if match:
        return match.group(1).strip()
    else:
        return query

Create a tool to execute the generated query.

from langchain_community.tools import QuerySQLDatabaseTool

execute_query = QuerySQLDatabaseTool(db=db)

Combine the chains: SQL query generation → Post-processing of the generated query → Query execution on the database

from langchain_core.runnables import RunnablePassthrough

chain = (
    RunnablePassthrough.assign(query=write_query).assign(
        query=lambda x: parse_sqlquery(x["query"])
    )
    | execute_query
)

# Check the execution result
chain.invoke({"question": "List the all customer names."})

"[('Altman',), ('Chase',), ('Hassabis',), ('Huang',), ('Musk',), ('Zuckerberg',)]"

# Check the execution result
chain.invoke({"question": "Calculate the total of Altman's transactions."})

'[(-965.7,)]'

Combine the chains: SQL query generation → Post-processing of the generated query → Query execution on the database → Natural language answer output by LLM

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate

# Define the prompt for generating answers
answer_prompt = PromptTemplate.from_template(
    """Given the following user question, corresponding SQL query, and SQL result, answer the user question.

Question: {question}
SQL Query: {query}
SQL Result: {result}
Answer: """
)

# Create a pipeline for generating natural answers
answer = answer_prompt | llm | StrOutputParser()

chain = (
    RunnablePassthrough.assign(query=write_query).assign(
        query=lambda x: parse_sqlquery(x["query"])
    )
    | RunnablePassthrough.assign(result=execute_query)
    | answer
)

# Check the execution result
chain.invoke({"question": "List the all customer names."})

'The customer names are: Altman, Chase, Hassabis, Huang, Musk, and Zuckerberg.'

# Check the execution result
chain.invoke({"question": "Calculate the total of Altman's transactions."})

"The total of Altman's transactions is -965.7."

PreviousSummarization NextStructured Output Chain

Last updated 5 months ago

Overview

Table of Contents