05-TitanicQASystem
TitanicQASystem
Author: Taylor(Jihyun Kim)
Peer Review: Jongcheol Kim, Heesun Moon
Proofread : Juni Lee
This is a part of LangChain Open Tutorial
Overview
In this tutorial, we walk through building a Q&A system with the Titanic dataset stored in a Neo4j graph.
Starting with a csv file, we preprocess passenger data and model both nodes (Passenger) and relationships (MARRIED_TO , SIBLING_OF , PARENT_OF ) in Neo4j .
LangChain then transforms user questions into Cypher queries that retrieve key insights, while langgraph handles invalid queries by allowing an LLM to revise them based on Neo4j feedback.
Ultimately, you gain a robust pipeline to analyze relationships, compute statistics, and explore passenger connections in the Titanic dataset.
Setup & Data Preparation
Acquire and clean the Titanic csv file.
Preprocess passenger data by handling missing values and extracting relevant fields.
Graph Modeling in Neo4j
Create
Passengernodes with core properties (e.g.,age,ticket,survived).Establish relationships such as:
MARRIED_TOSIBLING_OFPARENT_OF
Querying with LangChain
Convert natural-language questions into Cypher queries.
Retrieve insights like family ties, survival rates, and ticket usage.
Error Handling with langgraph
Catch and correct invalid queries.
Provide Neo4j feedback to an LLM, automatically revising problematic Cypher statements.
Outcome
By the end, you'll have a robust Q&A system for exploring passenger connections, computing statistics, and discovering meaningful patterns in the Titanic dataset stored in Neo4j.
Table of Contents
References
Environment Setup
Set up the environment. You may refer to Environment Setup for more details.
[Note]
langchain-opentutorialis a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.You can checkout the
langchain-opentutorialfor more details.
You can alternatively set API keys such as OPENAI_API_KEY in a .env file and load them.
[Note] This is not necessary if you've already set the required API keys in previous steps.
Load Titanic Data
Data Preparation
In this tutorial, we will use the following csv file:
Download Link: Kaggle Titanic Dataset
Author : M Yasser H (kaggle ID)
File name: Titanic-Dataset.csv"
File path: "./data/Titanic-Dataset.csv"
There are two ways to obtain the dataset:
Download directly from the Kaggle link above
Use the Python code below to automatically download via Kaggle API
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
Column Descriptions
Key column descriptions:
PassengerId: Unique identifier for each passengerSurvived: Survival status (0 = No, 1 = Yes)Pclass: Ticket class (1, 2, 3)Name: Passenger nameSex: GenderAge: Age in yearsSibSp: Number of siblings/spouses aboardParch: Number of parents/children aboardTicket: Ticket numberFare: Passenger fareCabin: Cabin numberEmbarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
Simple Data Preprocessing
Creating
LastNameColumn: Extracts the last name from the Name column and stores it in a new column LastName. The extraction is done by splitting the string at the comma and using the first element.Removing Missing Values: Drops rows where the
Agecolumn has null values to ensure data completeness.Data Type Conversion: Converts the
Ticketcolumn to string format to maintain consistency in data types.
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
Braund
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
Cumings
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
Heikkinen
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
Futrelle
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
Allen
Neo4j Database Connection
This tutorial is based on Neo4j Desktop. First, install the Neo4j graph database from Neo4j Desktop Installation link
[Note] You can set up Neo4j in several ways
Neo4j Desktop: A desktop application for local developmentNeo4j Sandbox: A free, cloud-based platform for working with graph databasesDocker: Run Neo4j in a container using the official Neo4j Docker image
[Important] Before importing the csv file, Please follow the setup instructions in the link:
Setup APOC Plugin
Update the neo4j.conf file.
Define Neo4j Credentials
Next, you need to define your Neo4j credentials. If you haven't done this in the previous steps, you can define them using the os package.
[Note] This is not necessary if you've already set the required Neo4j credentials in previous steps.
The default user account information:
Default username:
neo4jDefault password:
neo4jYou are required to change the password upon your first login.
Import Titanic data
How to Import csv File into Neo4j
We will import a csv file into the Neo4j Desktop by adding it to the import folder.
To open a finder window, hover over the three dots on the right side of the started DBMS, select Open folder , and then click Import
You can directly drag & drop files into this folder to add them.
Let's verify the data import using Neo4j Browser.
We will use a simple Cypher query to verify that the data has been successfully added:
This query will count the number of rows in the Titanic-Dataset.csv file and return the total count. If the data is accessible and correctly loaded, you will see the total row count in the result.
CYPER : LOAD CSV FROM 'file:///Titanic-Dataset.csv' AS row RETURN count(row);


It has been successfully loaded!
Consider the data with Arrows.app
When converting a complete tabular dataset like a passenger manifest into a graph, it may seem simple to create nodes for each Person, tickets, and Embarked points while turning the remaining columns into properties.
However, the flexibility of the graph structure requires careful consideration of how to categorize data into nodes, relationships, or properties. The way the data is structured may vary depending on the types of queries you plan to run on the graph.
To assist with this, Neo4j provides Arrows.app, a tool that allows you to visualize relationships across the graph before uploading any specific data. With arrows.app, you can explore and experiment with different ways to model the data. To demonstrate this, I will present an example graph that represents a complex data structure.
Defining the Relationship Categories
The first step was to define the categories of relationships we were interested in.
Here are the three relationships I had to define: MARRIED_TO , SIBLING_TO , PARENT_OF.

Both MARRIED_TO and SIBLING_TO would imply the same relationship in the other direction between the same nodes.
PARENT_OF would imply a reverse relationship of CHILD_OF .
Data Restructure
Why We Create Passenger Nodes
We create Passenger nodes to represent each Titanic passenger in the graph database.
This enables us to:
Assign properties (e.g.,
age,ticket,survived) directly to a node.Connect these person with relationships to other entities (e.g.,
SIBLING_OF,MARRIED_TO,PARENT_OF) once we identify family links or other relevant data points.Query the graph to analyze connections, run aggregations on survivor counts, family group structures, or other correlations inherent in the Titanic dataset.
By modeling person as nodes, Neo4j can leverage its graph capabilities (like path finding, pattern matching, or graph algorithms) to deliver deeper insights than a traditional relational or tabular approach might.
Verifying Nodes in Neo4j Browser
Cypher query:
MATCH (n:Passenger) RETURN n LIMIT 10
The nodes have been successfully created!
Why Create These Relationships?
MARRIED_TOInfers a couple is married if they share the same Ticket, have the same LastName, have sibsp = 1 (i.e., exactly one sibling/spouse count in the data), differ in sex, and a few additional age-based checks.SIBLING_OFAmong those not married, uses SibSp, LastName, Ticket, and other constraints (e.g., (p2).parch = 1 or 2) to guess they’re siblings if they appear to have the same “family” context but are not recognized as spouses.PARENT_OF(and/orCHILD_OF) If the passenger has parch >= 1 (parents/children on board), is older than some threshold, or specifically older than the potential child, create PARENT_OF edges.
These queries are heuristics to reconstruct plausible family connections from partial data. They rely on simplified assumptions—such as “If two people share a ticket, they might be family,” “If a passenger’s sibsp=1, that single sibling/spouse is probably a spouse rather than a child,” etc. You can refine or alter the logic to fit your own inference approach.
Key Idea
< MARRIED_TO >
Find passengers (person, other) who share the same ticket using the
TRAVELED_ONrelationship.Create a family members list by collecting others (collect(other)) after
ORDER BY other.age DESC.Consider familyMembers[0] (the oldest person) as the "spouse candidate" or "family representative".
Use
FOREACH(...CREATE...) statement to create relationships only for passengers meeting specific conditions.p2 = familyMembers[0] → "Only consider the oldest (or first) person as a spouse candidate"
(
size(familyMembers) = 1ORp1.age > familyMembers[1].age) → Complex conditions like "If there's only one family member, or if p1 is older than the second oldest person..."
If passengers share the same ticket + same family + sibsp=1 , they are considered spouses, processing only the first person by family age order as a spouse.
< SIBLING_OF & PARENT_OF >
Find passengers sharing the same ticket and family name using
TRAVELED_ONrelationship.Create a family members list ordered by age (
ORDER BY other.age DESC).Identify siblings based on conditions:
Not married (no
MARRIED_TOrelationship)Has siblings (
sibsp >= 1)Same sibsp value between passengers
Family value >= 1Parent/child count (parch) is 1 or 2
Not the oldest family member
Identify children based on conditions:
Not married
Not in siblings list
Family value >= 1Parent/child count is 1 or 2
Age comparison (
p1 older than p2)
Consider the Data with Neo4j Desktop Visualization
While we often handle and analyze large datasets using machine learning and deep learning techniques, visualizing data relationships through graph databases like Neo4j offers unique insights. The attached node visualization from Neo4j Desktop demonstrates the intricate connections within our Titanic dataset.
This graph-based approach allows us to:
Discover hidden patterns in passenger relationships
Analyze survival rates based on social connections
Identify clusters of passengers with similar characteristics
Explore complex relationships that might be missed in traditional tabular analysis
By combining these visual insights with ML/DL approaches, we can develop a more comprehensive understanding of the data and potentially uncover novel patterns that might be overlooked using traditional analysis methods alone.
[Attached: Neo4j Desktop visualization of Titanic dataset relationships]

Usage Example
Exploring Titanic Dataset with Neo4j and LangGraph
When converting natural language into Cypher queries, the process doesn’t always succeed on the first try. Queries can fail for various reasons:
Nonexistent columns or properties
Typos in relationship or node labels
Logical mistakes leading to syntax errors
To handle these challenges, this tutorial demonstrates:
Robust error handling for query validation
Property existence checking before query execution
Automated syntax verification using
EXPLAINSmart query reformulation using LLMs
Step-by-step debugging techniques for complex queries
Usage Example using Langgraph
Why Use langgraph?
Sometimes, natural language can’t be directly or accurately converted to a valid Cypher query on the first try. The query might fail for various reasons:
Nonexistent Columns or Properties
Typos in relationship labels or node labels
Logical Mistakes leading to syntax errors
Instead of manually fixing the query, we can automate the process:
Ask an LLM to generate a query from a user’s question.
Try running the query against
Neo4j.If it fails, capture the error message and feed it back to the LLM so it can revise the query.
Retry until a valid query is produced or we exceed the maximum number of attempts.
This iterative approach significantly improves robustness when handling open-ended user questions.
First, we will create a chain that detects any errors in the Cypher statement and extracts the property values it references.
LLMs often struggle with correctly determining relationship directions in generated Cypher statements. Since we have access to the schema, we can deterministically correct these directions using the CypherQueryCorrector.
Note: The CypherQueryCorrector is an experimental feature and doesn't support all the newest Cypher syntax.
Now we can implement the Cypher validation step. First, we use the EXPLAIN method to detect any syntax errors. Next, we leverage the LLM to identify potential issues and extract the properties used for filtering. For string properties, we validate them against the database using a simple CONTAINS clause.
Based on the validation results, the process can take the following paths:
If value mapping fails, we end the conversation and inform the user that we couldn't identify a specific property value (e.g., a passenger name or ticket number).
If errors are found, we route the query for correction.
If no issues are detected, we proceed to the Cypher execution step.
The Cypher correction step takes the existing Cypher statement, any identified errors, and the original question to generate a corrected version of the query.
We need to add a step that executes the given Cypher statement. If no results are returned, we should explicitly handle this scenario, as leaving the context empty can sometimes lead to LLM hallucinations.
The final step is to generate the answer. This involves combining the initial question with the database output to produce a relevant response.
Next, we will implement the LangGraph workflow, starting with defining the conditional edge functions.
Let's put it all together now.

Last updated