🧠 AI with Python – 🚢 Titanic Survival Predictor
Posted on: November 13, 2025
Description:
Predicting who survived the Titanic disaster has become a classic problem in data science — one that demonstrates how feature engineering and machine learning pipelines can turn raw passenger data into accurate predictive insights.
In this project, we use Gradient Boosting with advanced engineered features such as passenger titles, family size, cabin deck, and fare-based normalization to predict survival probabilities.
Understanding the Problem
The Titanic dataset from Kaggle contains information about passengers — including their class, age, gender, and ticket details — along with whether they survived (Survived = 1) or not (Survived = 0).
A simple model might look only at gender or passenger class. But deeper analysis shows that social status, family groups, and deck levels play a big role in survival chances.
Our goal is to:
- Extract meaningful features that represent social and spatial context.
- Handle missing values intelligently.
- Build a robust pipeline for model training and evaluation.
1. Load and Explore the Dataset
We begin with the Kaggle Titanic dataset (titanic.csv), which includes 891 passenger records.
df = pd.read_csv("titanic.csv")
df.head()
Each record contains demographic and ticket details such as:
Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked.
2. Feature Engineering
Feature engineering is the core of this project — it allows us to capture real-world relationships within the data.
🧩 Extract Titles from Names
Names often encode social class and gender roles.
For example, “Mr.” or “Master” reveals gender and age group, while “Dr.” or “Countess” indicates higher social standing.
def extract_title(name):
title = name.split(",")[1].split(".")[0].strip()
mapping = {
"Mlle": "Miss", "Ms": "Miss", "Mme": "Mrs",
"Lady": "Rare", "Countess": "Rare", "Sir": "Rare",
"Capt": "Rare", "Col": "Rare", "Major": "Rare",
"Rev": "Rare", "Dr": "Rare"
}
return mapping.get(title, title)
df["Title"] = df["Name"].apply(extract_title)
👨👩👧 Create Family-Based Features
df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
df["IsAlone"] = (df["FamilySize"] == 1).astype(int)
Passengers traveling with family often had better survival odds.
💺 Group Tickets and Cabin Decks
Ticket numbers reveal shared groups; cabin letters correspond to decks.
df["TicketGroupSize"] = df.groupby("Ticket")["Ticket"].transform("count")
df["CabinDeck"] = df["Cabin"].apply(lambda x: x[0] if pd.notna(x) else "U")
💰 Normalize Fare per Person
To adjust for family size, we compute fare per individual and add logarithmic transformations.
df["FarePerPerson"] = df["Fare"] / df["FamilySize"]
df["LogFare"] = np.log1p(df["Fare"])
df["LogFarePerPerson"] = np.log1p(df["FarePerPerson"])
3. Handle Missing Values
Missing Age values are filled using the median from passengers with the same Title, Sex, and Pclass combination.
This ensures age estimation aligns with realistic social group behavior.
age_group_median = (
df.groupby(["Title", "Sex", "Pclass"])["Age"]
.median()
.reset_index()
.rename(columns={"Age": "AgeMedian"})
)
df = df.merge(age_group_median, on=["Title", "Sex", "Pclass"], how="left")
df["Age"] = df["Age"].fillna(df["AgeMedian"])
df.drop(columns=["AgeMedian"], inplace=True)
4. Build a Preprocessing and Model Pipeline
We define transformations for numeric and categorical columns separately and combine them using a ColumnTransformer.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
numeric_features = ["Age", "Fare", "FamilySize", "IsAlone", "TicketGroupSize",
"FarePerPerson", "LogFare", "LogFarePerPerson"]
categorical_features = ["Pclass", "Sex", "Embarked", "Title", "CabinDeck"]
numeric_transformer = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_transformer = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer([
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features)
])
5. Model Training and Evaluation
We compare multiple models — Logistic Regression, Random Forest, and Gradient Boosting — using ROC-AUC with 5-fold stratified cross-validation.
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
models = {
"LogisticRegression": LogisticRegression(max_iter=5000),
"RandomForest": RandomForestClassifier(n_estimators=400, random_state=42),
"GradientBoosting": GradientBoostingClassifier(random_state=42)
}
for name, clf in models.items():
pipe = Pipeline([("prep", preprocessor), ("clf", clf)])
scores = cross_val_score(pipe, X_train, y_train, scoring="roc_auc", cv=cv)
print(f"{name}: ROC AUC {scores.mean():.3f} ± {scores.std():.3f}")
Gradient Boosting performs the best, so we proceed with it.
6. Hyperparameter Tuning
Using GridSearchCV, we find the best model configuration.
param_grid = {
"clf__n_estimators": [100, 200, 300],
"clf__learning_rate": [0.05, 0.1, 0.2],
"clf__max_depth": [2, 3],
"clf__subsample": [0.8, 1.0]
}
gbdt = Pipeline([("prep", preprocessor), ("clf", GradientBoostingClassifier(random_state=42))])
grid = GridSearchCV(gbdt, param_grid, scoring="roc_auc", cv=cv, n_jobs=-1)
grid.fit(X_train, y_train)
print("Best AUC:", grid.best_score_)
print("Best Params:", grid.best_params_)
✅ Example output:
Best AUC: 0.9049
Best Params: {'clf__learning_rate': 0.1, 'clf__max_depth': 3, 'clf__n_estimators': 100, 'clf__subsample': 0.8}
Test AUC (best): 0.8479
7. Evaluate the Final Model
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]
print("Test ROC AUC:", roc_auc_score(y_test, y_proba))
print("\nClassification Report:\n", classification_report(y_test, y_pred, digits=3))
AUC of 0.84–0.85 on test data indicates a strong, generalizable model.
Key Takeaways
- Feature Engineering Matters – Adding contextual features like Title, FamilySize, and CabinDeck significantly boosts model accuracy.
- Pipelines Ensure Clean Workflow – Using preprocessing pipelines keeps data transformation and model training consistent and reproducible.
- Gradient Boosting Strength – A reliable choice for structured datasets; it handles nonlinear relationships and feature interactions well.
- Cross-Validation + Grid Search – Helps prevent overfitting while finding the most effective hyperparameters.
- AUC as an Evaluation Metric – AUC ≈ 0.85 shows strong separation between survivors and non-survivors.
Conclusion
By combining feature engineering, data preprocessing, and model tuning, we’ve built a high-performing survival predictor that captures hidden relationships in the Titanic dataset.
This project demonstrates how data-driven feature design — not just algorithms — drives real improvements in predictive modeling.
You can now extend it further with XGBoost, LightGBM, or SHAP analysis for feature importance visualization.
Code Snippet:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
import matplotlib.pyplot as plt
# Path to Kaggle Titanic train file (adjust as needed)
CSV_PATH = "datasets/titanic.csv" # e.g., "/path/to/train.csv"
df = pd.read_csv(CSV_PATH)
print("Shape:", df.shape)
df.head(3)
def extract_title(name: str) -> str:
# Titles appear as "Lastname, Title. Firstname"
if pd.isna(name):
return "Unknown"
title = name.split(",")[1].split(".")[0].strip()
# Map rare titles
mapping = {
"Mlle": "Miss", "Ms": "Miss", "Mme": "Mrs",
"Lady": "Rare", "Countess": "Rare", "Sir": "Rare",
"Jonkheer": "Rare", "Don": "Rare", "Dona": "Rare",
"Capt": "Rare", "Col": "Rare", "Major": "Rare",
"Rev": "Rare", "Dr": "Rare"
}
return mapping.get(title, title)
def cabin_deck(cabin: str) -> str:
if pd.isna(cabin) or not str(cabin).strip():
return "U" # Unknown
return str(cabin)[0]
def engineer_features(df_raw: pd.DataFrame) -> pd.DataFrame:
df = df_raw.copy()
# Title
df["Title"] = df["Name"].apply(extract_title)
# Simplify extremely rare titles
freq = df["Title"].value_counts()
rare_titles = freq[freq < 10].index
df["Title"] = df["Title"].replace(dict.fromkeys(rare_titles, "Rare"))
# Family size & solitude
df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
df["IsAlone"] = (df["FamilySize"] == 1).astype(int)
# Ticket group size
df["TicketGroupSize"] = df.groupby("Ticket")["Ticket"].transform("count")
# Cabin deck
df["CabinDeck"] = df["Cabin"].apply(cabin_deck)
# Fare per person & log transforms (protect against zero)
df["FarePerPerson"] = df["Fare"] / df["FamilySize"].replace(0, 1)
df["LogFare"] = np.log1p(df["Fare"])
df["LogFarePerPerson"] = np.log1p(df["FarePerPerson"])
# Age bin (quantiles on available values)
df["AgeBin"] = pd.qcut(df["Age"], q=4, duplicates="drop")
return df
df_eng = engineer_features(df)
df_eng.head(3)
df_tmp = df_eng.copy()
age_group_median = (
df_tmp.groupby(["Title", "Sex", "Pclass"])["Age"]
.median()
.reset_index()
.rename(columns={"Age": "AgeMedian"})
)
df_tmp = df_tmp.merge(age_group_median, on=["Title", "Sex", "Pclass"], how="left")
df_tmp["Age"] = df_tmp["Age"].fillna(df_tmp["AgeMedian"])
df_tmp.drop(columns=["AgeMedian"], inplace=True)
# Recompute AgeBin with filled Age
df_tmp["AgeBin"] = pd.qcut(df_tmp["Age"], q=4, duplicates="drop")
df_eng = df_tmp
df_eng.isna().sum()
TARGET = "Survived"
features = [
# Numeric
"Age", "Fare", "FamilySize", "IsAlone", "TicketGroupSize",
"FarePerPerson", "LogFare", "LogFarePerPerson",
# Categorical
"Pclass", "Sex", "Embarked", "Title", "CabinDeck", "AgeBin"
]
X = df_eng[features]
y = df_eng[TARGET]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
numeric_features = ["Age", "Fare", "FamilySize", "IsAlone", "TicketGroupSize",
"FarePerPerson", "LogFare", "LogFarePerPerson"]
categorical_features = ["Pclass", "Sex", "Embarked", "Title", "CabinDeck", "AgeBin"]
numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features)
]
)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
models = {
"LogisticRegression": LogisticRegression(
max_iter=5000, C=1.0,
class_weight=None,
n_jobs=None if hasattr(LogisticRegression, "n_jobs") else None
),
"GradientBoosting": GradientBoostingClassifier(random_state=42),
"RandomForest": RandomForestClassifier(n_estimators=400, random_state=42)
}
cv_scores = {}
for name, clf in models.items():
pipe = Pipeline(steps=[("prep", preprocessor), ("clf", clf)])
scores = cross_val_score(pipe, X_train, y_train, scoring="roc_auc", cv=cv)
cv_scores[name] = (scores.mean(), scores.std())
print(f"{name}: ROC AUC {scores.mean():.3f} ± {scores.std():.3f}")
# Pick one strong model for final fit (GBDT is often great here)
final_model = Pipeline(steps=[("prep", preprocessor), ("clf", GradientBoostingClassifier(random_state=42))])
final_model.fit(X_train, y_train)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
y_pred = final_model.predict(X_test)
y_proba = final_model.predict_proba(X_test)[:, 1]
print("Test ROC AUC:", roc_auc_score(y_test, y_proba))
print("\nClassification Report:\n", classification_report(y_test, y_pred, digits=3))
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Not Survived", "Survived"])
disp.plot(cmap="Purples")
plt.title("Titanic – Confusion Matrix (Test)")
plt.show()
param_grid = {
"clf__n_estimators": [100, 200, 300],
"clf__learning_rate": [0.05, 0.1, 0.2],
"clf__max_depth": [2, 3],
"clf__subsample": [0.8, 1.0]
}
gbdt = Pipeline(steps=[("prep", preprocessor), ("clf", GradientBoostingClassifier(random_state=42))])
grid = GridSearchCV(gbdt, param_grid, scoring="roc_auc", cv=cv, n_jobs=-1)
grid.fit(X_train, y_train)
print("Best AUC:", grid.best_score_)
print("Best Params:", grid.best_params_)
best_model = grid.best_estimator_
y_proba_best = best_model.predict_proba(X_test)[:, 1]
print("Test AUC (best):", roc_auc_score(y_test, y_proba_best))
# If you want to create a Kaggle submission:
# Load test.csv, apply the SAME feature engineering, then predict.
TEST_CSV_PATH = "datasets/titanic.csv" # adjust if needed
test_df = pd.read_csv(TEST_CSV_PATH)
test_eng = engineer_features(test_df)
# Impute Age via the same grouped median strategy (fit on TRAIN!)
age_group_median_train = (
df_eng.groupby(["Title", "Sex", "Pclass"])["Age"]
.median()
.reset_index()
.rename(columns={"Age": "AgeMedian"})
)
test_eng = test_eng.merge(age_group_median_train, on=["Title", "Sex", "Pclass"], how="left")
test_eng["Age"] = test_eng["Age"].fillna(test_eng["AgeMedian"])
test_eng.drop(columns=["AgeMedian"], inplace=True)
test_eng["AgeBin"] = pd.qcut(test_eng["Age"], q=4, duplicates="drop")
X_submit = test_eng[features]
pred_submit = best_model.predict(X_submit) if 'best_model' in globals() else final_model.predict(X_submit)
submission = pd.DataFrame({
"PassengerId": test_df["PassengerId"],
"Survived": pred_submit.astype(int)
})
submission.head()
# submission.to_csv("submission.csv", index=False)
No comments yet. Be the first to comment!