AW Dev Rethought

⚖️ There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies - C.A.R. Hoare

🧠 AI with Python – 🚢 Titanic Survival Predictor


Description:

Predicting who survived the Titanic disaster has become a classic problem in data science — one that demonstrates how feature engineering and machine learning pipelines can turn raw passenger data into accurate predictive insights.

In this project, we use Gradient Boosting with advanced engineered features such as passenger titles, family size, cabin deck, and fare-based normalization to predict survival probabilities.


Understanding the Problem

The Titanic dataset from Kaggle contains information about passengers — including their class, age, gender, and ticket details — along with whether they survived (Survived = 1) or not (Survived = 0).

A simple model might look only at gender or passenger class. But deeper analysis shows that social status, family groups, and deck levels play a big role in survival chances.

Our goal is to:

  • Extract meaningful features that represent social and spatial context.
  • Handle missing values intelligently.
  • Build a robust pipeline for model training and evaluation.

1. Load and Explore the Dataset

We begin with the Kaggle Titanic dataset (titanic.csv), which includes 891 passenger records.

df = pd.read_csv("titanic.csv")
df.head()

Each record contains demographic and ticket details such as:

Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked.


2. Feature Engineering

Feature engineering is the core of this project — it allows us to capture real-world relationships within the data.

🧩 Extract Titles from Names

Names often encode social class and gender roles.

For example, “Mr.” or “Master” reveals gender and age group, while “Dr.” or “Countess” indicates higher social standing.

def extract_title(name):
    title = name.split(",")[1].split(".")[0].strip()
    mapping = {
        "Mlle": "Miss", "Ms": "Miss", "Mme": "Mrs",
        "Lady": "Rare", "Countess": "Rare", "Sir": "Rare",
        "Capt": "Rare", "Col": "Rare", "Major": "Rare",
        "Rev": "Rare", "Dr": "Rare"
    }
    return mapping.get(title, title)

df["Title"] = df["Name"].apply(extract_title)

👨‍👩‍👧 Create Family-Based Features

df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
df["IsAlone"] = (df["FamilySize"] == 1).astype(int)

Passengers traveling with family often had better survival odds.

💺 Group Tickets and Cabin Decks

Ticket numbers reveal shared groups; cabin letters correspond to decks.

df["TicketGroupSize"] = df.groupby("Ticket")["Ticket"].transform("count")
df["CabinDeck"] = df["Cabin"].apply(lambda x: x[0] if pd.notna(x) else "U")

💰 Normalize Fare per Person

To adjust for family size, we compute fare per individual and add logarithmic transformations.

df["FarePerPerson"] = df["Fare"] / df["FamilySize"]
df["LogFare"] = np.log1p(df["Fare"])
df["LogFarePerPerson"] = np.log1p(df["FarePerPerson"])

3. Handle Missing Values

Missing Age values are filled using the median from passengers with the same Title, Sex, and Pclass combination.

This ensures age estimation aligns with realistic social group behavior.

age_group_median = (
    df.groupby(["Title", "Sex", "Pclass"])["Age"]
    .median()
    .reset_index()
    .rename(columns={"Age": "AgeMedian"})
)

df = df.merge(age_group_median, on=["Title", "Sex", "Pclass"], how="left")
df["Age"] = df["Age"].fillna(df["AgeMedian"])
df.drop(columns=["AgeMedian"], inplace=True)

4. Build a Preprocessing and Model Pipeline

We define transformations for numeric and categorical columns separately and combine them using a ColumnTransformer.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

numeric_features = ["Age", "Fare", "FamilySize", "IsAlone", "TicketGroupSize", 
                    "FarePerPerson", "LogFare", "LogFarePerPerson"]
categorical_features = ["Pclass", "Sex", "Embarked", "Title", "CabinDeck"]

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])

5. Model Training and Evaluation

We compare multiple models — Logistic Regression, Random Forest, and Gradient Boosting — using ROC-AUC with 5-fold stratified cross-validation.

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

models = {
    "LogisticRegression": LogisticRegression(max_iter=5000),
    "RandomForest": RandomForestClassifier(n_estimators=400, random_state=42),
    "GradientBoosting": GradientBoostingClassifier(random_state=42)
}

for name, clf in models.items():
    pipe = Pipeline([("prep", preprocessor), ("clf", clf)])
    scores = cross_val_score(pipe, X_train, y_train, scoring="roc_auc", cv=cv)
    print(f"{name}: ROC AUC {scores.mean():.3f} ± {scores.std():.3f}")

Gradient Boosting performs the best, so we proceed with it.


6. Hyperparameter Tuning

Using GridSearchCV, we find the best model configuration.

param_grid = {
    "clf__n_estimators": [100, 200, 300],
    "clf__learning_rate": [0.05, 0.1, 0.2],
    "clf__max_depth": [2, 3],
    "clf__subsample": [0.8, 1.0]
}

gbdt = Pipeline([("prep", preprocessor), ("clf", GradientBoostingClassifier(random_state=42))])
grid = GridSearchCV(gbdt, param_grid, scoring="roc_auc", cv=cv, n_jobs=-1)
grid.fit(X_train, y_train)

print("Best AUC:", grid.best_score_)
print("Best Params:", grid.best_params_)

✅ Example output:

Best AUC: 0.9049
Best Params: {'clf__learning_rate': 0.1, 'clf__max_depth': 3, 'clf__n_estimators': 100, 'clf__subsample': 0.8}
Test AUC (best): 0.8479

7. Evaluate the Final Model

from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print("Test ROC AUC:", roc_auc_score(y_test, y_proba))
print("\nClassification Report:\n", classification_report(y_test, y_pred, digits=3))

AUC of 0.84–0.85 on test data indicates a strong, generalizable model.


Key Takeaways

  1. Feature Engineering Matters – Adding contextual features like Title, FamilySize, and CabinDeck significantly boosts model accuracy.
  2. Pipelines Ensure Clean Workflow – Using preprocessing pipelines keeps data transformation and model training consistent and reproducible.
  3. Gradient Boosting Strength – A reliable choice for structured datasets; it handles nonlinear relationships and feature interactions well.
  4. Cross-Validation + Grid Search – Helps prevent overfitting while finding the most effective hyperparameters.
  5. AUC as an Evaluation Metric – AUC ≈ 0.85 shows strong separation between survivors and non-survivors.

Conclusion

By combining feature engineering, data preprocessing, and model tuning, we’ve built a high-performing survival predictor that captures hidden relationships in the Titanic dataset.

This project demonstrates how data-driven feature design — not just algorithms — drives real improvements in predictive modeling.

You can now extend it further with XGBoost, LightGBM, or SHAP analysis for feature importance visualization.


Code Snippet:

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

import matplotlib.pyplot as plt


# Path to Kaggle Titanic train file (adjust as needed)
CSV_PATH = "datasets/titanic.csv"  # e.g., "/path/to/train.csv"

df = pd.read_csv(CSV_PATH)

print("Shape:", df.shape)
df.head(3)


def extract_title(name: str) -> str:
    # Titles appear as "Lastname, Title. Firstname"
    if pd.isna(name):
        return "Unknown"
    title = name.split(",")[1].split(".")[0].strip()
    # Map rare titles
    mapping = {
        "Mlle": "Miss", "Ms": "Miss", "Mme": "Mrs",
        "Lady": "Rare", "Countess": "Rare", "Sir": "Rare",
        "Jonkheer": "Rare", "Don": "Rare", "Dona": "Rare",
        "Capt": "Rare", "Col": "Rare", "Major": "Rare",
        "Rev": "Rare", "Dr": "Rare"
    }
    return mapping.get(title, title)

def cabin_deck(cabin: str) -> str:
    if pd.isna(cabin) or not str(cabin).strip():
        return "U"  # Unknown
    return str(cabin)[0]

def engineer_features(df_raw: pd.DataFrame) -> pd.DataFrame:
    df = df_raw.copy()

    # Title
    df["Title"] = df["Name"].apply(extract_title)
    # Simplify extremely rare titles
    freq = df["Title"].value_counts()
    rare_titles = freq[freq < 10].index
    df["Title"] = df["Title"].replace(dict.fromkeys(rare_titles, "Rare"))

    # Family size & solitude
    df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
    df["IsAlone"] = (df["FamilySize"] == 1).astype(int)

    # Ticket group size
    df["TicketGroupSize"] = df.groupby("Ticket")["Ticket"].transform("count")

    # Cabin deck
    df["CabinDeck"] = df["Cabin"].apply(cabin_deck)

    # Fare per person & log transforms (protect against zero)
    df["FarePerPerson"] = df["Fare"] / df["FamilySize"].replace(0, 1)
    df["LogFare"] = np.log1p(df["Fare"])
    df["LogFarePerPerson"] = np.log1p(df["FarePerPerson"])

    # Age bin (quantiles on available values)
    df["AgeBin"] = pd.qcut(df["Age"], q=4, duplicates="drop")

    return df

df_eng = engineer_features(df)
df_eng.head(3)


df_tmp = df_eng.copy()

age_group_median = (
    df_tmp.groupby(["Title", "Sex", "Pclass"])["Age"]
    .median()
    .reset_index()
    .rename(columns={"Age": "AgeMedian"})
)

df_tmp = df_tmp.merge(age_group_median, on=["Title", "Sex", "Pclass"], how="left")
df_tmp["Age"] = df_tmp["Age"].fillna(df_tmp["AgeMedian"])
df_tmp.drop(columns=["AgeMedian"], inplace=True)

# Recompute AgeBin with filled Age
df_tmp["AgeBin"] = pd.qcut(df_tmp["Age"], q=4, duplicates="drop")

df_eng = df_tmp
df_eng.isna().sum()


TARGET = "Survived"

features = [
    # Numeric
    "Age", "Fare", "FamilySize", "IsAlone", "TicketGroupSize",
    "FarePerPerson", "LogFare", "LogFarePerPerson",
    # Categorical
    "Pclass", "Sex", "Embarked", "Title", "CabinDeck", "AgeBin"
]

X = df_eng[features]
y = df_eng[TARGET]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


numeric_features = ["Age", "Fare", "FamilySize", "IsAlone", "TicketGroupSize", 
                    "FarePerPerson", "LogFare", "LogFarePerPerson"]
categorical_features = ["Pclass", "Sex", "Embarked", "Title", "CabinDeck", "AgeBin"]

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)


cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

models = {
    "LogisticRegression": LogisticRegression(
        max_iter=5000, C=1.0,
        class_weight=None,
        n_jobs=None if hasattr(LogisticRegression, "n_jobs") else None
    ),
    "GradientBoosting": GradientBoostingClassifier(random_state=42),
    "RandomForest": RandomForestClassifier(n_estimators=400, random_state=42)
}

cv_scores = {}
for name, clf in models.items():
    pipe = Pipeline(steps=[("prep", preprocessor), ("clf", clf)])
    scores = cross_val_score(pipe, X_train, y_train, scoring="roc_auc", cv=cv)
    cv_scores[name] = (scores.mean(), scores.std())
    print(f"{name}: ROC AUC {scores.mean():.3f} ± {scores.std():.3f}")

# Pick one strong model for final fit (GBDT is often great here)
final_model = Pipeline(steps=[("prep", preprocessor), ("clf", GradientBoostingClassifier(random_state=42))])
final_model.fit(X_train, y_train)


from sklearn.metrics import accuracy_score, precision_recall_fscore_support

y_pred = final_model.predict(X_test)
y_proba = final_model.predict_proba(X_test)[:, 1]

print("Test ROC AUC:", roc_auc_score(y_test, y_proba))
print("\nClassification Report:\n", classification_report(y_test, y_pred, digits=3))

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Not Survived", "Survived"])
disp.plot(cmap="Purples")
plt.title("Titanic – Confusion Matrix (Test)")
plt.show()


param_grid = {
    "clf__n_estimators": [100, 200, 300],
    "clf__learning_rate": [0.05, 0.1, 0.2],
    "clf__max_depth": [2, 3],
    "clf__subsample": [0.8, 1.0]
}

gbdt = Pipeline(steps=[("prep", preprocessor), ("clf", GradientBoostingClassifier(random_state=42))])
grid = GridSearchCV(gbdt, param_grid, scoring="roc_auc", cv=cv, n_jobs=-1)
grid.fit(X_train, y_train)

print("Best AUC:", grid.best_score_)
print("Best Params:", grid.best_params_)

best_model = grid.best_estimator_
y_proba_best = best_model.predict_proba(X_test)[:, 1]
print("Test AUC (best):", roc_auc_score(y_test, y_proba_best))


# If you want to create a Kaggle submission:
# Load test.csv, apply the SAME feature engineering, then predict.

TEST_CSV_PATH = "datasets/titanic.csv"  # adjust if needed
test_df = pd.read_csv(TEST_CSV_PATH)

test_eng = engineer_features(test_df)

# Impute Age via the same grouped median strategy (fit on TRAIN!)
age_group_median_train = (
    df_eng.groupby(["Title", "Sex", "Pclass"])["Age"]
    .median()
    .reset_index()
    .rename(columns={"Age": "AgeMedian"})
)

test_eng = test_eng.merge(age_group_median_train, on=["Title", "Sex", "Pclass"], how="left")
test_eng["Age"] = test_eng["Age"].fillna(test_eng["AgeMedian"])
test_eng.drop(columns=["AgeMedian"], inplace=True)
test_eng["AgeBin"] = pd.qcut(test_eng["Age"], q=4, duplicates="drop")

X_submit = test_eng[features]
pred_submit = best_model.predict(X_submit) if 'best_model' in globals() else final_model.predict(X_submit)

submission = pd.DataFrame({
    "PassengerId": test_df["PassengerId"],
    "Survived": pred_submit.astype(int)
})
submission.head()
# submission.to_csv("submission.csv", index=False)

Link copied!

Comments

Add Your Comment

Comment Added!