🧠 AI with Python – 🧪 Train/Test Split using train_test_split


Description:

Why Split Data?

To evaluate how well your model generalizes, you need separate training and testing sets. Training on all data may lead to overfitting, where the model performs well on known data but fails on new data.


The train_test_split Method

Scikit-learn’s train_test_split quickly divides data into:

  • Training set (for model training)
  • Testing set (for performance evaluation)
Dataset → Train (80%) + Test (20%)

Practical Takeaway

Always keep a dedicated testing dataset. A simple train_test_split gives an unbiased estimate of real-world performance.


Code Snippet:

# Import train_test_split from model_selection for splitting your dataset
from sklearn.model_selection import train_test_split
import pandas as pd

# Creating a simple DataFrame
data = {
    'Age': [22, 25, 47, 52, 46],
    'Salary': [18000, 24000, 52000, 58000, 60000],
    'Purchased': [0, 0, 1, 1, 1]
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)

X = df[['Age', 'Salary']]  # Features
y = df['Purchased']  # Target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("X_train:")
print(X_train)

print("\nX_test:")
print(X_test)

print("\ny_train:")
print(y_train)

print("\ny_test:")
print(y_test)

Link copied!

Comments

Add Your Comment

Comment Added!