AW Dev Rethought

⚖️ There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies - C.A.R. Hoare

🧠 AI with Python – 🎬 Movie Review Sentiment Analysis


Description:

Sentiment analysis is one of the most practical and popular applications of Natural Language Processing (NLP).

In this project, we use TF-IDF vectorization and Logistic Regression to classify movie reviews as positive or negative — turning raw text into meaningful, measurable insights.


Understanding the Problem

Text data is unstructured, making it difficult for machine learning models to interpret directly.

To analyze sentiments, we first convert text into numerical form using TF-IDF (Term Frequency–Inverse Document Frequency) — a technique that measures how important a word is within a document relative to the entire dataset.

Once the data is numerically represented, we train a Logistic Regression classifier to distinguish between positive and negative reviews.


1. Load and Explore the Dataset

We’ll start with a small sample of labeled reviews.

Each record contains a text review and a sentiment label (positive or negative).

base = pd.DataFrame({
    "review": [
        "This movie was fantastic! The story was engaging and the acting superb.",
        "Worst movie ever. Waste of time and money.",
        "I absolutely loved the visuals and soundtrack!",
        "The plot was dull and predictable.",
        "An enjoyable watch with a heartwarming message.",
        "Terrible direction and poor editing."
    ],
    "sentiment": ["positive","negative","positive","negative","positive","negative"]
})

# Minimal augmentation so the model learns your exact phrases
extra = pd.DataFrame({
    "review": [
        "Well made film with engaging story.",                 # +
        "Really enjoyed this movie. Well made overall.",       # +
        "Boring and too long. Very dull pacing.",              # -
        "The movie felt boring and was too long to enjoy."     # -
    ],
    "sentiment": ["positive","positive","negative","negative"]
})

df = pd.concat([base, extra], ignore_index=True)

For larger projects, you can easily replace this dataset with IMDb or Rotten Tomatoes reviews.


2. Split and Preprocess the Data

We now stratify by sentiment so both classes appear in training and testing sets.

X_train, X_test, y_train, y_test = train_test_split(
    df["review"], df["sentiment"],
    test_size=0.3,
    random_state=42,
    stratify=df["sentiment"]
)

3. Vectorization + Model Pipeline

We’ll use a Pipeline combining TfidfVectorizer and LogisticRegression.

Using bigrams (1,2) helps capture short phrases that affect sentiment.

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 3),     # unigrams + bigrams + trigrams
        stop_words=None,        # keep negations
        sublinear_tf=True
    )),
    ("clf", LogisticRegression(
        C=2.0,
        max_iter=5000,
        random_state=42
    ))
])

pipe.fit(df["review"], df["sentiment"])

4. Evaluate Model Performance

We’ll display a classification report and plot the confusion matrix.

y_pred = pipe.predict(X_test)

print("Classification Report:\n", classification_report(y_test, y_pred, zero_division=0))

cm = confusion_matrix(y_test, y_pred, labels=["positive", "negative"])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["positive", "negative"])
disp.plot(cmap="Purples")
plt.title("Confusion Matrix – Movie Review Sentiment")
plt.show()

5. Inspect the Learned Features

Let’s look at which words the model associates most strongly with positive and negative sentiment.

clf = pipe.named_steps["clf"]
tfidf = pipe.named_steps["tfidf"]
feature_names = np.array(tfidf.get_feature_names_out())

# Binary LR: coef_[0] points toward classes_[1].
coef = clf.coef_[0]
classes = clf.classes_

# Re-orient coefficients so "positive" always means larger weight ⇒ more positive
pos_oriented = coef if classes[1] == "positive" else -coef

top_n = min(10, len(feature_names))
top_pos_idx = np.argsort(pos_oriented)[-top_n:][::-1]
top_neg_idx = np.argsort(pos_oriented)[:top_n]

print("Classes order:", classes.tolist())

print("\nTop Positive Features:")
for i in top_pos_idx:
    print(f"{feature_names[i]:30s} {pos_oriented[i]:.4f}")

print("\nTop Negative Features:")
for i in top_neg_idx:
    print(f"{feature_names[i]:30s} {pos_oriented[i]:.4f}")

6. Testing the Model on New Data

Now, let’s test with two new reviews and see how the predictions improve. You’ll also see prediction probabilities to check model confidence.

new_reviews = [
    "I really enjoyed this film. It was well made!",
    "The movie was boring and too long."
]

proba = pipe.predict_proba(new_reviews)
preds = pipe.predict(new_reviews)
classes = pipe.named_steps["clf"].classes_
pos_idx = np.where(classes == "positive")[0][0]

for review, p, pred in zip(new_reviews, proba, preds):
    print(f"\n{review}")
    print(f"→ Prediction: {pred}")
    print(f"→ Probabilities: {dict(zip(classes, np.round(p, 3)))}")

Key Takeaways


  1. TF-IDF Vectorization — Transforms text into weighted numerical features, giving more importance to rare but informative words.
  2. Why Logistic Regression — A reliable baseline for sentiment analysis — interpretable, efficient, and performs well on small to medium datasets.
  3. Importance of Stop Words Removal — Eliminating frequent words ensures the model focuses on the meaningful ones driving sentiment polarity.
  4. Model Evaluation — Metrics like precision, recall, and F1-score give a balanced understanding of how well the model generalizes to unseen text.
  5. Scalability — The workflow easily scales for larger datasets (like IMDb). Enhancements can include n-grams, regularization tuning, or testing alternative classifiers such as Naive Bayes or LinearSVC.
  6. Improvement Ideas — Try n-grams, regularization tuning, or switch to a LinearSVC or Naive Bayes for experimentation.

Conclusion

With just a few lines of code, we’ve built a working sentiment classifier capable of understanding basic emotional tone in text.

This project provides a strong foundation for natural language processing tasks like feedback analysis, chatbot training, or product review classification.


Code Snippet:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt


base = pd.DataFrame({
    "review": [
        "This movie was fantastic! The story was engaging and the acting superb.",
        "Worst movie ever. Waste of time and money.",
        "I absolutely loved the visuals and soundtrack!",
        "The plot was dull and predictable.",
        "An enjoyable watch with a heartwarming message.",
        "Terrible direction and poor editing."
    ],
    "sentiment": ["positive","negative","positive","negative","positive","negative"]
})

# Minimal augmentation so the model learns your exact phrases
extra = pd.DataFrame({
    "review": [
        "Well made film with engaging story.",                 # +
        "Really enjoyed this movie. Well made overall.",       # +
        "Boring and too long. Very dull pacing.",              # -
        "The movie felt boring and was too long to enjoy."     # -
    ],
    "sentiment": ["positive","positive","negative","negative"]
})

df = pd.concat([base, extra], ignore_index=True)


X_train, X_test, y_train, y_test = train_test_split(
    df["review"], df["sentiment"],
    test_size=0.3,
    random_state=42,
    stratify=df["sentiment"]
)


pipe = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 3),     # unigrams + bigrams + trigrams
        stop_words=None,        # keep negations
        sublinear_tf=True
    )),
    ("clf", LogisticRegression(
        C=2.0,
        max_iter=5000,
        random_state=42
    ))
])

pipe.fit(df["review"], df["sentiment"])


y_pred = pipe.predict(X_test)

print("Classification Report:\n", classification_report(y_test, y_pred, zero_division=0))

cm = confusion_matrix(y_test, y_pred, labels=["positive", "negative"])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["positive", "negative"])
disp.plot(cmap="Purples")
plt.title("Confusion Matrix – Movie Review Sentiment")
plt.show()


clf = pipe.named_steps["clf"]
tfidf = pipe.named_steps["tfidf"]
feature_names = np.array(tfidf.get_feature_names_out())

# Binary LR: coef_[0] points toward classes_[1].
coef = clf.coef_[0]
classes = clf.classes_

# Re-orient coefficients so "positive" always means larger weight ⇒ more positive
pos_oriented = coef if classes[1] == "positive" else -coef

top_n = min(10, len(feature_names))
top_pos_idx = np.argsort(pos_oriented)[-top_n:][::-1]
top_neg_idx = np.argsort(pos_oriented)[:top_n]

print("Classes order:", classes.tolist())

print("\nTop Positive Features:")
for i in top_pos_idx:
    print(f"{feature_names[i]:30s} {pos_oriented[i]:.4f}")

print("\nTop Negative Features:")
for i in top_neg_idx:
    print(f"{feature_names[i]:30s} {pos_oriented[i]:.4f}")


new_reviews = [
    "I really enjoyed this film. It was well made!",
    "The movie was boring and too long."
]

proba = pipe.predict_proba(new_reviews)
preds = pipe.predict(new_reviews)
classes = pipe.named_steps["clf"].classes_
pos_idx = np.where(classes == "positive")[0][0]

for review, p, pred in zip(new_reviews, proba, preds):
    print(f"\n{review}")
    print(f"→ Prediction: {pred}")
    print(f"→ Probabilities: {dict(zip(classes, np.round(p, 3)))}")

Link copied!

Comments

Add Your Comment

Comment Added!