⚡️ Saturday AI Sparks 🤖 - 🌐📝#️⃣ Translate → Summarize → Hashtags


Description:

Introduction

AI tasks rarely happen in isolation. In real-world workflows, you often want to chain multiple AI steps together — for example, translate a text, summarize it, and then generate relevant hashtags for sharing.

In this post, we’ll show how to combine these three tasks into a simple end-to-end pipeline using scikit-learn’s Pipeline along with Hugging Face Transformers and Deep Translator.


Why This Matters

  • Translation: Makes content globally accessible.
  • Summarization: Reduces long text into a short, digestible version.
  • Hashtag Generation: Helps in discoverability when sharing on social platforms.

Instead of running these steps separately, we’ll tie them together with a single pipeline.


Step 1 — Translate

We use deep-translator to translate text into English before summarization.

This ensures the summarizer works on consistent input.

from deep_translator import GoogleTranslator

def translate_list(texts, source="auto", target="en"):
    return [GoogleTranslator(source=source, target=target).translate(t) for t in texts]

Step 2 — Summarize

We apply a pretrained summarization model from Hugging Face (distilbart-cnn-12-6).

The summarizer shortens the text while preserving its meaning.

from transformers import pipeline
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

def summarize_list(texts):
    outs = []
    for t in texts:
        out = summarizer(t, max_length=120, min_length=40, do_sample=False)
        outs.append(out[0]["summary_text"])
    return outs

Step 3 — Generate Hashtags

We create simple hashtags by extracting frequent keywords from the summary.

No extra dependencies are needed — just Python’s built-in libraries.

import collections, string, re

STOPWORDS = {"the","and","is","in","to","of","a","for","on","it","as","with","this","that","by","an","be","are"}

def clean_tokens(text):
    text = text.lower().translate(str.maketrans("", "", string.punctuation + string.digits))
    return [w for w in text.split() if w not in STOPWORDS and len(w) > 2]

def hashtags_from_list(texts, top_k=8):
    all_tags = []
    for t in texts:
        words = clean_tokens(t)
        counter = collections.Counter(words)
        most_common = [w for w, _ in counter.most_common(top_k)]
        tags = ["#" + w.capitalize() for w in most_common]
        all_tags.append(" ".join(tags))
    return all_tags

Step 4 — Combine into a Pipeline

With scikit-learn’s Pipeline, we can chain all steps together into a single, reusable workflow.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

pipe = Pipeline(steps=[
    ("translate_to_en", FunctionTransformer(lambda X: translate_list(X, source="auto", target="en"), validate=False)),
    ("summarize_en", FunctionTransformer(lambda X: summarize_list(X), validate=False)),
    ("hashtags", FunctionTransformer(lambda X: hashtags_from_list(X, top_k=8), validate=False)),
])

Running the pipeline on an input text returns both the summary and hashtags.


Sample Output

Input text (Spanish):

La inteligencia artificial está transformando industrias enteras, desde la salud hasta las finanzas.
Permite automatizar tareas, encontrar patrones complejos y ofrecer experiencias personalizadas a gran escala.

Summary (EN):

Artificial intelligence is transforming industries by automating tasks, finding 
complex patterns, and enabling personalization at scale.

Hashtags (EN):

#Artificial #Intelligence #Industries #Automating #Tasks #Patterns #Personalization #Scale

Key Takeaways

  • AI tasks like translation, summarization, and keywording can be chained into one workflow.
  • scikit-learn’s Pipeline makes the process modular and reusable.
  • Hashtags can be generated from simple frequency analysis — no heavy NLP needed.
  • This lightweight workflow is practical for social content automation.

Code Snippet:

from deep_translator import GoogleTranslator
from transformers import pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import string


# Example multilingual input (replace with your own)
original_text = """
La inteligencia artificial está transformando industrias enteras, desde la salud hasta las finanzas.
Permite automatizar tareas, encontrar patrones complejos y ofrecer experiencias personalizadas a gran escala.
"""

src_lang = "auto"  # detect automatically
pivot_lang = "en"  # summarize in English
tgt_lang = "en"  # change (e.g., "es", "fr", "de") to translate final outputs back


def translate_text(text: str, source: str, target: str) -> str:
    """
    Translate `text` from `source` → `target` using GoogleTranslator (no API key).
    Set source="auto" to auto-detect the input language.
    """
    return GoogleTranslator(source=source, target=target).translate(text)


translated_en = translate_text(original_text, src_lang, pivot_lang)
print("=== Translated → English (preview) ===\n", translated_en[:400], "...\n")


# Build the summarization pipeline once (downloads model on first run)
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")


def summarize(text: str, max_chars: int = 1800) -> str:
    """
    Summarize text. If it's very long, we truncate for demo purposes.
    For production, chunk the text and summarize per chunk, then summarize the summaries.
    """
    text = text.strip()
    if len(text) > max_chars:
        text = text[:max_chars]
    out = summarizer(text, max_length=130, min_length=45, do_sample=False)
    return out[0]["summary_text"].strip()


summary_en = summarize(translated_en)
print("=== Summary (EN) ===\n", summary_en, "\n")


def simple_clean(text: str) -> str:
    """Lowercase, remove punctuation/numbers, and collapse spaces."""
    text = text.lower()
    text = re.sub(r"http\S+|www\.\S+", " ", text)  # remove URLs
    text = text.translate(str.maketrans("", "", string.punctuation + string.digits))
    text = re.sub(r"\s+", " ", text).strip()
    return text


def top_keywords_tfidf(text: str, k: int = 8):
    """
    Return top-k keywords using TF-IDF on a single document by splitting into sentences.
    (Crude but effective for short social captions.)
    """
    # Split into pseudo-documents (sentences) to let TF-IDF score terms
    sentences = re.split(r"[.!?]\s+", text)
    sentences = [s for s in sentences if s.strip()]
    if not sentences:
        sentences = [text]

    vec = TfidfVectorizer(
        stop_words="english",
        ngram_range=(1, 2),  # allow unigrams + bigrams
        max_features=1000,
        token_pattern=r"(?u)\b[a-zA-Z][a-zA-Z]+\b",  # alphabetic tokens (≥2 letters)
    )
    X = vec.fit_transform(sentences)
    # Aggregate scores across sentences
    scores = X.sum(axis=0).A1
    terms = vec.get_feature_names_out()
    ranked = sorted(zip(terms, scores), key=lambda x: x[1], reverse=True)[:k]
    return [w for w, _ in ranked]


def to_hashtags(words):
    """Convert keyword tokens/phrases to social-friendly #hashtags."""
    tags = []
    for w in words:
        token = re.sub(r"\s+", "", w)  # remove spaces for bigrams
        token = re.sub(r"[^a-zA-Z0-9]", "", token)
        if token:
            tags.append("#" + token[:28])  # keep tags readable
    # De-duplicate while preserving order
    seen = set()
    uniq = []
    for t in tags:
        if t.lower() not in seen:
            uniq.append(t)
            seen.add(t.lower())
    return uniq


keywords = top_keywords_tfidf(simple_clean(summary_en), k=10)
hashtags_en = to_hashtags(keywords)
print("=== Hashtags (EN) ===\n", " ".join(hashtags_en), "\n")


def translate_summarize_hashtag(text: str, src="auto", pivot="en", tgt="en", k=10):
    # 1) translate → English
    en = translate_text(text, src, pivot)
    # 2) summarize (EN)
    summ_en = summarize(en)
    # 3) hashtags from EN summary
    kws = top_keywords_tfidf(simple_clean(summ_en), k=k)
    tags_en = to_hashtags(kws)
    # 4) optional translate back
    out_summary = summ_en if tgt == "en" else translate_text(summ_en, "en", tgt)
    _, out_tags = (tags_en, tags_en) if tgt == "en" else maybe_translate_list(tags_en, "en", tgt)
    return out_summary, out_tags


demo_summary, demo_tags = translate_summarize_hashtag(original_text, src=src_lang, pivot=pivot_lang, tgt=tgt_lang, k=10)
print("=== DEMO SUMMARY ===\n", demo_summary, "\n")
print("=== DEMO HASHTAGS ===\n", " ".join(demo_tags))

Link copied!

Comments

Add Your Comment

Comment Added!