30 — Tokenization & N-grams

Turn raw text into pieces a model can count. Run it through the pipeline: tokenize → lowercase → drop stopwords → stem → form n-grams.

After pipeline (unigrams)

N-grams of n =

Vocab size: —

Token count: —

Why each step matters: lowercasing collapses "The" and "the". Stopword removal cuts common filler ("the", "a"). Stemming turns "running / ran / runs" into one root. N-grams capture multi-word meaning ("New York" ≠ "New" + "York"). Skip steps for raw counts; apply them all when feeding a classifier.