← ROC/AUC All techniques Next: TF-IDF →

30 — Tokenization & N-grams

Turn raw text into pieces a model can count. Run it through the pipeline: tokenize → lowercase → drop stopwords → stem → form n-grams.

After pipeline (unigrams)
N-grams of n = 
Vocab size:
Token count:
Why each step matters: lowercasing collapses "The" and "the". Stopword removal cuts common filler ("the", "a"). Stemming turns "running / ran / runs" into one root. N-grams capture multi-word meaning ("New York" ≠ "New" + "York"). Skip steps for raw counts; apply them all when feeding a classifier.