Turn raw text into pieces a model can count. Run it through the pipeline: tokenize → lowercase → drop stopwords → stem → form n-grams.
After pipeline (unigrams)
N-grams of n =
Vocab size:—
Token count:—
Why each step matters: lowercasing collapses "The" and "the". Stopword removal cuts common filler ("the", "a"). Stemming turns "running / ran / runs" into one root. N-grams capture multi-word meaning ("New York" ≠ "New" + "York"). Skip steps for raw counts; apply them all when feeding a classifier.