NLP Packages Overview

This document provides an overview of three powerful Python packages for Natural Language Processing: NLTK, spaCy, and BERTopic.

NLTK (Natural Language Toolkit)

Overview

NLTK is one of the oldest and most comprehensive Python libraries for NLP, originally created for teaching and research.

Key Features:

Extensive collection of text processing tools
Access to over 50 corpora and lexical resources (WordNet, TreeBank)
Text classification, tokenization, stemming, tagging, parsing
Educational focus with extensive documentation

Best For:

Learning NLP concepts
Prototyping and research
Working with linguistic data structures
Academic projects and teaching

Limitations:

Slower than modern alternatives
Less suited for production environments
Requires more manual pipeline construction

NLTK Example: Basic Text Processing

Code

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required data (run once)
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')

text = """Natural language processing (NLP) is a fascinating field. 
It enables computers to understand and process human language. 
NLTK provides excellent tools for learning NLP concepts."""

# Tokenization
sentences = sent_tokenize(text)
words = word_tokenize(text)

print("Sentences:", len(sentences))
print("Words:", len(words))
print("\nFirst sentence tokens:", word_tokenize(sentences[0]))

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [w for w in words if w.lower() not in stop_words and w.isalpha()]
print("\nFiltered words:", filtered_words)

# Stemming vs Lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words_to_process = ['running', 'runs', 'ran', 'easily', 'fairly']
print("\n{:<15} {:<15} {:<15}".format("Original", "Stemmed", "Lemmatized"))
print("-" * 45)
for word in words_to_process:
    stemmed = stemmer.stem(word)
    lemmatized = lemmatizer.lemmatize(word, pos='v')  # v = verb
    print("{:<15} {:<15} {:<15}".format(word, stemmed, lemmatized))

Sentences: 3
Words: 30

First sentence tokens: ['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', '.']

Filtered words: ['Natural', 'language', 'processing', 'NLP', 'fascinating', 'field', 'enables', 'computers', 'understand', 'process', 'human', 'language', 'NLTK', 'provides', 'excellent', 'tools', 'learning', 'NLP', 'concepts']

Original        Stemmed         Lemmatized     
---------------------------------------------
running         run             run            
runs            run             run            
ran             ran             run            
easily          easili          easily         
fairly          fairli          fairly

NLTK Example: Part-of-Speech Tagging

Code

from nltk import pos_tag

sentence = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)

print("Part-of-Speech Tags:")
for word, tag in pos_tags:
    print(f"  {word:10} -> {tag}")

Part-of-Speech Tags:
  The        -> DT
  quick      -> JJ
  brown      -> NN
  fox        -> NN
  jumps      -> VBZ
  over       -> IN
  the        -> DT
  lazy       -> JJ
  dog        -> NN

NLTK Example: Sentiment Analysis

Code

from nltk.sentiment import SentimentIntensityAnalyzer

# Download required data
# nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()

texts = [
    "I absolutely love this product! It's amazing!",
    "This is terrible. I hate it.",
    "It's okay, nothing special.",
    "The weather is nice today."
]

print("Sentiment Analysis Results:")
print("-" * 60)
for text in texts:
    scores = sia.polarity_scores(text)
    print(f"Text: {text}")
    print(f"  Negative: {scores['neg']:.3f}, Neutral: {scores['neu']:.3f}, Positive: {scores['pos']:.3f}")
    print(f"  Compound Score: {scores['compound']:.3f}\n")

Sentiment Analysis Results:
------------------------------------------------------------
Text: I absolutely love this product! It's amazing!
  Negative: 0.000, Neutral: 0.311, Positive: 0.689
  Compound Score: 0.871

Text: This is terrible. I hate it.
  Negative: 0.694, Neutral: 0.306, Positive: 0.000
  Compound Score: -0.778

Text: It's okay, nothing special.
  Negative: 0.367, Neutral: 0.325, Positive: 0.309
  Compound Score: -0.092

Text: The weather is nice today.
  Negative: 0.000, Neutral: 0.588, Positive: 0.412
  Compound Score: 0.421

spaCy

Overview

spaCy is a modern, industrial-strength NLP library designed for production use.

Key Features:

Fast and efficient (Cython-optimized)
Pre-trained statistical models for multiple languages
Built-in support for NER, POS tagging, dependency parsing
Easy integration with deep learning frameworks (PyTorch, TensorFlow)
Beautiful visualization tools (displaCy)

Best For:

Production NLP pipelines
Real-time processing
Named Entity Recognition
Document similarity and classification
Information extraction at scale

Limitations:

Less flexible than NLTK for research
Fewer resources for learning basic concepts
Model-dependent (needs pre-trained models)

spaCy Example: Basic Text Analysis

Code

import spacy

# Load English model (run: python -m spacy download en_core_web_sm)
nlp = spacy.load("en_core_web_sm")

text = """Apple Inc. is planning to open a new store in San Francisco next month. 
The CEO, Tim Cook, announced this during a press conference."""

doc = nlp(text)

# Tokenization and linguistic features
print("Tokens and their attributes:")
print("{:<15} {:<10} {:<10} {:<10}".format("Token", "Lemma", "POS", "Is Stop?"))
print("-" * 50)
for token in doc[:10]:  # First 10 tokens
    print("{:<15} {:<10} {:<10} {:<10}".format(
        token.text, 
        token.lemma_, 
        token.pos_, 
        str(token.is_stop)
    ))

Tokens and their attributes:
Token           Lemma      POS        Is Stop?  
--------------------------------------------------
Apple           Apple      PROPN      False     
Inc.            Inc.       PROPN      False     
is              be         AUX        True      
planning        plan       VERB       False     
to              to         PART       True      
open            open       VERB       False     
a               a          DET        True      
new             new        ADJ        False     
store           store      NOUN       False     
in              in         ADP        True

spaCy Example: Named Entity Recognition

Code

# Named Entity Recognition
print("\n\nNamed Entities:")
print("{:<20} {:<15} {:<30}".format("Entity", "Type", "Explanation"))
print("-" * 70)
for ent in doc.ents:
    print("{:<20} {:<15} {:<30}".format(
        ent.text, 
        ent.label_, 
        spacy.explain(ent.label_)
    ))

# Visualize entities
from spacy import displacy

displacy.render(doc, style="ent", jupyter=True)



Named Entities:
Entity               Type            Explanation                   
----------------------------------------------------------------------
Apple Inc.           ORG             Companies, agencies, institutions, etc.
San Francisco        GPE             Countries, cities, states     
next month           DATE            Absolute or relative dates or periods
Tim Cook             PERSON          People, including fictional

Apple Inc. ORG is planning to open a new store in San Francisco GPE next month DATE .
The CEO, Tim Cook PERSON , announced this during a press conference.

spaCy Example: Dependency Parsing

Code

sentence = nlp("The quick brown fox jumps over the lazy dog")

print("\nDependency Parse:")
print("{:<10} {:<10} {:<10} {:<10}".format("Token", "Dependency", "Head", "Children"))
print("-" * 50)
for token in sentence:
    children = ", ".join([child.text for child in token.children])
    print("{:<10} {:<10} {:<10} {:<10}".format(
        token.text,
        token.dep_,
        token.head.text,
        children if children else "-"
    ))

# Visualize dependency tree
displacy.render(
    sentence,
    style="dep",
    jupyter=True,
    options={
        "compact": False,
        "color": "blue",
        "bg": "#fff",
        "distance": 120,
        "width": 700,
        "height": 300,
        "font": "10px Arial"    # Reduce font size
    }
)


Dependency Parse:
Token      Dependency Head       Children  
--------------------------------------------------
The        det        fox        -         
quick      amod       fox        -         
brown      amod       fox        -         
fox        nsubj      jumps      The, quick, brown
jumps      ROOT       jumps      fox, over 
over       prep       jumps      dog       
the        det        dog        -         
lazy       amod       dog        -         
dog        pobj       over       the, lazy

spaCy Example: Document Similarity

Code

# Document similarity using word vectors
doc1 = nlp("I love programming in Python")
doc2 = nlp("I enjoy coding with Python")
doc3 = nlp("The weather is nice today")

print("\nDocument Similarity (using word vectors):")
print(f"doc1 <-> doc2: {doc1.similarity(doc2):.3f}")
print(f"doc1 <-> doc3: {doc1.similarity(doc3):.3f}")
print(f"doc2 <-> doc3: {doc2.similarity(doc3):.3f}")

# Word similarity
word1 = nlp("king")
word2 = nlp("queen")
word3 = nlp("apple")

print("\nWord Similarity:")
print(f"king <-> queen: {word1.similarity(word2):.3f}")
print(f"king <-> apple: {word1.similarity(word3):.3f}")


Document Similarity (using word vectors):
doc1 <-> doc2: 0.839
doc1 <-> doc3: 0.271
doc2 <-> doc3: 0.322

Word Similarity:
king <-> queen: 0.422
king <-> apple: 0.690

BERTopic

Overview

BERTopic is a modern topic modeling technique that leverages transformer-based embeddings.

Key Features:

Uses BERT embeddings for semantic understanding
Automatically determines optimal number of topics
UMAP for dimensionality reduction
HDBSCAN for clustering
Class-based TF-IDF (c-TF-IDF) for topic representation
Interactive visualizations

Best For:

Topic discovery in document collections
Short text analysis (tweets, reviews, articles)
Dynamic topic modeling over time
High-quality, interpretable topics
Modern alternative to LDA

Limitations:

Computationally expensive (needs embeddings)
Requires more memory than classical methods
Slower than LDA for very large corpora
GPU recommended for large datasets

BERTopic Example: Basic Topic Modeling

Code

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Load sample data
categories = ['sci.space', 'rec.sport.baseball', 'talk.politics.guns']
newsgroups = fetch_20newsgroups(
    subset='train', 
    categories=categories, 
    remove=('headers', 'footers', 'quotes')
)
docs = newsgroups.data[:500]  # Use subset for speed

# Create and fit BERTopic model
print("Training BERTopic model...")
topic_model = BERTopic(verbose=False, language="english", min_topic_size=10)
topics, probabilities = topic_model.fit_transform(docs)

print(f"\nDiscovered {len(set(topics)) - 1} topics (excluding outliers)")
print(f"Outlier documents (topic -1): {sum(1 for t in topics if t == -1)}")

Training BERTopic model...

Discovered 1 topics (excluding outliers)
Outlier documents (topic -1): 0

BERTopic Example: Explore Topics

Code

# Get topic information
topic_info = topic_model.get_topic_info()
print("\nTopic Information:")
print(topic_info[['Topic', 'Count', 'Name']].head(10))

# Show representative words for each topic
print("\n\nTop Words per Topic:")
print("=" * 80)
for topic_id in range(min(5, len(set(topics)) - 1)):  # Show first 5 topics
    topic_words = topic_model.get_topic(topic_id)
    if topic_words:
        words = [word for word, score in topic_words[:8]]
        print(f"\nTopic {topic_id}: {', '.join(words)}")


Topic Information:
   Topic  Count             Name
0      0    484  0_the_to_of_and
1      1     16     1_anaheim___


Top Words per Topic:
================================================================================

Topic 0: the, to, of, and, in, is, that, for

BERTopic Example: Topic Visualization

#| code-fold: false
# Visualize topics
fig = topic_model.visualize_topics()
fig.show()

# Visualize topic hierarchy
fig_hierarchy = topic_model.visualize_hierarchy(top_n_topics=10)
fig_hierarchy.show()

# Visualize barchart for top topics
fig_barchart = topic_model.visualize_barchart(top_n_topics=5, n_words=10)
fig_barchart.show()

BERTopic Example: Find Similar Documents

Code

# Find documents similar to a query
similar_docs, similarity_scores = topic_model.find_topics(
    "space exploration and satellites", 
    top_n=3
)

print("\nTopics similar to 'space exploration and satellites':")
for topic_id, score in zip(similar_docs, similarity_scores):
    print(f"\nTopic {topic_id} (similarity: {score:.3f}):")
    words = [word for word, _ in topic_model.get_topic(topic_id)[:5]]
    print(f"  Key words: {', '.join(words)}")


Topics similar to 'space exploration and satellites':

Topic 0 (similarity: 0.207):
  Key words: the, to, of, and, in

Topic 1 (similarity: 0.122):
  Key words: anaheim, , , ,

BERTopic Example: Dynamic Topic Modeling

Code

# Topic modeling over time (if you have timestamps)
import pandas as pd
import numpy as np

# Create fake timestamps for demonstration
timestamps = pd.date_range('2020-01-01', periods=len(docs), freq='D')

# Fit dynamic topic model
topics_over_time = topic_model.topics_over_time(
    docs, 
    timestamps, 
    nr_bins=10
)

print("\nTopics Over Time:")
print(topics_over_time.head(15))

# Visualize topics over time
fig_timeline = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=5)
fig_timeline.show()


Topics Over Time:
    Topic                   Words  Frequency               Timestamp
0       0    the, to, of, and, in         48 2019-12-31 12:01:26.400
1       1         anaheim, , , ,           2 2019-12-31 12:01:26.400
2       0    the, of, to, and, in         49 2020-02-19 21:36:00.000
3       1         anaheim, , , ,           1 2020-02-19 21:36:00.000
4       0    the, of, and, to, in         49 2020-04-09 19:12:00.000
5       1         anaheim, , , ,           1 2020-04-09 19:12:00.000
6       0    the, to, and, of, in         49 2020-05-29 16:48:00.000
7       1         anaheim, , , ,           1 2020-05-29 16:48:00.000
8       0  the, to, of, and, that         49 2020-07-18 14:24:00.000
9       1         anaheim, , , ,           1 2020-07-18 14:24:00.000
10      0    the, to, and, of, in         49 2020-09-06 12:00:00.000
11      1         anaheim, , , ,           1 2020-09-06 12:00:00.000
12      0    the, to, and, of, in         47 2020-10-26 09:36:00.000
13      1         anaheim, , , ,           3 2020-10-26 09:36:00.000
14      0    the, to, of, and, in         50 2020-12-15 07:12:00.000

Package Comparison

Quick Comparison Table

Feature	NLTK	spaCy	BERTopic
Primary Use	Education, Research	Production NLP	Topic Modeling
Speed	Slow	Fast	Moderate
Ease of Use	Moderate	Easy	Easy
Pre-trained Models	Limited	Excellent	Uses transformer embeddings
Customization	High	Moderate	Moderate
Memory Usage	Low	Low-Moderate	High
Best For	Learning, Prototyping	NER, Pipelines, Real-time	Topic Discovery
Visualization	Limited	Excellent (displaCy)	Excellent (interactive)
GPU Support	No	Yes (for training)	Recommended
Community	Large, Academic	Large, Industry	Growing

When to Use Each Package

Use NLTK when:

Learning NLP concepts
Need access to linguistic resources (WordNet, TreeBank)
Working on academic research
Prototyping ideas
Need maximum flexibility

Use spaCy when:

Building production systems
Need fast, accurate NER
Processing large volumes of text
Want beautiful visualizations
Need dependency parsing
Building information extraction pipelines

Use BERTopic when:

Discovering topics in document collections
Working with short texts (tweets, reviews)
Need interpretable, coherent topics
Want to avoid specifying number of topics
Have access to GPU resources
Analyzing topic evolution over time

Combining Packages

These packages can be used together effectively:

Code

# Example: Use NLTK for preprocessing, spaCy for NER, BERTopic for topics

import nltk
import spacy
from bertopic import BERTopic

nlp = spacy.load("en_core_web_sm")

documents = [
    "Apple Inc. announced new products in Cupertino yesterday.",
    "Google is developing AI technology in Mountain View.",
    "Microsoft released a new version of Windows in Seattle."
]

# Step 1: Use NLTK for basic preprocessing
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# Step 2: Use spaCy for NER and lemmatization
processed_docs = []
for doc in documents:
    spacy_doc = nlp(doc)
    
    # Extract entities
    entities = [(ent.text, ent.label_) for ent in spacy_doc.ents]
    print(f"\nDocument: {doc}")
    print(f"Entities: {entities}")
    
    # Lemmatize and remove stopwords
    lemmatized = [token.lemma_ for token in spacy_doc 
                  if not token.is_stop and not token.is_punct]
    processed_docs.append(" ".join(lemmatized))

print("\n\nProcessed documents:")
for i, doc in enumerate(processed_docs, 1):
    print(f"{i}. {doc}")

# Step 3: Use BERTopic for topic modeling (would need more documents in practice)
# topic_model = BERTopic()
# topics, probs = topic_model.fit_transform(processed_docs)


Document: Apple Inc. announced new products in Cupertino yesterday.
Entities: [('Apple Inc.', 'ORG'), ('Cupertino', 'GPE'), ('yesterday', 'DATE')]

Document: Google is developing AI technology in Mountain View.
Entities: [('Google', 'ORG'), ('AI', 'ORG'), ('Mountain View', 'GPE')]

Document: Microsoft released a new version of Windows in Seattle.
Entities: [('Microsoft', 'ORG'), ('Windows', 'NORP'), ('Seattle', 'GPE')]


Processed documents:
1. Apple Inc. announce new product Cupertino yesterday
2. Google develop AI technology Mountain View
3. Microsoft release new version Windows Seattle

Summary

NLTK: Comprehensive, educational, flexible but slower
spaCy: Fast, production-ready, excellent for NER and pipelines
BERTopic: Modern topic modeling with transformer embeddings

Choose based on your specific needs: learning (NLTK), production (spaCy), or topic discovery (BERTopic). Often, combining these tools yields the best results!