NLP Packages Overview

This document provides an overview of three powerful Python packages for Natural Language Processing: NLTK, spaCy, and BERTopic.

NLTK (Natural Language Toolkit)

Overview

NLTK is one of the oldest and most comprehensive Python libraries for NLP, originally created for teaching and research.

Key Features:

  • Extensive collection of text processing tools
  • Access to over 50 corpora and lexical resources (WordNet, TreeBank)
  • Text classification, tokenization, stemming, tagging, parsing
  • Educational focus with extensive documentation

Best For:

  • Learning NLP concepts
  • Prototyping and research
  • Working with linguistic data structures
  • Academic projects and teaching

Limitations:

  • Slower than modern alternatives
  • Less suited for production environments
  • Requires more manual pipeline construction

NLTK Example: Basic Text Processing

Code
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required data (run once)
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')

text = """Natural language processing (NLP) is a fascinating field. 
It enables computers to understand and process human language. 
NLTK provides excellent tools for learning NLP concepts."""

# Tokenization
sentences = sent_tokenize(text)
words = word_tokenize(text)

print("Sentences:", len(sentences))
print("Words:", len(words))
print("\nFirst sentence tokens:", word_tokenize(sentences[0]))

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [w for w in words if w.lower() not in stop_words and w.isalpha()]
print("\nFiltered words:", filtered_words)

# Stemming vs Lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words_to_process = ['running', 'runs', 'ran', 'easily', 'fairly']
print("\n{:<15} {:<15} {:<15}".format("Original", "Stemmed", "Lemmatized"))
print("-" * 45)
for word in words_to_process:
    stemmed = stemmer.stem(word)
    lemmatized = lemmatizer.lemmatize(word, pos='v')  # v = verb
    print("{:<15} {:<15} {:<15}".format(word, stemmed, lemmatized))
Sentences: 3
Words: 30

First sentence tokens: ['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', '.']

Filtered words: ['Natural', 'language', 'processing', 'NLP', 'fascinating', 'field', 'enables', 'computers', 'understand', 'process', 'human', 'language', 'NLTK', 'provides', 'excellent', 'tools', 'learning', 'NLP', 'concepts']

Original        Stemmed         Lemmatized     
---------------------------------------------
running         run             run            
runs            run             run            
ran             ran             run            
easily          easili          easily         
fairly          fairli          fairly         

NLTK Example: Part-of-Speech Tagging

Code
from nltk import pos_tag

sentence = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)

print("Part-of-Speech Tags:")
for word, tag in pos_tags:
    print(f"  {word:10} -> {tag}")
Part-of-Speech Tags:
  The        -> DT
  quick      -> JJ
  brown      -> NN
  fox        -> NN
  jumps      -> VBZ
  over       -> IN
  the        -> DT
  lazy       -> JJ
  dog        -> NN

NLTK Example: Sentiment Analysis

Code
from nltk.sentiment import SentimentIntensityAnalyzer

# Download required data
# nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()

texts = [
    "I absolutely love this product! It's amazing!",
    "This is terrible. I hate it.",
    "It's okay, nothing special.",
    "The weather is nice today."
]

print("Sentiment Analysis Results:")
print("-" * 60)
for text in texts:
    scores = sia.polarity_scores(text)
    print(f"Text: {text}")
    print(f"  Negative: {scores['neg']:.3f}, Neutral: {scores['neu']:.3f}, Positive: {scores['pos']:.3f}")
    print(f"  Compound Score: {scores['compound']:.3f}\n")
Sentiment Analysis Results:
------------------------------------------------------------
Text: I absolutely love this product! It's amazing!
  Negative: 0.000, Neutral: 0.311, Positive: 0.689
  Compound Score: 0.871

Text: This is terrible. I hate it.
  Negative: 0.694, Neutral: 0.306, Positive: 0.000
  Compound Score: -0.778

Text: It's okay, nothing special.
  Negative: 0.367, Neutral: 0.325, Positive: 0.309
  Compound Score: -0.092

Text: The weather is nice today.
  Negative: 0.000, Neutral: 0.588, Positive: 0.412
  Compound Score: 0.421

spaCy

Overview

spaCy is a modern, industrial-strength NLP library designed for production use.

Key Features:

  • Fast and efficient (Cython-optimized)
  • Pre-trained statistical models for multiple languages
  • Built-in support for NER, POS tagging, dependency parsing
  • Easy integration with deep learning frameworks (PyTorch, TensorFlow)
  • Beautiful visualization tools (displaCy)

Best For:

  • Production NLP pipelines
  • Real-time processing
  • Named Entity Recognition
  • Document similarity and classification
  • Information extraction at scale

Limitations:

  • Less flexible than NLTK for research
  • Fewer resources for learning basic concepts
  • Model-dependent (needs pre-trained models)

spaCy Example: Basic Text Analysis

Code
import spacy

# Load English model (run: python -m spacy download en_core_web_sm)
nlp = spacy.load("en_core_web_sm")

text = """Apple Inc. is planning to open a new store in San Francisco next month. 
The CEO, Tim Cook, announced this during a press conference."""

doc = nlp(text)

# Tokenization and linguistic features
print("Tokens and their attributes:")
print("{:<15} {:<10} {:<10} {:<10}".format("Token", "Lemma", "POS", "Is Stop?"))
print("-" * 50)
for token in doc[:10]:  # First 10 tokens
    print("{:<15} {:<10} {:<10} {:<10}".format(
        token.text, 
        token.lemma_, 
        token.pos_, 
        str(token.is_stop)
    ))
Tokens and their attributes:
Token           Lemma      POS        Is Stop?  
--------------------------------------------------
Apple           Apple      PROPN      False     
Inc.            Inc.       PROPN      False     
is              be         AUX        True      
planning        plan       VERB       False     
to              to         PART       True      
open            open       VERB       False     
a               a          DET        True      
new             new        ADJ        False     
store           store      NOUN       False     
in              in         ADP        True      

spaCy Example: Named Entity Recognition

Code
# Named Entity Recognition
print("\n\nNamed Entities:")
print("{:<20} {:<15} {:<30}".format("Entity", "Type", "Explanation"))
print("-" * 70)
for ent in doc.ents:
    print("{:<20} {:<15} {:<30}".format(
        ent.text, 
        ent.label_, 
        spacy.explain(ent.label_)
    ))

# Visualize entities
from spacy import displacy

displacy.render(doc, style="ent", jupyter=True)


Named Entities:
Entity               Type            Explanation                   
----------------------------------------------------------------------
Apple Inc.           ORG             Companies, agencies, institutions, etc.
San Francisco        GPE             Countries, cities, states     
next month           DATE            Absolute or relative dates or periods
Tim Cook             PERSON          People, including fictional   
Apple Inc. ORG is planning to open a new store in San Francisco GPE next month DATE .
The CEO, Tim Cook PERSON , announced this during a press conference.

spaCy Example: Dependency Parsing

Code
sentence = nlp("The quick brown fox jumps over the lazy dog")

print("\nDependency Parse:")
print("{:<10} {:<10} {:<10} {:<10}".format("Token", "Dependency", "Head", "Children"))
print("-" * 50)
for token in sentence:
    children = ", ".join([child.text for child in token.children])
    print("{:<10} {:<10} {:<10} {:<10}".format(
        token.text,
        token.dep_,
        token.head.text,
        children if children else "-"
    ))

# Visualize dependency tree
displacy.render(
    sentence,
    style="dep",
    jupyter=True,
    options={
        "compact": False,
        "color": "blue",
        "bg": "#fff",
        "distance": 120,
        "width": 700,
        "height": 300,
        "font": "10px Arial"    # Reduce font size
    }
)

Dependency Parse:
Token      Dependency Head       Children  
--------------------------------------------------
The        det        fox        -         
quick      amod       fox        -         
brown      amod       fox        -         
fox        nsubj      jumps      The, quick, brown
jumps      ROOT       jumps      fox, over 
over       prep       jumps      dog       
the        det        dog        -         
lazy       amod       dog        -         
dog        pobj       over       the, lazy 
The DET quick ADJ brown ADJ fox NOUN jumps VERB over ADP the DET lazy ADJ dog NOUN det amod amod nsubj prep det amod pobj

spaCy Example: Document Similarity

Code
# Document similarity using word vectors
doc1 = nlp("I love programming in Python")
doc2 = nlp("I enjoy coding with Python")
doc3 = nlp("The weather is nice today")

print("\nDocument Similarity (using word vectors):")
print(f"doc1 <-> doc2: {doc1.similarity(doc2):.3f}")
print(f"doc1 <-> doc3: {doc1.similarity(doc3):.3f}")
print(f"doc2 <-> doc3: {doc2.similarity(doc3):.3f}")

# Word similarity
word1 = nlp("king")
word2 = nlp("queen")
word3 = nlp("apple")

print("\nWord Similarity:")
print(f"king <-> queen: {word1.similarity(word2):.3f}")
print(f"king <-> apple: {word1.similarity(word3):.3f}")

Document Similarity (using word vectors):
doc1 <-> doc2: 0.839
doc1 <-> doc3: 0.271
doc2 <-> doc3: 0.322

Word Similarity:
king <-> queen: 0.422
king <-> apple: 0.690

BERTopic

Overview

BERTopic is a modern topic modeling technique that leverages transformer-based embeddings.

Key Features:

  • Uses BERT embeddings for semantic understanding
  • Automatically determines optimal number of topics
  • UMAP for dimensionality reduction
  • HDBSCAN for clustering
  • Class-based TF-IDF (c-TF-IDF) for topic representation
  • Interactive visualizations

Best For:

  • Topic discovery in document collections
  • Short text analysis (tweets, reviews, articles)
  • Dynamic topic modeling over time
  • High-quality, interpretable topics
  • Modern alternative to LDA

Limitations:

  • Computationally expensive (needs embeddings)
  • Requires more memory than classical methods
  • Slower than LDA for very large corpora
  • GPU recommended for large datasets

BERTopic Example: Basic Topic Modeling

Code
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Load sample data
categories = ['sci.space', 'rec.sport.baseball', 'talk.politics.guns']
newsgroups = fetch_20newsgroups(
    subset='train', 
    categories=categories, 
    remove=('headers', 'footers', 'quotes')
)
docs = newsgroups.data[:500]  # Use subset for speed

# Create and fit BERTopic model
print("Training BERTopic model...")
topic_model = BERTopic(verbose=False, language="english", min_topic_size=10)
topics, probabilities = topic_model.fit_transform(docs)

print(f"\nDiscovered {len(set(topics)) - 1} topics (excluding outliers)")
print(f"Outlier documents (topic -1): {sum(1 for t in topics if t == -1)}")
Training BERTopic model...

Discovered 1 topics (excluding outliers)
Outlier documents (topic -1): 0

BERTopic Example: Explore Topics

Code
# Get topic information
topic_info = topic_model.get_topic_info()
print("\nTopic Information:")
print(topic_info[['Topic', 'Count', 'Name']].head(10))

# Show representative words for each topic
print("\n\nTop Words per Topic:")
print("=" * 80)
for topic_id in range(min(5, len(set(topics)) - 1)):  # Show first 5 topics
    topic_words = topic_model.get_topic(topic_id)
    if topic_words:
        words = [word for word, score in topic_words[:8]]
        print(f"\nTopic {topic_id}: {', '.join(words)}")

Topic Information:
   Topic  Count             Name
0      0    484  0_the_to_of_and
1      1     16     1_anaheim___


Top Words per Topic:
================================================================================

Topic 0: the, to, of, and, in, is, that, for

BERTopic Example: Topic Visualization

#| code-fold: false
# Visualize topics
fig = topic_model.visualize_topics()
fig.show()

# Visualize topic hierarchy
fig_hierarchy = topic_model.visualize_hierarchy(top_n_topics=10)
fig_hierarchy.show()

# Visualize barchart for top topics
fig_barchart = topic_model.visualize_barchart(top_n_topics=5, n_words=10)
fig_barchart.show()

BERTopic Example: Find Similar Documents

Code
# Find documents similar to a query
similar_docs, similarity_scores = topic_model.find_topics(
    "space exploration and satellites", 
    top_n=3
)

print("\nTopics similar to 'space exploration and satellites':")
for topic_id, score in zip(similar_docs, similarity_scores):
    print(f"\nTopic {topic_id} (similarity: {score:.3f}):")
    words = [word for word, _ in topic_model.get_topic(topic_id)[:5]]
    print(f"  Key words: {', '.join(words)}")

Topics similar to 'space exploration and satellites':

Topic 0 (similarity: 0.207):
  Key words: the, to, of, and, in

Topic 1 (similarity: 0.122):
  Key words: anaheim, , , , 

BERTopic Example: Dynamic Topic Modeling

Code
# Topic modeling over time (if you have timestamps)
import pandas as pd
import numpy as np

# Create fake timestamps for demonstration
timestamps = pd.date_range('2020-01-01', periods=len(docs), freq='D')

# Fit dynamic topic model
topics_over_time = topic_model.topics_over_time(
    docs, 
    timestamps, 
    nr_bins=10
)

print("\nTopics Over Time:")
print(topics_over_time.head(15))

# Visualize topics over time
fig_timeline = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=5)
fig_timeline.show()

Topics Over Time:
    Topic                   Words  Frequency               Timestamp
0       0    the, to, of, and, in         48 2019-12-31 12:01:26.400
1       1         anaheim, , , ,           2 2019-12-31 12:01:26.400
2       0    the, of, to, and, in         49 2020-02-19 21:36:00.000
3       1         anaheim, , , ,           1 2020-02-19 21:36:00.000
4       0    the, of, and, to, in         49 2020-04-09 19:12:00.000
5       1         anaheim, , , ,           1 2020-04-09 19:12:00.000
6       0    the, to, and, of, in         49 2020-05-29 16:48:00.000
7       1         anaheim, , , ,           1 2020-05-29 16:48:00.000
8       0  the, to, of, and, that         49 2020-07-18 14:24:00.000
9       1         anaheim, , , ,           1 2020-07-18 14:24:00.000
10      0    the, to, and, of, in         49 2020-09-06 12:00:00.000
11      1         anaheim, , , ,           1 2020-09-06 12:00:00.000
12      0    the, to, and, of, in         47 2020-10-26 09:36:00.000
13      1         anaheim, , , ,           3 2020-10-26 09:36:00.000
14      0    the, to, of, and, in         50 2020-12-15 07:12:00.000

Package Comparison

Quick Comparison Table

Feature NLTK spaCy BERTopic
Primary Use Education, Research Production NLP Topic Modeling
Speed Slow Fast Moderate
Ease of Use Moderate Easy Easy
Pre-trained Models Limited Excellent Uses transformer embeddings
Customization High Moderate Moderate
Memory Usage Low Low-Moderate High
Best For Learning, Prototyping NER, Pipelines, Real-time Topic Discovery
Visualization Limited Excellent (displaCy) Excellent (interactive)
GPU Support No Yes (for training) Recommended
Community Large, Academic Large, Industry Growing

When to Use Each Package

Use NLTK when:

  • Learning NLP concepts
  • Need access to linguistic resources (WordNet, TreeBank)
  • Working on academic research
  • Prototyping ideas
  • Need maximum flexibility

Use spaCy when:

  • Building production systems
  • Need fast, accurate NER
  • Processing large volumes of text
  • Want beautiful visualizations
  • Need dependency parsing
  • Building information extraction pipelines

Use BERTopic when:

  • Discovering topics in document collections
  • Working with short texts (tweets, reviews)
  • Need interpretable, coherent topics
  • Want to avoid specifying number of topics
  • Have access to GPU resources
  • Analyzing topic evolution over time

Combining Packages

These packages can be used together effectively:

Code
# Example: Use NLTK for preprocessing, spaCy for NER, BERTopic for topics

import nltk
import spacy
from bertopic import BERTopic

nlp = spacy.load("en_core_web_sm")

documents = [
    "Apple Inc. announced new products in Cupertino yesterday.",
    "Google is developing AI technology in Mountain View.",
    "Microsoft released a new version of Windows in Seattle."
]

# Step 1: Use NLTK for basic preprocessing
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# Step 2: Use spaCy for NER and lemmatization
processed_docs = []
for doc in documents:
    spacy_doc = nlp(doc)
    
    # Extract entities
    entities = [(ent.text, ent.label_) for ent in spacy_doc.ents]
    print(f"\nDocument: {doc}")
    print(f"Entities: {entities}")
    
    # Lemmatize and remove stopwords
    lemmatized = [token.lemma_ for token in spacy_doc 
                  if not token.is_stop and not token.is_punct]
    processed_docs.append(" ".join(lemmatized))

print("\n\nProcessed documents:")
for i, doc in enumerate(processed_docs, 1):
    print(f"{i}. {doc}")

# Step 3: Use BERTopic for topic modeling (would need more documents in practice)
# topic_model = BERTopic()
# topics, probs = topic_model.fit_transform(processed_docs)

Document: Apple Inc. announced new products in Cupertino yesterday.
Entities: [('Apple Inc.', 'ORG'), ('Cupertino', 'GPE'), ('yesterday', 'DATE')]

Document: Google is developing AI technology in Mountain View.
Entities: [('Google', 'ORG'), ('AI', 'ORG'), ('Mountain View', 'GPE')]

Document: Microsoft released a new version of Windows in Seattle.
Entities: [('Microsoft', 'ORG'), ('Windows', 'NORP'), ('Seattle', 'GPE')]


Processed documents:
1. Apple Inc. announce new product Cupertino yesterday
2. Google develop AI technology Mountain View
3. Microsoft release new version Windows Seattle

Summary

  • NLTK: Comprehensive, educational, flexible but slower
  • spaCy: Fast, production-ready, excellent for NER and pipelines
  • BERTopic: Modern topic modeling with transformer embeddings

Choose based on your specific needs: learning (NLTK), production (spaCy), or topic discovery (BERTopic). Often, combining these tools yields the best results!

Back to top