Building an Email Spam Detection System – Complete Tutorial

Building an Email Spam Detection System – Complete Tutorial

Problem Solved: Protect users from unwanted emails and potential security threats
Time: 1-2 hours
What You’ll Learn: Natural Language Processing, binary classification, model evaluation
Tech Stack: Python, scikit-learn, NLTK
Behind the Scenes: Use Naive Bayes or SVM to classify emails based on word frequency patterns


Project Overview

By the end of this tutorial, you’ll have built a machine learning system that can automatically detect spam emails with over 95% accuracy. You’ll understand how Gmail, Outlook, and other email providers protect users from unwanted messages.

Recommended YouTube Videos (Watch Before Starting)

  1. “What is Natural Language Processing?” – IBM Technology (5 mins)
  2. “Naive Bayes Classifier Explained” – StatQuest with Josh Starmer (15 mins)
  3. “Support Vector Machines (SVM) Explained” – Zach Star (10 mins)
  4. “Precision vs Recall Explained” – StatQuest with Josh Starmer (9 mins)

Exercise 1: Environment Setup (5 minutes)

Step 1.1: Create Project Directory

# Create and navigate to project directory
mkdir email-spam-detection
cd email-spam-detection

# Initialize git repository for your portfolio
git init

Step 1.2: Install Required Libraries

# Install all required Python packages
pip install pandas scikit-learn nltk matplotlib seaborn wordcloud jupyter

Step 1.3: Create Python Files

# Create the main project file
touch spam_detector.py

# Create a Jupyter notebook for interactive development (optional)
touch spam_detection_analysis.ipynb

# Create requirements file for dependencies
touch requirements.txt

# Create README for your GitHub repository
touch README.md

Step 1.4: Set Up Requirements File

# Add this content to requirements.txt
echo "pandas>=1.3.0
scikit-learn>=1.0.0
nltk>=3.6
matplotlib>=3.3.0
seaborn>=0.11.0
wordcloud>=1.8.0
jupyter>=1.0.0" > requirements.txt

Step 1.5: Choose Your Development Environment

Option A: Jupyter Notebook (Recommended for learning)

# Launch Jupyter for interactive development
jupyter notebook
# Then create a new notebook or open spam_detection_analysis.ipynb

Option B: Python Script

# Open spam_detector.py in your favorite code editor
# VS Code: code spam_detector.py
# Or any text editor: nano spam_detector.py

๐Ÿ’ก What you learned: Project organization and dependency management for ML projects. You can use either Jupyter notebooks (.ipynb) for interactive exploration or Python scripts (.py) for production code.


Exercise 2: Import Libraries and Load Data (10 minutes)

Step 2.1: Import All Required Libraries

# Data manipulation and analysis
import pandas as pd  # For working with structured data (like Excel but in Python)
import numpy as np   # For numerical operations and arrays

# Visualization libraries
import matplotlib.pyplot as plt  # For creating plots and charts
import seaborn as sns           # For beautiful statistical visualizations
from wordcloud import WordCloud # For creating word clouds

# Natural Language Processing
import nltk                          # Natural Language Toolkit - main NLP library
from nltk.corpus import stopwords    # Common words like 'the', 'and', 'is'
from nltk.tokenize import word_tokenize  # Splits text into individual words
from nltk.stem import PorterStemmer     # Reduces words to their root form
import re    # Regular expressions for text cleaning
import string  # String operations

# Machine Learning libraries
from sklearn.model_selection import train_test_split  # Splits data for training/testing
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer  # Converts text to numbers
from sklearn.naive_bayes import MultinomialNB  # Naive Bayes classifier
from sklearn.svm import SVC                    # Support Vector Machine classifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score  # Model evaluation

# Download required NLTK data (only need to run once)
nltk.download('punkt')      # For tokenization
nltk.download('stopwords')  # For removing common words

print("โœ… All libraries imported successfully!")

Step 2.2: Load the Dataset

# Load SMS Spam Collection dataset (works perfectly for email spam concepts)
# This dataset contains 5,574 messages labeled as 'ham' (legitimate) or 'spam'
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"

# Read the data into a pandas DataFrame
# sep='\t' means the data is separated by tabs
# names parameter gives column names since the file doesn't have headers
df = pd.read_csv(url, sep='\t', names=['label', 'message'])

# Display basic information about our dataset
print("๐Ÿ“Š Dataset Overview:")
print(f"Dataset shape: {df.shape}")  # Shows (rows, columns)
print("\n๐Ÿ“‹ First 5 rows:")
print(df.head())
print("\n๐Ÿ“ˆ Label distribution:")
print(df['label'].value_counts())  # Count of spam vs ham messages
print(f"\n๐Ÿ“Š Spam percentage: {(df['label'] == 'spam').mean() * 100:.1f}%")

๐ŸŽฏ Exercise Task: Run the code above. What percentage of messages are spam? Write your answer in a comment.

๐Ÿ’ก What you learned: How to load and inspect text data for machine learning projects.

๐Ÿ“บ Recommended Video: “Pandas Tutorial for Beginners” – Corey Schafer (30 mins)


Exercise 3: Exploratory Data Analysis (15 minutes)

Step 3.1: Analyze Message Characteristics

# Add new columns to analyze message patterns
df['message_length'] = df['message'].str.len()        # Count characters in each message
df['word_count'] = df['message'].str.split().str.len() # Count words in each message

# Calculate statistics for spam vs ham messages
print("๐Ÿ“Š Message Statistics:")
print("\n--- HAM (Legitimate) Messages ---")
ham_messages = df[df['label'] == 'ham']
print(f"Average length: {ham_messages['message_length'].mean():.1f} characters")
print(f"Average word count: {ham_messages['word_count'].mean():.1f} words")

print("\n--- SPAM Messages ---")
spam_messages = df[df['label'] == 'spam']
print(f"Average length: {spam_messages['message_length'].mean():.1f} characters")
print(f"Average word count: {spam_messages['word_count'].mean():.1f} words")

Step 3.2: Create Visualizations

# Create a 2x2 grid of visualizations to understand our data better
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Message length distribution
# This shows how long spam vs ham messages typically are
axes[0,0].hist(ham_messages['message_length'], alpha=0.7, label='Ham', bins=30, color='green')
axes[0,0].hist(spam_messages['message_length'], alpha=0.7, label='Spam', bins=30, color='red')
axes[0,0].set_title('Message Length Distribution')
axes[0,0].set_xlabel('Character Count')
axes[0,0].set_ylabel('Frequency')
axes[0,0].legend()

# Plot 2: Word count distribution
# This shows how many words spam vs ham messages typically contain
axes[0,1].hist(ham_messages['word_count'], alpha=0.7, label='Ham', bins=30, color='green')
axes[0,1].hist(spam_messages['word_count'], alpha=0.7, label='Spam', bins=30, color='red')
axes[0,1].set_title('Word Count Distribution')
axes[0,1].set_xlabel('Word Count')
axes[0,1].set_ylabel('Frequency')
axes[0,1].legend()

# Plot 3: Label distribution (pie chart)
# Shows the proportion of spam vs ham in our dataset
label_counts = df['label'].value_counts()
axes[1,0].pie(label_counts.values, labels=label_counts.index, autopct='%1.1f%%', 
              colors=['lightgreen', 'lightcoral'])
axes[1,0].set_title('Ham vs Spam Distribution')

# Plot 4: Box plot comparing message lengths
# Box plots show the distribution spread and outliers
df.boxplot(column='message_length', by='label', ax=axes[1,1])
axes[1,1].set_title('Message Length by Category')
axes[1,1].set_xlabel('Message Type')
axes[1,1].set_ylabel('Character Count')

plt.tight_layout()  # Adjust spacing between plots
plt.show()

Step 3.3: Create Word Clouds

# Word clouds show the most common words visually
# Larger words appear more frequently in the text

# Combine all spam messages into one large text string
spam_text = ' '.join(spam_messages['message'])
# Combine all ham messages into one large text string  
ham_text = ' '.join(ham_messages['message'])

# Create side-by-side word clouds
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Spam word cloud - shows what words appear most in spam
spam_wordcloud = WordCloud(width=400, height=300, 
                          background_color='white',
                          colormap='Reds').generate(spam_text)
axes[0].imshow(spam_wordcloud, interpolation='bilinear')
axes[0].set_title('Most Common Words in SPAM Messages', fontsize=16)
axes[0].axis('off')  # Remove axis labels

# Ham word cloud - shows what words appear most in legitimate messages
ham_wordcloud = WordCloud(width=400, height=300, 
                         background_color='white',
                         colormap='Greens').generate(ham_text)
axes[1].imshow(ham_wordcloud, interpolation='bilinear')
axes[1].set_title('Most Common Words in HAM Messages', fontsize=16)
axes[1].axis('off')  # Remove axis labels

plt.tight_layout()
plt.show()

๐ŸŽฏ Exercise Task: Look at the word clouds. List 3 words that appear prominently in spam but not in ham messages.

๐Ÿ’ก What you learned: How to analyze and visualize text data to understand patterns before building ML models.


Exercise 4: Text Preprocessing (15 minutes)

Step 4.1: Create Text Cleaning Function

def clean_text(text):
    """
    Clean and preprocess text data for machine learning.
    Think of this like preparing ingredients before cooking:
    - Remove unwanted parts (punctuation, numbers)
    - Make everything uniform (lowercase)
    - Break into pieces (tokenization)
    - Remove common words that don't help classification
    - Reduce words to their root form (stemming)
    """
    
    # Step 1: Convert to lowercase for consistency
    # "Hello" and "hello" should be treated the same
    text = text.lower()
    
    # Step 2: Remove punctuation and numbers using regular expressions
    # Keep only letters and spaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Step 3: Tokenization - split text into individual words
    # "hello world" becomes ["hello", "world"]
    tokens = word_tokenize(text)
    
    # Step 4: Remove stopwords (common words that don't help classification)
    # Words like 'the', 'and', 'is' don't tell us if something is spam
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words and len(token) > 2]
    
    # Step 5: Stemming - reduce words to their root form
    # "running", "runs", "ran" all become "run"
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]
    
    # Step 6: Join tokens back into a single string
    return ' '.join(tokens)

# Test the function with example messages
print("๐Ÿงช Testing text cleaning function:")
print("\n--- Original Message ---")
original = df['message'].iloc[0]  # Get first message
print(f"'{original}'")

print("\n--- Cleaned Message ---")
cleaned = clean_text(original)
print(f"'{cleaned}'")

print("\n--- Another Example ---")
test_message = "FREE! Win $1000 NOW!!! Call 123-456-7890"
print(f"Original: '{test_message}'")
print(f"Cleaned: '{clean_text(test_message)}'")

Step 4.2: Apply Cleaning to All Messages

# Apply the cleaning function to all messages in our dataset
print("๐Ÿงน Cleaning all messages... This may take a minute.")

# Create a new column with cleaned messages
# .apply() runs our clean_text function on every message
df['cleaned_message'] = df['message'].apply(clean_text)

print("โœ… Text cleaning complete!")

# Show before and after examples
print("\n๐Ÿ“‹ Cleaning Examples:")
for i in range(3):  # Show first 3 examples
    print(f"\n--- Example {i+1} ---")
    print(f"Original:  '{df['message'].iloc[i]}'")
    print(f"Cleaned:   '{df['cleaned_message'].iloc[i]}'")
    print(f"Label:     {df['label'].iloc[i]}")

๐ŸŽฏ Exercise Task: Create your own test message with punctuation, numbers, and capital letters. Run it through the clean_text() function and observe the changes.

๐Ÿ’ก What you learned: Text preprocessing is crucial for ML – it standardizes the data and removes noise.

๐Ÿ“บ Recommended Video: “Text Preprocessing for NLP” – Krish Naik (20 mins)


Exercise 5: Feature Engineering (20 minutes)

Step 5.1: Prepare Data for Machine Learning

# Separate features (X) and target variable (y)
X = df['cleaned_message']  # Features: the cleaned text messages
y = df['label']           # Target: spam or ham labels

# Convert text labels to numbers (required for machine learning)
# 'ham' becomes 0, 'spam' becomes 1
y_binary = y.map({'ham': 0, 'spam': 1})

print("๐Ÿ“Š Data Preparation Summary:")
print(f"Number of messages: {len(X)}")
print(f"Features type: {type(X)}")
print(f"Target distribution:")
print(f"  Ham (0): {(y_binary == 0).sum()}")
print(f"  Spam (1): {(y_binary == 1).sum()}")

Step 5.2: Convert Text to Numbers (Vectorization)

# Method 1: Count Vectorizer
# Counts how many times each word appears in each message
# Like counting ingredients in recipes
print("๐Ÿ”ข Method 1: Count Vectorizer")
count_vectorizer = CountVectorizer(
    max_features=3000,    # Keep only top 3000 most common words
    ngram_range=(1, 2),   # Use single words and pairs of words
    min_df=2              # Word must appear in at least 2 messages
)

# Transform our text data into a matrix of word counts
X_count = count_vectorizer.fit_transform(X)
print(f"Count Vectorizer shape: {X_count.shape}")
print(f"This means: {X_count.shape[0]} messages, {X_count.shape[1]} features")

# Method 2: TF-IDF Vectorizer (Term Frequency-Inverse Document Frequency)
# Not just counts, but considers how important each word is
# Common words get less weight, rare but meaningful words get more weight
print("\n๐Ÿ“Š Method 2: TF-IDF Vectorizer")
tfidf_vectorizer = TfidfVectorizer(
    max_features=3000,    # Keep only top 3000 most important words
    ngram_range=(1, 2),   # Use single words and pairs of words
    min_df=2,             # Word must appear in at least 2 messages
    max_df=0.95           # Ignore words that appear in >95% of messages
)

# Transform our text data into TF-IDF matrix
X_tfidf = tfidf_vectorizer.fit_transform(X)
print(f"TF-IDF Vectorizer shape: {X_tfidf.shape}")

# Show some example features (words/phrases the model will use)
feature_names = tfidf_vectorizer.get_feature_names_out()
print(f"\n๐Ÿ”ค Example features: {list(feature_names[:10])}")
print(f"๐Ÿ”ค More features: {list(feature_names[1000:1010])}")

Step 5.3: Understand the Vectorization Process

# Let's see how vectorization works with a simple example
example_messages = [
    "free money now",
    "meet for coffee",
    "win free prize money"
]

# Create a simple vectorizer for demonstration
demo_vectorizer = CountVectorizer()
demo_matrix = demo_vectorizer.fit_transform(example_messages)

# Show the feature names (vocabulary)
vocab = demo_vectorizer.get_feature_names_out()
print("๐ŸŽฏ Vectorization Example:")
print(f"Vocabulary: {list(vocab)}")

# Convert to dense array to see the actual numbers
dense_matrix = demo_matrix.toarray()
print("\n๐Ÿ“Š Count Matrix:")
for i, message in enumerate(example_messages):
    print(f"'{message}' โ†’ {dense_matrix[i]}")

print("\n๐Ÿ’ก Each number represents how many times each word appears in that message")

๐ŸŽฏ Exercise Task: Create your own 3 simple messages and run them through the count vectorizer. Predict what the count matrix will look like before running the code.

๐Ÿ’ก What you learned: Vectorization converts text into numbers that machine learning algorithms can understand. TF-IDF is usually better than simple counts because it considers word importance.

๐Ÿ“บ Recommended Video: “TF-IDF Explained” – Luis Serrano (12 mins)


Exercise 6: Model Training (15 minutes)

Step 6.1: Split Data for Training and Testing

# Split data into training (80%) and testing (20%) sets
# This is like studying for an exam (training) then taking the actual exam (testing)

# Split Count Vectorizer data
X_train_count, X_test_count, y_train, y_test = train_test_split(
    X_count, y_binary,     # Features and target
    test_size=0.2,         # 20% for testing
    random_state=42,       # For reproducible results
    stratify=y_binary      # Maintain same spam/ham ratio in both sets
)

# Split TF-IDF data (same split pattern)
X_train_tfidf, X_test_tfidf, _, _ = train_test_split(
    X_tfidf, y_binary,
    test_size=0.2,
    random_state=42,
    stratify=y_binary
)

print("๐Ÿ“Š Data Split Summary:")
print(f"Training set size: {X_train_count.shape[0]} messages")
print(f"Test set size: {X_test_count.shape[0]} messages")
print(f"Features per message: {X_train_count.shape[1]}")

# Check that split maintained class balance
print(f"\nTraining set spam ratio: {y_train.mean():.3f}")
print(f"Test set spam ratio: {y_test.mean():.3f}")

Step 6.2: Train Naive Bayes Models

# Naive Bayes is great for text classification
# It assumes features are independent and uses probability

print("๐Ÿง  Training Naive Bayes Models...")

# Model 1: Naive Bayes with Count Vectorizer
nb_count = MultinomialNB(alpha=1.0)  # alpha is smoothing parameter
nb_count.fit(X_train_count, y_train)  # Train the model
print("โœ… Naive Bayes (Count) trained")

# Model 2: Naive Bayes with TF-IDF
nb_tfidf = MultinomialNB(alpha=1.0)
nb_tfidf.fit(X_train_tfidf, y_train)  # Train the model
print("โœ… Naive Bayes (TF-IDF) trained")

# Quick accuracy check on training data
train_accuracy_count = nb_count.score(X_train_count, y_train)
train_accuracy_tfidf = nb_tfidf.score(X_train_tfidf, y_train)

print(f"\n๐Ÿ“Š Training Accuracies:")
print(f"Naive Bayes (Count): {train_accuracy_count:.4f}")
print(f"Naive Bayes (TF-IDF): {train_accuracy_tfidf:.4f}")

Step 6.3: Train Support Vector Machine

# SVM finds the best boundary between spam and ham messages
# It's often more accurate but takes longer to train

print("\n๐ŸŽฏ Training Support Vector Machine...")

# SVM with TF-IDF (usually works better than counts for SVM)
svm_model = SVC(
    kernel='linear',      # Linear boundary works well for text
    probability=True,     # Enable probability predictions
    random_state=42       # For reproducible results
)

# Train the SVM (this might take a minute)
print("โณ Training SVM... (this may take a moment)")
svm_model.fit(X_train_tfidf, y_train)
print("โœ… SVM trained successfully")

# Check training accuracy
svm_train_accuracy = svm_model.score(X_train_tfidf, y_train)
print(f"SVM training accuracy: {svm_train_accuracy:.4f}")

print("\n๐ŸŽ‰ All models trained successfully!")

๐ŸŽฏ Exercise Task: Which model had the highest training accuracy? Why do you think training accuracy might be different from real-world performance?

๐Ÿ’ก What you learned: Different algorithms have different strengths. Naive Bayes is fast and works well with text. SVM often has higher accuracy but is slower to train.

๐Ÿ“บ Recommended Video: “Machine Learning Algorithms Explained” – Zach Star (15 mins)


Exercise 7: Model Evaluation (20 minutes)

Step 7.1: Make Predictions and Calculate Basic Metrics

# Now test our models on unseen data (the test set)
print("๐Ÿ”ฎ Making Predictions on Test Data...")

# Naive Bayes (Count) predictions
nb_count_pred = nb_count.predict(X_test_count)
nb_count_accuracy = accuracy_score(y_test, nb_count_pred)

# Naive Bayes (TF-IDF) predictions  
nb_tfidf_pred = nb_tfidf.predict(X_test_tfidf)
nb_tfidf_accuracy = accuracy_score(y_test, nb_tfidf_pred)

# SVM predictions
svm_pred = svm_model.predict(X_test_tfidf)
svm_accuracy = accuracy_score(y_test, svm_pred)

print("๐Ÿ“ˆ Test Set Accuracies:")
print(f"Naive Bayes (Count):  {nb_count_accuracy:.4f} ({nb_count_accuracy*100:.1f}%)")
print(f"Naive Bayes (TF-IDF): {nb_tfidf_accuracy:.4f} ({nb_tfidf_accuracy*100:.1f}%)")
print(f"SVM (TF-IDF):         {svm_accuracy:.4f} ({svm_accuracy*100:.1f}%)")

Step 7.2: Detailed Classification Reports

# Classification report shows precision, recall, and F1-score for each class
# This is more informative than just accuracy

def show_classification_report(y_true, y_pred, model_name):
    """Display detailed classification metrics"""
    print(f"\n{'='*50}")
    print(f"{model_name} - Detailed Results")
    print(f"{'='*50}")
    
    # Classification report with explanation
    report = classification_report(y_true, y_pred, 
                                 target_names=['Ham', 'Spam'], 
                                 digits=4)
    print(report)
    
    print("\n๐Ÿ’ก Metrics Explanation:")
    print("๐Ÿ“ Precision: Of all messages we predicted as spam, how many were actually spam?")
    print("๐Ÿ“ Recall: Of all actual spam messages, how many did we catch?")
    print("๐Ÿ“ F1-Score: Harmonic mean of precision and recall")
    print("๐Ÿ“ Support: Number of actual messages in each category")

# Show detailed results for each model
show_classification_report(y_test, nb_count_pred, "Naive Bayes (Count Vectorizer)")
show_classification_report(y_test, nb_tfidf_pred, "Naive Bayes (TF-IDF)")
show_classification_report(y_test, svm_pred, "Support Vector Machine")

Step 7.3: Confusion Matrices

# Confusion matrices show exactly where our models make mistakes
def plot_confusion_matrix(y_true, y_pred, model_name):
    """Create and display confusion matrix"""
    
    # Calculate confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    
    # Create the plot
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['Ham', 'Spam'], 
                yticklabels=['Ham', 'Spam'],
                cbar_kws={'label': 'Count'})
    
    plt.title(f'{model_name} - Confusion Matrix', fontsize=14)
    plt.xlabel('Predicted Label', fontsize=12)
    plt.ylabel('True Label', fontsize=12)
    
    # Add explanatory text
    plt.figtext(0.02, 0.02, 
                "๐Ÿ’ก Perfect predictions would show numbers only on the diagonal.\n" +
                "Off-diagonal numbers represent mistakes.", 
                fontsize=10, ha='left')
    
    plt.tight_layout()
    plt.show()
    
    # Print interpretation
    tn, fp, fn, tp = cm.ravel()  # True Neg, False Pos, False Neg, True Pos
    print(f"\n๐Ÿ“Š {model_name} Confusion Matrix Breakdown:")
    print(f"โœ… True Negatives (Ham correctly identified): {tn}")
    print(f"โŒ False Positives (Ham labeled as Spam): {fp} โ† Bad! Blocks good emails")
    print(f"โŒ False Negatives (Spam labeled as Ham): {fn} โ† Spam gets through")
    print(f"โœ… True Positives (Spam correctly identified): {tp}")

# Create confusion matrices for all models
plot_confusion_matrix(y_test, nb_tfidf_pred, "Naive Bayes (TF-IDF)")
plot_confusion_matrix(y_test, svm_pred, "Support Vector Machine")

Step 7.4: Model Comparison Visualization

# Create a visual comparison of all models
models_comparison = {
    'Model': ['NB (Count)', 'NB (TF-IDF)', 'SVM (TF-IDF)'],
    'Accuracy': [nb_count_accuracy, nb_tfidf_accuracy, svm_accuracy]
}

comparison_df = pd.DataFrame(models_comparison)

# Create bar plot
plt.figure(figsize=(10, 6))
bars = plt.bar(comparison_df['Model'], comparison_df['Accuracy'], 
               color=['lightblue', 'lightgreen', 'lightcoral'],
               edgecolor='black', linewidth=1)

plt.title('Model Accuracy Comparison', fontsize=16)
plt.ylabel('Accuracy Score', fontsize=12)
plt.ylim(0.85, 1.0)  # Focus on the high accuracy range

# Add accuracy values on bars
for bar, accuracy in zip(bars, comparison_df['Accuracy']):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005, 
             f'{accuracy:.4f}', ha='center', va='bottom', fontweight='bold')

plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Print summary
print("๐Ÿ† Model Performance Summary:")
best_model_idx = comparison_df['Accuracy'].idxmax()
best_model = comparison_df.loc[best_model_idx, 'Model']
best_accuracy = comparison_df.loc[best_model_idx, 'Accuracy']
print(f"Best performing model: {best_model} with {best_accuracy:.4f} accuracy")

๐ŸŽฏ Exercise Task: Look at the confusion matrices. Which type of error is worse for spam detection: false positives (blocking good emails) or false negatives (letting spam through)? Why?

๐Ÿ’ก What you learned: Accuracy alone isn’t enough. In spam detection, precision and recall matter because the cost of different mistakes varies. Blocking important emails is usually worse than letting some spam through.


Exercise 8: Model Deployment and Testing (15 minutes)

Step 8.1: Choose Best Model and Create Prediction Function

# Based on our evaluation, let's choose the best model
# (Adjust this based on your results)
best_model = svm_model  # Usually SVM performs best
best_vectorizer = tfidf_vectorizer

def predict_spam(message, model=best_model, vectorizer=best_vectorizer):
    """
    Predict if a single message is spam or ham
    
    Args:
        message (str): The email/message to classify
        model: Trained machine learning model
        vectorizer: Fitted vectorizer to convert text to numbers
    
    Returns:
        tuple: (prediction, confidence_score)
    """
    
    # Step 1: Clean the input message (same preprocessing as training)
    cleaned_message = clean_text(message)
    
    # Step 2: Convert to vector format the model expects
    message_vector = vectorizer.transform([cleaned_message])
    
    # Step 3: Make prediction (0 = ham, 1 = spam)
    prediction = model.predict(message_vector)[0]
    
    # Step 4: Get confidence score (probability)
    if hasattr(model, 'predict_proba'):
        probabilities = model.predict_proba(message_vector)[0]
        confidence = probabilities[prediction]  # Confidence in the prediction
    else:
        confidence = 1.0  # Some models don't provide probabilities
    
    # Step 5: Convert prediction back to text
    result = "SPAM" if prediction == 1 else "HAM"
    
    return result, confidence

print("โœ… Prediction function created!")

Step 8.2: Test with Real Examples

# Test our spam detector with various messages
test_messages = [
    # Obvious spam examples
    "Congratulations! You've won $1000! Click here to claim your prize NOW!",
    "URGENT: Your account will be closed! Verify now: www.fake-bank.com",
    "Get rich quick! Work from home! Earn $5000/week! No experience needed!",
    "FREE MONEY! No strings attached! Call now!!!",
    
    # Obvious ham (legitimate) examples  
    "Hey, are we still meeting for coffee tomorrow at 3pm?",
    "The quarterly report is attached. Please review before the meeting.",
    "Happy birthday! Hope you have a wonderful day!",
    "Can you pick up milk on your way home?",
    
    # Tricky examples (could go either way)
    "Limited time offer on our premium software - 50% off this week only",
    "Your Amazon package has been delivered to your front door",
    "Don't forget about the team meeting tomorrow at 10am",
    "You have 1 new voicemail message"
]

print("๐Ÿ” Testing Our Spam Detection System")
print("="*60)

for i, message in enumerate(test_messages, 1):
    prediction, confidence = predict_spam(message)
    
    # Format output with emojis for better readability
    if prediction == "SPAM":
        emoji = "๐Ÿšจ"
        color_indicator = "โŒ"
    else:
        emoji = "โœ…"
        color_indicator = "โœ…"
    
    print(f"\n{i:2d}. {emoji} {color_indicator} {prediction}")
    print(f"    Confidence: {confidence:.3f} ({confidence*100:.1f}%)")
    print(f"    Message: \"{message}\"")
    
    # Add interpretation
    if confidence > 0.8:
        certainty = "Very confident"
    elif confidence > 0.6:
        certainty = "Moderately confident"
    else:
        certainty = "Less confident"
    
    print(f"    Model is: {certainty}")

print("\n" + "="*60)
print("๐ŸŽฏ Testing Complete!")

Step 8.3: Interactive Testing Function

def interactive_spam_test():
    """
    Interactive function to test custom messages
    """
    print("๐ŸŽฎ Interactive Spam Detector")
    print("Type 'quit' to exit")
    print("-" * 40)
    
    while True:
        # Get user input
        user_message = input("\n๐Ÿ“ Enter a message to test: ")
        
        # Check if user wants to quit
        if user_message.lower() in ['quit', 'exit', 'q']:
            print("๐Ÿ‘‹ Thanks for testing! Goodbye!")
            break
        
        # Skip empty messages
        if not user_message.strip():
            print("โš ๏ธ Please enter a message to test")
            continue
        
        # Make prediction
        try:
            prediction, confidence = predict_spam(user_message)
            
            # Display results
            if prediction == "SPAM":
                print(f"๐Ÿšจ SPAM DETECTED! (Confidence: {confidence:.3f})")
                print("โš ๏ธ This message has characteristics of spam")
            else:
                print(f"โœ… Legitimate message (Confidence: {confidence:.3f})")
                print("๐Ÿ‘ This appears to be a normal message")
                
        except Exception as e:
            print(f"โŒ Error processing message: {e}")

# Uncomment the line below to run interactive testing
# interactive_spam_test()

Step 8.4: Analyze Feature Importance

# Show which words/features are most important for spam detection
def analyze_important_features(model=best_model, vectorizer=best_vectorizer, top_n=20):
    """
    Analyze which features (words) are most important for classification
    """
    
    if hasattr(model, 'coef_'):
        # Get feature names (words/phrases)
        feature_names = vectorizer.get_feature_names_out()
        coefficients = model.coef_[0]
        
        # Get top spam indicators (positive coefficients)
        spam_features = sorted(zip(coefficients, feature_names), reverse=True)[:top_n]
        
        # Get top ham indicators (negative coefficients)  
        ham_features = sorted(zip(coefficients, feature_names))[:top_n]
        
        print("๐Ÿšจ TOP SPAM INDICATORS (words that suggest spam):")
        print("-" * 50)
        for i, (coef, word) in enumerate(spam_features, 1):
            print(f"{i:2d}. '{word}' (importance: {coef:.4f})")
        
        print("\nโœ… TOP HAM INDICATORS (words that suggest legitimate messages):")
        print("-" * 50)
        for i, (coef, word) in enumerate(ham_features, 1):
            print(f"{i:2d}. '{word}' (importance: {coef:.4f})")
            
        # Create visualization
        plt.figure(figsize=(15, 10))
        
        # Plot spam indicators
        plt.subplot(2, 1, 1)
        spam_words = [word for coef, word in spam_features]
        spam_scores = [coef for coef, word in spam_features]
        
        bars1 = plt.barh(range(len(spam_words)), spam_scores, color='red', alpha=0.7)
        plt.yticks(range(len(spam_words)), spam_words)
        plt.xlabel('Feature Importance (Spam Direction)')
        plt.title('Top 20 Words That Indicate SPAM')
        plt.gca().invert_yaxis()
        
        # Plot ham indicators
        plt.subplot(2, 1, 2)
        ham_words = [word for coef, word in ham_features]
        ham_scores = [abs(coef) for coef, word in ham_features]  # Use absolute value for visualization
        
        bars2 = plt.barh(range(len(ham_words)), ham_scores, color='green', alpha=0.7)
        plt.yticks(range(len(ham_words)), ham_words)
        plt.xlabel('Feature Importance (Ham Direction)')
        plt.title('Top 20 Words That Indicate LEGITIMATE Messages')
        plt.gca().invert_yaxis()
        
        plt.tight_layout()
        plt.show()
        
    else:
        print("โš ๏ธ This model doesn't provide feature importance information")

# Run the analysis
analyze_important_features()

Step 8.5: Save Your Model for Future Use

import pickle
import joblib
from datetime import datetime

# Save the trained model and vectorizer
def save_model(model, vectorizer, filename_prefix="spam_detector"):
    """
    Save the trained model and vectorizer for future use
    """
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Save using joblib (recommended for scikit-learn models)
    model_filename = f"{filename_prefix}_model_{timestamp}.joblib"
    vectorizer_filename = f"{filename_prefix}_vectorizer_{timestamp}.joblib"
    
    joblib.dump(model, model_filename)
    joblib.dump(vectorizer, vectorizer_filename)
    
    print(f"โœ… Model saved as: {model_filename}")
    print(f"โœ… Vectorizer saved as: {vectorizer_filename}")
    
    # Also save using pickle as backup
    with open(f"{filename_prefix}_model_{timestamp}.pkl", 'wb') as f:
        pickle.dump(model, f)
    
    with open(f"{filename_prefix}_vectorizer_{timestamp}.pkl", 'wb') as f:
        pickle.dump(vectorizer, f)
    
    print("โœ… Backup pickle files also created")
    
    return model_filename, vectorizer_filename

# Save your best model
model_file, vectorizer_file = save_model(best_model, best_vectorizer)

# Function to load saved model
def load_model(model_filename, vectorizer_filename):
    """
    Load a previously saved model and vectorizer
    """
    model = joblib.load(model_filename)
    vectorizer = joblib.load(vectorizer_filename)
    
    print(f"โœ… Model loaded from: {model_filename}")
    print(f"โœ… Vectorizer loaded from: {vectorizer_filename}")
    
    return model, vectorizer

print("\n๐ŸŽ‰ Your spam detection system is now complete and saved!")
print("You can use the saved files to deploy your model in production.")

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *