Building an Email Spam Detection System – Complete Tutorial

Titus Gitari 22 min read

Problem Solved: Protect users from unwanted emails and potential security threats
Time: 1-2 hours
What You’ll Learn: Natural Language Processing, binary classification, model evaluation
Tech Stack: Python, scikit-learn, NLTK
Behind the Scenes: Use Naive Bayes or SVM to classify emails based on word frequency patterns

Project Overview

By the end of this tutorial, you’ll have built a machine learning system that can automatically detect spam emails with over 95% accuracy. You’ll understand how Gmail, Outlook, and other email providers protect users from unwanted messages.

Exercise 1: Environment Setup (5 minutes)

Step 1.1: Create Project Directory

# Create and navigate to project directory
mkdir email-spam-detection
cd email-spam-detection

# Initialize git repository for your portfolio
git init

Step 1.2: Install Required Libraries

# Install all required Python packages
pip install pandas scikit-learn nltk matplotlib seaborn wordcloud jupyter

Step 1.3: Create Python Files

# Create the main project file
touch spam_detector.py

# Create a Jupyter notebook for interactive development (optional)
touch spam_detection_analysis.ipynb

# Create requirements file for dependencies
touch requirements.txt

# Create README for your GitHub repository
touch README.md

Step 1.4: Set Up Requirements File

# Add this content to requirements.txt
echo "pandas>=1.3.0
scikit-learn>=1.0.0
nltk>=3.6
matplotlib>=3.3.0
seaborn>=0.11.0
wordcloud>=1.8.0
jupyter>=1.0.0" > requirements.txt

Step 1.5: Choose Your Development Environment

Option A: Jupyter Notebook (Recommended for learning)

# Launch Jupyter for interactive development
jupyter notebook
# Then create a new notebook or open spam_detection_analysis.ipynb

Option B: Python Script

# Open spam_detector.py in your favorite code editor
# VS Code: code spam_detector.py
# Or any text editor: nano spam_detector.py

💡 What you learned: Project organization and dependency management for ML projects. You can use either Jupyter notebooks (.ipynb) for interactive exploration or Python scripts (.py) for production code.

Exercise 2: Import Libraries and Load Data (10 minutes)

Step 2.1: Import All Required Libraries

# Data manipulation and analysis
import pandas as pd  # For working with structured data (like Excel but in Python)
import numpy as np   # For numerical operations and arrays

# Visualization libraries
import matplotlib.pyplot as plt  # For creating plots and charts
import seaborn as sns           # For beautiful statistical visualizations
from wordcloud import WordCloud # For creating word clouds

# Natural Language Processing
import nltk                          # Natural Language Toolkit - main NLP library
from nltk.corpus import stopwords    # Common words like 'the', 'and', 'is'
from nltk.tokenize import word_tokenize  # Splits text into individual words
from nltk.stem import PorterStemmer     # Reduces words to their root form
import re    # Regular expressions for text cleaning
import string  # String operations

# Machine Learning libraries
from sklearn.model_selection import train_test_split  # Splits data for training/testing
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer  # Converts text to numbers
from sklearn.naive_bayes import MultinomialNB  # Naive Bayes classifier
from sklearn.svm import SVC                    # Support Vector Machine classifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score  # Model evaluation

# Download required NLTK data (only need to run once)
nltk.download('punkt')      # For tokenization
nltk.download('stopwords')  # For removing common words

print("✅ All libraries imported successfully!")

Step 2.2: Load the Dataset

# Load SMS Spam Collection dataset (works perfectly for email spam concepts)
# This dataset contains 5,574 messages labeled as 'ham' (legitimate) or 'spam'
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"

# Read the data into a pandas DataFrame
# sep='\t' means the data is separated by tabs
# names parameter gives column names since the file doesn't have headers
df = pd.read_csv(url, sep='\t', names=['label', 'message'])

# Display basic information about our dataset
print("📊 Dataset Overview:")
print(f"Dataset shape: {df.shape}")  # Shows (rows, columns)
print("\n📋 First 5 rows:")
print(df.head())
print("\n📈 Label distribution:")
print(df['label'].value_counts())  # Count of spam vs ham messages
print(f"\n📊 Spam percentage: {(df['label'] == 'spam').mean() * 100:.1f}%")

🎯 Exercise Task: Run the code above. What percentage of messages are spam? Write your answer in a comment.

💡 What you learned: How to load and inspect text data for machine learning projects.

📺 Recommended Video: “Pandas Tutorial for Beginners” – Corey Schafer (30 mins)

Exercise 3: Exploratory Data Analysis (15 minutes)

Step 3.1: Analyze Message Characteristics

# Add new columns to analyze message patterns
df['message_length'] = df['message'].str.len()        # Count characters in each message
df['word_count'] = df['message'].str.split().str.len() # Count words in each message

# Calculate statistics for spam vs ham messages
print("📊 Message Statistics:")
print("\n--- HAM (Legitimate) Messages ---")
ham_messages = df[df['label'] == 'ham']
print(f"Average length: {ham_messages['message_length'].mean():.1f} characters")
print(f"Average word count: {ham_messages['word_count'].mean():.1f} words")

print("\n--- SPAM Messages ---")
spam_messages = df[df['label'] == 'spam']
print(f"Average length: {spam_messages['message_length'].mean():.1f} characters")
print(f"Average word count: {spam_messages['word_count'].mean():.1f} words")

Step 3.2: Create Visualizations

# Create a 2x2 grid of visualizations to understand our data better
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Message length distribution
# This shows how long spam vs ham messages typically are
axes[0,0].hist(ham_messages['message_length'], alpha=0.7, label='Ham', bins=30, color='green')
axes[0,0].hist(spam_messages['message_length'], alpha=0.7, label='Spam', bins=30, color='red')
axes[0,0].set_title('Message Length Distribution')
axes[0,0].set_xlabel('Character Count')
axes[0,0].set_ylabel('Frequency')
axes[0,0].legend()

# Plot 2: Word count distribution
# This shows how many words spam vs ham messages typically contain
axes[0,1].hist(ham_messages['word_count'], alpha=0.7, label='Ham', bins=30, color='green')
axes[0,1].hist(spam_messages['word_count'], alpha=0.7, label='Spam', bins=30, color='red')
axes[0,1].set_title('Word Count Distribution')
axes[0,1].set_xlabel('Word Count')
axes[0,1].set_ylabel('Frequency')
axes[0,1].legend()

# Plot 3: Label distribution (pie chart)
# Shows the proportion of spam vs ham in our dataset
label_counts = df['label'].value_counts()
axes[1,0].pie(label_counts.values, labels=label_counts.index, autopct='%1.1f%%', 
              colors=['lightgreen', 'lightcoral'])
axes[1,0].set_title('Ham vs Spam Distribution')

# Plot 4: Box plot comparing message lengths
# Box plots show the distribution spread and outliers
df.boxplot(column='message_length', by='label', ax=axes[1,1])
axes[1,1].set_title('Message Length by Category')
axes[1,1].set_xlabel('Message Type')
axes[1,1].set_ylabel('Character Count')

plt.tight_layout()  # Adjust spacing between plots
plt.show()

Step 3.3: Create Word Clouds

# Word clouds show the most common words visually
# Larger words appear more frequently in the text

# Combine all spam messages into one large text string
spam_text = ' '.join(spam_messages['message'])
# Combine all ham messages into one large text string  
ham_text = ' '.join(ham_messages['message'])

# Create side-by-side word clouds
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Spam word cloud - shows what words appear most in spam
spam_wordcloud = WordCloud(width=400, height=300, 
                          background_color='white',
                          colormap='Reds').generate(spam_text)
axes[0].imshow(spam_wordcloud, interpolation='bilinear')
axes[0].set_title('Most Common Words in SPAM Messages', fontsize=16)
axes[0].axis('off')  # Remove axis labels

# Ham word cloud - shows what words appear most in legitimate messages
ham_wordcloud = WordCloud(width=400, height=300, 
                         background_color='white',
                         colormap='Greens').generate(ham_text)
axes[1].imshow(ham_wordcloud, interpolation='bilinear')
axes[1].set_title('Most Common Words in HAM Messages', fontsize=16)
axes[1].axis('off')  # Remove axis labels

plt.tight_layout()
plt.show()

🎯 Exercise Task: Look at the word clouds. List 3 words that appear prominently in spam but not in ham messages.

💡 What you learned: How to analyze and visualize text data to understand patterns before building ML models.

Exercise 4: Text Preprocessing (15 minutes)

Step 4.1: Create Text Cleaning Function

def clean_text(text):
    """
    Clean and preprocess text data for machine learning.
    Think of this like preparing ingredients before cooking:
    - Remove unwanted parts (punctuation, numbers)
    - Make everything uniform (lowercase)
    - Break into pieces (tokenization)
    - Remove common words that don't help classification
    - Reduce words to their root form (stemming)
    """
    
    # Step 1: Convert to lowercase for consistency
    # "Hello" and "hello" should be treated the same
    text = text.lower()
    
    # Step 2: Remove punctuation and numbers using regular expressions
    # Keep only letters and spaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Step 3: Tokenization - split text into individual words
    # "hello world" becomes ["hello", "world"]
    tokens = word_tokenize(text)
    
    # Step 4: Remove stopwords (common words that don't help classification)
    # Words like 'the', 'and', 'is' don't tell us if something is spam
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words and len(token) > 2]
    
    # Step 5: Stemming - reduce words to their root form
    # "running", "runs", "ran" all become "run"
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]
    
    # Step 6: Join tokens back into a single string
    return ' '.join(tokens)

# Test the function with example messages
print("🧪 Testing text cleaning function:")
print("\n--- Original Message ---")
original = df['message'].iloc[0]  # Get first message
print(f"'{original}'")

print("\n--- Cleaned Message ---")
cleaned = clean_text(original)
print(f"'{cleaned}'")

print("\n--- Another Example ---")
test_message = "FREE! Win $1000 NOW!!! Call 123-456-7890"
print(f"Original: '{test_message}'")
print(f"Cleaned: '{clean_text(test_message)}'")

Step 4.2: Apply Cleaning to All Messages

# Apply the cleaning function to all messages in our dataset
print("🧹 Cleaning all messages... This may take a minute.")

# Create a new column with cleaned messages
# .apply() runs our clean_text function on every message
df['cleaned_message'] = df['message'].apply(clean_text)

print("✅ Text cleaning complete!")

# Show before and after examples
print("\n📋 Cleaning Examples:")
for i in range(3):  # Show first 3 examples
    print(f"\n--- Example {i+1} ---")
    print(f"Original:  '{df['message'].iloc[i]}'")
    print(f"Cleaned:   '{df['cleaned_message'].iloc[i]}'")
    print(f"Label:     {df['label'].iloc[i]}")

🎯 Exercise Task: Create your own test message with punctuation, numbers, and capital letters. Run it through the clean_text() function and observe the changes.

💡 What you learned: Text preprocessing is crucial for ML – it standardizes the data and removes noise.

📺 Recommended Video: “Text Preprocessing for NLP” – Krish Naik (20 mins)

Exercise 5: Feature Engineering (20 minutes)

Step 5.1: Prepare Data for Machine Learning

# Separate features (X) and target variable (y)
X = df['cleaned_message']  # Features: the cleaned text messages
y = df['label']           # Target: spam or ham labels

# Convert text labels to numbers (required for machine learning)
# 'ham' becomes 0, 'spam' becomes 1
y_binary = y.map({'ham': 0, 'spam': 1})

print("📊 Data Preparation Summary:")
print(f"Number of messages: {len(X)}")
print(f"Features type: {type(X)}")
print(f"Target distribution:")
print(f"  Ham (0): {(y_binary == 0).sum()}")
print(f"  Spam (1): {(y_binary == 1).sum()}")

Step 5.2: Convert Text to Numbers (Vectorization)

# Method 1: Count Vectorizer
# Counts how many times each word appears in each message
# Like counting ingredients in recipes
print("🔢 Method 1: Count Vectorizer")
count_vectorizer = CountVectorizer(
    max_features=3000,    # Keep only top 3000 most common words
    ngram_range=(1, 2),   # Use single words and pairs of words
    min_df=2              # Word must appear in at least 2 messages
)

# Transform our text data into a matrix of word counts
X_count = count_vectorizer.fit_transform(X)
print(f"Count Vectorizer shape: {X_count.shape}")
print(f"This means: {X_count.shape[0]} messages, {X_count.shape[1]} features")

# Method 2: TF-IDF Vectorizer (Term Frequency-Inverse Document Frequency)
# Not just counts, but considers how important each word is
# Common words get less weight, rare but meaningful words get more weight
print("\n📊 Method 2: TF-IDF Vectorizer")
tfidf_vectorizer = TfidfVectorizer(
    max_features=3000,    # Keep only top 3000 most important words
    ngram_range=(1, 2),   # Use single words and pairs of words
    min_df=2,             # Word must appear in at least 2 messages
    max_df=0.95           # Ignore words that appear in >95% of messages
)

# Transform our text data into TF-IDF matrix
X_tfidf = tfidf_vectorizer.fit_transform(X)
print(f"TF-IDF Vectorizer shape: {X_tfidf.shape}")

# Show some example features (words/phrases the model will use)
feature_names = tfidf_vectorizer.get_feature_names_out()
print(f"\n🔤 Example features: {list(feature_names[:10])}")
print(f"🔤 More features: {list(feature_names[1000:1010])}")

Step 5.3: Understand the Vectorization Process

# Let's see how vectorization works with a simple example
example_messages = [
    "free money now",
    "meet for coffee",
    "win free prize money"
]

# Create a simple vectorizer for demonstration
demo_vectorizer = CountVectorizer()
demo_matrix = demo_vectorizer.fit_transform(example_messages)

# Show the feature names (vocabulary)
vocab = demo_vectorizer.get_feature_names_out()
print("🎯 Vectorization Example:")
print(f"Vocabulary: {list(vocab)}")

# Convert to dense array to see the actual numbers
dense_matrix = demo_matrix.toarray()
print("\n📊 Count Matrix:")
for i, message in enumerate(example_messages):
    print(f"'{message}' → {dense_matrix[i]}")

print("\n💡 Each number represents how many times each word appears in that message")

🎯 Exercise Task: Create your own 3 simple messages and run them through the count vectorizer. Predict what the count matrix will look like before running the code.

💡 What you learned: Vectorization converts text into numbers that machine learning algorithms can understand. TF-IDF is usually better than simple counts because it considers word importance.

📺 Recommended Video: “TF-IDF Explained” – Luis Serrano (12 mins)

Exercise 6: Model Training (15 minutes)

Step 6.1: Split Data for Training and Testing

# Split data into training (80%) and testing (20%) sets
# This is like studying for an exam (training) then taking the actual exam (testing)

# Split Count Vectorizer data
X_train_count, X_test_count, y_train, y_test = train_test_split(
    X_count, y_binary,     # Features and target
    test_size=0.2,         # 20% for testing
    random_state=42,       # For reproducible results
    stratify=y_binary      # Maintain same spam/ham ratio in both sets
)

# Split TF-IDF data (same split pattern)
X_train_tfidf, X_test_tfidf, _, _ = train_test_split(
    X_tfidf, y_binary,
    test_size=0.2,
    random_state=42,
    stratify=y_binary
)

print("📊 Data Split Summary:")
print(f"Training set size: {X_train_count.shape[0]} messages")
print(f"Test set size: {X_test_count.shape[0]} messages")
print(f"Features per message: {X_train_count.shape[1]}")

# Check that split maintained class balance
print(f"\nTraining set spam ratio: {y_train.mean():.3f}")
print(f"Test set spam ratio: {y_test.mean():.3f}")

Step 6.2: Train Naive Bayes Models

# Naive Bayes is great for text classification
# It assumes features are independent and uses probability

print("🧠 Training Naive Bayes Models...")

# Model 1: Naive Bayes with Count Vectorizer
nb_count = MultinomialNB(alpha=1.0)  # alpha is smoothing parameter
nb_count.fit(X_train_count, y_train)  # Train the model
print("✅ Naive Bayes (Count) trained")

# Model 2: Naive Bayes with TF-IDF
nb_tfidf = MultinomialNB(alpha=1.0)
nb_tfidf.fit(X_train_tfidf, y_train)  # Train the model
print("✅ Naive Bayes (TF-IDF) trained")

# Quick accuracy check on training data
train_accuracy_count = nb_count.score(X_train_count, y_train)
train_accuracy_tfidf = nb_tfidf.score(X_train_tfidf, y_train)

print(f"\n📊 Training Accuracies:")
print(f"Naive Bayes (Count): {train_accuracy_count:.4f}")
print(f"Naive Bayes (TF-IDF): {train_accuracy_tfidf:.4f}")

Step 6.3: Train Support Vector Machine

# SVM finds the best boundary between spam and ham messages
# It's often more accurate but takes longer to train

print("\n🎯 Training Support Vector Machine...")

# SVM with TF-IDF (usually works better than counts for SVM)
svm_model = SVC(
    kernel='linear',      # Linear boundary works well for text
    probability=True,     # Enable probability predictions
    random_state=42       # For reproducible results
)

# Train the SVM (this might take a minute)
print("⏳ Training SVM... (this may take a moment)")
svm_model.fit(X_train_tfidf, y_train)
print("✅ SVM trained successfully")

# Check training accuracy
svm_train_accuracy = svm_model.score(X_train_tfidf, y_train)
print(f"SVM training accuracy: {svm_train_accuracy:.4f}")

print("\n🎉 All models trained successfully!")

🎯 Exercise Task: Which model had the highest training accuracy? Why do you think training accuracy might be different from real-world performance?

💡 What you learned: Different algorithms have different strengths. Naive Bayes is fast and works well with text. SVM often has higher accuracy but is slower to train.

📺 Recommended Video: “Machine Learning Algorithms Explained” – Zach Star (15 mins)

Exercise 7: Model Evaluation (20 minutes)

Step 7.1: Make Predictions and Calculate Basic Metrics

# Now test our models on unseen data (the test set)
print("🔮 Making Predictions on Test Data...")

# Naive Bayes (Count) predictions
nb_count_pred = nb_count.predict(X_test_count)
nb_count_accuracy = accuracy_score(y_test, nb_count_pred)

# Naive Bayes (TF-IDF) predictions  
nb_tfidf_pred = nb_tfidf.predict(X_test_tfidf)
nb_tfidf_accuracy = accuracy_score(y_test, nb_tfidf_pred)

# SVM predictions
svm_pred = svm_model.predict(X_test_tfidf)
svm_accuracy = accuracy_score(y_test, svm_pred)

print("📈 Test Set Accuracies:")
print(f"Naive Bayes (Count):  {nb_count_accuracy:.4f} ({nb_count_accuracy*100:.1f}%)")
print(f"Naive Bayes (TF-IDF): {nb_tfidf_accuracy:.4f} ({nb_tfidf_accuracy*100:.1f}%)")
print(f"SVM (TF-IDF):         {svm_accuracy:.4f} ({svm_accuracy*100:.1f}%)")

Step 7.2: Detailed Classification Reports

# Classification report shows precision, recall, and F1-score for each class
# This is more informative than just accuracy

def show_classification_report(y_true, y_pred, model_name):
    """Display detailed classification metrics"""
    print(f"\n{'='*50}")
    print(f"{model_name} - Detailed Results")
    print(f"{'='*50}")
    
    # Classification report with explanation
    report = classification_report(y_true, y_pred, 
                                 target_names=['Ham', 'Spam'], 
                                 digits=4)
    print(report)
    
    print("\n💡 Metrics Explanation:")
    print("📍 Precision: Of all messages we predicted as spam, how many were actually spam?")
    print("📍 Recall: Of all actual spam messages, how many did we catch?")
    print("📍 F1-Score: Harmonic mean of precision and recall")
    print("📍 Support: Number of actual messages in each category")

# Show detailed results for each model
show_classification_report(y_test, nb_count_pred, "Naive Bayes (Count Vectorizer)")
show_classification_report(y_test, nb_tfidf_pred, "Naive Bayes (TF-IDF)")
show_classification_report(y_test, svm_pred, "Support Vector Machine")

Step 7.3: Confusion Matrices

# Confusion matrices show exactly where our models make mistakes
def plot_confusion_matrix(y_true, y_pred, model_name):
    """Create and display confusion matrix"""
    
    # Calculate confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    
    # Create the plot
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['Ham', 'Spam'], 
                yticklabels=['Ham', 'Spam'],
                cbar_kws={'label': 'Count'})
    
    plt.title(f'{model_name} - Confusion Matrix', fontsize=14)
    plt.xlabel('Predicted Label', fontsize=12)
    plt.ylabel('True Label', fontsize=12)
    
    # Add explanatory text
    plt.figtext(0.02, 0.02, 
                "💡 Perfect predictions would show numbers only on the diagonal.\n" +
                "Off-diagonal numbers represent mistakes.", 
                fontsize=10, ha='left')
    
    plt.tight_layout()
    plt.show()
    
    # Print interpretation
    tn, fp, fn, tp = cm.ravel()  # True Neg, False Pos, False Neg, True Pos
    print(f"\n📊 {model_name} Confusion Matrix Breakdown:")
    print(f"✅ True Negatives (Ham correctly identified): {tn}")
    print(f"❌ False Positives (Ham labeled as Spam): {fp} ← Bad! Blocks good emails")
    print(f"❌ False Negatives (Spam labeled as Ham): {fn} ← Spam gets through")
    print(f"✅ True Positives (Spam correctly identified): {tp}")

# Create confusion matrices for all models
plot_confusion_matrix(y_test, nb_tfidf_pred, "Naive Bayes (TF-IDF)")
plot_confusion_matrix(y_test, svm_pred, "Support Vector Machine")

Step 7.4: Model Comparison Visualization

# Create a visual comparison of all models
models_comparison = {
    'Model': ['NB (Count)', 'NB (TF-IDF)', 'SVM (TF-IDF)'],
    'Accuracy': [nb_count_accuracy, nb_tfidf_accuracy, svm_accuracy]
}

comparison_df = pd.DataFrame(models_comparison)

# Create bar plot
plt.figure(figsize=(10, 6))
bars = plt.bar(comparison_df['Model'], comparison_df['Accuracy'], 
               color=['lightblue', 'lightgreen', 'lightcoral'],
               edgecolor='black', linewidth=1)

plt.title('Model Accuracy Comparison', fontsize=16)
plt.ylabel('Accuracy Score', fontsize=12)
plt.ylim(0.85, 1.0)  # Focus on the high accuracy range

# Add accuracy values on bars
for bar, accuracy in zip(bars, comparison_df['Accuracy']):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005, 
             f'{accuracy:.4f}', ha='center', va='bottom', fontweight='bold')

plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Print summary
print("🏆 Model Performance Summary:")
best_model_idx = comparison_df['Accuracy'].idxmax()
best_model = comparison_df.loc[best_model_idx, 'Model']
best_accuracy = comparison_df.loc[best_model_idx, 'Accuracy']
print(f"Best performing model: {best_model} with {best_accuracy:.4f} accuracy")

🎯 Exercise Task: Look at the confusion matrices. Which type of error is worse for spam detection: false positives (blocking good emails) or false negatives (letting spam through)? Why?

💡 What you learned: Accuracy alone isn’t enough. In spam detection, precision and recall matter because the cost of different mistakes varies. Blocking important emails is usually worse than letting some spam through.

Exercise 8: Model Deployment and Testing (15 minutes)

Step 8.1: Choose Best Model and Create Prediction Function

# Based on our evaluation, let's choose the best model
# (Adjust this based on your results)
best_model = svm_model  # Usually SVM performs best
best_vectorizer = tfidf_vectorizer

def predict_spam(message, model=best_model, vectorizer=best_vectorizer):
    """
    Predict if a single message is spam or ham
    
    Args:
        message (str): The email/message to classify
        model: Trained machine learning model
        vectorizer: Fitted vectorizer to convert text to numbers
    
    Returns:
        tuple: (prediction, confidence_score)
    """
    
    # Step 1: Clean the input message (same preprocessing as training)
    cleaned_message = clean_text(message)
    
    # Step 2: Convert to vector format the model expects
    message_vector = vectorizer.transform([cleaned_message])
    
    # Step 3: Make prediction (0 = ham, 1 = spam)
    prediction = model.predict(message_vector)[0]
    
    # Step 4: Get confidence score (probability)
    if hasattr(model, 'predict_proba'):
        probabilities = model.predict_proba(message_vector)[0]
        confidence = probabilities[prediction]  # Confidence in the prediction
    else:
        confidence = 1.0  # Some models don't provide probabilities
    
    # Step 5: Convert prediction back to text
    result = "SPAM" if prediction == 1 else "HAM"
    
    return result, confidence

print("✅ Prediction function created!")

Step 8.2: Test with Real Examples

# Test our spam detector with various messages
test_messages = [
    # Obvious spam examples
    "Congratulations! You've won $1000! Click here to claim your prize NOW!",
    "URGENT: Your account will be closed! Verify now: www.fake-bank.com",
    "Get rich quick! Work from home! Earn $5000/week! No experience needed!",
    "FREE MONEY! No strings attached! Call now!!!",
    
    # Obvious ham (legitimate) examples  
    "Hey, are we still meeting for coffee tomorrow at 3pm?",
    "The quarterly report is attached. Please review before the meeting.",
    "Happy birthday! Hope you have a wonderful day!",
    "Can you pick up milk on your way home?",
    
    # Tricky examples (could go either way)
    "Limited time offer on our premium software - 50% off this week only",
    "Your Amazon package has been delivered to your front door",
    "Don't forget about the team meeting tomorrow at 10am",
    "You have 1 new voicemail message"
]

print("🔍 Testing Our Spam Detection System")
print("="*60)

for i, message in enumerate(test_messages, 1):
    prediction, confidence = predict_spam(message)
    
    # Format output with emojis for better readability
    if prediction == "SPAM":
        emoji = "🚨"
        color_indicator = "❌"
    else:
        emoji = "✅"
        color_indicator = "✅"
    
    print(f"\n{i:2d}. {emoji} {color_indicator} {prediction}")
    print(f"    Confidence: {confidence:.3f} ({confidence*100:.1f}%)")
    print(f"    Message: \"{message}\"")
    
    # Add interpretation
    if confidence > 0.8:
        certainty = "Very confident"
    elif confidence > 0.6:
        certainty = "Moderately confident"
    else:
        certainty = "Less confident"
    
    print(f"    Model is: {certainty}")

print("\n" + "="*60)
print("🎯 Testing Complete!")

Step 8.3: Interactive Testing Function

def interactive_spam_test():
    """
    Interactive function to test custom messages
    """
    print("🎮 Interactive Spam Detector")
    print("Type 'quit' to exit")
    print("-" * 40)
    
    while True:
        # Get user input
        user_message = input("\n📝 Enter a message to test: ")
        
        # Check if user wants to quit
        if user_message.lower() in ['quit', 'exit', 'q']:
            print("👋 Thanks for testing! Goodbye!")
            break
        
        # Skip empty messages
        if not user_message.strip():
            print("⚠️ Please enter a message to test")
            continue
        
        # Make prediction
        try:
            prediction, confidence = predict_spam(user_message)
            
            # Display results
            if prediction == "SPAM":
                print(f"🚨 SPAM DETECTED! (Confidence: {confidence:.3f})")
                print("⚠️ This message has characteristics of spam")
            else:
                print(f"✅ Legitimate message (Confidence: {confidence:.3f})")
                print("👍 This appears to be a normal message")
                
        except Exception as e:
            print(f"❌ Error processing message: {e}")

# Uncomment the line below to run interactive testing
# interactive_spam_test()

Step 8.4: Analyze Feature Importance

# Show which words/features are most important for spam detection
def analyze_important_features(model=best_model, vectorizer=best_vectorizer, top_n=20):
    """
    Analyze which features (words) are most important for classification
    """
    
    if hasattr(model, 'coef_'):
        # Get feature names (words/phrases)
        feature_names = vectorizer.get_feature_names_out()
        coefficients = model.coef_[0]
        
        # Get top spam indicators (positive coefficients)
        spam_features = sorted(zip(coefficients, feature_names), reverse=True)[:top_n]
        
        # Get top ham indicators (negative coefficients)  
        ham_features = sorted(zip(coefficients, feature_names))[:top_n]
        
        print("🚨 TOP SPAM INDICATORS (words that suggest spam):")
        print("-" * 50)
        for i, (coef, word) in enumerate(spam_features, 1):
            print(f"{i:2d}. '{word}' (importance: {coef:.4f})")
        
        print("\n✅ TOP HAM INDICATORS (words that suggest legitimate messages):")
        print("-" * 50)
        for i, (coef, word) in enumerate(ham_features, 1):
            print(f"{i:2d}. '{word}' (importance: {coef:.4f})")
            
        # Create visualization
        plt.figure(figsize=(15, 10))
        
        # Plot spam indicators
        plt.subplot(2, 1, 1)
        spam_words = [word for coef, word in spam_features]
        spam_scores = [coef for coef, word in spam_features]
        
        bars1 = plt.barh(range(len(spam_words)), spam_scores, color='red', alpha=0.7)
        plt.yticks(range(len(spam_words)), spam_words)
        plt.xlabel('Feature Importance (Spam Direction)')
        plt.title('Top 20 Words That Indicate SPAM')
        plt.gca().invert_yaxis()
        
        # Plot ham indicators
        plt.subplot(2, 1, 2)
        ham_words = [word for coef, word in ham_features]
        ham_scores = [abs(coef) for coef, word in ham_features]  # Use absolute value for visualization
        
        bars2 = plt.barh(range(len(ham_words)), ham_scores, color='green', alpha=0.7)
        plt.yticks(range(len(ham_words)), ham_words)
        plt.xlabel('Feature Importance (Ham Direction)')
        plt.title('Top 20 Words That Indicate LEGITIMATE Messages')
        plt.gca().invert_yaxis()
        
        plt.tight_layout()
        plt.show()
        
    else:
        print("⚠️ This model doesn't provide feature importance information")

# Run the analysis
analyze_important_features()

Step 8.5: Save Your Model for Future Use

import pickle
import joblib
from datetime import datetime

# Save the trained model and vectorizer
def save_model(model, vectorizer, filename_prefix="spam_detector"):
    """
    Save the trained model and vectorizer for future use
    """
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Save using joblib (recommended for scikit-learn models)
    model_filename = f"{filename_prefix}_model_{timestamp}.joblib"
    vectorizer_filename = f"{filename_prefix}_vectorizer_{timestamp}.joblib"
    
    joblib.dump(model, model_filename)
    joblib.dump(vectorizer, vectorizer_filename)
    
    print(f"✅ Model saved as: {model_filename}")
    print(f"✅ Vectorizer saved as: {vectorizer_filename}")
    
    # Also save using pickle as backup
    with open(f"{filename_prefix}_model_{timestamp}.pkl", 'wb') as f:
        pickle.dump(model, f)
    
    with open(f"{filename_prefix}_vectorizer_{timestamp}.pkl", 'wb') as f:
        pickle.dump(vectorizer, f)
    
    print("✅ Backup pickle files also created")
    
    return model_filename, vectorizer_filename

# Save your best model
model_file, vectorizer_file = save_model(best_model, best_vectorizer)

# Function to load saved model
def load_model(model_filename, vectorizer_filename):
    """
    Load a previously saved model and vectorizer
    """
    model = joblib.load(model_filename)
    vectorizer = joblib.load(vectorizer_filename)
    
    print(f"✅ Model loaded from: {model_filename}")
    print(f"✅ Vectorizer loaded from: {vectorizer_filename}")
    
    return model, vectorizer

print("\n🎉 Your spam detection system is now complete and saved!")
print("You can use the saved files to deploy your model in production.")