Problem Solved: Protect users from unwanted emails and potential security threats
Time: 1-2 hours
What You’ll Learn: Natural Language Processing, binary classification, model evaluation
Tech Stack: Python, scikit-learn, NLTK
Behind the Scenes: Use Naive Bayes or SVM to classify emails based on word frequency patterns
Project Overview
By the end of this tutorial, you’ll have built a machine learning system that can automatically detect spam emails with over 95% accuracy. You’ll understand how Gmail, Outlook, and other email providers protect users from unwanted messages.
Recommended YouTube Videos (Watch Before Starting)
- “What is Natural Language Processing?” – IBM Technology (5 mins)
- “Naive Bayes Classifier Explained” – StatQuest with Josh Starmer (15 mins)
- “Support Vector Machines (SVM) Explained” – Zach Star (10 mins)
- “Precision vs Recall Explained” – StatQuest with Josh Starmer (9 mins)
Exercise 1: Environment Setup (5 minutes)
Step 1.1: Create Project Directory
# Create and navigate to project directory
mkdir email-spam-detection
cd email-spam-detection
# Initialize git repository for your portfolio
git init
Step 1.2: Install Required Libraries
# Install all required Python packages
pip install pandas scikit-learn nltk matplotlib seaborn wordcloud jupyter
Step 1.3: Create Python Files
# Create the main project file
touch spam_detector.py
# Create a Jupyter notebook for interactive development (optional)
touch spam_detection_analysis.ipynb
# Create requirements file for dependencies
touch requirements.txt
# Create README for your GitHub repository
touch README.md
Step 1.4: Set Up Requirements File
# Add this content to requirements.txt
echo "pandas>=1.3.0
scikit-learn>=1.0.0
nltk>=3.6
matplotlib>=3.3.0
seaborn>=0.11.0
wordcloud>=1.8.0
jupyter>=1.0.0" > requirements.txt
Step 1.5: Choose Your Development Environment
Option A: Jupyter Notebook (Recommended for learning)
# Launch Jupyter for interactive development
jupyter notebook
# Then create a new notebook or open spam_detection_analysis.ipynb
Option B: Python Script
# Open spam_detector.py in your favorite code editor
# VS Code: code spam_detector.py
# Or any text editor: nano spam_detector.py
๐ก What you learned: Project organization and dependency management for ML projects. You can use either Jupyter notebooks (.ipynb) for interactive exploration or Python scripts (.py) for production code.
Exercise 2: Import Libraries and Load Data (10 minutes)
Step 2.1: Import All Required Libraries
# Data manipulation and analysis
import pandas as pd # For working with structured data (like Excel but in Python)
import numpy as np # For numerical operations and arrays
# Visualization libraries
import matplotlib.pyplot as plt # For creating plots and charts
import seaborn as sns # For beautiful statistical visualizations
from wordcloud import WordCloud # For creating word clouds
# Natural Language Processing
import nltk # Natural Language Toolkit - main NLP library
from nltk.corpus import stopwords # Common words like 'the', 'and', 'is'
from nltk.tokenize import word_tokenize # Splits text into individual words
from nltk.stem import PorterStemmer # Reduces words to their root form
import re # Regular expressions for text cleaning
import string # String operations
# Machine Learning libraries
from sklearn.model_selection import train_test_split # Splits data for training/testing
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer # Converts text to numbers
from sklearn.naive_bayes import MultinomialNB # Naive Bayes classifier
from sklearn.svm import SVC # Support Vector Machine classifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score # Model evaluation
# Download required NLTK data (only need to run once)
nltk.download('punkt') # For tokenization
nltk.download('stopwords') # For removing common words
print("โ
All libraries imported successfully!")
Step 2.2: Load the Dataset
# Load SMS Spam Collection dataset (works perfectly for email spam concepts)
# This dataset contains 5,574 messages labeled as 'ham' (legitimate) or 'spam'
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
# Read the data into a pandas DataFrame
# sep='\t' means the data is separated by tabs
# names parameter gives column names since the file doesn't have headers
df = pd.read_csv(url, sep='\t', names=['label', 'message'])
# Display basic information about our dataset
print("๐ Dataset Overview:")
print(f"Dataset shape: {df.shape}") # Shows (rows, columns)
print("\n๐ First 5 rows:")
print(df.head())
print("\n๐ Label distribution:")
print(df['label'].value_counts()) # Count of spam vs ham messages
print(f"\n๐ Spam percentage: {(df['label'] == 'spam').mean() * 100:.1f}%")
๐ฏ Exercise Task: Run the code above. What percentage of messages are spam? Write your answer in a comment.
๐ก What you learned: How to load and inspect text data for machine learning projects.
๐บ Recommended Video: “Pandas Tutorial for Beginners” – Corey Schafer (30 mins)
Exercise 3: Exploratory Data Analysis (15 minutes)
Step 3.1: Analyze Message Characteristics
# Add new columns to analyze message patterns
df['message_length'] = df['message'].str.len() # Count characters in each message
df['word_count'] = df['message'].str.split().str.len() # Count words in each message
# Calculate statistics for spam vs ham messages
print("๐ Message Statistics:")
print("\n--- HAM (Legitimate) Messages ---")
ham_messages = df[df['label'] == 'ham']
print(f"Average length: {ham_messages['message_length'].mean():.1f} characters")
print(f"Average word count: {ham_messages['word_count'].mean():.1f} words")
print("\n--- SPAM Messages ---")
spam_messages = df[df['label'] == 'spam']
print(f"Average length: {spam_messages['message_length'].mean():.1f} characters")
print(f"Average word count: {spam_messages['word_count'].mean():.1f} words")
Step 3.2: Create Visualizations
# Create a 2x2 grid of visualizations to understand our data better
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Plot 1: Message length distribution
# This shows how long spam vs ham messages typically are
axes[0,0].hist(ham_messages['message_length'], alpha=0.7, label='Ham', bins=30, color='green')
axes[0,0].hist(spam_messages['message_length'], alpha=0.7, label='Spam', bins=30, color='red')
axes[0,0].set_title('Message Length Distribution')
axes[0,0].set_xlabel('Character Count')
axes[0,0].set_ylabel('Frequency')
axes[0,0].legend()
# Plot 2: Word count distribution
# This shows how many words spam vs ham messages typically contain
axes[0,1].hist(ham_messages['word_count'], alpha=0.7, label='Ham', bins=30, color='green')
axes[0,1].hist(spam_messages['word_count'], alpha=0.7, label='Spam', bins=30, color='red')
axes[0,1].set_title('Word Count Distribution')
axes[0,1].set_xlabel('Word Count')
axes[0,1].set_ylabel('Frequency')
axes[0,1].legend()
# Plot 3: Label distribution (pie chart)
# Shows the proportion of spam vs ham in our dataset
label_counts = df['label'].value_counts()
axes[1,0].pie(label_counts.values, labels=label_counts.index, autopct='%1.1f%%',
colors=['lightgreen', 'lightcoral'])
axes[1,0].set_title('Ham vs Spam Distribution')
# Plot 4: Box plot comparing message lengths
# Box plots show the distribution spread and outliers
df.boxplot(column='message_length', by='label', ax=axes[1,1])
axes[1,1].set_title('Message Length by Category')
axes[1,1].set_xlabel('Message Type')
axes[1,1].set_ylabel('Character Count')
plt.tight_layout() # Adjust spacing between plots
plt.show()
Step 3.3: Create Word Clouds
# Word clouds show the most common words visually
# Larger words appear more frequently in the text
# Combine all spam messages into one large text string
spam_text = ' '.join(spam_messages['message'])
# Combine all ham messages into one large text string
ham_text = ' '.join(ham_messages['message'])
# Create side-by-side word clouds
fig, axes = plt.subplots(1, 2, figsize=(16, 8))
# Spam word cloud - shows what words appear most in spam
spam_wordcloud = WordCloud(width=400, height=300,
background_color='white',
colormap='Reds').generate(spam_text)
axes[0].imshow(spam_wordcloud, interpolation='bilinear')
axes[0].set_title('Most Common Words in SPAM Messages', fontsize=16)
axes[0].axis('off') # Remove axis labels
# Ham word cloud - shows what words appear most in legitimate messages
ham_wordcloud = WordCloud(width=400, height=300,
background_color='white',
colormap='Greens').generate(ham_text)
axes[1].imshow(ham_wordcloud, interpolation='bilinear')
axes[1].set_title('Most Common Words in HAM Messages', fontsize=16)
axes[1].axis('off') # Remove axis labels
plt.tight_layout()
plt.show()
๐ฏ Exercise Task: Look at the word clouds. List 3 words that appear prominently in spam but not in ham messages.
๐ก What you learned: How to analyze and visualize text data to understand patterns before building ML models.
Exercise 4: Text Preprocessing (15 minutes)
Step 4.1: Create Text Cleaning Function
def clean_text(text):
"""
Clean and preprocess text data for machine learning.
Think of this like preparing ingredients before cooking:
- Remove unwanted parts (punctuation, numbers)
- Make everything uniform (lowercase)
- Break into pieces (tokenization)
- Remove common words that don't help classification
- Reduce words to their root form (stemming)
"""
# Step 1: Convert to lowercase for consistency
# "Hello" and "hello" should be treated the same
text = text.lower()
# Step 2: Remove punctuation and numbers using regular expressions
# Keep only letters and spaces
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Step 3: Tokenization - split text into individual words
# "hello world" becomes ["hello", "world"]
tokens = word_tokenize(text)
# Step 4: Remove stopwords (common words that don't help classification)
# Words like 'the', 'and', 'is' don't tell us if something is spam
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words and len(token) > 2]
# Step 5: Stemming - reduce words to their root form
# "running", "runs", "ran" all become "run"
stemmer = PorterStemmer()
tokens = [stemmer.stem(token) for token in tokens]
# Step 6: Join tokens back into a single string
return ' '.join(tokens)
# Test the function with example messages
print("๐งช Testing text cleaning function:")
print("\n--- Original Message ---")
original = df['message'].iloc[0] # Get first message
print(f"'{original}'")
print("\n--- Cleaned Message ---")
cleaned = clean_text(original)
print(f"'{cleaned}'")
print("\n--- Another Example ---")
test_message = "FREE! Win $1000 NOW!!! Call 123-456-7890"
print(f"Original: '{test_message}'")
print(f"Cleaned: '{clean_text(test_message)}'")
Step 4.2: Apply Cleaning to All Messages
# Apply the cleaning function to all messages in our dataset
print("๐งน Cleaning all messages... This may take a minute.")
# Create a new column with cleaned messages
# .apply() runs our clean_text function on every message
df['cleaned_message'] = df['message'].apply(clean_text)
print("โ
Text cleaning complete!")
# Show before and after examples
print("\n๐ Cleaning Examples:")
for i in range(3): # Show first 3 examples
print(f"\n--- Example {i+1} ---")
print(f"Original: '{df['message'].iloc[i]}'")
print(f"Cleaned: '{df['cleaned_message'].iloc[i]}'")
print(f"Label: {df['label'].iloc[i]}")
๐ฏ Exercise Task: Create your own test message with punctuation, numbers, and capital letters. Run it through the clean_text()
function and observe the changes.
๐ก What you learned: Text preprocessing is crucial for ML – it standardizes the data and removes noise.
๐บ Recommended Video: “Text Preprocessing for NLP” – Krish Naik (20 mins)
Exercise 5: Feature Engineering (20 minutes)
Step 5.1: Prepare Data for Machine Learning
# Separate features (X) and target variable (y)
X = df['cleaned_message'] # Features: the cleaned text messages
y = df['label'] # Target: spam or ham labels
# Convert text labels to numbers (required for machine learning)
# 'ham' becomes 0, 'spam' becomes 1
y_binary = y.map({'ham': 0, 'spam': 1})
print("๐ Data Preparation Summary:")
print(f"Number of messages: {len(X)}")
print(f"Features type: {type(X)}")
print(f"Target distribution:")
print(f" Ham (0): {(y_binary == 0).sum()}")
print(f" Spam (1): {(y_binary == 1).sum()}")
Step 5.2: Convert Text to Numbers (Vectorization)
# Method 1: Count Vectorizer
# Counts how many times each word appears in each message
# Like counting ingredients in recipes
print("๐ข Method 1: Count Vectorizer")
count_vectorizer = CountVectorizer(
max_features=3000, # Keep only top 3000 most common words
ngram_range=(1, 2), # Use single words and pairs of words
min_df=2 # Word must appear in at least 2 messages
)
# Transform our text data into a matrix of word counts
X_count = count_vectorizer.fit_transform(X)
print(f"Count Vectorizer shape: {X_count.shape}")
print(f"This means: {X_count.shape[0]} messages, {X_count.shape[1]} features")
# Method 2: TF-IDF Vectorizer (Term Frequency-Inverse Document Frequency)
# Not just counts, but considers how important each word is
# Common words get less weight, rare but meaningful words get more weight
print("\n๐ Method 2: TF-IDF Vectorizer")
tfidf_vectorizer = TfidfVectorizer(
max_features=3000, # Keep only top 3000 most important words
ngram_range=(1, 2), # Use single words and pairs of words
min_df=2, # Word must appear in at least 2 messages
max_df=0.95 # Ignore words that appear in >95% of messages
)
# Transform our text data into TF-IDF matrix
X_tfidf = tfidf_vectorizer.fit_transform(X)
print(f"TF-IDF Vectorizer shape: {X_tfidf.shape}")
# Show some example features (words/phrases the model will use)
feature_names = tfidf_vectorizer.get_feature_names_out()
print(f"\n๐ค Example features: {list(feature_names[:10])}")
print(f"๐ค More features: {list(feature_names[1000:1010])}")
Step 5.3: Understand the Vectorization Process
# Let's see how vectorization works with a simple example
example_messages = [
"free money now",
"meet for coffee",
"win free prize money"
]
# Create a simple vectorizer for demonstration
demo_vectorizer = CountVectorizer()
demo_matrix = demo_vectorizer.fit_transform(example_messages)
# Show the feature names (vocabulary)
vocab = demo_vectorizer.get_feature_names_out()
print("๐ฏ Vectorization Example:")
print(f"Vocabulary: {list(vocab)}")
# Convert to dense array to see the actual numbers
dense_matrix = demo_matrix.toarray()
print("\n๐ Count Matrix:")
for i, message in enumerate(example_messages):
print(f"'{message}' โ {dense_matrix[i]}")
print("\n๐ก Each number represents how many times each word appears in that message")
๐ฏ Exercise Task: Create your own 3 simple messages and run them through the count vectorizer. Predict what the count matrix will look like before running the code.
๐ก What you learned: Vectorization converts text into numbers that machine learning algorithms can understand. TF-IDF is usually better than simple counts because it considers word importance.
๐บ Recommended Video: “TF-IDF Explained” – Luis Serrano (12 mins)
Exercise 6: Model Training (15 minutes)
Step 6.1: Split Data for Training and Testing
# Split data into training (80%) and testing (20%) sets
# This is like studying for an exam (training) then taking the actual exam (testing)
# Split Count Vectorizer data
X_train_count, X_test_count, y_train, y_test = train_test_split(
X_count, y_binary, # Features and target
test_size=0.2, # 20% for testing
random_state=42, # For reproducible results
stratify=y_binary # Maintain same spam/ham ratio in both sets
)
# Split TF-IDF data (same split pattern)
X_train_tfidf, X_test_tfidf, _, _ = train_test_split(
X_tfidf, y_binary,
test_size=0.2,
random_state=42,
stratify=y_binary
)
print("๐ Data Split Summary:")
print(f"Training set size: {X_train_count.shape[0]} messages")
print(f"Test set size: {X_test_count.shape[0]} messages")
print(f"Features per message: {X_train_count.shape[1]}")
# Check that split maintained class balance
print(f"\nTraining set spam ratio: {y_train.mean():.3f}")
print(f"Test set spam ratio: {y_test.mean():.3f}")
Step 6.2: Train Naive Bayes Models
# Naive Bayes is great for text classification
# It assumes features are independent and uses probability
print("๐ง Training Naive Bayes Models...")
# Model 1: Naive Bayes with Count Vectorizer
nb_count = MultinomialNB(alpha=1.0) # alpha is smoothing parameter
nb_count.fit(X_train_count, y_train) # Train the model
print("โ
Naive Bayes (Count) trained")
# Model 2: Naive Bayes with TF-IDF
nb_tfidf = MultinomialNB(alpha=1.0)
nb_tfidf.fit(X_train_tfidf, y_train) # Train the model
print("โ
Naive Bayes (TF-IDF) trained")
# Quick accuracy check on training data
train_accuracy_count = nb_count.score(X_train_count, y_train)
train_accuracy_tfidf = nb_tfidf.score(X_train_tfidf, y_train)
print(f"\n๐ Training Accuracies:")
print(f"Naive Bayes (Count): {train_accuracy_count:.4f}")
print(f"Naive Bayes (TF-IDF): {train_accuracy_tfidf:.4f}")
Step 6.3: Train Support Vector Machine
# SVM finds the best boundary between spam and ham messages
# It's often more accurate but takes longer to train
print("\n๐ฏ Training Support Vector Machine...")
# SVM with TF-IDF (usually works better than counts for SVM)
svm_model = SVC(
kernel='linear', # Linear boundary works well for text
probability=True, # Enable probability predictions
random_state=42 # For reproducible results
)
# Train the SVM (this might take a minute)
print("โณ Training SVM... (this may take a moment)")
svm_model.fit(X_train_tfidf, y_train)
print("โ
SVM trained successfully")
# Check training accuracy
svm_train_accuracy = svm_model.score(X_train_tfidf, y_train)
print(f"SVM training accuracy: {svm_train_accuracy:.4f}")
print("\n๐ All models trained successfully!")
๐ฏ Exercise Task: Which model had the highest training accuracy? Why do you think training accuracy might be different from real-world performance?
๐ก What you learned: Different algorithms have different strengths. Naive Bayes is fast and works well with text. SVM often has higher accuracy but is slower to train.
๐บ Recommended Video: “Machine Learning Algorithms Explained” – Zach Star (15 mins)
Exercise 7: Model Evaluation (20 minutes)
Step 7.1: Make Predictions and Calculate Basic Metrics
# Now test our models on unseen data (the test set)
print("๐ฎ Making Predictions on Test Data...")
# Naive Bayes (Count) predictions
nb_count_pred = nb_count.predict(X_test_count)
nb_count_accuracy = accuracy_score(y_test, nb_count_pred)
# Naive Bayes (TF-IDF) predictions
nb_tfidf_pred = nb_tfidf.predict(X_test_tfidf)
nb_tfidf_accuracy = accuracy_score(y_test, nb_tfidf_pred)
# SVM predictions
svm_pred = svm_model.predict(X_test_tfidf)
svm_accuracy = accuracy_score(y_test, svm_pred)
print("๐ Test Set Accuracies:")
print(f"Naive Bayes (Count): {nb_count_accuracy:.4f} ({nb_count_accuracy*100:.1f}%)")
print(f"Naive Bayes (TF-IDF): {nb_tfidf_accuracy:.4f} ({nb_tfidf_accuracy*100:.1f}%)")
print(f"SVM (TF-IDF): {svm_accuracy:.4f} ({svm_accuracy*100:.1f}%)")
Step 7.2: Detailed Classification Reports
# Classification report shows precision, recall, and F1-score for each class
# This is more informative than just accuracy
def show_classification_report(y_true, y_pred, model_name):
"""Display detailed classification metrics"""
print(f"\n{'='*50}")
print(f"{model_name} - Detailed Results")
print(f"{'='*50}")
# Classification report with explanation
report = classification_report(y_true, y_pred,
target_names=['Ham', 'Spam'],
digits=4)
print(report)
print("\n๐ก Metrics Explanation:")
print("๐ Precision: Of all messages we predicted as spam, how many were actually spam?")
print("๐ Recall: Of all actual spam messages, how many did we catch?")
print("๐ F1-Score: Harmonic mean of precision and recall")
print("๐ Support: Number of actual messages in each category")
# Show detailed results for each model
show_classification_report(y_test, nb_count_pred, "Naive Bayes (Count Vectorizer)")
show_classification_report(y_test, nb_tfidf_pred, "Naive Bayes (TF-IDF)")
show_classification_report(y_test, svm_pred, "Support Vector Machine")
Step 7.3: Confusion Matrices
# Confusion matrices show exactly where our models make mistakes
def plot_confusion_matrix(y_true, y_pred, model_name):
"""Create and display confusion matrix"""
# Calculate confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Create the plot
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Ham', 'Spam'],
yticklabels=['Ham', 'Spam'],
cbar_kws={'label': 'Count'})
plt.title(f'{model_name} - Confusion Matrix', fontsize=14)
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
# Add explanatory text
plt.figtext(0.02, 0.02,
"๐ก Perfect predictions would show numbers only on the diagonal.\n" +
"Off-diagonal numbers represent mistakes.",
fontsize=10, ha='left')
plt.tight_layout()
plt.show()
# Print interpretation
tn, fp, fn, tp = cm.ravel() # True Neg, False Pos, False Neg, True Pos
print(f"\n๐ {model_name} Confusion Matrix Breakdown:")
print(f"โ
True Negatives (Ham correctly identified): {tn}")
print(f"โ False Positives (Ham labeled as Spam): {fp} โ Bad! Blocks good emails")
print(f"โ False Negatives (Spam labeled as Ham): {fn} โ Spam gets through")
print(f"โ
True Positives (Spam correctly identified): {tp}")
# Create confusion matrices for all models
plot_confusion_matrix(y_test, nb_tfidf_pred, "Naive Bayes (TF-IDF)")
plot_confusion_matrix(y_test, svm_pred, "Support Vector Machine")
Step 7.4: Model Comparison Visualization
# Create a visual comparison of all models
models_comparison = {
'Model': ['NB (Count)', 'NB (TF-IDF)', 'SVM (TF-IDF)'],
'Accuracy': [nb_count_accuracy, nb_tfidf_accuracy, svm_accuracy]
}
comparison_df = pd.DataFrame(models_comparison)
# Create bar plot
plt.figure(figsize=(10, 6))
bars = plt.bar(comparison_df['Model'], comparison_df['Accuracy'],
color=['lightblue', 'lightgreen', 'lightcoral'],
edgecolor='black', linewidth=1)
plt.title('Model Accuracy Comparison', fontsize=16)
plt.ylabel('Accuracy Score', fontsize=12)
plt.ylim(0.85, 1.0) # Focus on the high accuracy range
# Add accuracy values on bars
for bar, accuracy in zip(bars, comparison_df['Accuracy']):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
f'{accuracy:.4f}', ha='center', va='bottom', fontweight='bold')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
# Print summary
print("๐ Model Performance Summary:")
best_model_idx = comparison_df['Accuracy'].idxmax()
best_model = comparison_df.loc[best_model_idx, 'Model']
best_accuracy = comparison_df.loc[best_model_idx, 'Accuracy']
print(f"Best performing model: {best_model} with {best_accuracy:.4f} accuracy")
๐ฏ Exercise Task: Look at the confusion matrices. Which type of error is worse for spam detection: false positives (blocking good emails) or false negatives (letting spam through)? Why?
๐ก What you learned: Accuracy alone isn’t enough. In spam detection, precision and recall matter because the cost of different mistakes varies. Blocking important emails is usually worse than letting some spam through.
Exercise 8: Model Deployment and Testing (15 minutes)
Step 8.1: Choose Best Model and Create Prediction Function
# Based on our evaluation, let's choose the best model
# (Adjust this based on your results)
best_model = svm_model # Usually SVM performs best
best_vectorizer = tfidf_vectorizer
def predict_spam(message, model=best_model, vectorizer=best_vectorizer):
"""
Predict if a single message is spam or ham
Args:
message (str): The email/message to classify
model: Trained machine learning model
vectorizer: Fitted vectorizer to convert text to numbers
Returns:
tuple: (prediction, confidence_score)
"""
# Step 1: Clean the input message (same preprocessing as training)
cleaned_message = clean_text(message)
# Step 2: Convert to vector format the model expects
message_vector = vectorizer.transform([cleaned_message])
# Step 3: Make prediction (0 = ham, 1 = spam)
prediction = model.predict(message_vector)[0]
# Step 4: Get confidence score (probability)
if hasattr(model, 'predict_proba'):
probabilities = model.predict_proba(message_vector)[0]
confidence = probabilities[prediction] # Confidence in the prediction
else:
confidence = 1.0 # Some models don't provide probabilities
# Step 5: Convert prediction back to text
result = "SPAM" if prediction == 1 else "HAM"
return result, confidence
print("โ
Prediction function created!")
Step 8.2: Test with Real Examples
# Test our spam detector with various messages
test_messages = [
# Obvious spam examples
"Congratulations! You've won $1000! Click here to claim your prize NOW!",
"URGENT: Your account will be closed! Verify now: www.fake-bank.com",
"Get rich quick! Work from home! Earn $5000/week! No experience needed!",
"FREE MONEY! No strings attached! Call now!!!",
# Obvious ham (legitimate) examples
"Hey, are we still meeting for coffee tomorrow at 3pm?",
"The quarterly report is attached. Please review before the meeting.",
"Happy birthday! Hope you have a wonderful day!",
"Can you pick up milk on your way home?",
# Tricky examples (could go either way)
"Limited time offer on our premium software - 50% off this week only",
"Your Amazon package has been delivered to your front door",
"Don't forget about the team meeting tomorrow at 10am",
"You have 1 new voicemail message"
]
print("๐ Testing Our Spam Detection System")
print("="*60)
for i, message in enumerate(test_messages, 1):
prediction, confidence = predict_spam(message)
# Format output with emojis for better readability
if prediction == "SPAM":
emoji = "๐จ"
color_indicator = "โ"
else:
emoji = "โ
"
color_indicator = "โ
"
print(f"\n{i:2d}. {emoji} {color_indicator} {prediction}")
print(f" Confidence: {confidence:.3f} ({confidence*100:.1f}%)")
print(f" Message: \"{message}\"")
# Add interpretation
if confidence > 0.8:
certainty = "Very confident"
elif confidence > 0.6:
certainty = "Moderately confident"
else:
certainty = "Less confident"
print(f" Model is: {certainty}")
print("\n" + "="*60)
print("๐ฏ Testing Complete!")
Step 8.3: Interactive Testing Function
def interactive_spam_test():
"""
Interactive function to test custom messages
"""
print("๐ฎ Interactive Spam Detector")
print("Type 'quit' to exit")
print("-" * 40)
while True:
# Get user input
user_message = input("\n๐ Enter a message to test: ")
# Check if user wants to quit
if user_message.lower() in ['quit', 'exit', 'q']:
print("๐ Thanks for testing! Goodbye!")
break
# Skip empty messages
if not user_message.strip():
print("โ ๏ธ Please enter a message to test")
continue
# Make prediction
try:
prediction, confidence = predict_spam(user_message)
# Display results
if prediction == "SPAM":
print(f"๐จ SPAM DETECTED! (Confidence: {confidence:.3f})")
print("โ ๏ธ This message has characteristics of spam")
else:
print(f"โ
Legitimate message (Confidence: {confidence:.3f})")
print("๐ This appears to be a normal message")
except Exception as e:
print(f"โ Error processing message: {e}")
# Uncomment the line below to run interactive testing
# interactive_spam_test()
Step 8.4: Analyze Feature Importance
# Show which words/features are most important for spam detection
def analyze_important_features(model=best_model, vectorizer=best_vectorizer, top_n=20):
"""
Analyze which features (words) are most important for classification
"""
if hasattr(model, 'coef_'):
# Get feature names (words/phrases)
feature_names = vectorizer.get_feature_names_out()
coefficients = model.coef_[0]
# Get top spam indicators (positive coefficients)
spam_features = sorted(zip(coefficients, feature_names), reverse=True)[:top_n]
# Get top ham indicators (negative coefficients)
ham_features = sorted(zip(coefficients, feature_names))[:top_n]
print("๐จ TOP SPAM INDICATORS (words that suggest spam):")
print("-" * 50)
for i, (coef, word) in enumerate(spam_features, 1):
print(f"{i:2d}. '{word}' (importance: {coef:.4f})")
print("\nโ
TOP HAM INDICATORS (words that suggest legitimate messages):")
print("-" * 50)
for i, (coef, word) in enumerate(ham_features, 1):
print(f"{i:2d}. '{word}' (importance: {coef:.4f})")
# Create visualization
plt.figure(figsize=(15, 10))
# Plot spam indicators
plt.subplot(2, 1, 1)
spam_words = [word for coef, word in spam_features]
spam_scores = [coef for coef, word in spam_features]
bars1 = plt.barh(range(len(spam_words)), spam_scores, color='red', alpha=0.7)
plt.yticks(range(len(spam_words)), spam_words)
plt.xlabel('Feature Importance (Spam Direction)')
plt.title('Top 20 Words That Indicate SPAM')
plt.gca().invert_yaxis()
# Plot ham indicators
plt.subplot(2, 1, 2)
ham_words = [word for coef, word in ham_features]
ham_scores = [abs(coef) for coef, word in ham_features] # Use absolute value for visualization
bars2 = plt.barh(range(len(ham_words)), ham_scores, color='green', alpha=0.7)
plt.yticks(range(len(ham_words)), ham_words)
plt.xlabel('Feature Importance (Ham Direction)')
plt.title('Top 20 Words That Indicate LEGITIMATE Messages')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
else:
print("โ ๏ธ This model doesn't provide feature importance information")
# Run the analysis
analyze_important_features()
Step 8.5: Save Your Model for Future Use
import pickle
import joblib
from datetime import datetime
# Save the trained model and vectorizer
def save_model(model, vectorizer, filename_prefix="spam_detector"):
"""
Save the trained model and vectorizer for future use
"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
# Save using joblib (recommended for scikit-learn models)
model_filename = f"{filename_prefix}_model_{timestamp}.joblib"
vectorizer_filename = f"{filename_prefix}_vectorizer_{timestamp}.joblib"
joblib.dump(model, model_filename)
joblib.dump(vectorizer, vectorizer_filename)
print(f"โ
Model saved as: {model_filename}")
print(f"โ
Vectorizer saved as: {vectorizer_filename}")
# Also save using pickle as backup
with open(f"{filename_prefix}_model_{timestamp}.pkl", 'wb') as f:
pickle.dump(model, f)
with open(f"{filename_prefix}_vectorizer_{timestamp}.pkl", 'wb') as f:
pickle.dump(vectorizer, f)
print("โ
Backup pickle files also created")
return model_filename, vectorizer_filename
# Save your best model
model_file, vectorizer_file = save_model(best_model, best_vectorizer)
# Function to load saved model
def load_model(model_filename, vectorizer_filename):
"""
Load a previously saved model and vectorizer
"""
model = joblib.load(model_filename)
vectorizer = joblib.load(vectorizer_filename)
print(f"โ
Model loaded from: {model_filename}")
print(f"โ
Vectorizer loaded from: {vectorizer_filename}")
return model, vectorizer
print("\n๐ Your spam detection system is now complete and saved!")
print("You can use the saved files to deploy your model in production.")