Unlocking Hidden Patterns: A Practical Guide to Customer Segmentation and Market Basket Analysis
Welcome, data enthusiasts! Ever scrolled through an online store and seen a "Frequently Bought Together" section that seems to read your mind? Or received a marketing email that feels perfectly tailored to you? That’s not magic; it's the power of unsupervised machine learning.
Unlike supervised learning, where we have a clear target to predict (like sales figures), unsupervised learning is about exploring data without a predefined outcome. It's about letting the data tell its own story by finding hidden structures and relationships. Today, we're going to dive into a Python project that showcases two powerhouse techniques in this domain: Association Rule Mining for market basket analysis and K-Means Clustering for customer segmentation.
Let's get started!
The Foundation: Loading and Cleaning Our Retail Data
First things first, we can't build a house on a shaky foundation. Our project begins by loading a real-world dataset called "Online Retail" from the UCI Machine Learning Repository
Like most real-world data, it needs a little tidying up
Removed any extra spaces in the item descriptions
. Dropped rows where essential information like
InvoiceNoorCustomerIDwas missing. Filtered out credit transactions (invoices containing 'C'), as they represent returns, not purchases
. Ensured we only looked at transactions with a positive quantity and unit price
.
With our data sparkling clean, we can move on to the exciting part.
Who Buys What? Association Rules with the Apriori Algorithm
Our first goal is to uncover which products are frequently purchased together. This is called Market Basket Analysis. For this demonstration, we focused our analysis on transactions from France to keep the computation manageable
1. Creating the Baskets: We restructured the data into a "basket" format. Each row represents a single invoice (a transaction), and each column represents a product
2. Finding Frequent Pairs with Apriori:
We used the powerful apriori algorithm to find "frequent itemsets"—groups of items that appear together oftenmin_support parameter was set to 0.07, meaning we only considered itemsets that appeared in at least 7% of all French transactions
3. Generating the Rules:
From these frequent itemsets, we generated association rules
Support: The percentage of transactions that contain a particular itemset.
Confidence: The likelihood of buying item B if you've already bought item A. Mathematically, it's .
Lift: This is the most interesting one. It tells us how much more likely you are to buy item B given that you've purchased item A. A lift greater than 1 suggests a strong association. A lift of 8 means the items are 8 times more likely to be bought together!
The Results:
The code produced a list of the top 5 rules, sorted by lift
This is a fantastic insight! Customers who buy a red bakelike alarm clock are over 8.5 times more likely to also buy the green one, and vice versa. An e-commerce manager could use this to create a product bundle, recommend the green clock on the red clock's product page, or even strategically place them together in marketing emails.
Who Are Our Customers? Segmentation with RFM and K-Means
Next, we shift our focus from products to people. The goal of customer segmentation is to group customers into distinct clusters based on their behavior, allowing for more targeted marketing.
1. Engineering RFM Features:
We used a popular marketing technique called RFM analysis
Recency (R): How many days ago was their last purchase?
Frequency (F): How many unique transactions have they made?
Monetary (M): What is the total amount of money they have spent?
To calculate recency, we first established a "snapshot date," which was one day after the last transaction in the dataset
2. Prepping for Clustering: Before clustering, we did two more things. We removed statistical outliers to prevent extreme values from skewing the resultsStandardScaler
3. Finding the Right Number of Clusters (k): How many customer groups should we create? We used the Elbow Method to find outk values (from 1 to 10) and plotted the Sum of Squared Errors (SSE) for eachk. The resulting plot clearly showed an elbow around 3 or 4, so we chose 3 for our analysis
4. Analyzing the Segments:
After running K-Means with k=3
Let’s interpret the results ourselves:
Note: The initial interpretation printed by the script seems to have misaligned the cluster numbers with their descriptions
Our analysis reveals three distinct groups:
Cluster 2 (Champions): These are the stars. They buy very recently, very frequently, and spend the most. They deserve VIP treatment!
Cluster 1 (Loyal Customers): This is a healthy group of active customers. They could be nurtured with loyalty programs to become Champions.
Cluster 0 (At-Risk): This group is a concern. They haven't purchased in a very long time (average of 247 days) and have low frequency and spending. A targeted re-engagement campaign ("We miss you!") might be in order.
Conclusion
In just one script, we've gone from a raw transaction log to actionable business intelligence. We discovered product associations that can boost sales and identified customer segments that allow for precise, personalized marketing. This is the incredible power of unsupervised learning—finding the hidden story in the data.
This blog presents key insights from our project for the ‘Machine Learning’ course (MBA 2024–26, 4th trimester) at Amrita School of Business, Coimbatore, under the guidance of Dr. Prashobhan Palakkel.
Comments
Post a Comment