Unlocking Hidden Patterns: A Practical Guide to Customer Segmentation and Market Basket Analysis

Welcome, data enthusiasts! Ever scrolled through an online store and seen a "Frequently Bought Together" section that seems to read your mind? Or received a marketing email that feels perfectly tailored to you? That’s not magic; it's the power of unsupervised machine learning.

Unlike supervised learning, where we have a clear target to predict (like sales figures), unsupervised learning is about exploring data without a predefined outcome. It's about letting the data tell its own story by finding hidden structures and relationships. Today, we're going to dive into a Python project that showcases two powerhouse techniques in this domain: Association Rule Mining for market basket analysis and K-Means Clustering for customer segmentation.

Let's get started!

The Foundation: Loading and Cleaning Our Retail Data

First things first, we can't build a house on a shaky foundation. Our project begins by loading a real-world dataset called "Online Retail" from the UCI Machine Learning Repository. This dataset is a treasure trove of transactional data from a UK-based online retailer.

Like most real-world data, it needs a little tidying up. We performed several key cleaning steps:

Removed any extra spaces in the item descriptions.
Dropped rows where essential information like
InvoiceNo or CustomerID was missing.
Filtered out credit transactions (invoices containing 'C'), as they represent returns, not purchases.
Ensured we only looked at transactions with a positive quantity and unit price.

With our data sparkling clean, we can move on to the exciting part.

Who Buys What? Association Rules with the Apriori Algorithm

Our first goal is to uncover which products are frequently purchased together. This is called Market Basket Analysis. For this demonstration, we focused our analysis on transactions from France to keep the computation manageable.

1. Creating the Baskets: We restructured the data into a "basket" format. Each row represents a single invoice (a transaction), and each column represents a product. If an item was in a transaction, we marked it with a 1; otherwise, it was a 0. We also removed the 'POSTAGE' item since it's a shipping charge, not a product.

2. Finding Frequent Pairs with Apriori:

We used the powerful apriori algorithm to find "frequent itemsets"—groups of items that appear together often. The min_support parameter was set to 0.07, meaning we only considered itemsets that appeared in at least 7% of all French transactions.

3. Generating the Rules:

From these frequent itemsets, we generated association rules. To understand these rules, we need three key metrics:

Support: The percentage of transactions that contain a particular itemset.
Confidence: The likelihood of buying item B if you've already bought item A. Mathematically, it's $C o n f i d e n ce (A \to B) = \frac{S u pp or t ( A \cup B )}{S u pp or t ( A )}$ .
Lift: This is the most interesting one. It tells us how much more likely you are to buy item B given that you've purchased item A. A lift greater than 1 suggests a strong association. A lift of 8 means the items are 8 times more likely to be bought together!

The Results: The code produced a list of the top 5 rules, sorted by lift. Let’s look at the top two:

Antecedents (If you buy...)	Consequents (...then you buy)	Lift
(ALARM CLOCK BAKELIKE RED)	(ALARM CLOCK BAKELIKE GREEN)	8.57
(ALARM CLOCK BAKELIKE GREEN)	(ALARM CLOCK BAKELIKE RED)	8.57

This is a fantastic insight! Customers who buy a red bakelike alarm clock are over 8.5 times more likely to also buy the green one, and vice versa. An e-commerce manager could use this to create a product bundle, recommend the green clock on the red clock's product page, or even strategically place them together in marketing emails.

Who Are Our Customers? Segmentation with RFM and K-Means

Next, we shift our focus from products to people. The goal of customer segmentation is to group customers into distinct clusters based on their behavior, allowing for more targeted marketing.

1. Engineering RFM Features:

We used a popular marketing technique called RFM analysis. We engineered three new features for each customer:

Recency (R): How many days ago was their last purchase?
Frequency (F): How many unique transactions have they made?
Monetary (M): What is the total amount of money they have spent?

To calculate recency, we first established a "snapshot date," which was one day after the last transaction in the dataset.

2. Prepping for Clustering: Before clustering, we did two more things. We removed statistical outliers to prevent extreme values from skewing the results. Then, we scaled the data using StandardScaler. This is crucial because K-Means is a distance-based algorithm; scaling ensures that one feature (like Monetary Value) doesn't dominate the others (like Frequency).

3. Finding the Right Number of Clusters (k): How many customer groups should we create? We used the Elbow Method to find out. We ran the K-Means algorithm for a range of k values (from 1 to 10) and plotted the Sum of Squared Errors (SSE) for each. The "elbow" of the curve, where the rate of SSE decrease slows down, suggests an optimal k. The resulting plot clearly showed an elbow around 3 or 4, so we chose 3 for our analysis.

4. Analyzing the Segments:

After running K-Means with k=3 , we analyzed the average Recency, Frequency, and Monetary value for each of the three clusters

Let’s interpret the results ourselves:

Cluster	Avg. Recency (days)	Avg. Frequency (transactions)	Avg. Monetary Value	Our Interpretation
0	247.72	1.48	434.98	At-Risk / Lapsed Customers
1	45.74	3.11	1029.10	Loyal / Mid-Tier Customers
2	21.15	12.66	5542.65	Champions / High-Value Customers

Note: The initial interpretation printed by the script seems to have misaligned the cluster numbers with their descriptions. Our analysis above correctly matches the data to the segment profiles.

Our analysis reveals three distinct groups:

Cluster 2 (Champions): These are the stars. They buy very recently, very frequently, and spend the most. They deserve VIP treatment!
Cluster 1 (Loyal Customers): This is a healthy group of active customers. They could be nurtured with loyalty programs to become Champions.
Cluster 0 (At-Risk): This group is a concern. They haven't purchased in a very long time (average of 247 days) and have low frequency and spending. A targeted re-engagement campaign ("We miss you!") might be in order.

Conclusion

In just one script, we've gone from a raw transaction log to actionable business intelligence. We discovered product associations that can boost sales and identified customer segments that allow for precise, personalized marketing. This is the incredible power of unsupervised learning—finding the hidden story in the data.

This blog presents key insights from our project for the ‘Machine Learning’ course (MBA 2024–26, 4th trimester) at Amrita School of Business, Coimbatore, under the guidance of Dr. Prashobhan Palakkel.

Search This Blog

Nanda Gopan