Predicting the Future: A Hands-On Tour of Supervised Machine Learning Models

What if you could predict whether a patient has heart disease based on their clinical data? Or forecast the price of a stock? This isn't science fiction; it's the domain of supervised machine learning.

Supervised learning is like teaching a student with a textbook full of questions and answers. We provide the algorithm with labeled data—features (the questions) and a known outcome or target (the answer)—and it learns the relationship between them. Today, we'll walk through a Python project that uses the famous Cleveland Heart Disease dataset to train and compare a whole suite of classification models to predict the presence of heart disease.

Step 1: Setting the Stage with Data Preparation

Every great machine learning project starts with meticulous data preparation. Our dataset comes from the UCI repository, but it needs some work before our models can learn from it.

Loading and Labeling: The raw data file doesn't have a header, so we loaded it using pandas and manually assigned the correct column names like 'age', 'sex', 'chol', etc.. The data also uses '?' for missing values, which we told pandas to recognize.
Cleaning: The dataset was remarkably clean, with only a few missing values. For simplicity, we dropped the 6 rows that had missing data.
Defining the Target: The original 'target' column ranged from 0 (no disease) to 4 (varying levels of disease). To make this a clear, binary classification problem, we simplified it: 0 still means no disease, but any value greater than 0 was converted to 1, indicating the presence of heart disease.
Splitting and Scaling: This is a critical step. We split our data into a training set (80%) and a testing set (20%). The model learns from the training set, and its performance is evaluated on the unseen testing set. We also scaled our features using StandardScaler. Scaling ensures all our features are on a comparable range, which is very important for models like Logistic Regression, KNN, and SVM.
With our data ready, it's time to bring in the models.

A Quick Detour: Linear Regression

Before diving into our main classification task, the project includes a small regression example. It attempts to predict a person's maximum heart rate ('thalach') using just their 'age' and resting blood pressure ('trestbps').The model was evaluated using two metrics:

Mean Squared Error (MSE): The average of the squared differences between the predicted and actual values. The model had an MSE of 413.46.
R-squared ( $R^{2}$ ): This tells us the proportion of the variance in the target that's predictable from the features. The result was an $R^{2}$ of 0.08. An $R^{2}$ value of 0.08 is very low, telling us that age and resting blood pressure alone are not good predictors of maximum heart rate. This is a valuable lesson: not all relationships are strong, and it's important to evaluate your model to understand its limitations.

The Main Event: A Battle of Classification Models

Now, for our primary goal: predicting the presence of heart disease. We trained seven different popular classification models to see which one would perform best on our dataset.
The contenders are:

Logistic Regression: A reliable baseline model that predicts probabilities.
K-Nearest Neighbors (KNN): Classifies a data point based on the majority class of its 'neighbors'.
Support Vector Machine (SVM): Finds the optimal boundary (hyperplane) to separate the classes. We tested both a linear and a non-linear rbf kernel.
Naive Bayes: A probabilistic classifier based on Bayes' Theorem with a "naive" assumption of feature independence.
Decision Tree: A flowchart-like model of decisions and their possible consequences.
Random Forest: An "ensemble" model that builds many decision trees and merges their results for a more robust prediction.

Each model was trained on the scaled training data and then evaluated on the unseen scaled test data. We used three key metrics for evaluation:

Accuracy: What percentage of predictions were correct?
Precision: Of all the patients the model flagged as having heart disease, how many actually did? ( $P rec i s i o n = \frac{T r u e P os i t i v es}{T r u e P os i t i v es + F a l se P os i t i v es}$ )
Recall: Of all the patients who truly have heart disease, how many did the model correctly identify? ( $R ec a ll = \frac{T r u e P os i t i v es}{T r u e P os i t i v es + F a l se N e g a t i v es}$ )

The Verdict: And the Winner Is...

After running all the models, the results were printed for comparison. Let's summarize them in a table:

The results are fascinating! The Naive Bayes model emerged as the top performer with an impressive Accuracy of 91.67% and a stunning Precision of 95.24%

. This means that when the Naive Bayes model predicts someone has heart disease, it's correct over 95% of the time!

This is a wonderful reminder that in machine learning, newer and more complex models (like Random Forest) aren't always superior. Sometimes, a simpler, classic algorithm like Naive Bayes can be the perfect tool for the job.

Conclusion

This journey through a supervised learning project highlights a complete workflow: from understanding and cleaning the data to transforming the target variable, training a diverse set of models, and finally, using clear metrics to evaluate and compare their performance. By systematically testing multiple approaches, we were able to build a highly accurate model for a critical real-world problem.

This blog presents key insights from our project for the ‘Machine Learning’ course (MBA 2024–26, 4th trimester) at Amrita School of Business, Coimbatore, under the guidance of Dr. Prashobhan Palakkel.

Search This Blog

Nanda Gopan