"Decoding Soccer Club Rankings: A Multivariate Analysis" - NANDA GOPAN

1.     Introduction

1.1 Multivariate Analysis

Multivariate analysis is a statistical methodology employed to examine relationships, patterns, and dependencies among multiple variables within a dataset. This approach encompasses various techniques, including Principal Component Analysis (PCA) for dimensionality reduction, Cluster Analysis for object segmentation, and Regression Models for predictive analysis. These techniques are widely utilized across diverse industries such as finance, healthcare, and marketing, facilitating complex data-driven insights. By leveraging multivariate analysis, organizations can enhance decision-making processes and improve strategic planning through a more comprehensive understanding of data relationships.


     1.2  Project Overview

This report presents a multivariate analysis of soccer club rankings. Multivariate analysis plays a crucial role in extracting insights from complex datasets, allowing for a deeper understanding of the factors influencing club performance. By applying these techniques, we can identify key performance indicators, analyze patterns in team rankings, and assess the impact of various attributes such as player statistics, financial investments, and match outcomes. This project utilizes multivariate methods within a real-world soccer ranking scenario to uncover valuable insights that can inform strategic decisions for clubs and analysts.


      1.3  Objectives

The objectives of the project will:

·       Analyze the structure, distribution, and key statistics of the soccer club ranking dataset.

·       Use correlation analysis to understand dependencies between ranking factors such as points, previous scores, and yearly changes

·       Implement multiple regression to predict club point scores based on historical data.

·       Show how statistical data can enhance decision making in sports analytics

      1.4 Dataset Description

The dataset contains the following variables:

·        ranking: The overall rank of the soccer club based on its performance metrics.

·        point.score: Total score assigned to the club, showing its competitive standing.

·        1 year change: Difference in the club’s ranking points over the past year.

·        previous.points.scored: Total points the club had in the previous period.

·        club.name: Official name of the club

·        country: Country where the club is based on.

·        X1.year.change: An alternative notation for the one-year change in ranking points.

·        Symbol.change: Categorical indicator showing whether the club’s ranking has increased, decreased or remained unchanged.

       1.5  Libraries Used







      1.6  Data Cleaning:

 Steps:

·       Identify Numeric and Non-Numeric Columns : The dataset may contain both numerical (e.g., rankings, scores) and categorical (e.g., club names, countries) data.

·       Filter Only Numeric Columns : The function extracts only columns with numerical data types, removing non-numeric columns.





2.     Data Exploration

 2.1 Descriptive Statistics

Descriptive statistics summarize the dataset’s main features by providing key numerical insights such as mean, median, standard deviation, and range of variables. These statistics help in understanding the central tendency, dispersion, and distribution of the data. Additionally, measures like skewness and kurtosis provide insights into the shape of the data distribution. Descriptive statistics are essential for detecting outliers, missing values, and data inconsistencies, which can impact further analysis and modeling.


















 2.2  Correlation Analysis    

Visualizing correlations between numerical features helps identify strong positive or negative relationships among variables. This analysis is essential for feature selection, as it highlights multicollinearity, which can affect model accuracy. Understanding these relationships allows us to identify the most impactful variables for predictions while eliminating redundant or highly correlated features that may distort results.

  Codes:












































3. Dependency Model

  3.1. Multiple Regression

Introduction
The multiple regression model is used in the multivariate analysis of soccer club rankings to model the relationship between the dependent variable, Point Score, and the independent variables, including Previous Point Scored, 1-Year Change, and Other Performance Metrics.


  3.2. Model Explanation

          Model Equation:
          

Y=  β0 ​+ β1​X1 ​+ β2​X2 ​+ β3​X3​ +...+ βn​Xn​ + ε


where:

·           YYY = Point Score (dependent variable)

·     X1X_1X1​ = Previous Point Scored

·     X2X_2X2​ = 1-Year Change

·     X3X_3X3​ = Symbol Change(categorical,converted to numerical dummy variables)

·     XnX_nXn​ = Other relevant numerical features

·     β0\beta_0β0​ = Intercept (constant term)

·     β1,β2,β3,...,βn\beta_1, \beta_2, \beta_3, ..., \beta_nβ1​,β2​,β3​,...,βn​ = Regression   coefficients (weights assigned to each independent variable)

·     ε\varepsilonε = Error term (accounts for variations not explained by the model)


Final Model for Soccer Club Ranking Analysis:

 

Point Score = β0 ​+ β1​(Previous Point Scored) + β2​(1-Year Change) + β3​(Symbol Change) + ε

This equation helps predict a club’s ranking score based on its past performance and ranking trends.

 
  Code:







·  Dependent Variable: Point_score

·  Independent Variables: previous point scored, 1 year change


The multiple regression analysis reveals that the model has low explanatory power, with an R-squared value of 22.76%, indicating that only a small portion of the variation in Point Score is explained by the independent variables. The Adjusted R-squared (20.55%) further suggests that some predictors may not be significantly contributing to the model. Additionally, the F-statistic (0.6032) is quite low, implying that the overall model does not provide a strong fit to the data. However, the p-value (2.2e-16) is extremely low, which suggests that at least one of the independent variables is statistically significant. Despite this, the weak predictive performance of the model indicates a need for further refinement, such as adding more relevant predictors, addressing multicollinearity, or exploring non-linear relationships. Improvements in feature selection and data transformation could enhance the model’s accuracy in predicting soccer club rankings.

   3.3. Logistic Regression

Introduction

Logistic regression is used to model the probability of a binary outcome. In this analysis, the dependent variable is Club Performance Category (e.g., high or low-ranked clubs), while the independent variables include Previous Point Scored, 1-Year Change, Symbol Change, and Other Performance Metrics. This model helps in predicting the likelihood of a soccer club belonging to a specific performance category based on key ranking factors.


3.4. Model Explanation

 

Model Equation:

 

General Linear Regression Equation:

Y=β0 + β1X1 + β2X2+β3X3+...+ βnXn + εY = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + ... + \beta_n X_n + \varepsilonY=β0​+β1​X1​+β2​X2​+β3​X3​+...+βn​Xn​+ε

where:

·  YYY = Point Score (dependent variable)

·  X1X_1X1​ = Previous Point Scored

·  X2X_2X2​ = 1-Year Change

·  X3X_3X3​ = Symbol Change (categorical, converted to numerical dummy variables)

·  XnX_nXn​ = Other relevant numerical features

·  β0\beta_0β0​ = Intercept (constant term)

 

Code:










·  Dependent Variable: Top_tier

·  Independent Variables: ranking, club.name, country, point.score, X1.year.change, and previous.point.scored.

 

3.4. Interpretation:

 

The logistic regression analysis indicates that the intercept is statistically significant (p < 2.2e-16), suggesting that there are underlying factors influencing club rankings that are not captured by the predictor variables. However, the independent variables are not statistically significant, implying that the selected predictors may not have a strong influence in determining whether a club belongs to the top tier. Despite this, the model achieves a high accuracy of 95.66%, meaning it correctly classifies football clubs into their respective tiers in most cases. This suggests that while the model performs well in classification, it may rely on patterns in the data rather than meaningful relationships between the predictors and the outcome. Further refinement, such as feature selection, adding interaction terms, or exploring alternative modeling techniques, could improve the interpretability and robustness of the model.


4. Dimension Reduction


4.1. Introduction

Dimension reduction techniques, such as Principal Component Analysis (PCA), are used in this soccer club ranking analysis to reduce the number of variables while preserving critical information. PCA transforms the original ranking-related variables, such as Previous Point Scored, 1-Year Change, and Symbol Change, into a new set of uncorrelated variables called Principal Components. These components capture the most significant variance in the dataset, allowing for a more efficient and interpretable analysis while minimizing redundancy and multicollinearity among the original features.


4.2. Explanation

In this soccer club ranking analysis, Principal Component Analysis (PCA) systematically evaluates the directions of maximum variance within the dataset. It projects the original ranking-related variables, such as Previous Point Scored, 1-Year Change, and Symbol Change, onto these principal directions to create a new set of uncorrelated features. The first principal component captures the highest amount of variance in the data, with each subsequent component explaining progressively less variance. This approach helps in simplifying the dataset while retaining the most important information for ranking analysis.


4.3. Examples:

Explained Variance:

·  Explained variance for the first two principal components: [0.63294366 0.36705634]

 

Code:









4.4. Interpretation:

In this soccer club ranking analysis, Principal Component Analysis (PCA) helps reduce dimensionality while preserving important ranking-related information. The first two principal components explain approximately 63.29% and 36.71% of the total variance, respectively. This means that together, they capture 100% of the variance, effectively representing the entire dataset without losing any critical information.


5. Cluster Analysis


5.1. Introduction
Cluster analysis is used to group similar data points into clusters, revealing patterns and structures within a dataset, ultimately providing insights into underlying relationships

5.2. Model Explanation:

This analysis covers two clustering techniques: K-Means Clustering and Hierarchical Clustering:

·        K-Means Clustering divides data points into K distinct clusters, assigning each point to the cluster with the nearest centroid based on similarity.

·        Hierarchical Clustering constructs a hierarchy of clusters, progressively merging smaller clusters into larger ones to form a tree-like structure, enabling a deeper understanding of data relationships.


Code:











5.2. K-means Clustering

Optimal Clusters: Optimal K = 3 (Elbow Method)

·        The results are visualized using a scatterplot of the first two principal components, with points colored according to their cluster

·        The clusters are well-separated, indicating that the chosen algorithm effectively distinguished different patterns in the data.


















5.2. Hierarchical Clustering

 

A dendrogram is generated to visualize hierarchical relationships between data points.




















Code:









Interpretation:

In the hierarchical clustering analysis of soccer club rankings, the largest merge occurring at a height above 80 suggests that the most significant division between clusters happens at this level. This indicates that the largest groups of clubs differ substantially in their ranking-related features, such as Point Score, Previous Point Scored, and 1-Year Change. The high merge height reflects a clear distinction between the top-tier and lower-tier clubs, meaning that clubs in different clusters have considerably different performance metrics.



5. Conclusion


   Top Tier Classification:

·   A logistic regression model was developed to classify clubs into "Top_Tier" and "Not Top_Tier" categories.

·   The model achieved a high accuracy of 95.66%, demonstrating strong predictive capability in distinguishing high-performing clubs.


    User Segmentation:

·   K-Means clustering was applied to segment clubs into distinct groups based on ranking-related performance metrics.

·   These clusters provide insights for strategic decision-making, performance evaluation, and targeted improvements in different club categories.


    Overall Impact:

·   The use of multivariate analysis techniques, including logistic regression and clustering, enhances the understanding of soccer club rankings.

·   This data-driven approach aids in classification, segmentation, and performance assessment, supporting informed decision-making. 


    This blog presents key insights from our project report for the ‘Data Analysis using R and   Python’ course (MBA 2024–26, 3rd trimester) at Amrita School of Business, Coimbatore.


Comments

Popular posts from this blog

Automating Trash Sorting with AI: Building a CNN Model to Classify Waste

Unlocking Hidden Patterns: A Practical Guide to Customer Segmentation and Market Basket Analysis

Mastering the Layered Grammar of Graphics with ggplot2: A Complete Guide Using Global Findex Data