Exploratory Data Analysis (EDA) with Python and Pandas

Perform EDA using Python to uncover patterns in employee compensation data. This guide covers data visualization techniques to analyze distributions, identify correlations, and explore trends over time.

Published in

itversity

4 min readNov 2, 2024

In the previous articles of this series, we covered Predictive Modeling and Dataset Overview, and Data Cleaning and Preprocessing. Now, with a cleaned dataset, we’re ready to dive into Exploratory Data Analysis (EDA).

EDA helps us better understand the data, uncover patterns, and identify relationships that will guide our feature engineering and model-building phases. By the end of this article, you’ll be equipped with essential EDA techniques to explore any dataset, preparing it for the next stages in our machine-learning pipeline.

Series Recap

Before we proceed, here’s a quick recap of the series so far:

1. Model Development Life Cycle: Introduced the key stages in building a machine learning model.

2. Predictive Modeling and Dataset Overview: Defined the problem and explored our employee compensation dataset.

3. Data Cleaning and Preprocessing: Cleaned the dataset by handling missing values, removing redundant columns, and standardizing data types.

Now, let’s use these cleaned data to perform EDA, setting the foundation for feature engineering and model building.

Objectives

In this article, we will:

Calculate descriptive statistics for key columns in the dataset.
Visualize data distributions, identify outliers, and examine relationships.
Analyze correlations between features.
Use grouping and aggregation to find trends in categorical variables.

Understanding Data Distributions

Why Analyze Distributions?

Understanding the distribution of numerical features is essential for identifying patterns, skewness, and outliers, which can influence model performance. Features like Salaries, Overtime, and Retirement play a direct role in employee compensation, so it’s important to examine their distributions.

Loading the data

First, let us load the data into a data frame before performing Exploratory Data Analysis (EDA).

import pandas as pd

# Load the dataset
df = pd.read_csv('compensation_data.csv')

# Display the first few rows of the dataset
df.head()

Visualizing Key Features

To analyze distributions, we’ll use histograms with density plots for each salary and benefit component. These plots help identify the shape of the data (e.g., normal or skewed) and any extreme values.

import matplotlib.pyplot as plt
import seaborn as sns

# List of salary and benefit components
salary_benefit_fields = ['Salaries', 'Overtime', 'Other Salaries', 'Retirement', 'Health/Dental', 'Other Benefits']

# Plot distributions for each component
for field in salary_benefit_fields:
    plt.figure(figsize=(8, 4))
    sns.histplot(df[field], kde=True)
    plt.title(f'Distribution of {field}')
    plt.xlabel(field)
    plt.ylabel('Frequency')
    plt.show()

These distribution plots allow us to see how each component is spread across employees, helping us spot any unusual patterns or outliers.

Correlation Analysis

Why Analyze Correlations?

Correlation measures the relationship between numerical features. Identifying correlations between features like Salaries, Overtime, and Total Compensation can help us identify strong predictors and avoid redundant features that may confuse the model.

Creating a Correlation Heatmap

A correlation heatmap visually represents relationships between numerical features. Features with strong correlations to Total Compensation are likely to be effective predictors.

# Compute and visualize correlation matrix
correlation_matrix = df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

In the heatmap, higher values (closer to 1 or -1) indicate stronger correlations. Focus on features that show a high correlation with Total Compensation, as they may be valuable for model building.

Identifying Trends Over Time

Why Examine Time-Based Trends?

Understanding how Total Compensation changes over time can reveal seasonal or annual patterns that influence employee compensation. These trends can inform our model and help it capture year-to-year changes.

Visualizing Yearly Trends

A line plot of Total Compensation over the years helps us see how compensation has evolved. This can indicate if there’s a general upward or downward trend in compensation across the dataset.

# Line plot of Total Compensation by Year
plt.figure(figsize=(10, 6))
sns.lineplot(data=df, x='Year', y='Total Compensation', ci=None)
plt.title('Yearly Trend of Total Compensation')
plt.xlabel('Year')
plt.ylabel('Average Total Compensation')
plt.show()

This plot shows any significant changes in compensation year-over-year, which could be incorporated into model features or serve as a baseline for forecasting.

Examining Relationships with Categorical Features

Why Analyze Categorical Features?

Categorical features like Department or Union may also impact compensation. Using box plots, we can examine how Total Compensation varies across categories, helping us identify which categorical features are most relevant.

Box Plot for Total Compensation by Department

Box plots provide a summary of compensation across different departments, highlighting the median, spread, and outliers for each category.

# Box plot for Total Compensation by Department
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='Department', y='Total Compensation')
plt.title('Total Compensation by Department')
plt.xticks(rotation=90)
plt.show()

This plot helps us compare compensation levels across departments and identify any outliers or patterns, which may guide feature selection for the model.

Conclusion

Through EDA, we’ve gained valuable insights into our dataset by examining distributions, correlations, and trends over time. These findings will help guide our feature engineering and model building, ensuring we use the most informative data for predictions.

Next Steps

In the next article, we’ll cover Feature Engineering and Model Building, where we’ll create and select features and build our initial predictive models.

Stay tuned for the next article in the Machine Learning for Beginners series!