Data Science: Exploratory Data Analysis

Image Credit: AnalyticsLearn

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is the process of gaining a better understanding of data sets by analyzing and visualizing their primary properties. This stage is critical, especially when it comes to classifying the data to apply Machine Learning. Histograms, Box plots, Scatter plots, and other plotting options are available in EDA. Exploring the data can take a long time. We can ask to define the problem statement or definition on our extremely significant data collection through the EDA method.

What data are we exploring today?

Here we selected data set of cars from Kaggle. The data set can be downloaded from here. To give a piece of brief information about the data set this data contains more than 10, 000 rows and more than 10 columns which contains features of the car such as Engine Fuel Type, Engine Size, HP, Transmission Type, highway MPG, city MPG, and many more.

1. Importing the required libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

2. Loading the data set

df = pd.read_csv("Data.csv")
df.head(5) # To display the top 5 rows
df.tail(5) # To display the bottom 5 rows

3. Checking the types of data


4. Dropping irrelevant columns

df = df.drop(['Engine Fuel Type', 'Market Category', 'Vehicle Style', 'Popularity', 'Number of Doors', 'Vehicle Size'], axis=1)

5. Renaming the columns

df = df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "city mpg": "MPG-C", "MSRP": "Price" })

6. Dropping the duplicate rows

df.shape(11914, 10)duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)
number of duplicate rows: (989, 10)

Now, let us remove the duplicate data.

df.count()Make            11914
Model 11914
Year 11914
HP 11845
Cylinders 11884
Transmission 11914
Drive Mode 11914
MPG-H 11914
MPG-C 11914
Price 11914
dtype: int64

So above there are 11914 rows and we are removing 989 rows of duplicate data.

df = df.drop_duplicates()
df.count()Make            10925
Model 10925
Year 10925
HP 10856
Cylinders 10895
Transmission 10925
Drive Mode 10925
MPG-H 10925
MPG-C 10925
Price 10925
dtype: int64

7. Dropping the missing or null values

print(df.isnull().sum())Make             0 
Model 0
Year 0
HP 69
Cylinders 30
Transmission 0
Drive Mode 0
Price 0
dtype: int64

This is the reason in the above step while counting both Cylinders and Horsepower (HP) had 10856 and 10895 over 10925 rows.

df = df.dropna() 
Make 10827
Model 10827
Year 10827
HP 10827
Cylinders 10827
Transmission 10827
Drive Mode 10827
MPG-H 10827
MPG-C 10827
Price 10827
dtype: int64

Now we have removed all the rows which contain the Null or N/A values (Cylinders and Horsepower (HP)).

print(df.isnull().sum())Make            0 
Model 0
Year 0
HP 0
Cylinders 0
Transmission 0
Drive Mode 0
Price 0
dtype: int64

8. Detecting Outliers

Box Plot of Price
Box Plot of Horse Power
Box Plot of Cylinder
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
Year 9.0
HP 130.0
Cylinders 2.0
MPG-H 8.0
MPG-C 6.0
Price 21327.5
dtype: float64
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
(9191, 10)

9. Plot different features against one another (scatter), against frequency (histogram)


Histogram refers to the frequency of occurrence of variables in an interval. In this case, there are mainly 10 different types of car manufacturing companies, but it is often important to know who has the most number of cars. To do this histogram is one of the trivial solutions which lets us know the total number of cars manufactured by a different company.

df.Make.value_counts().nlargest(40).plot(kind=’bar’, figsize=(10,5))
plt.title(“Number of cars by make”)
plt.ylabel(‘Number of cars’)

Heat Maps

Heat Maps is a type of plot which is necessary when we need to find the dependent variables. One of the best ways to find the relationship between the features can be done using heat maps. In the below heat map we know that the price feature depends mainly on the Engine Size, Horsepower, and Cylinders.

c= df.corr()


We generally use scatter plots to find the correlation between two variables. Here the scatter plots are plotted between Horsepower and Price and we can see the plot below. With the plot given below, we can easily draw a trend line. These features provide a good scattering of points.

fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df[‘HP’], df[‘Price’])


Hence, the above are some of the steps involved in Exploratory data analysis, these are some general steps that you must follow to perform EDA. There are many more yet to come but for now, this is more than enough idea as to how to perform a good EDA given any data sets.

Thank You!




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Covid-19 Bullshit Exposed in 5 Simple Graphics

Fledgling Data Engineering Project Part II: Analysis

Deploy a Digital Twin in 6 Months for $1M USD

Using CLIP and Gradio to assess similarity between text prompts and ranges of colors

When Accuracy Isn’t Enough: Visualization and Game Design

Screenshot of a game-like interface with a map of San Francisco on the left and controls to change that simulation on the right.

Sentiment Analysis on Roman Urdu using python, sklearn and nltk

Indicators of the coronavirus COVID-19 outbreak development

Appropriate ways to Treat Missing Values

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


More from Medium

Data analysis with NumPy

What is Exploratory Data Analysis

T-20 Cricket Data Cleaning and Transformation for Data Analysis (json to csv) using Python

Decision Tree Regression in Machine Learning