Data Science: Exploratory Data Analysis

5 min readOct 28, 2021

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is the process of gaining a better understanding of data sets by analyzing and visualizing their primary properties. This stage is critical, especially when it comes to classifying the data to apply Machine Learning. Histograms, Box plots, Scatter plots, and other plotting options are available in EDA. Exploring the data can take a long time. We can ask to define the problem statement or definition on our extremely significant data collection through the EDA method.

What data are we exploring today?

Here we selected data set of cars from Kaggle. The data set can be downloaded from here. To give a piece of brief information about the data set this data contains more than 10, 000 rows and more than 10 columns which contains features of the car such as Engine Fuel Type, Engine Size, HP, Transmission Type, highway MPG, city MPG, and many more.

1. Importing the required libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)

2. Loading the data set

df = pd.read_csv("Data.csv")
df.head(5) # To display the top 5 rows

df.tail(5) # To display the bottom 5 rows

3. Checking the types of data

df.dtypes

4. Dropping irrelevant columns

df = df.drop(['Engine Fuel Type', 'Market Category', 'Vehicle Style', 'Popularity', 'Number of Doors', 'Vehicle Size'], axis=1)
df.head(5)

5. Renaming the columns

df = df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "city mpg": "MPG-C", "MSRP": "Price" })
df.head(5)

6. Dropping the duplicate rows

df.shape(11914, 10)duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)number of duplicate rows:  (989, 10)

Now, let us remove the duplicate data.

df.count()Make            11914
Model           11914
Year            11914
HP              11845
Cylinders       11884
Transmission    11914
Drive Mode      11914
MPG-H           11914
MPG-C           11914
Price           11914
dtype: int64

So above there are 11914 rows and we are removing 989 rows of duplicate data.

df = df.drop_duplicates()
df.head(5)

df.count()Make            10925
Model           10925
Year            10925
HP              10856
Cylinders       10895
Transmission    10925
Drive Mode      10925
MPG-H           10925
MPG-C           10925
Price           10925
dtype: int64

7. Dropping the missing or null values

print(df.isnull().sum())Make             0 
Model            0 
Year             0 
HP              69 
Cylinders       30 
Transmission     0 
Drive Mode       0 
MPG-H            0 
MPG-C            0 
Price            0 
dtype: int64

This is the reason in the above step while counting both Cylinders and Horsepower (HP) had 10856 and 10895 over 10925 rows.

df = df.dropna() 
df.count()Make            10827 
Model           10827 
Year            10827 
HP              10827 
Cylinders       10827 
Transmission    10827 
Drive Mode      10827 
MPG-H           10827 
MPG-C           10827 
Price           10827 
dtype: int64

Now we have removed all the rows which contain the Null or N/A values (Cylinders and Horsepower (HP)).

print(df.isnull().sum())Make            0 
Model           0 
Year            0 
HP              0 
Cylinders       0 
Transmission    0 
Drive Mode      0 
MPG-H           0 
MPG-C           0 
Price           0 
dtype: int64

8. Detecting Outliers

sns.boxplot(x=df[‘Price’])

sns.boxplot(x=df['HP'])

sns.boxplot(x=df['Cylinders'])

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)Year             9.0 
HP             130.0 
Cylinders        2.0 
MPG-H            8.0 
MPG-C            6.0 
Price        21327.5 
dtype: float64df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
df.shape(9191, 10)

9. Plot different features against one another (scatter), against frequency (histogram)

Histogram

Histogram refers to the frequency of occurrence of variables in an interval. In this case, there are mainly 10 different types of car manufacturing companies, but it is often important to know who has the most number of cars. To do this histogram is one of the trivial solutions which lets us know the total number of cars manufactured by a different company.

df.Make.value_counts().nlargest(40).plot(kind=’bar’, figsize=(10,5))
plt.title(“Number of cars by make”)
plt.ylabel(‘Number of cars’)
plt.xlabel(‘Make’);

Heat Maps

Heat Maps is a type of plot which is necessary when we need to find the dependent variables. One of the best ways to find the relationship between the features can be done using heat maps. In the below heat map we know that the price feature depends mainly on the Engine Size, Horsepower, and Cylinders.

plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG",annot=True)
c

Scatterplot

We generally use scatter plots to find the correlation between two variables. Here the scatter plots are plotted between Horsepower and Price and we can see the plot below. With the plot given below, we can easily draw a trend line. These features provide a good scattering of points.

fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df[‘HP’], df[‘Price’])
ax.set_xlabel(‘HP’)
ax.set_ylabel(‘Price’)
plt.show()

Conclusion

Hence, the above are some of the steps involved in Exploratory data analysis, these are some general steps that you must follow to perform EDA. There are many more yet to come but for now, this is more than enough idea as to how to perform a good EDA given any data sets.

GitHub - yashptl2611/DS_Practical-12

Exploratory Data Analysis (EDA) is the process of gaining a better understanding of data sets by analyzing and…

github.com

Thank You!