Data Science: Exploratory Data Analysis
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is the process of gaining a better understanding of data sets by analyzing and visualizing their primary properties. This stage is critical, especially when it comes to classifying the data to apply Machine Learning. Histograms, Box plots, Scatter plots, and other plotting options are available in EDA. Exploring the data can take a long time. We can ask to define the problem statement or definition on our extremely significant data collection through the EDA method.
What data are we exploring today?
Here we selected data set of cars from Kaggle. The data set can be downloaded from here. To give a piece of brief information about the data set this data contains more than 10, 000 rows and more than 10 columns which contains features of the car such as Engine Fuel Type, Engine Size, HP, Transmission Type, highway MPG, city MPG, and many more.
1. Importing the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)
2. Loading the data set
df = pd.read_csv("Data.csv")
df.head(5) # To display the top 5 rows
df.tail(5) # To display the bottom 5 rows
3. Checking the types of data
df.dtypes
4. Dropping irrelevant columns
df = df.drop(['Engine Fuel Type', 'Market Category', 'Vehicle Style', 'Popularity', 'Number of Doors', 'Vehicle Size'], axis=1)
df.head(5)
5. Renaming the columns
df = df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "city mpg": "MPG-C", "MSRP": "Price" })
df.head(5)
6. Dropping the duplicate rows
df.shape(11914, 10)duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)number of duplicate rows: (989, 10)
Now, let us remove the duplicate data.
df.count()Make 11914
Model 11914
Year 11914
HP 11845
Cylinders 11884
Transmission 11914
Drive Mode 11914
MPG-H 11914
MPG-C 11914
Price 11914
dtype: int64
So above there are 11914 rows and we are removing 989 rows of duplicate data.
df = df.drop_duplicates()
df.head(5)
df.count()Make 10925
Model 10925
Year 10925
HP 10856
Cylinders 10895
Transmission 10925
Drive Mode 10925
MPG-H 10925
MPG-C 10925
Price 10925
dtype: int64
7. Dropping the missing or null values
print(df.isnull().sum())Make 0
Model 0
Year 0
HP 69
Cylinders 30
Transmission 0
Drive Mode 0
MPG-H 0
MPG-C 0
Price 0
dtype: int64
This is the reason in the above step while counting both Cylinders and Horsepower (HP) had 10856 and 10895 over 10925 rows.
df = df.dropna()
df.count()Make 10827
Model 10827
Year 10827
HP 10827
Cylinders 10827
Transmission 10827
Drive Mode 10827
MPG-H 10827
MPG-C 10827
Price 10827
dtype: int64
Now we have removed all the rows which contain the Null or N/A values (Cylinders and Horsepower (HP)).
print(df.isnull().sum())Make 0
Model 0
Year 0
HP 0
Cylinders 0
Transmission 0
Drive Mode 0
MPG-H 0
MPG-C 0
Price 0
dtype: int64
8. Detecting Outliers
sns.boxplot(x=df[‘Price’])
sns.boxplot(x=df['HP'])
sns.boxplot(x=df['Cylinders'])
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)Year 9.0
HP 130.0
Cylinders 2.0
MPG-H 8.0
MPG-C 6.0
Price 21327.5
dtype: float64df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
df.shape(9191, 10)
9. Plot different features against one another (scatter), against frequency (histogram)
Histogram
Histogram refers to the frequency of occurrence of variables in an interval. In this case, there are mainly 10 different types of car manufacturing companies, but it is often important to know who has the most number of cars. To do this histogram is one of the trivial solutions which lets us know the total number of cars manufactured by a different company.
df.Make.value_counts().nlargest(40).plot(kind=’bar’, figsize=(10,5))
plt.title(“Number of cars by make”)
plt.ylabel(‘Number of cars’)
plt.xlabel(‘Make’);
Heat Maps
Heat Maps is a type of plot which is necessary when we need to find the dependent variables. One of the best ways to find the relationship between the features can be done using heat maps. In the below heat map we know that the price feature depends mainly on the Engine Size, Horsepower, and Cylinders.
plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG",annot=True)
c
Scatterplot
We generally use scatter plots to find the correlation between two variables. Here the scatter plots are plotted between Horsepower and Price and we can see the plot below. With the plot given below, we can easily draw a trend line. These features provide a good scattering of points.
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df[‘HP’], df[‘Price’])
ax.set_xlabel(‘HP’)
ax.set_ylabel(‘Price’)
plt.show()
Conclusion
Hence, the above are some of the steps involved in Exploratory data analysis, these are some general steps that you must follow to perform EDA. There are many more yet to come but for now, this is more than enough idea as to how to perform a good EDA given any data sets.
Thank You!