# Data Science: Exploratory Data Analysis

--

# What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is the process of gaining a better understanding of data sets by analyzing and visualizing their primary properties. This stage is critical, especially when it comes to classifying the data to apply Machine Learning. Histograms, Box plots, Scatter plots, and other plotting options are available in EDA. Exploring the data can take a long time. We can ask to define the problem statement or definition on our extremely significant data collection through the EDA method.

# What data are we exploring today?

Here we selected data set of cars from Kaggle. The data set can be downloaded from **here****.** To give a piece of brief information about the data set this data contains more than 10, 000 rows and more than 10 columns which contains features of the car such as Engine Fuel Type, Engine Size, HP, Transmission Type, highway MPG, city MPG, and many more.

**1. Importing the required libraries**

`import pandas as pd`

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline

sns.set(color_codes=True)

**2. Loading the data set**

`df = pd.read_csv("Data.csv")`

df.head(5) # To display the top 5 rows

`df.tail(5) # To display the bottom 5 rows`

**3. Checking the types of data**

`df.dtypes`

**4. Dropping irrelevant columns**

`df = df.drop(['Engine Fuel Type', 'Market Category', 'Vehicle Style', 'Popularity', 'Number of Doors', 'Vehicle Size'], axis=1)`

df.head(5)

**5. Renaming the columns**

`df = df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "city mpg": "MPG-C", "MSRP": "Price" })`

df.head(5)

**6. Dropping the duplicate rows**

df.shape(11914, 10)duplicate_rows_df = df[df.duplicated()]

print("number of duplicate rows: ", duplicate_rows_df.shape)number of duplicate rows: (989, 10)

Now, let us remove the duplicate data.

df.count()Make 11914

Model 11914

Year 11914

HP 11845

Cylinders 11884

Transmission 11914

Drive Mode 11914

MPG-H 11914

MPG-C 11914

Price 11914

dtype: int64

So above there are 11914 rows and we are removing 989 rows of duplicate data.

`df = df.drop_duplicates()`

df.head(5)

df.count()Make 10925

Model 10925

Year 10925

HP 10856

Cylinders 10895

Transmission 10925

Drive Mode 10925

MPG-H 10925

MPG-C 10925

Price 10925

dtype: int64

**7. Dropping the missing or null values**

print(df.isnull().sum())Make 0

Model 0

Year 0

HP 69

Cylinders 30

Transmission 0

Drive Mode 0

MPG-H 0

MPG-C 0

Price 0

dtype: int64

This is the reason in the above step while counting both Cylinders and Horsepower (HP) had 10856 and 10895 over 10925 rows.

df = df.dropna()

df.count()Make 10827

Model 10827

Year 10827

HP 10827

Cylinders 10827

Transmission 10827

Drive Mode 10827

MPG-H 10827

MPG-C 10827

Price 10827

dtype: int64

Now we have removed all the rows which contain the Null or N/A values (Cylinders and Horsepower (HP)).

print(df.isnull().sum())Make 0

Model 0

Year 0

HP 0

Cylinders 0

Transmission 0

Drive Mode 0

MPG-H 0

MPG-C 0

Price 0

dtype: int64

**8. Detecting Outliers**

`sns.boxplot(x=df[‘Price’])`

`sns.boxplot(x=df['HP'])`

`sns.boxplot(x=df['Cylinders'])`

Q1 = df.quantile(0.25)

Q3 = df.quantile(0.75)

IQR = Q3 - Q1

print(IQR)Year 9.0df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

HP 130.0

Cylinders 2.0

MPG-H 8.0

MPG-C 6.0

Price 21327.5

dtype: float64

df.shape(9191, 10)

**9. Plot different features against one another (scatter), against frequency (histogram)**

**Histogram**

Histogram refers to the frequency of occurrence of variables in an interval. In this case, there are mainly 10 different types of car manufacturing companies, but it is often important to know who has the most number of cars. To do this histogram is one of the trivial solutions which lets us know the total number of cars manufactured by a different company.

`df.Make.value_counts().nlargest(40).plot(kind=’bar’, figsize=(10,5))`

plt.title(“Number of cars by make”)

plt.ylabel(‘Number of cars’)

plt.xlabel(‘Make’);

**Heat Maps**

Heat Maps is a type of plot which is necessary when we need to find the dependent variables. One of the best ways to find the relationship between the features can be done using heat maps. In the below heat map we know that the price feature depends mainly on the Engine Size, Horsepower, and Cylinders.

`plt.figure(figsize=(10,5))`

c= df.corr()

sns.heatmap(c,cmap="BrBG",annot=True)

c

**Scatterplot**

We generally use scatter plots to find the correlation between two variables. Here the scatter plots are plotted between Horsepower and Price and we can see the plot below. With the plot given below, we can easily draw a trend line. These features provide a good scattering of points.

`fig, ax = plt.subplots(figsize=(10,6))`

ax.scatter(df[‘HP’], df[‘Price’])

ax.set_xlabel(‘HP’)

ax.set_ylabel(‘Price’)

plt.show()

**Conclusion**

Hence, the above are some of the steps involved in Exploratory data analysis, these are some general steps that you must follow to perform EDA. There are many more yet to come but for now, this is more than enough idea as to how to perform a good EDA given any data sets.

Thank You!