Data Science: Exploratory Data Analysis

Image Credit: AnalyticsLearn

What is Exploratory Data Analysis?

What data are we exploring today?

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)
df = pd.read_csv("Data.csv")
df.head(5) # To display the top 5 rows
df.tail(5) # To display the bottom 5 rows
df.dtypes
df = df.drop(['Engine Fuel Type', 'Market Category', 'Vehicle Style', 'Popularity', 'Number of Doors', 'Vehicle Size'], axis=1)
df.head(5)
df = df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "city mpg": "MPG-C", "MSRP": "Price" })
df.head(5)
df.shape(11914, 10)duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)
number of duplicate rows: (989, 10)
df.count()Make            11914
Model 11914
Year 11914
HP 11845
Cylinders 11884
Transmission 11914
Drive Mode 11914
MPG-H 11914
MPG-C 11914
Price 11914
dtype: int64
df = df.drop_duplicates()
df.head(5)
df.count()Make            10925
Model 10925
Year 10925
HP 10856
Cylinders 10895
Transmission 10925
Drive Mode 10925
MPG-H 10925
MPG-C 10925
Price 10925
dtype: int64
print(df.isnull().sum())Make             0 
Model 0
Year 0
HP 69
Cylinders 30
Transmission 0
Drive Mode 0
MPG-H 0
MPG-C 0
Price 0
dtype: int64
df = df.dropna() 
df.count()
Make 10827
Model 10827
Year 10827
HP 10827
Cylinders 10827
Transmission 10827
Drive Mode 10827
MPG-H 10827
MPG-C 10827
Price 10827
dtype: int64
print(df.isnull().sum())Make            0 
Model 0
Year 0
HP 0
Cylinders 0
Transmission 0
Drive Mode 0
MPG-H 0
MPG-C 0
Price 0
dtype: int64
sns.boxplot(x=df[‘Price’])
Box Plot of Price
sns.boxplot(x=df['HP'])
Box Plot of Horse Power
sns.boxplot(x=df['Cylinders'])
Box Plot of Cylinder
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
Year 9.0
HP 130.0
Cylinders 2.0
MPG-H 8.0
MPG-C 6.0
Price 21327.5
dtype: float64
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
df.shape
(9191, 10)
df.Make.value_counts().nlargest(40).plot(kind=’bar’, figsize=(10,5))
plt.title(“Number of cars by make”)
plt.ylabel(‘Number of cars’)
plt.xlabel(‘Make’);
Histogram
plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG",annot=True)
c
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df[‘HP’], df[‘Price’])
ax.set_xlabel(‘HP’)
ax.set_ylabel(‘Price’)
plt.show()

Conclusion