Exploratory Data Analysis on Haberman Dataset
Exploratory Data Analysis is the process of exploring data and deriving some meaningful conclusions from it like finding patterns, hypothesis testing, dimensionality reduction, handling missing values and many more.
This is a dataset of breast cancer patient survival who went through surgery. This case study was conducted between 1958-1970. It has 4 attributes as follows.
1. age - Age of Patients
2. year - Year on which they were operated on
3. nodes - number of nodes found
4. Status - 1/2
1 - survived less than 5 years
2 - survived more than 5 years
1. DATA EXPLORATION
Data exploration is the first step towards data analysis. We visualize the data and try to understand it.
# importing libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Reading Data
# Reading first 5 columns of Data
age year nodes status
0 30 64 1 1
1 30 62 3 1
2 30 65 0 1
3 31 59 2 1
4 31 65 4 1
# Class Labels
# IMBALANCED DATA
# 1- lived<5 yrs
# 2- lived>5 yrs
Index(['age', 'year', 'nodes', 'status'], dtype='object')
Name: status, dtype: int64
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age 306 non-null int64
year 306 non-null int64
nodes 306 non-null int64
status 306 non-null int64
memory usage: 9.6 KB
IMBALANCED DATASET - Data is distributed unequally. For ex. here, class-1 has 255 dataset and class-2 has 81 dataset.
No. of rows - 306
Number of columns columns - 4
There is no missing value in any column
Column Names - age, year, nodes, status
All columns are of integer type.
No. of people with status '1' - 225 (Lived more than 5 years)
No. of people with status '2'- 81 (Lived less than 5 years)
The dataset is imbalanced
# Dataset Description
age year nodes status
count 306.000000 306.000000 306.000000 306.000000
mean 52.457516 62.852941 4.026144 1.264706
std 10.803452 3.249405 7.189654 0.441899
min 30.000000 58.000000 0.000000 1.000000
25% 44.000000 60.000000 0.000000 1.000000
50% 52.000000 63.000000 1.000000 1.000000
75% 60.750000 65.750000 4.000000 2.000000
max 83.000000 69.000000 52.000000 2.000000
Describe function gives information like count, mean, std, percentiles, range of each and every column
1. Average age of patients - 52 years
2. Average year of operation - 1962
3. Average no. of nodes found in patient - 4
Range of Columns
1. AGE - (30-83)
2. YEAR - (1958-1969)
3. NODES - (0-52)
Since number of average nodes is 4 and maximum no. of nodes is 52.
Also, average of nodes till 75% percentile is 4 only.
Therefore, there are outliers in nodes column.
Similarly, the average age of people below 75% is 60 years.
And maximum age is 83 years.
Hence, there are chances of outliers there also.
*NOTE* : These are assumptions based on Intuition. They are not proved yet.
2. BIVARIATE ANALYSIS
Analysis of two variable is known as bivariate analysis. We can plot 2D scatter plots for plotting graph between two variables.
2-D Scatter Plot - provides visual image of the relationship between two variables.
# Using Seaborn
# Between AGE and year
# hue on status (for different colors)
sns.FacetGrid(data, hue="status", size=4).map(plt.scatter, "age", "year").add_legend();
# AGE and NODES
sns.FacetGrid(data, hue="status", size=4).map(plt.scatter, "age", "nodes").add_legend();
# YEAR and NODES
sns.FacetGrid(data, hue="status", size=4).map(plt.scatter, "year", "nodes").add_legend();
1. Most of the patients have nodes number less than 5.
2. Patients with age less than 40 have higher chances of survival and comparatively have less number of nodes.
3. People with nodes more than 10 and above age 50 have less chances of survival.
4. There are more no. of patients between age 40-65.
Limitations of 2D scatter plots
Scatter plots are not very interpretable as they overlap a lot
As number of features increases, no. of pairs will increases, hence, plotting 2D graph of each pair will take time
To solve this problem we have pair plots wich is an inbuilt function in seaborn library.
It plots all the variable pairs with one line of code
Instead of plotting each and every pair seperately, we can use pair plots.
Total no. of pair-plots - nC2
Limitaion - Not useful in case of high number of dimensions
sns.pairplot(data, hue="status", size=4, vars=['age','year','nodes']);
1. There is sharp decrease in people between number of nodes from 0-4. Most people have nodes near to zero.
2. People with less number of nodes have more chance of survival and vice-versa.
3. People above 5 nodes have survived almost half as compared to deaths.
4. Survival rate is more before year 1965 and comparatively less after that.
4. Univariate Analysis
Analyze each variable separately. Univariate analysis can be done using the following graphs.
It accurately represents numerical data distribution. It gives an estimate of continuous variables’ probability distribution. The histogram is a univariate analysis. We divide the distribution into intervals, known as bins. The problem with 1D scatter plot is, it might overlap and hence, it becomes difficult to choose threshold. Farther the distribution are, it is better.
Probability Density Function (PDF)
It describes how many values fall in a particular range. It is the probability function used to describe a continuous probability distribution. It deals with the probabilities of random variables with continuous outcomes. The area under the curve always sum up to 1.
Cumulative Distribution Function (CDF)
it describes how many percentage is less than the particular length. The integration of Probability Density Function gives CDF. CDF is also a univariate analysis.
# 1D scatter plot using one feature (AGE)
# loc takes only index labels and returns row if the index label exists
live_more = data.loc[data["status"] == 1]
live_less = data.loc[data["status"] == 2]
plt.plot(live_more["age"], np.zeros_like(live_more['age']), 'o')
plt.plot(live_less["age"], np.zeros_like(live_less['age']), 'o')
1. Too many overlapping data
2. People having age less than 35 age tend to survive more
Probablity Density Function
PDF is the probablity distribution which tells how many numbers lie between a range
FacetGrid helps in visualization of one or more variables.
Seaborn distplot lets you show a histogram with a line on it.
sns.FacetGrid(data, hue="status", size=4).map(sns.distplot, "age").add_legend();
1. Overlapping indiactes that chances of survival cannot be determined clearly based on age.
2. People below age 35 have high chances of surviving.
2. People between age 35-40 have almost double survival rate.
3. People between age 40-50 have less surviving chance.
5. People between age 50-65 have almost equal chance of surving.
6. People above 65 years have low survival rate.
7. No. of patients is first increasing till 50 years age and then decreasing.
8. There are more people between age 40-70
But age is not a very determining factor and no clear inferences can be made.
sns.FacetGrid(data, hue="status", size=4).map(sns.distplot, "year").add_legend();
This indicates the surviving rate based on year of operation, which cannot be factor for deciding survival chance.
But, it can be seen that more operations were less successful till 1960, then no. of successful operations increased till 1963.
Again, there was a high rate of unsuccessful operations between 1963-1967.
sns.FacetGrid(data, hue="status", size=4).map(sns.distplot, "nodes").add_legend();
1. Chances of survival decreases after 10 nodes.
2. Survival rate almost negligible after 25 nodes.
Cummulative Distribution Function
It tells how many percentage of population is less than the particular value.
counts, bin_edges = np.histogram(live_more['nodes'], bins=10, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
[0.83555556 0.08 0.02222222 0.02666667 0.01777778 0.00444444
0.00888889 0. 0. 0.00444444]
[ 0. 4.6 9.2 13.8 18.4 23. 27.6 32.2 36.8 41.4 46. ]
counts, bin_edges = np.histogram(live_less['nodes'], bins=10, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
[0.56790123 0.14814815 0.13580247 0.04938272 0.07407407 0.
0.01234568 0. 0. 0.01234568]
[ 0. 5.2 10.4 15.6 20.8 26. 31.2 36.4 41.6 46.8 52. ]
82-83% of people have nodes less than 4.6
Mean, Variance and Std-dev
1. Mean refers to the average of a particular column.
2. Variance indicates the spread of data
3. Standard Deviation is square root of variance. It is a measure of the extent to which data varies from the mean.
These are used for finding outliers. But, one single outlier can affect mean. Hence, we have median. Few outliers can't corrupt median but if more than 50% data is corrupt then median will also be affected.
print('Means : ')
print('\nVariance : ')
print('\nStandard Deviation : ')
Standard Deviation :
1. People who survived more had only 2.7 average no. of nodes.
2. People who survived less had high 7.4 average no. of nodes.
Median, Percentile, Quantile, IQR, MAD
1. Median refers to middle values. It is not prone to outliers as mean.
2. Quantile refers to percentage as 0,25,50,75.
3. IQR is inter-quartile range, which is range of quantiles.
4. MAD is Median absolute deviation, i.e., how deviated value is from median (center)
[0. 0. 0. 3.]
[ 0. 1. 4. 11.]
1. People who survived more had 0 nodes till 75% and only 3 nodes till 100%.
2. People who survived less had 4 nodes as median