Online Courses
Free Tutorials  Go to Your University  Placement Preparation 
Artificial Intelligence(AI) & Machine Learning(ML) Training in Jaipur
Online Training - Youtube Live Class Link
0 like 0 dislike
in AI-ML-Data Science Projects by (750 points)
Data Scientists explores the data and analyze it to gather insightful information from it. Exploratory Data Analysis is the process of exploring data and deriving some meaningful conclusions from it like finding patterns, hypothesis testing, dimensionality reduction, handling missing values and many more. Haberman dataset is collected from a study conducted between 1958-1970, at the University of Chicago's Billings Hospital, on breast cancer patient survival, who underwent surgery during this period.

Goeduhub's Online Courses @Udemy

For Indian Students- INR 570/- || For International Students- $12.99/-


Course Name

Apply Coupon


Tensorflow 2 & Keras:Deep Learning & Artificial Intelligence

Apply Coupon


Computer Vision with OpenCV | Deep Learning CNN Projects

Apply Coupon


Complete Machine Learning & Data Science with Python Apply Coupon


Natural Language Processing-NLP with Deep Learning in Python Apply Coupon


Computer Vision OpenCV Python | YOLO| Deep Learning in Colab Apply Coupon


Complete Python Programming from scratch with Projects Apply Coupon

2 Answers

0 like 0 dislike
by (750 points)
selected by
Best answer

Exploratory Data Analysis on Haberman Dataset

Exploratory Data Analysis is the process of exploring data and deriving some meaningful conclusions from it like finding patterns, hypothesis testing, dimensionality reduction, handling missing values and many more.


This is a dataset of breast cancer patient survival who went through surgery. This case study was conducted between 1958-1970. It has 4 attributes as follows.


1. age - Age of Patients
2. year - Year on which they were operated on
3. nodes - number of nodes found
4. Status - 1/2
           1 - survived less than 5 years
           2 - survived more than 5 years


Data exploration is the first step towards data analysis. We visualize the data and try to understand it.

# importing libraries

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

import numpy as np

# Reading Data


# Reading first 5 columns of Data



age  year  nodes  status
0   30    64      1       1
1   30    62      3       1
2   30    65      0       1
3   31    59      2       1
4   31    65      4       1
# shape
# columns 

# Class Labels



# 1- lived<5 yrs

# 2- lived>5 yrs


(306, 4) 
Index(['age', 'year', 'nodes', 'status'], dtype='object')
1    225
2     81
Name: status, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age       306 non-null int64
year      306 non-null int64
nodes     306 non-null int64
status    306 non-null int64
dtypes: int64(4)
memory usage: 9.6 KB

IMBALANCED DATASET - Data is distributed unequally. For ex. here, class-1 has 255 dataset and class-2 has 81 dataset.


  1. No. of rows - 306

  2. Number of columns columns - 4

  3. There is no missing value in any column

  4. Column Names - age, year, nodes, status

  5. All columns are of integer type.

  6. No. of people with status '1' - 225 (Lived more than 5 years)

  7. No. of people with status '2'- 81 (Lived less than 5 years)

  8. The dataset is imbalanced

# Dataset Description



 age        year       nodes      status
count  306.000000  306.000000  306.000000  306.000000
mean    52.457516   62.852941    4.026144    1.264706
std     10.803452    3.249405    7.189654    0.441899
min     30.000000   58.000000    0.000000    1.000000
25%     44.000000   60.000000    0.000000    1.000000
50%     52.000000   63.000000    1.000000    1.000000
75%     60.750000   65.750000    4.000000    2.000000
max     83.000000   69.000000   52.000000    2.000000


Describe function gives information like count, mean, std, percentiles, range of each and every column


1. Average age of patients - 52 years

2. Average year of operation - 1962

3. Average no. of nodes found in patient - 4

Range of Columns

1. AGE - (30-83)

2. YEAR - (1958-1969)

3. NODES - (0-52)

Intuitive Inferences

Since number of average nodes is 4 and maximum no. of nodes is 52. 

Also, average of nodes till 75% percentile is 4 only.

Therefore, there are outliers in nodes column.

Similarly, the average age of people below 75% is 60 years. 

And maximum age is 83 years.

Hence, there are chances of outliers there also. 

*NOTE* : These are assumptions based on Intuition. They are not proved yet. 


Analysis of two variable is known as bivariate analysis. We can plot 2D scatter plots for plotting graph between two variables.

2-D Scatter Plot - provides visual image of the relationship between two variables.

# Using Seaborn

# Between AGE and year 

# hue on status (for different colors)


sns.FacetGrid(data, hue="status", size=4).map(plt.scatter, "age", "year").add_legend();;

age year



sns.FacetGrid(data, hue="status", size=4).map(plt.scatter, "age", "nodes").add_legend();;




sns.FacetGrid(data, hue="status", size=4).map(plt.scatter, "year", "nodes").add_legend();;

year node


1. Most of the patients have nodes number less than 5.

2. Patients with age less than 40 have higher chances of survival and comparatively have less number of nodes.

3. People with nodes more than 10 and above age 50 have less chances of survival.

4. There are more no. of patients between age 40-65.

Limitations of 2D scatter plots

Scatter plots are not very interpretable as they overlap a lot

As number of features increases, no. of pairs will increases, hence, plotting 2D graph of each pair will take time

To solve this problem we have pair plots wich is an inbuilt function in seaborn library.

It plots all the variable pairs with one line of code

3. Pair-plots

Instead of plotting each and every pair seperately, we can use pair plots. 

Total no. of pair-plots - nC2

Limitaion - Not useful in case of high number of dimensions



sns.pairplot(data, hue="status", size=4, vars=['age','year','nodes']);

pair plot


1. There is sharp decrease in people between number of nodes from 0-4. Most people have nodes near to zero.

2. People with less number of nodes have more chance of survival and vice-versa.

3. People above 5 nodes have survived almost half as compared to deaths.

4. Survival rate is more before year 1965 and comparatively less after that.

4. Univariate Analysis

Analyze each variable separately. Univariate analysis can be done using the following graphs. 


It accurately represents numerical data distribution. It gives an estimate of continuous variables’ probability distribution. The histogram is a univariate analysis. We  divide the distribution into intervals, known as bins. The problem with 1D scatter plot is, it might overlap and hence, it becomes difficult to choose threshold. Farther the distribution are, it is better. 

Probability Density Function (PDF)

It describes how many values fall in a particular range. It is the probability function used to describe a continuous probability distribution. It deals with the probabilities of random variables with continuous outcomes. The area under the curve always sum up to 1. 

Cumulative Distribution Function (CDF)

it describes how many percentage is less than the particular length. The integration of Probability Density Function gives CDF. CDF is also a univariate analysis.

# 1D scatter plot using one feature (AGE)

# loc takes only index labels and returns row if the index label exists

live_more = data.loc[data["status"] == 1]

live_less = data.loc[data["status"] == 2]

plt.plot(live_more["age"], np.zeros_like(live_more['age']), 'o')

plt.plot(live_less["age"], np.zeros_like(live_less['age']), 'o')

1D scatter plot


1. Too many overlapping data

2. People having age less than 35 age tend to survive more

Probablity Density Function

PDF is the probablity distribution which tells how many numbers lie between a range

FacetGrid helps in visualization of one or more variables.

Seaborn distplot lets you show a histogram with a line on it.

sns.FacetGrid(data, hue="status", size=4).map(sns.distplot, "age").add_legend();;



1. Overlapping indiactes that chances of survival cannot be determined clearly based on age. 

2. People below age 35 have high chances of surviving.

2. People between age 35-40 have almost double survival rate.

3. People between age 40-50 have less surviving chance.

5. People between age 50-65 have almost equal chance of surving.

6. People above 65 years have low survival rate.

7. No. of patients is first increasing till 50 years age and then decreasing. 

8. There are more people between age 40-70

But age is not a very determining factor and no clear inferences can be made.

sns.FacetGrid(data, hue="status", size=4).map(sns.distplot, "year").add_legend();;



This indicates the surviving rate based on year of operation, which cannot be factor for deciding survival chance.

But, it can be seen that more operations were less successful till 1960, then no. of successful operations increased till 1963.

Again, there was a high rate of unsuccessful operations between 1963-1967.

sns.FacetGrid(data, hue="status", size=4).map(sns.distplot, "nodes").add_legend();;



1. Chances of survival decreases after 10 nodes.

2. Survival rate almost negligible after 25 nodes.

Cummulative Distribution Function

It tells how many percentage of population is less than the particular value.


counts, bin_edges = np.histogram(live_more['nodes'], bins=10, density = True)

pdf = counts/(sum(counts))



cdf = np.cumsum(pdf)


plt.plot(bin_edges[1:], cdf)

[0.83555556 0.08       0.02222222 0.02666667 0.01777778 0.00444444
 0.00888889 0.         0.         0.00444444]
[ 0.   4.6  9.2 13.8 18.4 23.  27.6 32.2 36.8 41.4 46. ]


counts, bin_edges = np.histogram(live_less['nodes'], bins=10, density = True)

pdf = counts/(sum(counts))



cdf = np.cumsum(pdf)


plt.plot(bin_edges[1:], cdf)

[0.56790123 0.14814815 0.13580247 0.04938272 0.07407407 0.
 0.01234568 0.         0.         0.01234568]
[ 0.   5.2 10.4 15.6 20.8 26.  31.2 36.4 41.6 46.8 52. ]


82-83% of people have nodes less than 4.6

Mean, Variance and Std-dev

1. Mean refers to the average of a particular column.

2. Variance indicates the spread of data

3. Standard Deviation is square root of variance. It is a measure of the extent to which data varies from the mean.

These are used for finding outliers. But, one single outlier can affect mean. Hence, we have median. Few outliers can't corrupt median but if more than 50% data is corrupt then median will also be affected.


print('Means : ')

print (np.mean(live_more['nodes']))

print (np.mean(live_less['nodes']))

print('\nVariance : ')

print (np.var(live_more['nodes']))

print (np.var(live_less['nodes']))

print('\nStandard Deviation : ')

print (np.std(live_more['nodes']))

print (np.std(live_less['nodes']))

Means : 

Variance : 

Standard Deviation : 


1. People who survived more had only 2.7 average no. of nodes.

2. People who survived less had high 7.4 average no. of nodes.

Median, Percentile, Quantile, IQR, MAD

1. Median refers to middle values. It is not prone to outliers as mean.

2. Quantile refers to percentage as 0,25,50,75.

3. IQR is inter-quartile range, which is range of quantiles.

4. MAD is Median absolute deviation, i.e., how deviated value is from median (center)









print("50th percentile")





[0. 0. 0. 3.]
[ 0.  1.  4. 11.]

50th percentile


1. People who survived more had 0 nodes till 75% and only 3 nodes till 100%.

2. People who survived less had 4 nodes as median

0 like 0 dislike
by (750 points)

5. Box-Plots

Box plot - depicts lower to upper quartile values of the data, with a line at the median. 

Whiskers -  show the range of the data. 

Outlier points are extra points which generally don't add much value to data.

#Box-plot can be visualized as a PDF on the side-ways.

sns.boxplot(x='status',y='age', data=data)



1. Lower age slighltly indicates high rate of survival

sns.boxplot(x='status',y='year', data=data)



1. As year of operation is increasing, there are comparatively little high rate of success.

sns.boxplot(x='status',y='nodes', data=data)



1. There are many outliers in nodes column. 

2. There are few patients with number of nodes more than 10.

3. Less the no. of nodes, higher the survival chance.

4. After 4 number of nodes, there is less chances of survival, but still many people survived even after having higher number of nodes

6. Violin plots

 Violin plot is the combination of a box plot and probability density function(CDF).

# Combines the benefits of the previous two plots and simplifies them

# Denser regions of the data are fatter, and sparser ones thinner 

#in a violin plot

sns.violinplot(x="status", y="age", data=data, size=8)

violin plot


1. There are maximum number of people in the age group of 40-60

2. After 82 years, chances of survival is less.

3. Below 30 years, patient is likely to survive more

4. More deaths is seen in age group 40-50 as compared to survival rate.


sns.violinplot(x="status", y="year", data=data, size=8)

violin plot


1. No. of operations were more unsucessful in the year 1965.

2. Comparatively, more people survived till year 1960


sns.violinplot(x="status", y="nodes", data=data, size=8)

violin plot


1. Patient below 1 node are more likely to survive.

2. Despite having negligible number of nodes (closer to 0), there are some people who died.

3. Patients wih nodes more than 5 are more likely to die.


Analysis of three or more variables

Contour plot

Contour plots (level plots) are a way to show a three-dimensional surface on a two-dimensional plane. It graphs two predictor variables X Y on the y-axis and a response variable Z as contours. 


sns.jointplot(x="age", y="year", data = data, kind = "kde")


1. As, the graph is denser between year 1960 to 1963, more operations were done on the patients in the age group 45 to 55.

2. There were comparatively more no. of operations between year 1958-1959 for the age group 37-43


1. Most people have nodes less than 4 and there is sharp decrease in patients with higher number of nodes. Most people have nodes between 0-1.

2. Higher number of nodes indiactes less chances of survival. But there are few people who survied with higher number of nodes and also there are people who died with almost no nodes. Hence, number of nodes alone cannot be strictly deciding factor.

3. Age and Year of operation alone cannot be deciding factor. But, more number of people survived with age below 35 years.

4. People with less below 40 were seen with lesser number of nodes and more survival rate.

If you discover the nodes at early age and get operated on, more is the chances of survival.

Our Mentors(For AI-ML)

Sharda Godara Chaudhary

Mrs. Sharda Godara Chaudhary

An alumna of MNIT-Jaipur and ACCENTURE, Pune


Ms. Nisha

An alumna of IIT-BHU


About Us | Contact Us || Terms & Conditions | Privacy Policy || Youtube Channel || Telegram Channel © Social::   |  |