# Some basic statistics formulas for machine learning (data science) -part 1

0 like 0 dislike
96 views

edited
In this, Article we will discuss some statistics topics and concepts vital in data science. Statistics itself a large subject to study but as we know that machine learning and data science are the fields somehow depend on statistics. So, in this tutorial, we will take a look at some basic concepts which determine the scene behind machine learning and data science.

Centre Tendencies - Mean, Median, Mode .

Dispersion- Range , Interquartile Range (IQR) , Standard deviation , Variance.

Correlation , Frequencies , Proportion , Hypothesis and in inferences and it will be helpful if you basic knowledge of Probability and Algebra (vector)

## 1 Answer

0 like 0 dislike
by Goeduhub's Expert (3.1k points)
edited by

Best answer

In this article we will discuss some statistics formulas that are important in data science point of view.

Note that some terminologies can confuse you because of different meaning in different disciplines (data science , information technology and statistics)

For example in statistics term independent variable or predictor variable used to predict a response or dependent variable. On the other hand in data science features are used to predict target.

### Centre Tendencies:

Measure an estimation where most of the data (values) located.

Mean: The mean is sum of all values is divided by number of values.

For example  we have a numbers  ( 4,7,6,-13,6,8 )  and want to calculate mean of the numbers.

4+7+6+ (-13) +6+8/6 = 18/6 = 3. Note: Note that n refers to total number of observations or values, in statistics N is used in formula for population (parameter) and n is used in formula for sample. But in data science it is not important you can use any.

Weightage Mean: Weightage mean is another type of mean in which each data value is multiplied by weight and their sum  then divided by sum of weights.

For example we want to calculate the weightage mean of the set of numbers ( 3, 4, 6, 7, 3, 6,4, 4 )

Weightage of a number in any set of number can be different,  based on problems, you have to recognize weightage of a number.

Now let's solve the above mentioned problem

Weightage mean =2(3)+3(4)+2(6)+1(7)/8  = 4.62

here 8 is sum of weights 2+3+2+1=8 Median: The median is a middle number in sorted set of numbers.  If there is odd number of values in data then median is middle value of set and if there is even values of number then median is calculated using average of two middle numbers.

For example; 1,2, 3, 4, 5, 6, 7- median =4      and  in 1, 2, 3, 4, 5, 6 median = 3+4/2=3.5

Mode:  It is the value that most often found in a set of numbers.

For example 1, 2, 3, 3, 3, 4, 4, 4, 4  mode=4 .

In practical life we can say that mode of religion in India is Hindu. The mode is simple summary statistics of categorical data and it is generally not used for numeric data.

### Dispersion / Measure of variabilities:

Measures of the amount by which values are dispersed or scattered
in a distribution.

Range: Range is the difference between largest and smallest values of in a set of numbers.

For example; 23, 2, 3, 4, 5, 6, 7, 0 then    Range= 23-0 =23

As you can see that range is only  depends on maximum and minimum values of a set, that is highly sensitive to the outlier (extreme value low and high) , so it is not  very useful  as a general measure of dispersion.

Mean Deviation: The deviation is calculated about mean , median and mode. But most useful deviation and widely used is mean deviation. Mean deviation is the average of deviations of each value in the from mean  of dataset. The deviation tell us how dispersed the data is around the central value. Standard deviation  and variance:  The best known measures  for variability are the variance and the standard deviation, which are based on squared deviations. The variance is an average of the squared deviations, and the standard deviation is the square root of the variance.

Look at the formulas The value of standard deviation can never be negative.

Question: calculate MD, SD and variance for (2,5,9,11,13).

Answer: mean = 40/5=8

Mean deviation = |2-5|+ |5-8|+|9-8|+|11-8|+|13-8|/5= 18/5 =3.6

Variance= By using above formula it is  13.25

Standard deviation = 3.6

Question: Why N is replaced by n-1 in sample variance/ standard deviation ?.

Answer: For detail explanation click here-  Concept of replacing N to n-1 in sample variance/ standard deviation.

Interquartile Range (IQR) : The range of the middle 50 percent of a number of set. In simple terms it is a common measurement of variability and calculate difference between the 25th percentile and the 75th percentile, called the interquartile range (or IQR). It is done to avoid sensitivity to outlier like in range which is highly sensitive to outlier (extreme values low and high).

Q1= (n+1)/4  and  Q3=3(n+1)/4  positions in sorted set of number.

Question: Calculate IQR for (7,9,9,10,11,11,13)

Q1=7+1/4=2 , Q3= 2x3=6  values at 2nd and 6th position

IQR= Q3-Q1 = 11- 9= 2 .