Karl Pearson’s Algorithm in Python
Karl Pearson’s Correlation:
To deduce the degree of correlation between two variables, Karl Pearson deduced a formula in 1896.
To break this up, it means ” Pearson’s Correlation co-efficient” is :-
- useful to measure how strong a relationship is between two variables,
- commonly used in linear regression.
..basically it concerns with collection, manipulation, management, organization and analysis of numeric data.
What is it used for ?
Much used for data fitting in all sciences, to tell how related two or more variables under consideration are:-
Ok, alright !! When you,
- Want to see relation between your diet and weight? ….Use Karl Pearson’s correlation
This is a positive relationship as variables change in same direction. So is, the relationship between your height and weight (yes !!…I just judged you.) and also between your diet and weight
- Want to see relation between Price and quantity in demand ? ….Use Karl Pearson’s correlation
This is a negative relationship as variables change in opposite direction. So is, alcohol consumption and driving ability.
But how exactly do you find a relationship ?
Without the alarm, you probably would have overslept. In this scenario, the alarm had the effect of you waking up at a certain time. This is what is meant by cause and effect. A cause–effect relationship helps you to find the relationship between two variables. This cause–effect relationship is a relationship in which one event (the cause) makes another event happen (the effect). Also, called as Causation
- Linear Relationship — the relationship between the variables must be ‘linear’ that is when the data is plotted, it tends to cluster around a non horizontal straight line and if it does not, it can be nonlinear.
- Normal distribution — There has to be a large number of independent causes that affect the variables under study so as to form a Normal Distribution.
The normal distribution can be characterized by the mean and standard deviation. The mean determines where the peak occurs. The standard deviation is a measure of the spread of the normal probability distribution.
Linear Correlation can be of Three types:-
Correlation co-efficient value always lie between ±1. Types are as follows:
- Positive Correlation: (+1) when values of one variable increase with that of another. In simple words, the date makes a straight line going through the origin (0,0) to the increasing values of X and Y.
- Negative Correlation: (-1) when increase in value of one variable causes corresponding values of another variable to decrease. In simple words, data makes a straight line going through the higher values of Y down to the higher values of X.
- No Correlation: No impact on one variable with increase/decrease of values of another variable for example: your age(X) and Internet Bandwidth(Y).
Using Python for calculating Correlation between AGE and GLUCOSE LEVEL :-
Consider two variables i.e. Age(X) and Glucose Level (Y), the correlation between these variables refers to their relationship between them in a way that in which manner they will vary.
|S.No.||Age (X)||Glucose Level (Y)|
The formula that helps us deduce the correlation is:-
age = '47 27 38 34 45 59'
glevel = '99 55 63 59 87 81'
X = list(map(int,age.split())) ## no need to use list()...
Y = list(map(int,glevel.split())) ## ..while using Python2.x
sig_X = sum(X) ## -- ΣX
sig_Y = sum(Y) ## -- ΣY
numer1 = sum([a*b for a,b in zip(X, Y)]) ## -- ΣXY
numer = int(numer1) - int(numer2)
denom1 = sum([i**2 for i in X])
denom2 = (sig_X**2)/len(X)
denom3 = sum([j**2 for j in Y])
denom4 = (sig_Y**2)/len(Y)
group1 = denom1 - denom2
group2 = denom3 - denom4
denom = math.sqrt(group1*group2)
kp_coeff = numer/denom
print("%.03f" % kp_coeff )
Our output comes out to be 0.742 , proving that the variables have a moderate positive correlation.