Statistics pills. Linear Regression

Soledad Musella Rubio
3 min readNov 13, 2020

In this blog I would like to clarify a little more from a mathematical point of view the functionality of the linear regression. In fact, this method plays an important role for many machine learning algorithms and is extremely useful to understand it deeply.

The following a common definition of linear regression:

In statistics, linear regression is a method of estimating the conditional expectation of a dependent or endogenous variable given the values of other independent or exogenous variables.

This simply means that through linear regression, it is possible to estimate the value of a “something” that varies as a function of “something else”.

Linear regression can be simple (or univariate), if there is only one independent variable, or multiple (or multivariate), if there are more than one independent variables. In the particular case of linear regression, we are assuming that there is a linear relationship between the independent variables and the dependent variable. In this blog, I start going through the case of a simple linear regression.

Mathematical explanation

Since we have assumed that there is a linear relationship between the independent and the dependent variable, we can write the function as follows:

y = mx + c

Where m is the angular coefficient of the line, while c is the known term. These two parameters that determine one and only one straight line, have a precise meaning, even geometric, which is beyond this discussion. What we want to obtain, through the linear regression method, is the best possible line that minimises the error in the estimates we will make. But how do we find it?

Well, if we stop and think for a moment, we can determine different ways to determine how much is the error in the estimate. For example, we could say that it is “heavier” an error in the defect in the estimate (since we want to sell) rather than in excess. Or the other way around. Or we could invent many other ways.

Cost Function

The cost function is a function that determines the accuracy of our hypothesis. Given every possible hypothesis (which respects the linear model we gave ourselves at the beginning), we therefore want to find the best one (called “the best fit”), that is the one that allows us to make more precise estimates, always based on the data that are in our possession. If we look carefully at the shape of our hypothesis, we see that we can “imagine” infinites lines, one for each combination of the two parameters, angular coefficient and known term. The problem, in fact, can be given by how to identify the value of the two parameters m and q that make the error in the estimate smaller.

Mean square error

The way we will use to evaluate our hypotheses is to calculate the mean square error between the estimate obtained through the hypothesis and the actual value.

For each hypothesis and for each data within our set:

  • we calculate the estimate of the dependent variable. We call this value y -
  • we subtract y ’from the y we have as starting data
  • this difference is squared. We call this value e
  • we add up all the e’s obtained for each datum of our set
  • we divide this sum by the number of elements within our set (let’s make the average)
  • this average, which we will divide again by 2 for reasons that we will see later, is the value of the accuracy of our hypothesis
  • the hypothesis that this average (accuracy) will have the lowest, is the best

Below you can see a graphic representation of a simple linear regression:

Conclusion

This was a brief introduction of the simple linear regression, in the next blog we will go through the more articulated and complex multivariate linear regression. Stay tuned!

--

--