Hand Coding the Linear Model

4 minute read

In order to understand statistics, you have to do the calculations yourself!

Warnings such as these are often given in statistics courses and for good reason too! Doing the work yourself really cements the understanding of statistics.

Yet when it comes to econometrics, the main take aways from the workshops are primarily in terms of the syntax of yet another computer program.

In this series of post titled: handcoded, I show how many workhorses of the econometrician’s toolbox can be implemented in a very simple manner. This manual implementation greatly helps in keeping an understanding of what actually happens when we call e.g. MCMC.. with a few parameters.

The Linear Model

Then using the Ordinary Least Squares approach to solving a model, we start with the following equation of the OLS model for a univariate regression.

This can be solver for the following (hat denotes the estimator, bar denotes the mean):

We start by loading a basic data set.

data("iris")

Inspect the data set.

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Assign our variables to objects (in the global environment)

y <- iris$Petal.Length
x <- iris$Petal.Width

We can now estimate the slope parameter:

x_m <- mean(x) # x bar in our equation
y_m <- mean(y) # y bar in our equation
numerator   <- sum( (x-x_m) * (y-y_m) )
denominator <- sum( (x-x_m)*(x-x_m) )
( beta1_hat   <- numerator / denominator )
## [1] 2.22994

Using the slope parameter we can now compute the intercept.

( beta0_hat <- y_m - ( beta1_hat * x_m ) )
## [1] 1.083558

Lets check this using the built in command.

lm( y ~ x )
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept)            x  
##       1.084        2.230

The matrix model

In matrix form we can specify our general equation as:

From which we can derive our estimator:

The matrix estimation

Use the built in command.

lm( y ~ x -1 ) # the -1 ensures that we don't have an intercept
##
## Call:
## lm(formula = y ~ x - 1)
##
## Coefficients:
##     x  
## 2.875

Now we estimate our beta ourselves, the function used to invert is called solve().

solve(t(x)%*%x) %*% t(x)%*%y
##          [,1]
## [1,] 2.874706

Now lets estimate with an intercept

lm( y ~ x )
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept)            x  
##       1.084        2.230

To hand code this, we need to add a vector of ones (1s). Note that the single 1 that we are binding to the vector X will be repeated until it is the same length.

XI <- cbind(x, 1) # create a new object XI (I for Intercept)
head(XI) # inspect the first few rows
##        x  
## [1,] 0.2 1
## [2,] 0.2 1
## [3,] 0.2 1
## [4,] 0.2 1
## [5,] 0.2 1
## [6,] 0.4 1

We also have to discuss the above solve() function, ^-1 is not correct syntax for a matrix inversion. In the above case it would still work correctly because our X matrix is in fact a vector. If we pre-multiply this vector with the transpose of itself, we obtain a scalar.

dim( t(x) %*% x )
## [1] 1 1

However, for matrices wider than one column this is not the case.

dim( t(XI) %*% XI )
## [1] 2 2

This ^-1 will invert every individual number in the matrix, rather than the matrix as a whole.

(t(XI)%*%XI)^-1
##             x            
## x 0.003307644 0.005558644
##   0.005558644 0.006666667

We want to obtain to obtain the inverse of the matrix, because this will allow us to pre-multiply on both sides, eliminating XI on the Right-Hand Side (RHS).

We therefore use a different tool, thesolve() function from the base package. This function implements the QR decomposition, which is an efficient way of deriving an inverse of a matrix.

Now we can use this matrix to estimate a model with an intercept.

solve( (t(XI)%*%XI) ) %*% t(XI)%*%y
##       [,1]
## x 2.229940
##   1.083558

Note that this is programmatically exactly the same the way that the lm() function does this.

We can suppress the automatic intercept and include our XI variable and we will obtain the same results.

lm(y ~ XI -1 )
##
## Call:
## lm(formula = y ~ XI - 1)
##
## Coefficients:
##   XIx     XI  
## 2.230  1.084

We have now constructed a univariate (univariate) model, however, from a programmatic point of view, the hurdles of multivariate modelling have already been overcome by estimating a model with an intercept (making X a matrix).

It is therefore very easy to use the same method in a case with two independent variables.

x1 <- iris$Petal.Width
x2 <- iris$Sepal.Length
y  <- iris$Petal.Length

we start by binding the two independent variables together (with vector of 1s, since we want an intercept).

XI <- cbind(x1, x2, 1)
head(XI)
##       x1  x2  
## [1,] 0.2 5.1 1
## [2,] 0.2 4.9 1
## [3,] 0.2 4.7 1
## [4,] 0.2 4.6 1
## [5,] 0.2 5.0 1
## [6,] 0.4 5.4 1

Now we estimate our model

solve( (t(XI)%*%XI) ) %*% t(XI)%*%y
##          [,1]
## x1  1.7481029
## x2  0.5422556
##    -1.5071384

And that’s all! The leap from univariate to multivariate modelling was truly very small.

EDIT: in tomorrow’s post we use the method we developed here to create an easy to use function.