# Hand Coding Categorical Variables

In last week’s posts we discussed handcoding a linear model and writing a convenient function for this, in today’s post we will take this a step further by including a categorical variable.

## Swiss life

Since I live in Geneva we will use a built-in data set that is close to home.

This data set compares fertility rates in 47 different French-speaking regions (sub-Cantonal) of Switzerland around the year 1888 (for more information see `help("swiss")`

).

In our study we want to look at the effect of religion on fertility.

As we’re dealing with Switzerland in the 19th certury,
there are essentially only two religions, Protestant and Roman-Catholic.
As the predominantly protestant authorities in the cities are generally suspicious of the Catholic mountainfolk, this variable is astutely labelled `Catholic`

(indicating the percentage of Catholic citizen, the percentage of protestant is therefore `100 - Catholic`

).

## Dichotonomous Categorical Variables

Handling categorical variables in a statistical model is bit different because they generally do not have a direct numerical representation.

A special case of categorical variable is the dichotonomous variable. This categorical variable can take only two forms, e.g. `Male|Female`

, `Infected|Not-Infected`

, `Pre-Intervention|Post-Intervention`

, etc.

In Switzerland many issues are decided upon by popular vote, with no first-past-the-post or other democracy autrocities.
This means that having minimal majority can drastically change policymaking.
We use this as our justification for re-encoding the percentage variable `Catholic`

to a dichotonomous variable `Catholic_D`

`D`

here no longer stands for dichotonomous, it stands for dummy. A dummy variable is a variable that either takes the value `1`

(TRUE) or `0`

(FALSE).
This allows us to quantify the covariance of the question (Catholic?).
It is thus the effect of a region being Catholic rather than Protestant.

Let’s reload our function from last week.

We can now call the function with the parameters set.

Our model gives us an intercept of `66.22`

and the coefficient of the variable `Catholic_D`

is `10.24`

.
Note that it wasn’t necesarry to specify `intercept = TRUE`

since this is already the default.
The `y = `

and `X = `

parts also are not necessary as long was we enter our object in the correct order,
i.e. in the order that we specified when we created the function (`function(y, X, ...)`

).

Let’s quickly verify if we obtain the same results from the built in function.

## General categorical Variables

A general categorical variable can take more than two values.

Mapping a dichotonomous variable to a dummy works so well, because the two possible values that the varible can take are both, mutually exclusive as well as collectively exhausting. In plain terms this means that it is one or the other, never both and never neither.

Knowing this, we can use one dummy variable to describe one value (Catholic)
and the other value is implicit, since not-Catholic (`Catholic == 0`

) means Protestant.

For categorical variables with more than two possible values this is not true.
Consider, not `A`

, could mean `B`

, or it could mean `C`

(if we had three possible values).
This means that we need to create a dummy variable for each possible value.

## Handcoding

We start by creating a categorical variable with three possible values.
We can use our existing `Religion`

variable for this.

Now we have to choose a region for which we are going to change the religion.

The nineth position is occupied by Gruyere.

This is a mountain region know for its dairy products such a cheese and chocolate.

Let us suppose that due to a large inflow this region is now predominantly Eastern-Orthodox.

We are now in a position where we can no longer map our `Religion`

variable to a dummy variable.
We can however create a dummy variable for every possible value.

Let’s check we don’t have any regions for which no value is true, or regions for which more than one variable is true.

Yup! that looks good. Time to run a regression.

What happened here? Why didn’t this work? All our rows summed up to `1`

!

Our matrix is singular precisely because of that reason. The fact that all dummy variables sum to `1`

,
and that a row with one `1`

s is included (remember the intercept) makes the matrix singular.

There are two ways to solve this, we can either drop the intercept, or drop one of the dummies. Dropping the intercept would lead to a very distorted estimation of our coefficients. Additionally, dropping of of the dummy variables is a good idea. It will make this value the baseline, which makes comparison easier.

If you think about it, removing the last dummy is what we did implicitly in the dichotonomous case.
If we would have created two dummies there, the second one would have been implicit in the first one (as its’ negation).
The same occurs here, if a region is not `Catholic`

and is not either `Orthodox`

, then it must be protestant.

Since `Protestant`

was our baseline in the previous estimation (since there was dummy for Catholic), we will stick with that.

Finally, lets see if this is what we get if we use the built in model.

Which gives us the same result!