## Categorical input data

In this article, we are going to talk about categorical input data or, in other words, how to deal with things that are not numbers (categorical array).

In the introductory lecture on machine learning, we mentioned the output data, which are categories or characteristics. The output data were true values with regression.

However, when our input data are categories or characteristics, unfortunately, we cannot easily tell them apart.

For example, the output data, which we have, are the salary, which we want to predict, and the input data are the person’s gender, age, academic degree, work experience and grade point average.

Note that some of these inputs are numbers, for example, age, working experience, and grade point average, while the rest data are categories. Sex can be male or female; the academic degree can be a bachelor, a master or a Doctor of Philosophy. Previously, all of our input data x were numbers.

So what should we do now?

We cannot multiply the categories!

One solution is called direct encoding (*one-hot encoding*). Its essence is that we assign one dimension X for each possible state of the category.

For example, we represent a degree with three different states (bachelor, master or Ph.D.) as a character row vector with dimension 3: X = [x1, x2, x3]. The bachelor’s degree is the value [1, 0, 0], the master’s degree is [0, 1, 0], the degree of the Doctor of Philosophy is [0, 0, 1]. Note that only one dimension can have a value of one, so you can never see something like [1, 1, 0] or [0, 1, 1].

What should we do if we have more than one input factor?

For example, we have two factors: a scientific degree and gender. The academic degree has a dimension 3; gender is a dimension 2. Therefore, the total dimension is 5.

Let’s consider a more specific example. Let our output y be the salary, x1 = 1 if gender is female, and x2 = 1 if gender is male. Suppose we have used linear regression and have calculated that

*y* = 50 000 – 5000*x _{1}* + 5000

*x*.

_{2}

What is good about working with categories is that the results are very easy to interpret.

In this case, our free term is 50,000, which can be regarded as a kind of starting point. If you are a woman, then we have to deduct 5,000 from this amount; therefore, the expected salary is 45,000. If you are a man, then the expected salary is 55,000. It means that your company demonstrates sexual discrimination on the level of wages.

Since gender is a binary variable, we can formulate this equation differently. Let y be the salary, but now we only have one x that has the value 1 if this is a man and 0 if this is a woman. So our equation is

*y* = 45 000 + 10000*x*.

This gives us the same salary level values for a man and a woman, but the case with a man is in x, whereas the case with a woman is in the free member.

Note that if you have more than one categorical input variable, then you cannot do this. The fact is that if you have an output variable that depends on the degree of a person and his/her sex, the free member simultaneously characterizes both the starting point for the first multiple variable and the value of the result for gender. In this case, the result is much harder to interpret.

Really informative and wonderful anatomical structure of articles, now that’s user friendly (:.