I am trying to fully understand difference between categorical and ordinal data when doing regression analysis. For now, what is clear:
Categorical feature and data example:
Color: red, white, black
Why categorical: red < white < black
is logically incorrect
Ordinal feature and data example:
Condition: old, renovated, new
Why ordinal: old < renovated < new
is logically correct
Categorical-to-numeric and ordinal-to-numeric encoding methods:
One-Hot encoding for categorical data
Arbitrary numbers for ordinal data
Categorical data to numeric:
data = {'color': ['blue', 'green', 'green', 'red']}
Numeric format after One-Hot encoding:
color_blue color_green color_red
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
Ordinal data to numeric:
data = {'con': ['old', 'new', 'new', 'renovated']}
Numeric format after using mapping: Old < renovated < new → 0, 1, 2
0 0
1 2
2 2
3 1
In my data I have 'color' feature. As color changes from white to black price increases. From above mentioned rules I probably have to use one-hot encoding for categorical 'color' data. But why I cannot use ordinal representation. Below I provided my observations from where my question arised.
Let me start with introducing formula for linear regression:
Let have a look at data representations for color:
Let's predict price for 1-st and 2-nd item using formula for both data representations:
One-hot encoding:
In this case different thetas for different colors will exist. I assume that thetas already derived from regression (20, 50 and 100). Prediction will be:
Price (1 item) = 0 + 20*1 + 50*0 + 100*0 = 20$ (thetas are assumed for example)
Price (2 item) = 0 + 20*0 + 50*1 + 100*0 = 50$
Ordinal encoding for color: In this case all colors will have 1 common theta but my assigned multipliers (10, 20, 30) differ:
Price (1 item) = 0 + 20*10 = 200$ (theta assumed for example)
Price (2 item) = 0 + 20*20 = 400$ (theta assumed for example)
In my model White < Red < Black in prices. Seem to be that correlation works correctly and it is logical predictions in both cases. For ordinal and categorical representations. So I can use any encoding for my regression regardless of the data type (categorical or ordinal)? This division in data representations is just a matter of conventions and software-oriented representations rather than a matter of regression logic itself?