I have been advised that it is best to order categorical variables where appropriate (e.g. short less than medium less than long). I am wondering, what is the specific advantage of treating a categorical variable as ordered as opposed to just simple categorical, in the context of modelling it as an explanatory variable? What does it mean mathematically (in lay terms preferably!)?
Many thanks!
Among other things, it allows you to compare values from those factors:
> ord.fac <- ordered(c("small", "medium", "large"), levels=c("small", "medium", "large"))
> fac <- factor(c("small", "medium", "large"), levels=c("small", "medium", "large"))
> ord.fac[[1]] < ord.fac[[2]]
[1] TRUE
> fac[[1]] < fac[[2]]
[1] NA
Warning message:
In Ops.factor(fac[[1]], fac[[2]]) : < not meaningful for factors
Documentation suggests there is quite an impact from a modeling perspective:
Ordered factors differ from factors only in their class, but methods and the model-fitting functions treat the two classes quite differently
but I'll have to let someone familiar with those use cases provide the details on that.
You should use ordinal data only when it makes sense from the data's point of view (i.e. the data is naturally ordered like in the case of small, medium and large).
In modeling terms, a categorical variable has a dummy variable created for each level buy one of the possible values it can take. The effect of the dummy variable essentially gives you the effect of that level compared to the reference level (the level without a dummy variable). In general, dealing with a categorical variable is easier that dealing with an ordinal data.
Ordinal data is not modeled in the same way as continuous and categorical (unless you treat the values as continuous, which is often done). In R, the ordinal package has several functions to perform the modeling that are based on a cumulative link function (a link function transforms the data to something that is closer to linear regression).
The advantage of recoding categorical data as ordinal is that the inferences made from the data are better represent the data and have a more intuitive interpretation.
The most useful difference is in displaying results. If we have levels low, med, and high and create an appropriate ordered factor then boxplots, barplots, tables, etc. will display the results in the order low, med, high. But if we create an unordered factor and go with the default ordering then the plots/tables will put things in the order high, low, med; which makes less sense.
The default contrasts/dummy variable encoding is different for ordered and non-ordered factors (but you can change the encoding, so this only affects things if you use the defaults) which can change interpretations of individual pieces, but will not affect the overall fit in general (for the linear model and extensions, other tools like trees could be different).