I have the following data with all categorical variables:
class education income social_standing
1 basic low good
0 low high V_good
1 high low not_good
0 v_high high good
Here education has four levels (basic, low, high and v_high). income has two levels low and high ; and social_standing has three levels (good, v_good and not_good).
In so far as my understanding of converting the above data to VW format is concerned, it will be something like this:
1 |person education_basic income_low social_standing_good
0 |person education_low income_high social_standing_v_good
1 |person education_high income_low social_standing_not_good
0 |person education_v_high income_high social_standing_good
Here, 'person', is namespace and all other are feature values, prefixed by respective feature names. Am I correct? Somehow this representation of feature values is quite perplexing to me. Is there any other way to represent features? Shall be grateful for help.
Yes, you are correct.
This representation would definitely work with vowpal wabbit, but under some conditions, may not be optimal (it depends).
To represent non-ordered, categorical variables (with discrete values), the standard vowpal wabbit trick is to use logical/boolean values for each possible (name, value) combination (e.g.
person_is_good, color_blue, color_red
). The reason this works is thatvw
implicitly assumes a value of1
whereever a value is missing. There's no practical difference betweencolor_red, color=red
,color_is_red
, or even(color,red)
andcolor_red:1
except hash locations in memory. The only characters you can not use in a variable name are the special separators (:
and|
) and white-space.But in this case the variable-values may not be "strictly categorical". They may be:
low < basic < high < v_high
)so by making them "strict categorical" (my term for a variable with a discrete range which doesn't have the two properties above) you may be losing some information that may help learning.
In your particular case, you may get better result by converting the values to numeric, e.g. (
1, 2, 3, 4
) for education. i.e you could use something like:The training set in the question should work fine, because even when you convert all your discrete variables into boolean variables like you did,
vw
should self-discover both the ordering and the monotonicity with the label from the data itself, as long as the two properties above are true, and there's enough data to deduce them.Here's the short cheat-sheet for encoding variables in vowpal wabbit:
Final notes:
vw
all variables are numeric. The encoding tricks are just practical ways to make things appear ascategorical
orboolean
. Boolean variables are simply numeric 0 or 1; Categorical variables can be encoded as boolean: name+value:1.--initial_weight <value>
option is used) so it can be dropped from the training set:
is considered a special separator (between the variable name and its numeric value) anything else is considered a part of the name and the whole name string is hashed to a location in memory. A missing:<value>
part implies:1
Edit: what about name-spaces?
Name spaces are prepended to feature names with a special-char separator so they map identical features to different hash locations. Example:
Is essentially equivalent to the (no name spaces flat example):
The main use of name-spaces is to easily redefine all members of a name-space to something else, ignore a full name space of features, cross features of a name space with another etc. (see
-q
,--cubic
,--redefine
,--ignore
,--keep
options).