I have a dataset about 105000 rows and 30 columns. I have a categorical variable that I would like to assign it to a number. In Excel, I would probably do something with VLOOKUP
and fill.
How would I go about doing the same thing in R
?
Essentially, what I have is a HouseType
variable, and I need to calculate the HouseTypeNo
. Here are some sample data:
HouseType HouseTypeNo
Semi 1
Single 2
Row 3
Single 2
Apartment 4
Apartment 4
Row 3
I think you can also use
match()
:This still works if I scramble the order of
lookup
.The poster didn't ask about looking up values if
exact=FALSE
, but I'm adding this as an answer for my own reference and possibly others.If you're looking up categorical values, use the other answers.
Excel's
vlookup
also allows you to match match approximately for numeric values with the 4th argument(1)match=TRUE
. I think ofmatch=TRUE
like looking up values on a thermometer. The default value is FALSE, which is perfect for categorical values.If you want to match approximately (perform a lookup), R has a function called
findInterval
, which (as the name implies) will find the interval / bin that contains your continuous numeric value.However, let's say that you want to
findInterval
for several values. You could write a loop or use an apply function. However, I've found it more efficient to take a DIY vectorized approach.Let's say that you have a grid of values indexed by x and y:
and you have some values you want to look up by x and y:
Here is the example visualized:
You can find the x intervals and y intervals with this type of formula:
You could take it one step further and perform a (simplistic) interpolation on the z values in
grid
like this:Which gives you these values:
Note that ix, and iy could have also been found with a loop using
findInterval
, e.g. here's one example for the second rowWhich matches
ix
andiy
indf[2]
Footnote: (1) The fourth argument of vlookup was previously called "match", but after they introduced the ribbon it was renamed to "[range_lookup]".