I have a dataset about 105000 rows and 30 columns. I have a categorical variable that I would like to assign it to a number. In Excel, I would probably do something with VLOOKUP
and fill.
How would I go about doing the same thing in R
?
Essentially, what I have is a HouseType
variable, and I need to calculate the HouseTypeNo
. Here are some sample data:
HouseType HouseTypeNo
Semi 1
Single 2
Row 3
Single 2
Apartment 4
Apartment 4
Row 3
You could use
mapvalues()
from the plyr package.Initial data:
Lookup / crosswalk table:
Create the new variable:
Or for simple replacements you can skip creating a long lookup table and do this directly in one step:
Result:
Solution #2 of @Ben's answer is not reproducible in other more generic examples. It happens to give the correct lookup in the example because the unique
HouseType
inhouses
appear in increasing order. Try this:Bens solution#2 gives
which when
when the correct answer is 17 from the lookup table
The correct way to do it is
Now the lookups are performed correctly
I tried to edit Bens answer but it gets rejected for reasons I cannot understand.
If I understand your question correctly, here are four methods to do the equivalent of Excel's
VLOOKUP
and fill down usingR
:Here are four methods to fill the
HouseTypeNo
in thelargetable
using the values in thelookup
table:First with
merge
in base:A second method with named vectors in base:
Third, using the
plyr
package:Fourth, using the
sqldf
packageIf it's possible that some house types in
largetable
do not exist inlookup
then a left join would be used:Corresponding changes to the other solutions would be needed too.
Is that what you wanted to do? Let me know which method you like and I'll add commentary.
Using
merge
is different from lookup in Excel as it has potential to duplicate (multiply) your data if primary key constraint is not enforced in lookup table or reduce the number of records if you are not usingall.x = T
.To make sure you don't get into trouble with that and lookup safely, I suggest two strategies.
First one is to make a check on a number of duplicated rows in lookup key:
This will force you to de-dupe lookup dataset before using it:
Second option is to reproduce Excel behaviour by taking the first matching value from the lookup dataset:
These functions are slightly different from
lookup
as they add multiple columns.Starting with:
... you can use
... to give a unique number for each house type. You can see the result here:
... so you end up with different numbers on the rows (because the factors are ordered alphabetically) but the same pattern.
(EDIT: the remaining text in this answer is actually redundant. It occurred to me to check and it turned out that
read.table()
had already made houses$HouseType into a factor when it was read into the dataframe in the first place).However, you may well be better just to convert HouseType to a factor, which would give you all the same benefits as HouseTypeNo, but would be easier to interpret because the house types are named rather than numbered, e.g.:
I also like using
qdapTools::lookup
or shorthand binary operator%l%
. It works identically to an Excel vlookup, but it accepts name arguments opposed to column numbers