可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a dataset about 105000 rows and 30 columns. I have a categorical variable that I would like to assign it to a number. In Excel, I would probably do something with VLOOKUP
and fill.
How would I go about doing the same thing in R
?
Essentially, what I have is a HouseType
variable, and I need to calculate the HouseTypeNo
. Here are some sample data:
HouseType HouseTypeNo
Semi 1
Single 2
Row 3
Single 2
Apartment 4
Apartment 4
Row 3
回答1:
If I understand your question correctly, here are four methods to do the equivalent of Excel\'s VLOOKUP
and fill down using R
:
# load sample data from Q
hous <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text=\"HouseType HouseTypeNo
Semi 1
Single 2
Row 3
Single 2
Apartment 4
Apartment 4
Row 3\")
# create a toy large table with a \'HouseType\' column
# but no \'HouseTypeNo\' column (yet)
largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType), 1000, replace = TRUE)), stringsAsFactors = FALSE)
# create a lookup table to get the numbers to fill
# the large table
lookup <- unique(hous)
HouseType HouseTypeNo
1 Semi 1
2 Single 2
3 Row 3
5 Apartment 4
Here are four methods to fill the HouseTypeNo
in the largetable
using the values in the lookup
table:
First with merge
in base:
# 1. using base
base1 <- (merge(lookup, largetable, by = \'HouseType\'))
A second method with named vectors in base:
# 2. using base and a named vector
housenames <- as.numeric(1:length(unique(hous$HouseType)))
names(housenames) <- unique(hous$HouseType)
base2 <- data.frame(HouseType = largetable$HouseType,
HouseTypeNo = (housenames[largetable$HouseType]))
Third, using the plyr
package:
# 3. using the plyr package
library(plyr)
plyr1 <- join(largetable, lookup, by = \"HouseType\")
Fourth, using the sqldf
package
# 4. using the sqldf package
library(sqldf)
sqldf1 <- sqldf(\"SELECT largetable.HouseType, lookup.HouseTypeNo
FROM largetable
INNER JOIN lookup
ON largetable.HouseType = lookup.HouseType\")
If it\'s possible that some house types in largetable
do not exist in lookup
then a left join would be used:
sqldf(\"select * from largetable left join lookup using (HouseType)\")
Corresponding changes to the other solutions would be needed too.
Is that what you wanted to do? Let me know which method you like and I\'ll add commentary.
回答2:
I think you can also use match()
:
largetable$HouseTypeNo <- with(lookup,
HouseTypeNo[match(largetable$HouseType,
HouseType)])
This still works if I scramble the order of lookup
.
回答3:
I also like using qdapTools::lookup
or shorthand binary operator %l%
. It works identically to an Excel vlookup, but it accepts name arguments opposed to column numbers
## Replicate Ben\'s data:
hous <- structure(list(HouseType = c(\"Semi\", \"Single\", \"Row\", \"Single\",
\"Apartment\", \"Apartment\", \"Row\"), HouseTypeNo = c(1L, 2L, 3L,
2L, 4L, 4L, 3L)), .Names = c(\"HouseType\", \"HouseTypeNo\"),
class = \"data.frame\", row.names = c(NA, -7L))
largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType),
1000, replace = TRUE)), stringsAsFactors = FALSE)
## It\'s this simple:
library(qdapTools)
largetable[, 1] %l% hous
回答4:
Solution #2 of @Ben\'s answer is not reproducible in other more generic examples. It happens to give the correct lookup in the example because the unique HouseType
in houses
appear in increasing order. Try this:
hous <- read.table(header = TRUE, stringsAsFactors = FALSE, text=\"HouseType HouseTypeNo
Semi 1
ECIIsHome 17
Single 2
Row 3
Single 2
Apartment 4
Apartment 4
Row 3\")
largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType), 1000, replace = TRUE)), stringsAsFactors = FALSE)
lookup <- unique(hous)
Bens solution#2 gives
housenames <- as.numeric(1:length(unique(hous$HouseType)))
names(housenames) <- unique(hous$HouseType)
base2 <- data.frame(HouseType = largetable$HouseType,
HouseTypeNo = (housenames[largetable$HouseType]))
which when
unique(base2$HouseTypeNo[ base2$HouseType==\"ECIIsHome\" ])
[1] 2
when the correct answer is 17 from the lookup table
The correct way to do it is
hous <- read.table(header = TRUE, stringsAsFactors = FALSE, text=\"HouseType HouseTypeNo
Semi 1
ECIIsHome 17
Single 2
Row 3
Single 2
Apartment 4
Apartment 4
Row 3\")
largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType), 1000, replace = TRUE)), stringsAsFactors = FALSE)
housenames <- tapply(hous$HouseTypeNo, hous$HouseType, unique)
base2 <- data.frame(HouseType = largetable$HouseType,
HouseTypeNo = (housenames[largetable$HouseType]))
Now the lookups are performed correctly
unique(base2$HouseTypeNo[ base2$HouseType==\"ECIIsHome\" ])
ECIIsHome
17
I tried to edit Bens answer but it gets rejected for reasons I cannot understand.
回答5:
Starting with:
houses <- read.table(text=\"Semi 1
Single 2
Row 3
Single 2
Apartment 4
Apartment 4
Row 3\",col.names=c(\"HouseType\",\"HouseTypeNo\"))
... you can use
as.numeric(factor(houses$HouseType))
... to give a unique number for each house type. You can see the result here:
> houses2 <- data.frame(houses,as.numeric(factor(houses$HouseType)))
> houses2
HouseType HouseTypeNo as.numeric.factor.houses.HouseType..
1 Semi 1 3
2 Single 2 4
3 Row 3 2
4 Single 2 4
5 Apartment 4 1
6 Apartment 4 1
7 Row 3 2
... so you end up with different numbers on the rows (because the factors are ordered alphabetically) but the same pattern.
(EDIT: the remaining text in this answer is actually redundant. It occurred to me to check and it turned out that read.table()
had already made houses$HouseType into a factor when it was read into the dataframe in the first place).
However, you may well be better just to convert HouseType to a factor, which would give you all the same benefits as HouseTypeNo, but would be easier to interpret because the house types are named rather than numbered, e.g.:
> houses3 <- houses
> houses3$HouseType <- factor(houses3$HouseType)
> houses3
HouseType HouseTypeNo
1 Semi 1
2 Single 2
3 Row 3
4 Single 2
5 Apartment 4
6 Apartment 4
7 Row 3
> levels(houses3$HouseType)
[1] \"Apartment\" \"Row\" \"Semi\" \"Single\"
回答6:
The poster didn\'t ask about looking up values if exact=FALSE
, but I\'m adding this as an answer for my own reference and possibly others.
If you\'re looking up categorical values, use the other answers.
Excel\'s vlookup
also allows you to match match approximately for numeric values with the 4th argument(1) match=TRUE
. I think of match=TRUE
like looking up values on a thermometer. The default value is FALSE, which is perfect for categorical values.
If you want to match approximately (perform a lookup), R has a function called findInterval
, which (as the name implies) will find the interval / bin that contains your continuous numeric value.
However, let\'s say that you want to findInterval
for several values. You could write a loop or use an apply function. However, I\'ve found it more efficient to take a DIY vectorized approach.
Let\'s say that you have a grid of values indexed by x and y:
grid <- list(x = c(-87.727, -87.723, -87.719, -87.715, -87.711),
y = c(41.836, 41.839, 41.843, 41.847, 41.851),
z = (matrix(data = c(-3.428, -3.722, -3.061, -2.554, -2.362,
-3.034, -3.925, -3.639, -3.357, -3.283,
-0.152, -1.688, -2.765, -3.084, -2.742,
1.973, 1.193, -0.354, -1.682, -1.803,
0.998, 2.863, 3.224, 1.541, -0.044),
nrow = 5, ncol = 5)))
and you have some values you want to look up by x and y:
df <- data.frame(x = c(-87.723, -87.712, -87.726, -87.719, -87.722, -87.722),
y = c(41.84, 41.842, 41.844, 41.849, 41.838, 41.842),
id = c(\"a\", \"b\", \"c\", \"d\", \"e\", \"f\")
Here is the example visualized:
contour(grid)
points(df$x, df$y, pch=df$id, col=\"blue\", cex=1.2)
You can find the x intervals and y intervals with this type of formula:
xrng <- range(grid$x)
xbins <- length(grid$x) -1
yrng <- range(grid$y)
ybins <- length(grid$y) -1
df$ix <- trunc( (df$x - min(xrng)) / diff(xrng) * (xbins)) + 1
df$iy <- trunc( (df$y - min(yrng)) / diff(yrng) * (ybins)) + 1
You could take it one step further and perform a (simplistic) interpolation on the z values in grid
like this:
df$z <- with(df, (grid$z[cbind(ix, iy)] +
grid$z[cbind(ix + 1, iy)] +
grid$z[cbind(ix, iy + 1)] +
grid$z[cbind(ix + 1, iy + 1)]) / 4)
Which gives you these values:
contour(grid, xlim = range(c(grid$x, df$x)), ylim = range(c(grid$y, df$y)))
points(df$x, df$y, pch=df$id, col=\"blue\", cex=1.2)
text(df$x + .001, df$y, lab=round(df$z, 2), col=\"blue\", cex=1)
df
# x y id ix iy z
# 1 -87.723 41.840 a 2 2 -3.00425
# 2 -87.712 41.842 b 4 2 -3.11650
# 3 -87.726 41.844 c 1 3 0.33150
# 4 -87.719 41.849 d 3 4 0.68225
# 6 -87.722 41.838 e 2 1 -3.58675
# 7 -87.722 41.842 f 2 2 -3.00425
Note that ix, and iy could have also been found with a loop using findInterval
, e.g. here\'s one example for the second row
findInterval(df$x[2], grid$x)
# 4
findInterval(df$y[2], grid$y)
# 2
Which matches ix
and iy
in df[2]
Footnote:
(1) The fourth argument of vlookup was previously called \"match\", but after they introduced the ribbon it was renamed to \"[range_lookup]\".
回答7:
You could use mapvalues()
from the plyr package.
Initial data:
dat <- data.frame(HouseType = c(\"Semi\", \"Single\", \"Row\", \"Single\", \"Apartment\", \"Apartment\", \"Row\"))
> dat
HouseType
1 Semi
2 Single
3 Row
4 Single
5 Apartment
6 Apartment
7 Row
Lookup / crosswalk table:
lookup <- data.frame(type_text = c(\"Semi\", \"Single\", \"Row\", \"Apartment\"), type_num = c(1, 2, 3, 4))
> lookup
type_text type_num
1 Semi 1
2 Single 2
3 Row 3
4 Apartment 4
Create the new variable:
dat$house_type_num <- plyr::mapvalues(dat$HouseType, from = lookup$type_text, to = lookup$type_num)
Or for simple replacements you can skip creating a long lookup table and do this directly in one step:
dat$house_type_num <- plyr::mapvalues(dat$HouseType,
from = c(\"Semi\", \"Single\", \"Row\", \"Apartment\"),
to = c(1, 2, 3, 4))
Result:
> dat
HouseType house_type_num
1 Semi 1
2 Single 2
3 Row 3
4 Single 2
5 Apartment 4
6 Apartment 4
7 Row 3
回答8:
Using merge
is different from lookup in Excel as it has potential to duplicate (multiply) your data if primary key constraint is not enforced in lookup table or reduce the number of records if you are not using all.x = T
.
To make sure you don\'t get into trouble with that and lookup safely, I suggest two strategies.
First one is to make a check on a number of duplicated rows in lookup key:
safeLookup <- function(data, lookup, by, select = setdiff(colnames(lookup), by)) {
# Merges data to lookup making sure that the number of rows does not change.
stopifnot(sum(duplicated(lookup[, by])) == 0)
res <- merge(data, lookup[, c(by, select)], by = by, all.x = T)
return (res)
}
This will force you to de-dupe lookup dataset before using it:
baseSafe <- safeLookup(largetable, house.ids, by = \"HouseType\")
# Error: sum(duplicated(lookup[, by])) == 0 is not TRUE
baseSafe<- safeLookup(largetable, unique(house.ids), by = \"HouseType\")
head(baseSafe)
# HouseType HouseTypeNo
# 1 Apartment 4
# 2 Apartment 4
# ...
Second option is to reproduce Excel behaviour by taking the first matching value from the lookup dataset:
firstLookup <- function(data, lookup, by, select = setdiff(colnames(lookup), by)) {
# Merges data to lookup using first row per unique combination in by.
unique.lookup <- lookup[!duplicated(lookup[, by]), ]
res <- merge(data, unique.lookup[, c(by, select)], by = by, all.x = T)
return (res)
}
baseFirst <- firstLookup(largetable, house.ids, by = \"HouseType\")
These functions are slightly different from lookup
as they add multiple columns.