可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I\'m still learning how to translate a SAS code into R and I get warnings. I need to understand where I\'m making mistakes. What I want to do is create a variable which summarizes and differentiates 3 status of a population: mainland, overseas, foreigner.
I have a database with 2 variables:
- id nationality:
idnat
(french, foreigner),
If idnat
is french then:
- id birthplace:
idbp
(mainland, colony, overseas)
I want to summarize the info from idnat
and idbp
into a new variable called idnat2
:
- status: k (mainland, overseas, foreigner)
All these variables use \"character type\".
Results expected in column idnat2 :
idnat idbp idnat2
1 french mainland mainland
2 french colony overseas
3 french overseas overseas
4 foreign foreign foreign
Here is my SAS code I want to translate in R:
if idnat = \"french\" then do;
if idbp in (\"overseas\",\"colony\") then idnat2 = \"overseas\";
else idnat2 = \"mainland\";
end;
else idnat2 = \"foreigner\";
run;
Here is my attempt in R:
if(idnat==\"french\"){
idnat2 <- \"mainland\"
} else if(idbp==\"overseas\"|idbp==\"colony\"){
idnat2 <- \"overseas\"
} else {
idnat2 <- \"foreigner\"
}
I receive this warning:
Warning message:
In if (idnat==\"french\") { :
the condition has length > 1 and only the first element will be used
I was advised to use a \"nested ifelse
\" instead for its easiness but get more warnings:
idnat2 <- ifelse (idnat==\"french\", \"mainland\",
ifelse (idbp==\"overseas\"|idbp==\"colony\", \"overseas\")
)
else (idnat2 <- \"foreigner\")
According to the Warning message, the length is greater than 1 so only what\'s between the first brackets will be taken into account. Sorry but I don\'t understand what this length has to do with here? Anybody know where I\'m wrong?
回答1:
If you are using any spreadsheet application there is a basic function if()
with syntax:
if(<condition>, <yes>, <no>)
Syntax is exactly the same for ifelse()
in R:
ifelse(<condition>, <yes>, <no>)
The only difference to if()
in spreadsheet application is that R ifelse()
is vectorized (takes vectors as input and return vector on output). Consider the following comparison of formulas in spreadsheet application and in R for an example where we would like to compare if a > b and return 1 if yes and 0 if not.
In spreadsheet:
A B C
1 3 1 =if(A1 > B1, 1, 0)
2 2 2 =if(A2 > B2, 1, 0)
3 1 3 =if(A3 > B3, 1, 0)
In R:
> a <- 3:1; b <- 1:3
> ifelse(a > b, 1, 0)
[1] 1 0 0
ifelse()
can be nested in many ways:
ifelse(<condition>, <yes>, ifelse(<condition>, <yes>, <no>))
ifelse(<condition>, ifelse(<condition>, <yes>, <no>), <no>)
ifelse(<condition>,
ifelse(<condition>, <yes>, <no>),
ifelse(<condition>, <yes>, <no>)
)
ifelse(<condition>, <yes>,
ifelse(<condition>, <yes>,
ifelse(<condition>, <yes>, <no>)
)
)
To calculate column idnat2
you can:
df <- read.table(header=TRUE, text=\"
idnat idbp idnat2
french mainland mainland
french colony overseas
french overseas overseas
foreign foreign foreign\"
)
with(df,
ifelse(idnat==\"french\",
ifelse(idbp %in% c(\"overseas\",\"colony\"),\"overseas\",\"mainland\"),\"foreign\")
)
R Documentation
What is the condition has length > 1 and only the first element will be used
? Let\'s see:
> # What is first condition really testing?
> with(df, idnat==\"french\")
[1] TRUE TRUE TRUE FALSE
> # This is result of vectorized function - equality of all elements in idnat and
> # string \"french\" is tested.
> # Vector of logical values is returned (has the same length as idnat)
> df$idnat2 <- with(df,
+ if(idnat==\"french\"){
+ idnat2 <- \"xxx\"
+ }
+ )
Warning message:
In if (idnat == \"french\") { :
the condition has length > 1 and only the first element will be used
> # Note that the first element of comparison is TRUE and that\'s whay we get:
> df
idnat idbp idnat2
1 french mainland xxx
2 french colony xxx
3 french overseas xxx
4 foreign foreign xxx
> # There is really logic in it, you have to get used to it
Can I still use if()
? Yes, you can, but the syntax is not so cool :)
test <- function(x) {
if(x==\"french\") {
\"french\"
} else{
\"not really french\"
}
}
apply(array(df[[\"idnat\"]]),MARGIN=1, FUN=test)
If you are familiar with SQL, you can also use CASE
statement in sqldf
package.
回答2:
Try something like the following:
# some sample data
idnat <- sample(c(\"french\",\"foreigner\"),100,TRUE)
idbp <- rep(NA,100)
idbp[idnat==\"french\"] <- sample(c(\"mainland\",\"overseas\",\"colony\"),sum(idnat==\"french\"),TRUE)
# recoding
out <- ifelse(idnat==\"french\" & !idbp %in% c(\"overseas\",\"colony\"), \"mainland\",
ifelse(idbp %in% c(\"overseas\",\"colony\"),\"overseas\",
\"foreigner\"))
cbind(idnat,idbp,out) # check result
Your confusion comes from how SAS and R handle if-else constructions. In R, if
and else
are not vectorized, meaning they check whether a single condition is true (i.e., if(\"french\"==\"french\")
works) and cannot handle multiple logicals (i.e., if(c(\"french\",\"foreigner\")==\"french\")
doesn\'t work) and R gives you the warning you\'re receiving.
By contrast, ifelse
is vectorized, so it can take your vectors (aka input variables) and test the logical condition on each of their elements, like you\'re used to in SAS. An alternative way to wrap your head around this would be to build a loop using if
and else
statements (as you\'ve started to do here) but the vectorized ifelse
approach will be more efficient and involve generally less code.
回答3:
You can create the vector idnat2
without if
and ifelse
.
The function replace
can be used to replace all occurrences of \"colony\"
with \"overseas\"
:
idnat2 <- replace(idbp, idbp == \"colony\", \"overseas\")
回答4:
If the data set contains many rows it might be more efficient to join with a lookup table using data.table
instead of nested ifelse()
.
Provided the lookup table below
lookup
idnat idbp idnat2
1: french mainland mainland
2: french colony overseas
3: french overseas overseas
4: foreign foreign foreign
and a sample data set
library(data.table)
n_row <- 10L
set.seed(1L)
DT <- data.table(idnat = \"french\",
idbp = sample(c(\"mainland\", \"colony\", \"overseas\", \"foreign\"), n_row, replace = TRUE))
DT[idbp == \"foreign\", idnat := \"foreign\"][]
idnat idbp
1: french colony
2: french colony
3: french overseas
4: foreign foreign
5: french mainland
6: foreign foreign
7: foreign foreign
8: french overseas
9: french overseas
10: french mainland
then we can do an update while joining:
DT[lookup, on = .(idnat, idbp), idnat2 := i.idnat2][]
idnat idbp idnat2
1: french colony overseas
2: french colony overseas
3: french overseas overseas
4: foreign foreign foreign
5: french mainland mainland
6: foreign foreign foreign
7: foreign foreign foreign
8: french overseas overseas
9: french overseas overseas
10: french mainland mainland
回答5:
Using the SQL CASE statement with the dplyr and sqldf packages:
Data
df <-structure(list(idnat = structure(c(2L, 2L, 2L, 1L), .Label = c(\"foreign\",
\"french\"), class = \"factor\"), idbp = structure(c(3L, 1L, 4L,
2L), .Label = c(\"colony\", \"foreign\", \"mainland\", \"overseas\"), class = \"factor\")), .Names = c(\"idnat\",
\"idbp\"), class = \"data.frame\", row.names = c(NA, -4L))
sqldf
library(sqldf)
sqldf(\"SELECT idnat, idbp,
CASE
WHEN idbp IN (\'colony\', \'overseas\') THEN \'overseas\'
ELSE idbp
END AS idnat2
FROM df\")
dplyr
library(dplyr)
df %>%
mutate(idnat2 = case_when(.$idbp == \'mainland\' ~ \"mainland\",
.$idbp %in% c(\"colony\", \"overseas\") ~ \"overseas\",
TRUE ~ \"foreign\"))
Output
idnat idbp idnat2
1 french mainland mainland
2 french colony overseas
3 french overseas overseas
4 foreign foreign foreign
回答6:
With data.table, the solutions is:
DT[, idnat2 := ifelse(idbp %in% \"foreign\", \"foreign\",
ifelse(idbp %in% c(\"colony\", \"overseas\"), \"overseas\", \"mainland\" ))]
The ifelse
is vectorized. The if-else
is not. Here, DT is:
idnat idbp
1 french mainland
2 french colony
3 french overseas
4 foreign foreign
This gives:
idnat idbp idnat2
1: french mainland mainland
2: french colony overseas
3: french overseas overseas
4: foreign foreign foreign
回答7:
# Read in the data.
idnat=c(\"french\",\"french\",\"french\",\"foreign\")
idbp=c(\"mainland\",\"colony\",\"overseas\",\"foreign\")
# Initialize the new variable.
idnat2=as.character(vector())
# Logically evaluate \"idnat\" and \"idbp\" for each case, assigning the appropriate level to \"idnat2\".
for(i in 1:length(idnat)) {
if(idnat[i] == \"french\" & idbp[i] == \"mainland\") {
idnat2[i] = \"mainland\"
} else if (idnat[i] == \"french\" & (idbp[i] == \"colony\" | idbp[i] == \"overseas\")) {
idnat2[i] = \"overseas\"
} else {
idnat2[i] = \"foreign\"
}
}
# Create a data frame with the two old variables and the new variable.
data.frame(idnat,idbp,idnat2)