I'm still learning how to translate a SAS code into R and I get warnings. I need to understand where I'm making mistakes. What I want to do is create a variable which summarizes and differentiates 3 status of a population: mainland, overseas, foreigner. I have a database with 2 variables:
- id nationality:
idnat
(french, foreigner),
If idnat
is french then:
- id birthplace:
idbp
(mainland, colony, overseas)
I want to summarize the info from idnat
and idbp
into a new variable called idnat2
:
- status: k (mainland, overseas, foreigner)
All these variables use "character type".
Results expected in column idnat2 :
idnat idbp idnat2
1 french mainland mainland
2 french colony overseas
3 french overseas overseas
4 foreign foreign foreign
Here is my SAS code I want to translate in R:
if idnat = "french" then do;
if idbp in ("overseas","colony") then idnat2 = "overseas";
else idnat2 = "mainland";
end;
else idnat2 = "foreigner";
run;
Here is my attempt in R:
if(idnat=="french"){
idnat2 <- "mainland"
} else if(idbp=="overseas"|idbp=="colony"){
idnat2 <- "overseas"
} else {
idnat2 <- "foreigner"
}
I receive this warning:
Warning message:
In if (idnat=="french") { :
the condition has length > 1 and only the first element will be used
I was advised to use a "nested ifelse
" instead for its easiness but get more warnings:
idnat2 <- ifelse (idnat=="french", "mainland",
ifelse (idbp=="overseas"|idbp=="colony", "overseas")
)
else (idnat2 <- "foreigner")
According to the Warning message, the length is greater than 1 so only what's between the first brackets will be taken into account. Sorry but I don't understand what this length has to do with here? Anybody know where I'm wrong?
Using the SQL CASE statement with the dplyr and sqldf packages:
Data
sqldf
dplyr
Output
If the data set contains many rows it might be more efficient to join with a lookup table using
data.table
instead of nestedifelse()
.Provided the lookup table below
and a sample data set
then we can do an update while joining:
If you are using any spreadsheet application there is a basic function
if()
with syntax:Syntax is exactly the same for
ifelse()
in R:The only difference to
if()
in spreadsheet application is that Rifelse()
is vectorized (takes vectors as input and return vector on output). Consider the following comparison of formulas in spreadsheet application and in R for an example where we would like to compare if a > b and return 1 if yes and 0 if not.In spreadsheet:
In R:
ifelse()
can be nested in many ways:To calculate column
idnat2
you can:R Documentation
What is
the condition has length > 1 and only the first element will be used
? Let's see:Can I still use
if()
? Yes, you can, but the syntax is not so cool :)If you are familiar with SQL, you can also use
CASE
statement insqldf
package.You can create the vector
idnat2
withoutif
andifelse
.The function
replace
can be used to replace all occurrences of"colony"
with"overseas"
:With data.table, the solutions is:
The
ifelse
is vectorized. Theif-else
is not. Here, DT is:This gives: