I was looking for some way to change class of variables in one data frame by using the reference of another data frame which has information of class for each variable.
I have a data which contains around 150 variables. All the variables are in character format. Now I want to change the class of each variable depending upon its type. For this we created a separate data frame having information of class for each of the variables. Let me explain with an sample data frame.
Consider my original data frame to be df with 5 variables -
df <- data.frame(A="a",B="1",C="111111",D="d",E="e")
Now we have another data frame "variable_info" which contains just 2 variables, one "variable_name" and another "variable_class".
variable_info <- data.frame(variable_name=c("A","B","C","D","E"),variable_class=c("character","integer","numeric","character","character"))
Now using the variable_info data frame I want to change the class for each of the variables in df so that their class is as specified in "variable_info$variable_class" linking the variable name with "variable_info$variable_name"
How can we do this for a data frame? Will it be good to do this in data.table? How can we do this in data.table?
Thank you!!
Prasad
You could try it like this:
Make sure both tables are in the same order:
variable_info <- variable_info[match(variable_info$variable_name, names(df)),]
Create a list of function calls:
funs <- sapply(paste0("as.", variable_info$variable_class), match.fun)
Then map them to each column:
df[] <- Map(function(dd, f) f(as.character(dd)), df, funs)
With data.table
you could do it almost the same way, except you replace the last line by:
library(data.table)
dt <- as.data.table(df) # or use setDT(df)
dt[, names(dt) := Map(function(dd, f) f(as.character(dd)), dt, funs)]
An alternative approach is to use a function. This function can take any pair of dataframes, find their common columns and assign the class of the first to the columns in the second.
matchColClasses<- function(df1, df2){
# Purpose: protect joins from column type mismatches - a problem with multi-column empty df
# Input: df1 - master for class assignments, df2 - for col reclass and return.
# Output: df2 with shared columns classed to match df1
# Usage: df2 <- matchColClasses(df1, df2)
sharedColNames <- names(df1)[names(df1) %in% names(df2)]
sharedColTypes <- sapply(df1[,sharedColNames], class)
for (n in sharedColNames) {
class(df2[, n]) <- sharedColTypes[n]
}
return(df2)
}