I actually have a data frame with 2000 rows (different days), each row contains a character ”vector” containing binary info on 30 different skills. If the skill has been used its number appear in the vector. But to simplify:
If I have a data frame with 3 observations (3 days) of 10 different skills -named "S_total"
:
S_total= [1,3,7,8,9,10], [5,9], []
, and a variable Day= 1,2,3
I'd like to construct a dataframe with 3 rows and 12 columns
The columns being: Day,S_total,,s1,s,2,s3,s4,s5,s6,s7,s8,s9,s10
Where the numbered variables could be of the format true/false
.
I have thought in the direction of as.numeric(read.csv)
and then a for
-loop containing cbind
.
But there must be a better way ? tidy verse? I could hope for someone demonstrating: regular expression and the Map-command
You can simply add a new column by either using dataFrame$newColumn
or dataFrame[, "newColum]
. Then you can use grepl
to test if a skill is found in the vector dataFrame$S_total
. e.g.
dataFrame[, "1"] <- grepl("1", dataFrame$S_total)
To get all different skills that occur in the dataset, you can split the character vectors into single numbers and then use unique. Then you can loop through all different skills and create one new column for each skill:
> dataFrame <- data.frame(S_total = c(toString(c(1,3,7,8,11,20)), toString(c(5,12)), ""),
+ Day = c(1,2,3),
+ stringsAsFactors = FALSE)
>
> dataFrame
S_total Day
1 1, 3, 7, 8, 11, 20 1
2 5, 12 2
3 3
>
> allSkill <- sort(unique(unlist(strsplit(dataFrame$S_total, ", "))))
> for(i in allSkill){
+ dataFrame[, i] <- grepl(i, dataFrame$S_total)
+ }
> dataFrame
S_total Day 1 11 12 20 3 5 7 8
1 1, 3, 7, 8, 11, 20 1 TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
2 5, 12 2 TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
3 3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
If your dataset is not that large, this will do it. If you have a very large set and performance is important, you can first create empty columns and then loop through them which increases performance see.
No need to use map or any of the tidyverse packages in my opinion.
Very cool solution, Just what I needed. I only needed to remove my brackets to get this to work. SO, imagining that my vector "S_total" had brackets, I'd have to:
S_total_nobracket <- gsub("\\[|\\]", "", S_total).
Thanks a mill, for your answer. It was just what I needed :-)