I have a R dataset with key value strings which looks like below:
quest<-data.frame(city=c("Atlanta","New York","Atlanta","Tampa"), key_value=c("rev=63;qty=1;zip=45987","rev=10.60|34;qty=1|2;zip=12686|12694","rev=12;qty=1;zip=74268","rev=3|24|8;qty=1|6|3;zip=33684|36842|30254"))
which translates to:
city key_value
1 Atlanta rev=63;qty=1;zip=45987
2 New York rev=10.60|34;qty=1|2;zip=12686|12694
3 Atlanta rev=12;qty=1;zip=74268
4 Tampa rev=3|24|8;qty=1|6|3;zip=33684|36842|30254
Based on the above dataframe how can I create a new data frame which looks like below :
city rev qty zip
1 Atlanta 63.0 1 45987
2 New York 10.6 1 12686
3 New York 34.0 2 12686
4 Atlanta 12.0 1 74268
5 Tampa 3.0 1 33684
6 Tampa 24.0 6 33684
7 Tampa 8.0 3 33684
"|" is the common delimiter which will determine the number of rows to be created.
Split by ;
, then by =
and |
, and combine into a matrix, using the first part as the name. Then repeat the rows of the original data frame by however many rows were found for each, and combine. I don't convert here any columns to numeric, they're left as character.
a <- strsplit(as.character(quest$key_value), ";")
a <- lapply(a, function(x) {
x <- do.call(cbind, strsplit(x, "[=|]"))
colnames(x) <- x[1,]
x[-1,,drop=FALSE]
})
b <- quest[rep(seq_along(a), sapply(a, nrow)), colnames(quest) != "key_value", drop=FALSE]
out <- cbind(b, do.call(rbind, a), stringsAsFactors=FALSE)
rownames(out) <- NULL
out
## city rev qty zip
## 1 Atlanta 63 1 45987
## 2 New York 10.60 1 12686
## 3 New York 34 2 12694
## 4 Atlanta 12 1 74268
## 5 Tampa 3 1 33684
## 6 Tampa 24 6 36842
## 7 Tampa 8 3 30254
We can use tidyverse
. With separate_rows
, split the 'key_value' by ;
and expand the rows, then separate
the column into two columns ('key', 'value' at =
, expand the rows at |
(separate_rows
), grouped by 'city', 'key', get the sequence number (row_number()
) and spread
to 'wide' format
library(tidyverse)
separate_rows(quest, key_value, sep=";") %>%
separate(key_value, into = c("key", "value"), sep="=") %>%
separate_rows(value, sep="[|]", convert = TRUE) %>%
group_by(city, key) %>%
mutate(rn = row_number()) %>%
spread(key, value) %>%
select(-rn)
# A tibble: 7 x 4
# Groups: city [3]
# city qty rev zip
#* <fctr> <dbl> <dbl> <dbl>
#1 Atlanta 1 63.0 45987
#2 Atlanta 1 12.0 74268
#3 New York 1 10.6 12686
#4 New York 2 34.0 12694
#5 Tampa 1 3.0 33684
#6 Tampa 6 24.0 36842
#7 Tampa 3 8.0 30254