How to transform a key/value string into distinct

2019-02-25 19:39发布

问题:

I have a R dataset with key value strings which looks like below:

quest<-data.frame(city=c("Atlanta","New York","Atlanta","Tampa"), key_value=c("rev=63;qty=1;zip=45987","rev=10.60|34;qty=1|2;zip=12686|12694","rev=12;qty=1;zip=74268","rev=3|24|8;qty=1|6|3;zip=33684|36842|30254"))

which translates to:

      city                                  key_value
1  Atlanta                     rev=63;qty=1;zip=45987
2 New York       rev=10.60|34;qty=1|2;zip=12686|12694
3  Atlanta                     rev=12;qty=1;zip=74268
4    Tampa rev=3|24|8;qty=1|6|3;zip=33684|36842|30254

Based on the above dataframe how can I create a new data frame which looks like below :

      city  rev qty   zip
1  Atlanta 63.0   1 45987
2 New York 10.6   1 12686
3 New York 34.0   2 12686
4  Atlanta 12.0   1 74268
5    Tampa  3.0   1 33684
6    Tampa 24.0   6 33684
7    Tampa  8.0   3 33684

"|" is the common delimiter which will determine the number of rows to be created.

回答1:

Split by ;, then by = and |, and combine into a matrix, using the first part as the name. Then repeat the rows of the original data frame by however many rows were found for each, and combine. I don't convert here any columns to numeric, they're left as character.

a <- strsplit(as.character(quest$key_value), ";")
a <- lapply(a, function(x) {
    x <- do.call(cbind, strsplit(x, "[=|]"))
    colnames(x) <- x[1,]
    x[-1,,drop=FALSE]
})
b <- quest[rep(seq_along(a), sapply(a, nrow)), colnames(quest) != "key_value", drop=FALSE]
out <- cbind(b, do.call(rbind, a), stringsAsFactors=FALSE)
rownames(out) <- NULL
out
##       city   rev qty   zip
## 1  Atlanta    63   1 45987
## 2 New York 10.60   1 12686
## 3 New York    34   2 12694
## 4  Atlanta    12   1 74268
## 5    Tampa     3   1 33684
## 6    Tampa    24   6 36842
## 7    Tampa     8   3 30254


回答2:

We can use tidyverse. With separate_rows, split the 'key_value' by ; and expand the rows, then separate the column into two columns ('key', 'value' at =, expand the rows at | (separate_rows), grouped by 'city', 'key', get the sequence number (row_number()) and spread to 'wide' format

library(tidyverse)
separate_rows(quest, key_value, sep=";") %>% 
     separate(key_value, into = c("key", "value"), sep="=") %>% 
     separate_rows(value, sep="[|]", convert = TRUE) %>% 
     group_by(city, key) %>% 
     mutate(rn = row_number()) %>% 
     spread(key, value) %>%
     select(-rn)
# A tibble: 7 x 4
# Groups:   city [3]
#      city   qty   rev   zip
#*   <fctr> <dbl> <dbl> <dbl>
#1  Atlanta     1  63.0 45987
#2  Atlanta     1  12.0 74268
#3 New York     1  10.6 12686
#4 New York     2  34.0 12694
#5    Tampa     1   3.0 33684
#6    Tampa     6  24.0 36842
#7    Tampa     3   8.0 30254