-->

counting the occurrence of substrings in a column

2019-02-19 19:55发布

问题:

I would like to count the occurrences of a string in a column ....per group. In this case the string is often a substring in a character column.

I have some data e.g.

ID   String              village
1    fd_sec, ht_rm,      A
2    NA, ht_rm           A
3    fd_sec,             B
4    san, ht_rm,         C

The code that I began with is obviously incorrect, but I am failing on my search to find out I could use the grep function in a column and group by village

impacts <- se %>%  group_by(village) %>%
summarise(c_NA = round(sum(sub$en41_1 ==  "NA")),
          c_ht_rm = round(sum(sub$en41_1 ==  "ht_rm")),
          c_san = round(sum(sub$en41_1 ==  "san")),
          c_fd_sec = round(sum(sub$en41_1 ==  "fd_sec")))

Ideally my output would be:

village  fd_sec  NA  ht_rm  san
A        1       1   2 
B        1
C                    1      1

Thank you in advance

回答1:

You can also use cSplit() from my "splitstackshape" package. Since this package also loads "data.table", you can then just use dcast() to tabulate the result.

Example:

library(splitstackshape)
cSplit(mydf, "String", direction = "long")[, dcast(.SD, village ~ String)]
# Using 'village' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
#    village fd_sec ht_rm san NA
# 1:       A      1     2   0  1
# 2:       B      1     0   0  0
# 3:       C      0     1   1  0


回答2:

We can do this with base R by splitting the 'String' column with 'village', then split the 'String' into substrings by splitting at , followed by zero or more spaces (\\s*), stack the list into a two column data.frame and get the frequency with table

table(stack(lapply(split(df1$String, df1$village), 
            function(x) unlist(strsplit(x, ",\\s*"))))[2:1])
#  values
#ind fd_sec ht_rm NA san   
#  A      1     2  1   0
#  B      1     0  0   0
#  C      0     1  0   1

Or using tidyverse, after grouping by 'village', reshape into 'long' format by splitting the 'String' using separate_rows, filter out the rows that have blank values in 'String', count the frequency and spread it to 'wide' format

library(dplyr)
library(tidyr)
df1 %>%
   group_by(village) %>% 
   separate_rows(String, sep=",\\s*") %>%
   filter(nzchar(String)) %>% 
   count(village, String) %>% 
   spread(String, n, fill = 0)
# A tibble: 3 x 5
# Groups: village [3]
#  village fd_sec ht_rm  `NA`   san
#* <chr>    <dbl> <dbl> <dbl> <dbl>
#1 A         1.00  2.00  1.00  0   
#2 B         1.00  0     0     0   
#3 C         0     1.00  0     1.00


标签: r grep summarize