Subsetting tidy data from a vector

2019-09-04 00:13发布

问题:

I'm using R to analyse data about antibiotic use from a number of hospitals.

I've imported this data into a frame, according to the tidy data principles.

>head(data)
        date   antibiotic  usage  hospital
1 2006-01-01   amikacin 0.000000 hospital1
2 2006-02-01   amikacin 0.000000 hospital1
3 2006-03-01   amikacin 0.000000 hospital1
4 2006-04-01   amikacin 0.000000 hospital1
5 2006-05-01   amikacin 0.937119 hospital1
6 2006-06-01   amikacin 1.002961 hospital1

(the data set is monthly data x 5 hospitals x 40 antibiotics)

The first thing I would like to do is aggregate the antibiotics into classes.

> head(distinct(select(data, antibiotic)))
                antibiotic
1                 amikacin
2  amoxicillin-clavulanate
3              amoxycillin
4               ampicillin
5             azithromycin
6         benzylpenicillin
7                cefalotin
8                cefazolin

> penicillins <- c("amoxicillin-clavulanate", "amoxycillin", "ampicillin", "benzylpenicillin")
> ceph1 <- c("cefalotin", "cefazolin")

What I would like to do is then subset the data based on these antibiotic class vectors:

filter(data, antibiotic =(any one of the values in the vector "penicillins")

Thanks to thelatemail for pointing out the way to do this is:

d <- filter(data, antibiotic %in% penicillins)

What I would like the data to do is to be analysed in a number of ways:

The key analysis (and ggplot output) is:

x = date

y = usage of antibiotic(s) stratified by (drug | class), filtered by hospital

What I'm not clear on now is how to aggregate the data for this sort of thing.

Example:
I want to analyse the use of class "ceph1" across all the hospitals in the district, resulting in (apologies - i know this is not proper code)

   x         y
Jan-2006   for all in hospitals(usage of cephazolin + usage of cephalotin)
Feb-2006   for all in hospitals(usage of cephazolin + usage of cephalotin)
etc

And, in the long-run, to be able to pass arguments to a function which will let me select which hospitals and which antibiotic or class of antibiotics.

Thanks again - I know this is an order of magnitude more complicated than the original question!

回答1:

So after lots of trial and error and heaps of reading, I've managed to sort it out.

>str(data)
'data.frame':   23360 obs. of  4 variables:
 $ date      : Date, format: "2007-09-01" "2012-06-01" ...
 $ antibiotic: Factor w/ 41 levels "amikacin","amoxicillin-clavulanate",..: 17 3 19 30 38 20 20 20 7 25 ...
 $ usage     : num  21.368 36.458 7.226 3.671 0.917 ...
 $ hospital  : Factor w/ 5 levels "hospital1","hospital2",..: 1 3 2 1 4 1 4 3 5 1 ...

So I can subset the data first:

>library(dplyr)
>penicillins <- c("amoxicillin-clavulanate", "amoxycillin", "ampicillin", "benzylpenicillin")
>d <- filter(data, antibiotic %in% penicillins) 

And then make the summary using more of dplyr (thanks, Hadley!)

>d1 <- summarise(group_by(d, date), total = sum(usage))
>d1    
Source: local data frame [122 x 2]

         date    total
       (date)    (dbl)
1  2006-01-01 1669.177
2  2006-02-01 1901.749
3  2006-03-01 2311.008
4  2006-04-01 1921.436
5  2006-05-01 1594.781
6  2006-06-01 2150.997
7  2006-07-01 2052.517
8  2006-08-01 2132.501
9  2006-09-01 1959.916
10 2006-10-01 1751.667
..        ...      ...
>
> qplot(date, total, data = d1) + geom_smooth()
> [scatterplot as desired!]

Next step will be to try and build it all into a function and/or to try and do the subsetting in-line, building on what I've worked out here.



标签: r dplyr