Subsetting tidy data from a vector

2019-09-04 00:13发布


I'm using R to analyse data about antibiotic use from a number of hospitals.

I've imported this data into a frame, according to the tidy data principles.

        date   antibiotic  usage  hospital
1 2006-01-01   amikacin 0.000000 hospital1
2 2006-02-01   amikacin 0.000000 hospital1
3 2006-03-01   amikacin 0.000000 hospital1
4 2006-04-01   amikacin 0.000000 hospital1
5 2006-05-01   amikacin 0.937119 hospital1
6 2006-06-01   amikacin 1.002961 hospital1

(the data set is monthly data x 5 hospitals x 40 antibiotics)

The first thing I would like to do is aggregate the antibiotics into classes.

> head(distinct(select(data, antibiotic)))
1                 amikacin
2  amoxicillin-clavulanate
3              amoxycillin
4               ampicillin
5             azithromycin
6         benzylpenicillin
7                cefalotin
8                cefazolin

> penicillins <- c("amoxicillin-clavulanate", "amoxycillin", "ampicillin", "benzylpenicillin")
> ceph1 <- c("cefalotin", "cefazolin")

What I would like to do is then subset the data based on these antibiotic class vectors:

filter(data, antibiotic =(any one of the values in the vector "penicillins")

Thanks to thelatemail for pointing out the way to do this is:

d <- filter(data, antibiotic %in% penicillins)

What I would like the data to do is to be analysed in a number of ways:

The key analysis (and ggplot output) is:

x = date

y = usage of antibiotic(s) stratified by (drug | class), filtered by hospital

What I'm not clear on now is how to aggregate the data for this sort of thing.

I want to analyse the use of class "ceph1" across all the hospitals in the district, resulting in (apologies - i know this is not proper code)

   x         y
Jan-2006   for all in hospitals(usage of cephazolin + usage of cephalotin)
Feb-2006   for all in hospitals(usage of cephazolin + usage of cephalotin)

And, in the long-run, to be able to pass arguments to a function which will let me select which hospitals and which antibiotic or class of antibiotics.

Thanks again - I know this is an order of magnitude more complicated than the original question!


So after lots of trial and error and heaps of reading, I've managed to sort it out.

'data.frame':   23360 obs. of  4 variables:
 $ date      : Date, format: "2007-09-01" "2012-06-01" ...
 $ antibiotic: Factor w/ 41 levels "amikacin","amoxicillin-clavulanate",..: 17 3 19 30 38 20 20 20 7 25 ...
 $ usage     : num  21.368 36.458 7.226 3.671 0.917 ...
 $ hospital  : Factor w/ 5 levels "hospital1","hospital2",..: 1 3 2 1 4 1 4 3 5 1 ...

So I can subset the data first:

>penicillins <- c("amoxicillin-clavulanate", "amoxycillin", "ampicillin", "benzylpenicillin")
>d <- filter(data, antibiotic %in% penicillins) 

And then make the summary using more of dplyr (thanks, Hadley!)

>d1 <- summarise(group_by(d, date), total = sum(usage))
Source: local data frame [122 x 2]

         date    total
       (date)    (dbl)
1  2006-01-01 1669.177
2  2006-02-01 1901.749
3  2006-03-01 2311.008
4  2006-04-01 1921.436
5  2006-05-01 1594.781
6  2006-06-01 2150.997
7  2006-07-01 2052.517
8  2006-08-01 2132.501
9  2006-09-01 1959.916
10 2006-10-01 1751.667
..        ...      ...
> qplot(date, total, data = d1) + geom_smooth()
> [scatterplot as desired!]

Next step will be to try and build it all into a function and/or to try and do the subsetting in-line, building on what I've worked out here.

标签: r dplyr