For a project I have received a large amount of confidential patient level data that I need to fit a distribution to so as to use it in a simulation model. I am using R.
The problem is that I need is to fit the distribution to get the shape/rate data for at least 288 separate distributions (at least 48 subsets of 6 variables). The process will vary slightly between variables (depending on how that variable is distributed) but I want to be able to set up a function or loop for each variable and generate the shape and rate data for each subset I define.
An example of this: I need to find length of stay data for subsets of patients. There are 48 subsets of patients. The way I have currently been doing this is by manually filtering the data and then extracting those to vectors, and then fitting the data to the vector using fitdist
.
i.e. For a variable that is gamma distributed:
vector1 <- los_data %>%
filter(group == 1, setting == 1, diagnosis == 1)
fitdist(vector1, "gamma")
I am quite new to data science and data processing, and I know there must be a simpler way to do this than by hand! I'm assuming something to do with a matrix, but I am absolutely clueless about how best to proceed.
One common practice is to split the data using
split
and then apply the function of interest on that group. Let's assume here we have four columns, group, settings, diagnosis and stay.length. The first three have two levels.Perform
split
and you will get a splittedList
:Afterwards, we can use
lapply
to perform whatever function on each group in the list. For example we can applymean
In your case, you can apply
fitdist
or any other function.OK, your example isn't quite reproducible here, but I think the answer you want will something like the following:
This will give you a data frame of all the fits, with columns for group, setting, diagnosis as well as a list-column which contains the fits for each one. Since it is a list column, you will need to use double brackets to extract individual fits. Example: