Splitting data and fitting distributions efficient

For a project I have received a large amount of confidential patient level data that I need to fit a distribution to so as to use it in a simulation model. I am using R.

The problem is that I need is to fit the distribution to get the shape/rate data for at least 288 separate distributions (at least 48 subsets of 6 variables). The process will vary slightly between variables (depending on how that variable is distributed) but I want to be able to set up a function or loop for each variable and generate the shape and rate data for each subset I define.

An example of this: I need to find length of stay data for subsets of patients. There are 48 subsets of patients. The way I have currently been doing this is by manually filtering the data and then extracting those to vectors, and then fitting the data to the vector using fitdist.

i.e. For a variable that is gamma distributed:

vector1 <- los_data %>%
filter(group == 1, setting == 1, diagnosis == 1)

fitdist(vector1, "gamma")

I am quite new to data science and data processing, and I know there must be a simpler way to do this than by hand! I'm assuming something to do with a matrix, but I am absolutely clueless about how best to proceed.

标签： r subset simulation distribution purrr

2条回答

淡お忘

2楼-- · 2019-07-14 12:24

One common practice is to split the data using split and then apply the function of interest on that group. Let's assume here we have four columns, group, settings, diagnosis and stay.length. The first three have two levels.

df <- data.frame(
  group = sample(1:2, 64, TRUE),
  setting  = sample(1:2, 64, TRUE),
  diagnosis  = sample(1:2, 64, TRUE), 
  stay.length = sample(1:5, 64, TRUE)
)
> head(df)
    group setting diagnosis var
1     1       1         1   4
2     1       1         2   5
3     1       1         2   4
4     2       1         2   3
5     1       2         2   3
6     1       1         2   5

Perform split and you will get a splitted List :

dfl <- split(df$stay.length, list(df$group, df$setting, df$diagnosis))

> head(dfl)
$`1.1.1`
[1] 5 3 4 1 4 5 4 2 1

$`2.1.1`
[1] 5 4 5 4 3 1 5 3 1

$`1.2.1`
[1] 4 2 5 4 5 3 5 3

$`2.2.1`
[1] 2 1 4 3 5 4 4

$`1.1.2`
[1] 5 4 4 4 3 2 4 4 5 1 5 5

$`2.1.2`
[1] 5 4 4 5 3 2 4 5 1 2

Afterwards, we can use lapply to perform whatever function on each group in the list. For example we can apply mean

dflm <- lapply(dfl, mean)
> dflm
$`1.1.1`
[1] 3.222222

.
.
.
.

$`2.2.2`
[1] 2.8

In your case, you can apply fitdist or any other function.

dfl.fitdist <- lapply(dfl, function(x) fitdist(x, "gamma"))

> dfl
$`1.1.1`
Fitting of the distribution ' gamma ' by maximum likelihood 
Parameters:
  estimate Std. Error
shape  3.38170  2.2831073
rate   1.04056  0.7573495

.
.
.


$`2.2.2`
Fitting of the distribution ' gamma ' by maximum likelihood 
Parameters:
  estimate Std. Error
shape 4.868843  2.5184018
rate  1.549188  0.8441106

0人赞添加讨论(0) 举报

家丑人穷心不美

3楼-- · 2019-07-14 12:28

OK, your example isn't quite reproducible here, but I think the answer you want will something like the following:

result <- los_data %>%
group_by(group, setting, diagnosis) %>%
do({
  fit <- fitdist(.$my_column, "gamma")
  data_frame(group=.$group[1], setting=.$setting[1], diagnosis=.$diagnosis[1], fit = list(fit))
}) %>%
ungroup()

This will give you a data frame of all the fits, with columns for group, setting, diagnosis as well as a list-column which contains the fits for each one. Since it is a list column, you will need to use double brackets to extract individual fits. Example:

# Get the fit in the first row
result$fit[[1]]

0人赞添加讨论(0) 举报

Splitting data and fitting distributions efficient

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间