Broom/Dplyr error with glance() when using lm inst

2020-08-26 04:05发布

问题:

I am using the dplyr/broom package to do linear regressions for multiple sensors. The glance() function from broom will not work when I use lm() within the do statement, but will if I use biglm(). This wouldn't be an issue, but I would like the r^2, F-Statistic and p-val that glance returns quite beautifully for the traditional lm().

I've looked elsewhere and cannot find a similar case with this error:

Error in data.frame(r.squared = r.squared, adj.r.squared = adj.r.squared,  : 
 object 'fstatistic' not found

Possible hunches:

?Anova 
"The comparison between two or more models will only be valid if they are 
fitted to the same dataset. This may be a problem if there are missing
values and R's default of na.action = na.omit is used."

Here is the code:

library(tidyr)
library(broom)
library(biglm) # if not install.packages("biglm")
library(dplyr)
regressionBig <- tidied_rm_outliers %>%
group_by(sensor_name, Lot.Tool, Lot.Module, Recipe, Step, Stage, MEAS_TYPE) %>%
do(fit = biglm(MEAS_AVG ~ value, data = .)) #note biglm is used

regressionBig 

#extract the r^2 from the complex list type from the data frame we just stored

glances <- regressionBig %>% glance(fit)
glances %>% 
  ungroup() %>%
  arrange(desc(r.squared))
#Biglm works but if i try the same thing with regular lm It errors on glance() 

ErrorDf <- tidied_rm_outliers %>%
  group_by(sensor_name, Lot.Tool, Lot.Module, Recipe, Step, Stage, MEAS_TYPE) %>% 
  do(fit = lm(MEAS_AVG ~ value, data = .)) #note lm is normal
ErrorDf %>% glance(fit)

#Error in data.frame(r.squared = r.squared, adj.r.squared = adj.r.squared,  : 
#object 'fstatistic' not found

I hate to upload the entire data frame as I know it's usually not acceptable on S/O but I am not sure I can create a reproducible example without doing so. https://www.dropbox.com/s/pt6xe4jdxj743ka/testdf.Rda?dl=0

R session info on pastebin if you would like it here!

回答1:

It looks like a bad model in ErrorDf. I diagnosed it running a for loop.

for (i in 1:nrow(ErrorDf)){
  print(i)
  glance(ErrorDf$fit[[i]])
}

It looks like no coefficient for value could be estimated for model # 94. I haven't done any further investigation, but it brings up the interesting question of how broom should handle that.



回答2:

I came across this post after encountering the same issue. If lm() is failing because some groupings have too few cases, then you can resolve the issue by pre-filtering the data to remove these groupings before running do() loop. Generic code below shows how one might filter out groups with less than 30 data points.

require(dplyr)
require(broom)

data_grp = ( data 
    %>% group_by(factor_a, factor_b)
    %>% mutate(grp_cnt=n())
    %>% filter(grp_cnt>30)
)


回答3:

I wrote a function to deal with this after finding this post in my troubleshooting. The package maintainers probably (will) have a more clever solution but I think it should work for most cases. Thanks to @Benjamin for the loop inspiration.

collect_glance=function(mdldF){
    # mdldF should be a data frame from dplyr/broom with the column 'mdl' for the object models
    mdlglance=data_frame() #initialize empty dataframe
    metadF=mdldF %>% slice(0) %>% select(-ncol(mdldF))#create an empty data frame with only the group info
    i=1
    for(i in 1:nrow(mdldF)){
        # fill in metadata for each group for each modeling iteration
        for(colnums in 1:ncol(mdldF)-1){
            metadF[1,colnames(mdldF)[colnums]]=mdldF[i,colnames(mdldF[colnums])]
        }
        # attempt glance(). if succesful, bind to metadata. if not, return empty dataframe
        gtmp=tryCatch(glance(mdldF$mdl[[i]]) %>% bind_cols(metadF,.), error = function(e) {
            data_frame()
        })
        # test for empty dataframe. bind to mdlglance data frame if glance was successful. otherwise use full_join to join mdlglance and metadata by group names and get NA for all the other glance columns.
        if(nrow(gtmp)!=0) { 
            mdlglance=bind_rows(mdlglance,gtmp) 
        } else {
            mdlglance=full_join(mdlglance,metadF)
            }
    }
    return(mdlglance)
}


标签: r dplyr