Filtering na or missing value rows(observations) o

2019-08-26 09:33发布

问题:

(EDIT: totally refined question)

using package mitools & survey and followiing Anthony Damico's code, I am working with Survey of Consumer Finance dataset for several days. original list of datasets is "scf_imp", and the imputation imposed list of datasets is "scf_design". The problem is the following:

5 multiple imputation data frames have different columns and therefore if I make a subset of samples with that column variable ("houses" in my case), data frames with missing value in that "houses" column behaves differently from other data frames.

What I tried were:

  1. subsetting the whole list by criteria(houses>0 & income>0) and include all=TRUE as advised from last line in here (http://r-survey.r-forge.r-project.org/survey/svymi.html) to keep only those observations in the subset for all imputations.

    scf_design_owner <- subset(scf_design, houses > 0 & income > 0, all=TRUE)

or

  1. I cut off the na value rows even before creating imputation list as follows:

    lapply(scf_imp, function(x){replace_na(x,list(houses=0, income=0))})

and I did the filter trial as well, but some of the things were not working in imputationlist.

After those trials, when I check the error message. Warning message: In subset.svyimputationList(scf_design, houses > 0 & income > 0, : subset differed between imputations

I am completely stuck, I spend like more than three days on this. In short, my plan is to have imputation lists filtered by "houses>0 and income>0" (both column names in the list) and use only observations(rows) which all five imputation data frames have.


I’m just a beginner with R, so please bear with me. I am stuck at using SCF datasets and doing simple stat analysis. I have to trim the data in which samples only include positive value of houses & income.

First, I tried to do it by adding additional column to the list of dataframes as Anthony Damico specified in Variable Recoding (http://asdfree.com/survey-of-consumer-finances-scf.html). I wasn’t able to do that there. So I decided to restrict the whole list of dataframes (scf_design) to include the condition criteria as follows:

Here is my R code (up to subset):

setwd( "D:/Dropbox/Data/SCF 2016" )
library(mitools)    # allows analysis of multiply-imputed survey data
library(survey)     # load survey package (analyzes complex design surveys)
library(downloader) # downloads and then runs the source() function on 
scripts from github
library(foreign)    # load foreign package (converts data files into R)
library(Hmisc)      # load Hmisc package (loads a simple wtd.quantile function)

scf_imp <- readRDS("scf 2016.rds" )
scf_rw <- readRDS("scf 2016 rw.rds" )

scf_design <- svrepdesign( 

     # use the main weight within each of the imp# objects
     weights = ~wgt , 

     # use the 999 replicate weights stored in the separate replicate weights file, -1 drops first id column
     repweights = scf_rw[ , -1 ] , 

     # read the data directly from the scf data, list of all five imputation data frames
     data = imputationList( scf_imp ) , 

     scale = 1 ,

     rscales = rep( 1 / 998 , 999 ) ,

     # use the mean of the replicate statistics as the center
     # when calculating the variance, as opposed to the main weight's statistic
     mse = TRUE ,

     type = "other" ,

     combined.weights = TRUE
 )

 scf_design_owner <- subset(scf_design, houses > 0 & income > 0)  

If you do not have time, please look at the last line and what I get is the following message

scf_design_owner <- subset(scf_design, houses > 0 & income > 0)
It seemed to work at first (when I did it with only one criterion..) However, 
it shows the following warnings.

Warning message:
In subset.svyimputationList(scf_design, houses > 0 & income > 0) :
subset differed between imputations

The problem is that the number of samples in each imputation data frame seems to be different. (there is five imputation data frames created from SCF. They use multiple imputation techinique.. so, the 'scf_designer' is a list of five data frames)

> lodown:::scf_MIcombine( with( scf_design_owner , svyby( ~ one , ~ one , 
unwtd.count ) ) )
Multiple imputation results:
  with(scf_design_owner, svyby(~one, ~one, unwtd.count))
  lodown:::scf_MIcombine(with(scf_design_owner, svyby(~one, ~one, unwtd.count)))
  results        se
1  4131.6 0.9797959

The number of original samples were 6248. It surely decreased, but now it has decimals.... I suspect this is due to different number of samples in each imputation lists..

I am stuck here. So long story short, here are my questions.

  1. Is there any way I can get the subsetting the dataframe in the right way such that all the revised imputation dataframes have the same number of samples?

  2. If my method is not efficient, how to do it in the “Variable Recoding” part instead? (which was my original trial). I was able to add additional variable for houses, since there was a variable hhouses in the SCF macros, which is a logical var identifying home owner. But I couldn’t similar variables for income, so I gave up there. (Income in SCF starts from 0, so there are measure at 0 point)

    what I mean by variable recoding aprt is what Anthony Damico has written as below:

Example:

scf_design <- 
    update( 
    scf_design , 
    hhsex = factor( hhsex , labels = c( "male" , "female" ) ) ,
    married = as.numeric( married == 1 ) ,
    edcl = 
        factor( 
            edcl , 
            labels = 
                c( 
                    "less than high school" , 
                    "high school or GED" , 
                    "some college" , 
                    "college degree" 
                ) 
        )

)

(addition)

I found this, and solved the problem. If the subset differs between the multiple imputations the default is to take the observations that are in the subset for any imputations, with a warning.

 d3<-subset(des, HAB1MI>3) 
 Warning message: In subset.svyimputationList(des, HAB1MI > 3) : 
 subset differed between imputations 
 To keep only those observations in the subset for all imputations 
 use the all=TRUE argument to subset