可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance:
Many R functions have an na.rm flag that when set to TRUE, remove the NAs:
>>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67), na.rm=T)
>>> v
(5, 6, 12, 87, 9, 43, 67)
But if you want to deal with NAs before the function call, you need to do something like this:
to remove each 'NA' from a vector:
vx = vx[!is.na(a)]
to remove each 'NA' from a vector and replace it w/ a '0':
ifelse(is.na(vx), 0, vx)
to remove entire each row that contains 'NA' from a data frame:
dfx = dfx[complete.cases(dfx),]
All of these functions permanently remove 'NA' or rows with an 'NA' in them.
Sometimes this isn't quite what you want though--making an 'NA'-excised copy of the data frame might be necessary for the next step in the workflow but in subsequent steps you often want those rows back (e.g., to calculate a column-wise statistic for a column that has missing rows caused by a prior call to 'complete cases' yet that column has no 'NA' values in it).
to be as clear as possible about what i'm looking for: python/numpy has a class, masked array, with a mask method, which lets you conceal--but not remove--NAs during a function call. Is there an analogous function in R?
回答1:
Exactly what to do with missing data -- which may be flagged as NA
if we know it is missing -- may well differ from domain to domain.
To take an example related to time series, where you may want to skip, or fill, or interpolate, or interpolate differently, ... is that just the (very useful and popular) zoo has all these functions related to NA
handling:
zoo::na.approx zoo::na.locf
zoo::na.spline zoo::na.trim
allowing to approximate (using different algorithms), carry-forward or backward, use spline interpolation or trim.
Another example would be the numerous missing imputation packages on CRAN -- often providing domain-specific solutions. [ So if you call R a DSL, what is this? "Sub-domain specific solutions for domain specific languages" or SDSSFDSL? Quite a mouthful :) ]
But for your specific question: no, I am not aware of a bit-level flag in base R that allows you to mark observations as 'to be excluded'. I presume most R users would resort to functions like na.omit()
et al or use the na.rm=TRUE
option you mentioned.
回答2:
It's a good practice to look at the data, hence infer about the type of missing values: is it MCAR (missing complete and random), MAR (missing at random) or MNAR (missing not at random)? Based on these three types, you can study the underlying structure of missing values and conclude whether imputation is at all applicable (you're lucky if it's not MNAR, 'cause, in that case, missing values are considered non-ignorable, and are related to some unknown underlying influence, factor, process, variable... whatever).
Chapter 3. in "Interactive and Dynamic Graphics for Data Analyst with R and GGobi" by Di Cook and Deborah Swayne is great reference regarding this topic.
You'll see norm
package in action in this chapter, but Hmisc
package has data imputation routines. See also Amelia
, cat
(for categorical missings imputation), mi
, mitools
, VIM
, vmv
(for missing data visualisation).
Honestly, I still don't quite understand is your question about statistics, or about R missing data imputation capabilities? I reckon that I've provided good references on second one, and about the first one: you can replace your NA's either with central tendency (mean, median, or similar), hence reduce the variability, or with random constant "pulled out" of observed (recorded) cases, or you can apply regression analysis with variable that contains NA's as criteria, and other variables as predictors, then assign residuals to NA's... it's an elegant way to deal with NA's, but quite often it would not go easy on your CPU (I have Celeron on 1.1GHz, so I have to be gentle).
This is an optimization problem... there's no definite answer, you should decide what/why are you sticking with some method. But it's always good practice to look at the data! =)
Be sure to check Cook & Swayne - it's an excellent, skilfully written guide. "Linear Models with R" by Faraway also contains a chapter about missing values.
So there.
Good luck! =)
回答3:
The function na.exclude()
sounds like what you want, although it's only an option for some (important) functions.
In the context of fitting and working with models, R has a family of generic functions for dealing with NAs: na.fail()
, na.pass()
, na.omit()
, and na.exclude()
. These are, in turn, arguments for some of R's key modeling functions, such as lm()
, glm()
, and nls()
as well as functions in MASS, rpart, and survival packages.
All four generic functions basically act as filters. na.fail()
will only pass the data through if there are no NAs, otherwise it fails. na.pass()
passes all cases through. na.omit()
and na.exclude()
will both leave out cases with NAs and pass the other cases through. But na.exclude()
has a different attribute that tells functions processing the resulting object to take into account the NAs. You could see this attribute if you did attributes(na.exclude(some_data_frame))
. Here's a demonstration of how na.exclude()
alters the behavior of predict()
in the context of a linear model.
fakedata <- data.frame(x = c(1, 2, 3, 4), y = c(0, 10, NA, 40))
## We can tell the modeling function how to handle the NAs
r_omitted <- lm(x~y, na.action="na.omit", data=fakedata)
r_excluded <- lm(x~y, na.action="na.exclude", data=fakedata)
predict(r_omitted)
# 1 2 4
# 1.115385 1.846154 4.038462
predict(r_excluded)
# 1 2 3 4
# 1.115385 1.846154 NA 4.038462
Your default na.action, by the way, is determined by options("na.action")
and begins as na.omit()
but you can set it.