I'm interested to specify types of missing values. I have data that have different types of missing and I am trying to code these values as missing in R, but I am looking for a solution were I can still distinguish between them.
Say I have some data that looks like this,
set.seed(667)
df <- data.frame(a = sample(c("Don't know/Not sure","Unknown","Refused","Blue", "Red", "Green"), 20, rep=TRUE), b = sample(c(1, 2, 3, 77, 88, 99), 10, rep=TRUE), f = round(rnorm(n=10, mean=.90, sd=.08), digits = 2), g = sample(c("C","M","Y","K"), 10, rep=TRUE) ); df
# a b f g
# 1 Unknown 2 0.78 M
# 2 Refused 2 0.87 M
# 3 Red 77 0.82 Y
# 4 Red 99 0.78 Y
# 5 Green 77 0.97 M
# 6 Green 3 0.99 K
# 7 Red 3 0.99 Y
# 8 Green 88 0.84 C
# 9 Unknown 99 1.08 M
# 10 Refused 99 0.81 C
# 11 Blue 2 0.78 M
# 12 Green 2 0.87 M
# 13 Blue 77 0.82 Y
# 14 Don't know/Not sure 99 0.78 Y
# 15 Unknown 77 0.97 M
# 16 Refused 3 0.99 K
# 17 Blue 3 0.99 Y
# 18 Green 88 0.84 C
# 19 Refused 99 1.08 M
# 20 Red 99 0.81 C
If I now make two tables my missing values ("Don't know/Not sure","Unknown","Refused"
and 77, 88, 99
) are included as regular data,
table(df$a,df$g)
# C K M Y
# Blue 0 0 1 2
# Don't know/Not sure 0 0 0 1
# Green 2 1 2 0
# Red 1 0 0 3
# Refused 1 1 2 0
# Unknown 0 0 3 0
and
table(df$b,df$g)
# C K M Y
# 2 0 0 4 0
# 3 0 2 0 2
# 77 0 0 2 2
# 88 2 0 0 0
# 99 2 0 2 2
I now recode the three factor levels "Don't know/Not sure","Unknown","Refused"
into <NA>
is.na(df[,c("a")]) <- df[,c("a")]=="Don't know/Not sure"|df[,c("a")]=="Unknown"|df[,c("a")]=="Refused"
and remove the empty levels
df$a <- factor(df$a)
and the same is done with the numeric values 77, 88,
and 99
is.na(df) <- df=="77"|df=="88"|df=="99"
table(df$a, df$g, useNA = "always")
# C K M Y <NA>
# Blue 0 0 1 2 0
# Green 2 1 2 0 0
# Red 1 0 0 3 0
# <NA> 1 1 5 1 0
table(df$b,df$g, useNA = "always")
# C K M Y <NA>
# 2 0 0 4 0 0
# 3 0 2 0 2 0
# <NA> 4 0 4 4 0
Now the missing categories are recode into NA
but they are all lumped together. Is there a way in a to recode something as missing, but retain the original values? I want R to thread "Don't know/Not sure","Unknown","Refused"
and 77, 88, 99
as missing, but I want to be able to still have the information in the variable.
If you are willing to stick to numeric values then
NA
,Inf
,-Inf
, andNaN
could be used for different missing values. You can then useis.finite
to distinguish between them and normal values:You could have a special print function that displays them in a more meaningful way or even create a special class but even without that this would divide the data into finite and multiple non-finite values.
To my knowledge, base R doesn't have an in-built way to handle different
NA
types. (editor: It does:NA_integer_
,NA_real_
,NA_complex_
, andNA_character
. See?base::NA
.)One option is to use a package which does so, for instance "memisc". It's a little bit of extra work, but it seems to do what you're looking for.
Here's an example:
First, your data. I've made a copy since we will be making some pretty significant changes to the dataset, and it's always nice to have a backup.
Let's factor variable "a":
Load the "memisc" library:
Now, convert variables "a" and "b" to
item
s in "memisc":By doing this, we have a new data type. Compare the following:
We can use this information to create tables behaving the way you describe, while retaining all the original information.
The tables for the numeric column with missing data behaves the same way.
As a bonus, you get the facility to generate nice
codebook
s:However, I do also suggest you read the comment from @Maxim.K about what really constitutes missing values.
To retain the original values, you can create new columns where you code the NA information , for example :
Then you can do something like this :
Another option without creating new columns is to use ,
exclude
option like this , to set the non desired values to NULL,( different of missing values)You can define some global constants( even it is not recommnded ) to group your "missing values", and use them in the rest of your program. Something like this :