How to fill NA in R for quasi-same row?

2019-08-27 22:13发布

I'm looking for a way to fillNA in duplicated() rows. There are totally same rows and at one time there is a NA, so I decide to fill this one by value of complete row but I don't see how to deal with it.

Using the duplicated() function, I could have a data frame like that:

 df <- data.frame(
   Year = rnorm(5), 
   hour = rnorm(5), 
   LOT = rnorm(5), 
   S123_AA = c('ABF4576','ABF4576','ABF4576','ABF4576','ABF4576'), 
   S135_AA = c('ABF5403',NA,'ABF5403','ABF5403','ABF5403'), 
   S13_BB = c('BF50343','BF50343','BF50343','BF50343',NA),  
   S1763_BB = c('AA3489','AA3489','AA3489','AA3489','AA3489'), 
   S173_BB = c('BQA0478','BQA0478','BQA0478','BQA0478','BQA0478'),
   S234543 = c('AD4352','AD4352','AD4352','AD4352','AD4352'),
   S1265UU5 = c('AZERTY', 'AZERTY', 'AZERTY', 'AZERTY','AZERTY')
 )

The rows are similar, so how could I feel the NA by the value of the preceding raw (which is not an NA) ? There is no complete.cases()rows.

标签: r duplicates na
3条回答
叛逆
2楼-- · 2019-08-27 22:27

You could loop through the data and find the first none NA value and replace the NA values with that value

# Loop through the data
for(c in 1:ncol(df)) {
    vals <- df[,c]
    noneNA <- vals[!is.na(vals)][1]
    vals[is.na(vals)] <- noneNA
    df[,c] <- vals
}

Or alternatively you could review your data element by element and take a none NA value from either above or below the relevant cell using nested for loops.

for(c in 1:ncol(df)) {
    for(r in 1:nrow(df)) {
        if (is.na(df[r,c])) {
            nearVals <- df[c(r-1, r+1),c]
            noneNA <- nearVals[!is.na(nearVals)][1]
            df[r,c] <- noneNA
        }
    }
}
查看更多
该账号已被封号
3楼-- · 2019-08-27 22:36

You can do the following:

library(zoo)

# get cols with missing values
na_cols <- names(df)[colSums(is.na(df)) > 0]

# fill the missing value backwards
for (i in na_cols){
    df[[i]] <- na.locf(df[[i]])
}
查看更多
看我几分像从前
4楼-- · 2019-08-27 22:37

reading your question made me think of an imputation problem for the dataframe.

In other terms you need to fill the NAs with some sort of value to be able to "save" records in the dataframe. The simplest way is to select the value of a particular column by searching the mean (when dealing with cardinal values) or the mode (when dealing with categorical values) [you may also execute a regression, but I guess it's a more complex method].

In this case we may choose the mode replacement because the attributes are categorical. By running your code we obtain the dataframe df:

         Year       hour         LOT S123_AA S135_AA  S13_BB S1763_BB S173_BB S234543 S1265UU5
1 -0.32837526  0.7930541 -1.10954824 ABF4576 ABF5403 BF50343   AA3489 BQA0478  AD4352   AZERTY
2  0.55379245 -0.7320060 -0.95088434 ABF4576    <NA> BF50343   AA3489 BQA0478  AD4352   AZERTY
3  0.36442118  0.9920967 -0.07345038 ABF4576 ABF5403 BF50343   AA3489 BQA0478  AD4352   AZERTY
4 -0.02546781 -0.1127828 -1.78241434 ABF4576 ABF5403 BF50343   AA3489 BQA0478  AD4352   AZERTY
5  1.92550340 -1.0531371  0.88318695 ABF4576 ABF5403    <NA>   AA3489 BQA0478  AD4352   AZERTY

We can then create a function to calculate the mode of a particular column:

getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

And then use it to fill the missing values. Below the code to impute the missing values for the column S135_AA (I created a new dataframe named workdf) :

workdf <- df
workdf[is.na(workdf$S135_AA),c('S135_AA')] <- getmode(workdf[,'S135_AA'])

This is the output where you can see that the column S135_AA NAs took the most recurring value of the colum:

         Year       hour         LOT S123_AA S135_AA  S13_BB S1763_BB S173_BB S234543 S1265UU5
1 -0.32837526  0.7930541 -1.10954824 ABF4576 ABF5403 BF50343   AA3489 BQA0478  AD4352   AZERTY
2  0.55379245 -0.7320060 -0.95088434 ABF4576 ABF5403 BF50343   AA3489 BQA0478  AD4352   AZERTY
3  0.36442118  0.9920967 -0.07345038 ABF4576 ABF5403 BF50343   AA3489 BQA0478  AD4352   AZERTY
4 -0.02546781 -0.1127828 -1.78241434 ABF4576 ABF5403 BF50343   AA3489 BQA0478  AD4352   AZERTY
5  1.92550340 -1.0531371  0.88318695 ABF4576 ABF5403    <NA>   AA3489 BQA0478  AD4352   AZERTY

If your objective was data cleaning I guess that you should use an imputation method to deal with it.

查看更多
登录 后发表回答