subset in parallel using a list of dataframes and

2019-09-16 10:26发布

问题:

This works:

onion$yearone$id %in% mask$yearone

This doesn't:

onion[1][1] %in% mask[1]
onion[1]['id'] %in% mask[1]

Why? Short of an obvious way to vectorize in parallel columns in DF and in memberids (so I only get rows within each year when ids are present in both DF and memberids), im using a for loop, but I'm not being lucky at finding the right way to express the index... Help?

Example data:

yearone <- data.frame(id=c("b","b","c","a","a"),v=rnorm(5))
onion <- list()
onion[[1]] <- yearone
names(onion) <- 'yearone'
mask <- list()
mask[[1]] <- c('a','c')
names(mask) <- 'yearone'

回答1:

Here is an approach using Map

# some data
onion <- replicate(5,data.frame(id = sample(letters[1:3], 5,T), v = 1:5), 
                   simplify = F)
mask <- replicate(5, sample(letters[1:3],2), simplify = F)
names(onion) <- names(mask) <- paste0('year', seq_along(onion))

A function that will do the matching

get_matches <- function(data, id, mask){
   rows <- data[[id]] %in% mask
   data[rows,]
}


Map(get_matches , data = onion, mask = mask, MoreArgs = list(id = 'id'))


回答2:

The '$' operator is not the same as the '[' operator. If the "yearone' and 'ids' are in fact the first items in those lists you should see that this is giving the same results as the first call:

DF[[1]][[1]] %in% memberids[[1]]

Why we should think that accessing yearpathall should give the same results is entirely unclear at this point, but using the "[[" operator will possibly give an atomic vector, whereas using "[" will certainly not. The "[" operator always returns a result that is the same class as its first argument so in this case would be a list rather than a vector, for both 'DF' and 'memberids'. The %in% operator is just an infix version fo match and needs an atomic vector as both of its arguments



回答3:

This seems to be the answer I was seeking:

merge(mask[1],onion[[1]], by.x = names(mask[1]), by.y = names(onion[[1]][1]))

And applied to parallel lists of dataframes:

result <- list()
for (i in 1:(length(names(onion)))) {
  result[[i]] <- merge(mask[i],onion[[i]], by.x = names(mask[i]), by.y = names(onion[[i]][1]))
}