Expanding a Frequency Table Where the Variable Nam

2019-09-06 14:21发布

问题:

I am working with a dataframe where each observation is linked to a specific ID, and I have a set of variables that define the "values" as if I had a factor variable. However, the value in the "cell" is the frequency. Here is a simplified version:

ID  1  2  3
A   2  3  2
B   1  4  1

I would like to get two vectors that expand the frequencies so that I can calculate an interpolated median for each ID. That is, I'd like something of the form:

A  B
1  1
1  2
2  2
2  2
2  2
3  3
3

The psych package has a function interp.median that could then take each vector and return the interpolated median for each ID that I would like to include as a new variable in the original dataframe. I checked out the vcdExtra package which could maybe do this with its expand.dft function, but I'm not sure exactly how it would work.

Any help would be greatly appreciated!

EDIT: To refine a bit more, interp.median would work best if the final result was a data frame, with NAs padded at the end. That is, something of the form:

A  B
1  1
1  2
2  2
2  2
2  2
3  3
3  NA

回答1:

If dat is the dataset

  lst <- by(dat[,-1], dat[,1], function(x) rep(seq_along(x), x))
  lst
  #dat[, 1]: A
  #[1] 1 1 2 2 2 3 3
  #------------------------------------------------------------ 
 #dat[, 1]: B
 #[1] 1 2 2 2 2 3

 indx <- max(sapply(lst,length))
 dat2 <- do.call(data.frame,lapply(lst, function(x) c(x,rep(NA,indx-length(x)))))
 dat2
 #  A  B
 #1 1  1
 #2 1  2
 #3 2  2
 #4 2  2
 #5 2  2
 #6 3  3
 #7 3 NA

Or

  lst2 <- lapply(split(dat[,-1], dat$ID), function(x) rep(seq_along(unlist(x)), unlist(x)))

  do.call(data.frame,lapply(lst2, function(x) c(x,rep(NA,indx-length(x)))))

data

 dat <-  structure(list(ID = c("A", "B"), `1` = c(2L, 1L), `2` = 3:4, 
`3` = c(2L, 1L)), .Names = c("ID", "1", "2", "3"), class = "data.frame", row.names = c(NA, 
 -2L))


回答2:

Here one way:

# your data
df <- data.frame(ID=c(1,2,3), A=c(2,3,2), B=c(1,4,1))

# function to repeat each ID a given number of times,
# as specified in 'colname' of df
rep_id <- function(colname) {
  unname(unlist(apply(df[, c('ID',colname)], 1, function(x) rep(x[1], x[2]))))
}

# apply this function to all columns (except the first, which is ID)
sapply(names(df)[-1], rep_id)

Yields:

$A
[1] 1 1 2 2 2 3 3

$B
[1] 1 2 2 2 2 3


回答3:

Sample data:

df <- read.table(text="
ID  1  2  3
A   2  3  2
B   1  4  1", header=TRUE, check.names=FALSE)

Use apply:

(newlist <- apply(df[2:4], 1, function(x) rep(names(x), x)))
#[[1]]
#[1] "1" "1" "2" "2" "2" "3" "3"
#
#[[2]]
#[1] "1" "2" "2" "2" "2" "3"

names(newlist) <- df$ID
#$A
#[1] "1" "1" "2" "2" "2" "3" "3"
#
#$B
#[1] "1" "2" "2" "2" "2" "3"

This outputs characters, but you could output numbers like this:

newlist <- apply(df[2:4], 1, function(x) rep(as.numeric(names(x)), x))
names(newlist) <- df$ID

Edit:

To address OP's new request that the vectors be put in a data.frame and padded with NAs, call this after running either of the options above:

newlist <- sapply(newlist, function(x) x[1:max(sapply(newlist, length))])
#     A  B
#[1,] 1  1
#[2,] 1  2
#[3,] 2  2
#[4,] 2  2
#[5,] 2  2
#[6,] 3  3
#[7,] 3 NA