How to apply a custom function over each column of

2019-08-01 03:43发布

问题:

I have been trying use a custom function that I found on here to recalculate median household income from census tracts aggregated to neighborhoods. My data looks like this

> inc_df[, 1:5]
          San Francisco Bayview Hunters Point Bernal Heights Castro/Upper Market Chinatown
2500-9999             22457                  1057            287                 329      1059
10000-14999           20708                   920            288                 463      1327
1500-19999            12701                   626            145                 148       867
20000-24999           12106                   491            285                 160       689
25000-29999           10129                   554            238                 328       167
30000-34999           10310                   338            257                 179       289
35000-39999            9028                   383            184                 163       326
40000-44999            9532                   472            334                 173       264
45000-49999            8406                   394            345                 241       193
50000-59999           17317                   727            367                 353       251
60000-74999           25947                  1037            674                 794       236
75000-99999           36378                  1185            980                 954       289
100000-124999         33890                   990            640                1208       199
125000-149999         24935                   522            666                 957       234
150000-199999         37190                   814           1310                1535       150
200000-250001         65763                   796           2122                3175       302

The function is as follows:

GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
  # If "sep" is specified, the function will try to create the 
  #   required "intervals" matrix. "trim" removes any unwanted 
  #   characters before attempting to convert the ranges to numeric.
  if (!is.null(sep)) {
    if (is.null(trim)) pattern <- ""
    else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
    else pattern <- trim
    intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
  }

  Midpoints <- rowMeans(intervals)
  cf <- cumsum(frequencies)
  Midrow <- findInterval(max(cf)/2, cf) + 1
  L <- intervals[1, Midrow]      # lower class boundary of median class
  h <- diff(intervals[, Midrow]) # size of median class
  f <- frequencies[Midrow]       # frequency of median class
  cf2 <- cf[Midrow - 1]          # cumulative frequency class before median class
  n_2 <- max(cf)/2               # total observations divided by 2

  unname(L + (n_2 - cf2)/f * h)
}

And the code to apply the function looks like this:

GroupedMedian(inc_df[, "Bernal Heights"], rownames(inc_df), sep="-", trim="cut")

This all works fine but I can't figure out how to apply this to each column of the matrix instead of typing out each column name and running it again and again. I have tried this:

> minc_hood <- data.frame(apply(inc_df, 2, function(x) GroupedMedian(inc_df[, x], 
rownames(inc_df), sep="-", trim="cut")))

But I get this error message

Error in inc_df[, x] : subscript out of bounds

回答1:

There are a couple of things at play here:

  • advice: never use apply with a data.frame (unless you are absolutely certain you don't mind the overhead of converting to matrix^1 and can accept the potential data loss^2).

  • even if you're going to use apply, you're doing it a little "off": when you say apply(df, 2, func), it takes the first column of df and presents it as the arguments, so for instance

    apply(mtcars, 2, mean)
    

    will make calls like

    mean(c(21, 21, 22.8, 21.4, 18.7, ...)) # mpg
    mean(c(6, 6, 4, 6, 8, ...))            # cyl
    mean(c(160, 160, 108, 258, 360, ...))  # disp
    # ... etc
    

    In that context, your use of apply(inc_df, 2, function(x) GroupedMedian(inc_df[, x], ...)) is wrong, since x is replaced by all values of the first column of inc_df (and then all values of the 2nd column, etc).

Since your function looks like it accepts a vector of values (plus some other arguments), I suggest you try something like

inc_df[] <- lapply(inc_df, GroupedMedian, rownames(inc_df), sep="-", trim="cut")

If you want to apply this function to a subset of those columns, then something like this works well:

ind <- c(1,3,7)
inc_df[ind] <- lapply(inc_df[ind], GroupedMedian, rownames(inc_df), sep="-", trim="cut")

The use of inc_df[] <- ... (when not doing a column-subset) ensures that we replace the values of the columns without losing the attribute that it is a data.frame. It is effectively the same as inc_df <- as.data.frame(...) with some other minor nuances.

Notes:

^1: apply will always convert a data.frame to a matrix. This might be alright, but with larger data will take a non-zero amount of time. It also may have consequences, see next ...

^2: a matrix can have only one class, unlike a data.frame. That means that all columns will be up-converted to the highest common type, in the order of logical < integer < numeric < POSIXct < character. This means that if you have all numeric columns and one character, then the function you are applying on it will see all character data. This can be mitigated by only selecting those columns with the types you expect, perhaps with:

isnum <- sapply(inc_df, is.numeric)
inc_df[isnum] <- apply(inc_df[isnum], 2, GroupedMedian, ...)

and in that case, the worst conversion you will get will be integer-to-numeric, likely an acceptable (and reversible) conversion.



标签: r apply