可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am trying to use the crossvalidation cv.glm function from the boot library in R to determine the number of misclassifications when a glm logistic regression is applied.

The function has the following signature:

cv.glm(data, glmfit, cost, K)

with the first two denoting the data and model and K specifies the k-fold. My problem is the cost parameter which is defined as:

cost: A function of two vector arguments specifying the cost function for the crossvalidation. The first argument to cost should correspond to the observed responses and the second argument should correspond to the predicted or fitted responses from the generalized linear model. cost must return a non-negative scalar value. The default is the average squared error function.

I guess for classification it would make sense to have a function which returns the rate of misclassification something like:

nrow(subset(data, (predict >= 0.5  & data$response == "no") | 
                  (predict <  0.5  & data$response == "yes")))

which is of course not even syntactically correct.

Unfortunately, my limited R knowledge let me waste hours and I was wondering if someone could point me in the correct direction.

回答1:

It sounds like you might do well to just use the cost function (i.e. the one named cost) defined further down in the "Examples" section of ?cv.glm. Quoting from that section:

 # [...] Since the response is a binary variable an
 # appropriate cost function is
 cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)

This does essentially what you were trying to do with your example. Replacing your "no" and "yes" with 0 and 1, lets say you have two vectors, predict and response. Then cost() is nicely designed to take them and return the mean classification rate:

## Simulate some reasonable data
set.seed(1)
predict <- seq(0.1, 0.9, by=0.1)
response <-  rbinom(n=length(predict), prob=predict, size=1)
response
# [1] 0 0 0 1 0 0 0 1 1

## Demonstrate the function 'cost()' in action
cost(response, predict)
# [1] 0.3333333  ## Which is right, as 3/9 elements (4, 6, & 7) are misclassified
                 ## (assuming you use 0.5 as the cutoff for your predictions).

I'm guessing the trickiest bit of this will be just getting your mind fully wrapped around the idea of passing a function in as an argument. (At least that was for me, for the longest time, the hardest part of using the boot package, which requires that move in a fair number of places.)

Added on 2016-03-22:

The function cost(), given above is in my opinion unnecessarily obfuscated; the following alternative does exactly the same thing but in a more expressive way:

cost <- function(r, pi = 0) { 
        mean((pi < 0.5) & r==1 | (pi > 0.5) & r==0)
}

回答2:

I will try to explain the cost function in simple words. Let's take cv.glm(data, glmfit, cost, K) arguments step by step:

data The data consists of many observations. Think of it like series of numbers or even.
glmfit It is generalized linear model, which runs on the above series. But there is a catch it splits data into several parts equal to K. And runs glmfit on each of them separately (test set), taking the rest of them as training set. The output of glmfit is a series consisting of same number of elements as the split input passed.
cost Cost Function. It takes two arguments first the split input series(test set), and second the output of glmfit on the test input. The default is mean square error function. . It sums the square of difference between observed data point and predicted data point. Inside the function a loop runs over the test set (output and input should have same number of elements) calculates difference, squares it and adds to output variable.
K The number to which the input should be split. Default gives leave one out cross validation.

Judging from your cost function description. Your input(x) would be a set of numbers between 0 and 1 (0-0.5 = no and 0.5-1 = yes) and output(y) is 'yes' or 'no'. So error(e) between observation(x) and prediction(y) would be :

cost<- function(x, y){
  e=0
  for (i in 1:length(x)){
    if(x[i]>0.5)
    {
      if( y[i]=='yes') {e=0}
      else {e=x[i]-0.5}
    }else
    {
      if( y[i]=='no') {e=0}
      else {e=0.5-x[i]}
    }
    e=e*e #square error
  }
  e=e/i #mean square error
  return (e)
}

Sources : http://www.cs.cmu.edu/~schneide/tut5/node42.html

回答3:

The cost function can optionally be defined if there is one you prefer over the default average squared error. If you wanted to do so then the you would write a function that returns the cost you want to minimize using two inputs: (1) the vector of known labels that you are predicting, and (2) the vector of predicted probabilities from your model for those corresponding labels. So for the cost function that (I think) you described in your post you are looking for a function that will return the average number of accurate classifications which would look something like this:

cost <- function(labels,pred){
 mean(labels==ifelse(pred > 0.5, 1, 0))
}

With that function defined you can then pass it into your glm.cv() call. Although I wouldn't recommend using your own cost function over the default one unless you have reason to. Your example isn't reproducible, so here is another example:

> library(boot)
> 
> cost <- function(labels,pred){
+   mean(labels==ifelse(pred > 0.5, 1, 0))
+ }
> 
> #make model
> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
> #run cv with your cost function
> (nodal.glm.err <- cv.glm(nodal, nodal.glm, cost, nrow(nodal)))
$call
cv.glm(data = nodal, glmfit = nodal.glm, cost = cost, K = nrow(nodal))

$K
[1] 53

$delta
[1] 0.8113208 0.8113208

$seed
  [1]         403         213 -2068233650  1849869992 -1836368725 -1035813431  1075589592  -782251898
...