What is cross-entropy?

2019-01-29 17:03发布

I know that there are a lot of explanations of what cross-entropy is, but I'm still confused.

Is it only a method to describe the loss function? Then, we can use, for example, gradient descent algorithm to find the minimum. Or it's the whole process that involves also finding the minimum algorithm?

1条回答
三岁会撩人
2楼-- · 2019-01-29 17:30

Cross-entropy is commonly used to quantify the difference between two probability distributions. Usually the "true" distribution (the one that your machine learning algorithm is trying to match) is expressed in terms of a one-hot distribution.

For example, suppose for a specific training instance, the label is B (out of the possible labels A, B, and C). The one-hot distribution for this training instance is therefore:

Pr(Class A)  Pr(Class B)  Pr(Class C)
        0.0          1.0          0.0

You can interpret the above "true" distribution to mean that the training instance has 0% probability of being class A, 100% probability of being class B, and 0% probability of being class C.

Now, suppose your machine learning algorithm predicts the following probability distribution:

Pr(Class A)  Pr(Class B)  Pr(Class C)
      0.228        0.619        0.153

How close is the predicted distribution to the true distribution? That is what the cross-entropy loss determines. Use this formula:

Cross entropy loss formula

Where p(x) is the wanted probability, and q(x) the actual probability. The sum is over the three classes A, B, and C. In this case the loss is 0.479 :

H = - (0.0*ln(0.228) + 1.0*ln(0.619) + 0.0*ln(0.153)) = 0.479

So that is how "wrong" or "far away" your prediction is from the true distribution.

Cross entropy is one out of many possible loss functions (another popular one is SVM hinge loss). These loss functions are typically written as J(theta) and can be used within gradient descent, which is an iterative framework of moving the parameters (or coefficients) towards the optimum values. In the equation below, you would replace J(theta) with H(p, q). But note that you need to compute the derivative of H(p, q) with respect to the parameters first.

gradient descent

So to answer your original questions directly:

Is it only a method to describe the loss function?

Correct, cross-entropy describes the loss between two probability distributions. It is one of many possible loss functions.

Then we can use, for example, gradient descent algorithm to find the minimum.

Yes, the cross-entropy loss function can be used as part of gradient descent.

Further reading: one of my other answers related to TensorFlow.

查看更多
登录 后发表回答