Using binary_crossentropy loss in Keras (Tensorflo

2020-06-18 04:02发布

问题:

In the training example in Keras documentation,

https://keras.io/getting-started/sequential-model-guide/#training

binary_crossentropy is used and sigmoid activation is added in the network's last layer, but is it necessary that add sigmoid in the last layer? As I found in the source code:

def binary_crossentropy(output, target, from_logits=False):
  """Binary crossentropy between an output tensor and a target tensor.
  Arguments:
      output: A tensor.
      target: A tensor with the same shape as `output`.
      from_logits: Whether `output` is expected to be a logits tensor.
          By default, we consider that `output`
          encodes a probability distribution.
  Returns:
      A tensor.
  """
  # Note: nn.softmax_cross_entropy_with_logits
  # expects logits, Keras expects probabilities.
  if not from_logits:
    # transform back to logits
    epsilon = _to_tensor(_EPSILON, output.dtype.base_dtype)
    output = clip_ops.clip_by_value(output, epsilon, 1 - epsilon)
    output = math_ops.log(output / (1 - output))
  return nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)

Keras invokes sigmoid_cross_entropy_with_logits in Tensorflow, but in sigmoid_cross_entropy_with_logits function, sigmoid(logits) is calculated again.

https://www.tensorflow.org/versions/master/api_docs/python/tf/nn/sigmoid_cross_entropy_with_logits

So I don't think it makes sense that add a sigmoid at last, but seemingly all the binary/multi-label classification examples and tutorials in Keras I found online added sigmoid at last. Besides I don't understand what is the meaning of

# Note: nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.

Why Keras expects probabilities? Doesn't it use the nn.softmax_cross_entropy_with_logits function? Does it make sense?

Thanks.

回答1:

You're right, that's exactly what's happening. I believe this is due to historical reasons.

Keras was created before tensorflow, as a wrapper around theano. And in theano, one has to compute sigmoid/softmax manually and then apply cross-entropy loss function. Tensorflow does everything in one fused op, but the API with sigmoid/softmax layer was already adopted by the community.

If you want to avoid unnecessary logit <-> probability conversions, call binary_crossentropy loss withfrom_logits=True and don't add the sigmoid layer.



回答2:

In categorical cross entropy :

  • if it is prediction it will compute the cross entropy directly
  • if it is logit it will apply softmax_cross entropy with logit

In Binary cross entropy:

  • if it is prediction it will convert it back to logit then apply sigmoied cross entropy with logit
  • if it is logit it will apply sigmoied cross entropy with logitdirectly


回答3:

In Keras by default we use activation sigmoid on the output layer and then use the keras binary_crossentropy loss function, independent of the backend implementation (Theano, Tensorflow or CNTK).

If you look more in depth for the pure Tensorflow case you find that the tensorflow backend binary_crossentropy function (which you pasted in your question) uses tf.nn.sigmoid_cross_entropy_with_logits. The later function also add the sigmoid activation. To avoid double sigmoid, the tensorflow backend binary_crossentropy, will by default (with from_logits=False) calculate the inverse sigmoid (logit(x)=log(x/1-x)) to get the output back into the raw state from the network with no activation.

The extra activation sigmoid, and inverse sigmoid calculation can be avoided by using no sigmoid activation function in your last layer, and then call the tensorflow backend binary_crossentropy with parameter from_logits=True (Or directly use tf.nn.sigmoid_cross_entropy_with_logits)