For example, I have a net that take tensor [N, 7](N is the samples num) as input and tensor [N, 4] as output, the “4” represents the different classes’ probabilities.
And the training data’s labels are the form of tensor [N], from range 0 to 3(represent the ground-truth class).
Here’s my question, I’ve seen some demos, they directly apply the loss function on the output tensor and label tensor. I wonder why this can work, since they have different size, and there sizes seems don’t fit the “broadcasting semantics”.
Here’s the minimal demo.
import torch
import torch.nn as nn
import torch.optim as optim
if __name__ == '__main__':
features = torch.randn(2, 7)
gt = torch.tensor([1, 1])
model = nn.Sequential(
nn.Linear(7, 4),
nn.ReLU(),
nn.Linear(4, 4)
)
optimizer = optim.SGD(model.parameters(), lr=0.005)
f = nn.CrossEntropyLoss()
for epoch in range(1000):
optimizer.zero_grad()
output = model(features)
loss = f(output, gt)
loss.backward()
optimizer.step()
In PyTorch the implementation is:
Link to the Documentation: https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss
So implementing this formula in pytorch you get:
Output:
I hope this helps and sorry for the confusion.