As an exercice in pytorch framework (0.4.1) , I am trying to display the gradient of X (gX or dSdX) in a simple Linear layer (Z = X.W + B). To simplify my toy example, I backward() from a sum of Z (not a loss).
To sum up, I want gX(dSdX) of S=sum(XW+B).
The problem is that the gradient of Z (dSdZ) is None. As a result, gX is wrong too of course.
import torch
X = torch.tensor([[0.5, 0.3, 2.1], [0.2, 0.1, 1.1]], requires_grad=True)
W = torch.tensor([[2.1, 1.5], [-1.4, 0.5], [0.2, 1.1]])
B = torch.tensor([1.1, -0.3])
Z = torch.nn.functional.linear(X, weight=W.t(), bias=B)
S = torch.sum(Z)
S.backward()
print("Z:\n", Z)
print("gZ:\n", Z.grad)
print("gX:\n", X.grad)
Result:
Z:
tensor([[2.1500, 2.9100],
[1.6000, 1.2600]], grad_fn=<ThAddmmBackward>)
gZ:
None
gX:
tensor([[ 3.6000, -0.9000, 1.3000],
[ 3.6000, -0.9000, 1.3000]])
I have exactly the same result if I use nn.Module as below:
class Net1Linear(torch.nn.Module):
def __init__(self, wi, wo,W,B):
super(Net1Linear, self).__init__()
self.linear1 = torch.nn.Linear(wi, wo)
self.linear1.weight = torch.nn.Parameter(W.t())
self.linear1.bias = torch.nn.Parameter(B)
def forward(self, x):
return self.linear1(x)
net = Net1Linear(3,2,W,B)
Z = net(X)
S = torch.sum(Z)
S.backward()
print("Z:\n", Z)
print("gZ:\n", Z.grad)
print("gX:\n", X.grad)
First of all you only calculate gradients for tensors where you enable the gradient by setting the requires_grad
to True
.
So your output is just as one would expect. You get the gradient for X
.
PyTorch does not save gradients of intermediate results for performance reasons. So you will just get the gradient for those tensors you set requires_grad
to True
.
However you can use register_hook
to extract the intermediate grad during calculation or to save it manually. Here I just save it to the grad
variable of tensor Z
:
import torch
# function to extract grad
def set_grad(var):
def hook(grad):
var.grad = grad
return hook
X = torch.tensor([[0.5, 0.3, 2.1], [0.2, 0.1, 1.1]], requires_grad=True)
W = torch.tensor([[2.1, 1.5], [-1.4, 0.5], [0.2, 1.1]])
B = torch.tensor([1.1, -0.3])
Z = torch.nn.functional.linear(X, weight=W.t(), bias=B)
# register_hook for Z
Z.register_hook(set_grad(Z))
S = torch.sum(Z)
S.backward()
print("Z:\n", Z)
print("gZ:\n", Z.grad)
print("gX:\n", X.grad)
This will output:
Z:
tensor([[2.1500, 2.9100],
[1.6000, 1.2600]], grad_fn=<ThAddmmBackward>)
gZ:
tensor([[1., 1.],
[1., 1.]])
gX:
tensor([[ 3.6000, -0.9000, 1.3000],
[ 3.6000, -0.9000, 1.3000]])
Hope this helps!
Btw.: Normally you would want the gradient to be activated for your parameters - so your weights and biases. Because what you would do right now when using the optimizer, is altering your inputs X
and not your weights W
and bias B
. So usually gradient is activated for W
and B
in such a case.
blue-phoenox, thanks for your answer. I am pretty happy to have heard about register_hook().
What led me to think that I had a wrong gX is that it was independant of the values of X. I will have to do the math to understand it. But using CCE Loss instead of SUM makes things much more clean. So I updated the example for those who might be interested. Using SUM was a bad idea in this case.
T_dec = torch.tensor([0, 1])
X = torch.tensor([[0.5, 0.8, 2.1], [0.7, 0.1, 1.1]], requires_grad=True)
W = torch.tensor([[2.7, 0.5], [-1.4, 0.5], [0.2, 1.1]])
B = torch.tensor([1.1, -0.3])
Z = torch.nn.functional.linear(X, weight=W.t(), bias=B)
print("Z:\n", Z)
L = torch.nn.CrossEntropyLoss()(Z,T_dec)
Z.register_hook(lambda gZ: print("gZ:\n",gZ))
L.backward()
print("gX:\n", X.grad)
Result:
Z:
tensor([[1.7500, 2.6600],
[3.0700, 1.3100]], grad_fn=<ThAddmmBackward>)
gZ:
tensor([[-0.3565, 0.3565],
[ 0.4266, -0.4266]])
gX:
tensor([[-0.7843, 0.6774, 0.3209],
[ 0.9385, -0.8105, -0.3839]])