I'm currently building a model on Tensorflow( ver:1.8 os:Ubuntu MATE16.04) platform. The model's purpose is to detect/match Keypoints of human body. While training, the error "No gradients for any variable" occurred, and I have difficulties to fix it.
Background of the model: Its basic ideas came from these two papers:
- Deep Learning of Binary Hash Codes for fast Image Retrieval
- Learning Compact Binary Descriptors with Unsupervised Deep Neural Networks
They showed it's possible to match images according to Hash codes generated from a convolutional network. The similarity of two pictures is determined by the Hamming distance between their corresponding hash codes.
I think it's possible to develop a extremely light weight model to perform real-time human pose estimation on a video with "constant human subject" and "fixed background".
Model Structure
01.Data source:
3 images from one video with the same human subject and similar background. Every human keypoints in each image are well labeled. 2 of the images will be used as the "hint sources" and the last image will be the target for keypoint detection/matching.
02.Hints:
23x23pixel ROIs will be cropped from the "hint source" images according to the location of human keypoints. The center of these ROIs are the keypoints.
03.convolutional network "for Hints":
A simple 3-layered structure. The first two layers are convolution by [2,2] stride with a 3x3 filter. The last layer is a 5x5 convolution on a 5x5 input with no padding(equals to a fully connected layer)
This will turn a 23x23pixel Hint ROI into one 32 bit Hash codes. One hint souce image will generate a set of 16 Hash codes.
04.Convolutional network "for target image": The network share the smae weights with the hint network. But in this case, each convolution layer have paddings. The 301x301pixel image will be turned into a 76x76 "Hash map"
05.Hash matching:
I made a function called " locateMin_and_get_loss " to calculate the Hamming distance between "hint hash" and the hash codes on each point of the hash map. This function will create a "distance map". he location of the point with lowest distance value will be treated as the location of the keypoint.
06.Loss calculation:
I made a function "get_total_loss_and_result" to calculate the total loss of 16 keypoints. The loss are normalized euclidean distance between ground truth label points and the points located by the model.
07.proposed work flow:
Before initializing this model, the user will take two pictures of the target human subject from different angles. The pictures will be labeled by the state of art models like OpenPose or DeepPose and generate Hint Hashs from them with convolution network mentioned in 03.
Finally the video stream will be started and processd by the model.
08.Why "Two" sets of hints?
One human joint/keypoint observed from different angles will have very diferent appearance. Instead of increasing dimetionality of the neural networ, I want to "cheat the game" by gathering two hints instead of one. I want to know whether it can increase the precision and generalizational capacity of the model or not.
The problems I faced:
01.The "No gradients for any variable " error (My main question of this post):
As mentioned above, I'm facing this error while training the model. I tried to learn from posts like this and this and this. But currently I have no clue even though I checked the computational graph.
02.The "Batch" problem:
Due to its unique structure, it's hard to use conventional placeholder to contain the input data of multiple batch. I fixed it by setting the batch number to 3 and manually combine the value of loss functions.
2018.10.28 Edit:
The simplified version with only one hint set:
import tensorflow as tf
import numpy as np
import time
from imageLoader import getPaddedROI,training_data_feeder
import math
'''
created by Cid Zhang
a sub-model for human pose estimation
'''
def truncated_normal_var(name,shape,dtype):
return(tf.get_variable(name=name, shape=shape, dtype=dtype, initializer=tf.truncated_normal_initializer(stddev=0.01)))
def zero_var(name,shape,dtype):
return(tf.get_variable(name=name, shape=shape, dtype=dtype, initializer=tf.constant_initializer(0.0)))
roi_size = 23
image_input_size = 301
#input placeholders
#batch1 hints
inputs_b1h1 = tf.placeholder(tf.float32, ( 16, roi_size, roi_size, 3), name='inputs_b1h1')
inputs_s = tf.placeholder(tf.float32, (None, image_input_size, image_input_size, 3), name='inputs_s')
labels = tf.placeholder(tf.float32,(16,76,76), name='labels')
#define the model
def paraNet(input):
out_l1 = tf.layers.conv2d(input, 8, [3, 3],strides=(2, 2), padding ='valid' ,name='para_conv_1')
out_l1 = tf.nn.relu6(out_l1)
out_l2 = tf.layers.conv2d(out_l1, 16, [3, 3],strides=(2, 2), padding ='valid' ,name='para_conv_2')
out_l2 = tf.nn.relu6(out_l2)
out_l3 = tf.layers.conv2d(out_l2, 32, [5, 5],strides=(1, 1), padding ='valid' ,name='para_conv_3')
return out_l3
#network pipeline to create the first Hint Hash Sets (Three batches)
with tf.variable_scope('conv'):
out_b1h1_l3 = paraNet(inputs_b1h1)
#flatten and binerize the hashs
out_b1h1_l3 =tf.squeeze( tf.round(tf.nn.sigmoid(out_b1h1_l3)) )
with tf.variable_scope('conv', reuse=True):
out_2_l1 = tf.layers.conv2d(inputs_s, 8, [3, 3],strides=(2, 2), padding ='same' ,name='para_conv_1')
out_2_l1 = tf.nn.relu6(out_2_l1)
out_2_l2 = tf.layers.conv2d(out_2_l1, 16, [3, 3],strides=(2, 2), padding ='same' ,name='para_conv_2')
out_2_l2 = tf.nn.relu6(out_2_l2)
out_2_l3 = tf.layers.conv2d(out_2_l2, 32, [5, 5],strides=(1, 1), padding ='same' ,name='para_conv_3')
#binerize the value into Hash code
out_2_l3 = tf.round(tf.nn.sigmoid(out_2_l3))
orig_feature_map_size = tf.shape(out_2_l3)[1]
#calculate Hamming distance maps
map0 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[0] , out_2_l3 ) ) , axis=3 )
map1 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[1] , out_2_l3 ) ) , axis=3 )
map2 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[2] , out_2_l3 ) ) , axis=3 )
map3 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[3] , out_2_l3 ) ) , axis=3 )
map4 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[4] , out_2_l3 ) ) , axis=3 )
map5 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[5] , out_2_l3 ) ) , axis=3 )
map6 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[6] , out_2_l3 ) ) , axis=3 )
map7 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[7] , out_2_l3 ) ) , axis=3 )
map8 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[8] , out_2_l3 ) ) , axis=3 )
map9 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[9] , out_2_l3 ) ) , axis=3 )
map10 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[10] , out_2_l3 ) ) , axis=3 )
map11 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[11] , out_2_l3 ) ) , axis=3 )
map12 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[12] , out_2_l3 ) ) , axis=3 )
map13 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[13] , out_2_l3 ) ) , axis=3 )
map14 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[14] , out_2_l3 ) ) , axis=3 )
map15 = tf.reduce_sum ( tf.abs (tf.subtract( out_b1h1_l3[15] , out_2_l3 ) ) , axis=3 )
totoal_map =tf.div( tf.concat([map0, map1, map2, map3, map4, map5, map6, map7,
map8, map9, map10,map11,map12, map13, map14, map15], 0) , 32)
loss = tf.nn.l2_loss(totoal_map - labels , name = 'loss' )
#ValueError: No gradients provided for any variable, check your graph for ops that do not support gradients, between variables
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss )
init = tf.global_variables_initializer()
batchsize = 3
with tf.Session() as sess:
#writer = tf.summary.FileWriter("./variable_graph",graph = sess.graph)
sess.run(init)
#load image from dataset(train set)
joint_data_path = "./custom_data.json"
train_val_path = "./train_val_indices.json"
imgpath = "./000/"
input_size = 301
hint_roi_size = 23
hintSet01_norm_batch = []
hintSet02_norm_batch = []
t_img_batch = []
t_label_norm_batch = []
#load data
hintSet01,hintSet02,t_img,t_label_norm = training_data_feeder(joint_data_path, train_val_path, imgpath, input_size, hint_roi_size )
#Normalize the image pixel values to 0~1
hintSet01_norm = []
hintSet02_norm = []
t_img = np.float32(t_img /255.0)
for rois in hintSet01:
tmp = np.float32(rois / 255.0)
hintSet01_norm.append(tmp.tolist())
for rois in hintSet02:
tmp = np.float32(rois / 255.0)
hintSet02_norm.append(tmp.tolist())
print(tf.trainable_variables())
temp = sess.run(totoal_map , feed_dict={inputs_s: [t_img] ,
inputs_b1h1: hintSet01_norm,
labels: t_label_norm
})
print(temp)
print(np.shape(temp))
The code: https://github.com/gitpharm01/Parapose/blob/master/paraposeNetworkV3.py
The Tensorflow graph: https://github.com/gitpharm01/Parapose/blob/master/variable_graph/events.out.tfevents.1540296979.pharmboy-K30AD-M31AD-M51AD
The Dataset:
It's a custom dataset generated from mpii dataset. It have 223 clusters of images. Each cluster have one constant human subject in various poses and the background remains the same. One cluster have at least 3 pictures. It's about 627MB and I'll try to pack it and upload it later.
2018.10.26 Edit:
You can download it on GoogleDrive, the whole data set was divided into 9 parts.( I can't post more than 8 links in this article. The links are in this file: https://github.com/gitpharm01/Parapose/blob/master/000/readme.md