I want to provide a mask, the same size as the input image and adjust the weights learned from the image according to this mask (similar to attention, but pre-computed for each image input). How can I do this with keras (or tensorflow)?

Question

How can I add another feature layer to an image, like a Mask, and have the neural network take this new feature layer into account?

Answer

The short answer is to add it as another colour channel to the image. If your image already has 3 colour channels; red, blue, green, then adding another channel of 1 & 0 of a mask gives the neural network that much more information to use to make decisions.

Thought Experiment

As a thought experiment, let's tackle MNIST. MNIST images are 28x28. Let's take 1 image, the 'true' image, and 3 other images, the 'distractions' and form a 56x56 image of the 4 28x28 images. MNIST is black and white so it only has 1 colour channel, brightness. Let's now add another colour channel which is a mask, 1's in area of the 56x56 image where the 'true' image is and 0's else where.

If we use the same architecture as usual for solving MNIST, convolution all the way down, we can imagine that it can use this new information to learn to only pay attention to the 'true' area and categorize the image correctly.

Code Example

In this example we try and solve the XOR problem. We take a classic XOR and double the input with noise and add a channel that is 1's for the non-noise and 0's for the noise


# Adapted from https://github.com/panchishin/learn-to-tensorflow/blob/master/solutions/04-xor-2d.py

# -- The xor problem --
x = np.array([[0., 0.], [1., 1.], [1., 0.], [0., 1.]])
y_ = [[1., 0.], [1., 0.], [0., 1.], [0., 1.]]


def makeBatch() :
    # Add an additional 2 channels of noise
    # either before or after the two real 'x's.
    global x
    rx = np.random.rand(4,4,2) > 0.5
    # set the mask to 0 for all items
    rx[:,:,1] = 0
    index = int(np.random.random()*3)
    rx[:,index:index+2,0] = x
    # set the mask to 1 for 'real' values
    rx[:,index:index+2,1] = 1
    return rx

# -- imports --
import tensorflow as tf

# np.set_printoptions(precision=1) reduces np precision output to 1 digit
np.set_printoptions(precision=2, suppress=True)


# -- induction --

# Layer 0
x0 = tf.placeholder(dtype=tf.float32, shape=[None, 4, 2])
y0 = tf.placeholder(dtype=tf.float32, shape=[None, 2])

# Layer 1
f1 = tf.reshape(x0,shape=[-1,8])
m1 = tf.Variable(tf.random_uniform([8, 9], minval=0.1, maxval=0.9, dtype=tf.float32))
b1 = tf.Variable(tf.random_uniform([9], minval=0.1, maxval=0.9, dtype=tf.float32))
h1 = tf.sigmoid(tf.matmul(f1, m1) + b1)

# Layer 2
m2 = tf.Variable(tf.random_uniform([9, 2], minval=0.1, maxval=0.9, dtype=tf.float32))
b2 = tf.Variable(tf.random_uniform([2], minval=0.1, maxval=0.9, dtype=tf.float32))
y_out = tf.nn.softmax(tf.matmul(h1, m2) + b2)


# -- loss --

# loss : sum of the squares of y0 - y_out
loss = tf.reduce_sum(tf.square(y0 - y_out))

# training step : gradient descent (1.0) to minimize loss
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)



# -- training --
# run 500 times using all the X and Y
# print out the loss and any other interesting info
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    print("\nloss")
    for step in range(5000):
        sess.run(train, feed_dict={x0: makeBatch(), y0: y_})
        if (step + 1) % 1000 == 0:
            print(sess.run(loss, feed_dict={x0: makeBatch(), y0: y_}))

    results = sess.run([m1, b1, m2, b2, y_out, loss], feed_dict={x0: makeBatch(), y0: y_})
    labels = "m1,b1,m2,b2,y_out,loss".split(",")
    for label, result in zip(*(labels, results)):
        print("")
        print(label)
        print(result)

print("")

Output

We can see that the network correctly solves the problem and give the correct output with high certainty

y_ (truth) = [[1., 0.], [1., 0.], [0., 1.], [0., 1.]]

y_out
[[0.99 0.01]
 [0.99 0.01]
 [0.01 0.99]
 [0.01 0.99]]

loss
0.00056630466

Confirmation that the mask is doing something

Let's change the mask function so that it is just random by commenting out the lines that set 0's for noise and 1's for signal

def makeBatch() :
    global x
    rx = np.random.rand(4,4,2) > 0.5
    #rx[:,:,1] = 0
    index = int(np.random.random()*3)
    rx[:,index:index+2,0] = x
    #rx[:,index:index+2,1] = 1
    return rx

and then rerun the code. Indeed we can see that the network cannot learn without the mask.

y_out
[[0.99 0.01]
 [0.76 0.24]
 [0.09 0.91]
 [0.58 0.42]]

loss
0.8080765

Conclusion

If you have some signal and noise in an image (or other data structure), and successfully add another channel (a mask) that indicates where the signal is and where the noise is, a neural net can leverage that mask to focus on the signal yet still have access to the noise.