Convolutional neural network not converging

I've been watching some videos on deep learning/convolutional neural networks, like here and here, and I tried to implement my own in C++. I tried to keep the input data fairly simple for my first attempt so the idea is to differentiate between a cross and a circle, I have a small data set of around 25 of each (64*64 images), they look like this:

The network itself is five layers:

Convolution (5 filters, size 3, stride 1, with a ReLU)
MaxPool (size 2) 
Convolution (1 filter, size 3, stride 1, with a ReLU)
MaxPool (size 2)
Linear Regression classifier

My issue is that my network isn't converging, on anything. None of the weights appear to change. If I run it the predictions mostly stay the same other than the occasional outlier which jumps up before returning on the next iteration.

The convolutional layer training looks something like this, removed some loops to make it cleaner

// Yeah, I know I should change the shared_ptr<float>
void ConvolutionalNetwork::Train(std::shared_ptr<float> input,std::shared_ptr<float> outputGradients, float label)
{
    float biasGradient = 0.0f;

    // Calculate the deltas with respect to the input.
    for (int layer = 0; layer < m_Filters.size(); ++layer)
    {
        // Pseudo-code, each loop on it's own line in actual code
        For z < depth, x <width - filterSize, y < height -filterSize
        {               
            int newImageIndex = layer*m_OutputWidth*m_OutputHeight+y*m_OutputWidth + x;

            For the bounds of the filter (U,V)
            {
                // Find the index in the input image
                int imageIndex = x + (y+v)*m_OutputWidth + z*m_OutputHeight*m_OutputWidth;
                int kernelIndex = u +v*m_FilterSize + z*m_FilterSize*m_FilterSize;
                m_pGradients.get()[imageIndex] += outputGradients.get()[newImageIndex]*input.get()[imageIndex];
                m_GradientSum[layer].get()[kernelIndex] += m_pGradients.get()[imageIndex] * m_Filters[layer].get()[kernelIndex];

                biasGradient += m_GradientSum[layer].get()[kernelIndex];
            }       
        }
    }

    // Update the weights
    for (int layer = 0; layer < m_Filters.size(); ++layer)
    {
        For z < depth, U & V < filtersize
        {
            // Find the index in the input image
            int kernelIndex = u +v*m_FilterSize + z*m_FilterSize*m_FilterSize;
            m_Filters[layer].get()[kernelIndex] -= learningRate*m_GradientSum[layer].get()[kernelIndex];
        }
        m_pBiases.get()[layer] -= learningRate*biasGradient;
    }
}

So, I create a buffer (m_pGradients) which is the dimensions of the input buffer to feed the gradients back to the previous layer but use the gradient sum to adjust the weights.

The max pooling calculates the gradients back like so (it saves the max indices and zeros all the other gradients out)

void MaxPooling::Train(std::shared_ptr<float> input,std::shared_ptr<float> outputGradients, float label)
{
    for (int outputVolumeIndex = 0; outputVolumeIndex <m_OutputVolumeSize; ++outputVolumeIndex)
    {
        int inputIndex = m_Indices.get()[outputVolumeIndex];
        m_pGradients.get()[inputIndex] = outputGradients.get()[outputVolumeIndex];
    }
}

And the final regression layer calculates its gradients like this:

void LinearClassifier::Train(std::shared_ptr<float> data,std::shared_ptr<float> output, float y)
{
    float * x  = data.get();

    float biasError = 0.0f;
    float h = Hypothesis(output) - y;

    for (int i =1; i < m_NumberOfWeights; ++i)
    {
        float error = h*x[i];
        m_pGradients.get()[i] = error;
        biasError += error;
    }

    float cost = h;
    m_Error = cost*cost;

    for (int theta = 1; theta < m_NumberOfWeights; ++theta)
    {
        m_pWeights.get()[theta] = m_pWeights.get()[theta] - learningRate*m_pGradients.get()[theta];
    }

    m_pWeights.get()[0] -= learningRate*biasError;
}

After 100 iterations of training on the two examples the prediction on each is the same as the other and unchanged from the start.

Should a convolutional network like this be able to discriminate between the two classes?
Is this the correct approach?
Should I be accounting for the ReLU (max) in the convolution layer backpropagation?

Should a convolutional network like this be able to discriminate between the two classes?

Yes. In fact even linear classifier itself should be able to discriminate very easily (if images are more or less centered).

Is this the correct approach?

The most probable cause is error in your gradient formulas. Always follow 2 easy rules:

Start with basic model. Do not start with 2-conv network. Start your code without any convolutions. Does it work now? When you have working 1 linear layer, add single convolution. Does it work now? and so on.
Always check your gradients numerically. This is so easy to do and will save you hours of debuging! recall from analysis that
```
[grad f(x) ]_i ~  (f(x+eps*e_i) - f(x-eps*e_i)) / 2*eps
```
where by []_i I mean i'th coordinate, and by e_i I mean i'th canonical vector (zero vector with one on i'th coordinate)

Should I be accounting for the ReLU (max) in the convolution layer backpropagation?

Yes, ReLU alters your gradient, as this is a nonlinearity which you need to differentiate. Again - back to point 1. start with simple models, and add each element separately to find which one causes your gradients/model to crash.