There have been a number of papers (particularly for image captioning) that use CNN and LSTM architectures jointly for prediction and generation tasks. However, they all seem to train the CNN independently from the LSTM. I was looking through Torch and TensorFlow (with Keras), and couldn't find a reason why it shouldn't be possible to do end-to-end training (at least from an architecture design point-of-view), but there doesn't seem to be any documentation for such a model.
So, can it be done? Does Torch or TensorFlow (or even Theanos or Caffe) support jointly training an end-to-end CNN-LSTM neural network? If so, is it as simple as just linking the output from the CNN to the input to the LSTM and running SGD? Or is there more complexity to it?