Multivariate LSTM Forecast Loss and evaluation

I have a CNN-RNN model architecture with Bidirectional LSTMS for time series regression problem. My loss does not converge over 50 epochs. Each epoch has 20k samples. The loss keeps bouncing between 0.001 - 0.01.

batch_size=1
epochs = 50
model.compile(loss='mean_squared_error', optimizer='adam')   
trainingHistory=model.fit(trainX,trainY,epochs=epochs,batch_size=batch_size,shuffle=False)

I tried to train the model with incorrectly paired X and Y data for which the loss stays around 0.5, is it reasonable conclusion that my X and Y have a non linear relationship which can be learned by my model over more epochs ?
The predictions of my model capture the pattern but with an offset, I use dynamic time warping distance to manually check the accuracy of predictions, is there a better way ?

Model :

model = Sequential()
model.add(LSTM(units=128, dropout=0.05, recurrent_dropout=0.35, return_sequences=True, batch_input_shape=(batch_size,featureSteps,input_dim)))
model.add(LSTM(units=32, dropout=0.05, recurrent_dropout=0.35, return_sequences=False))
model.add(Dense(units=2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

If you tested with:

Wrong data: loss ~0.5
Correct data: loss ~0.01

Then your model is actually cabable of learning something.

There are some possibilities there:

Your output data does not fit in the range of the last layer's activation
Your model reached a limit for the current learning rate (gradient update steps are too big and can't improve the model anymore).
Your model is not good enough for the task.
Your data has some degree of random factors

Case 1:

Make sure your Y is within the range of your last activation function.

For a tanh (the LSTM's default), all Y data should be between -1 and + 1
For a sigmoid, between 0 and 1
For a softmax, between 0 and 1, but make sure your last dimension is not 1, otherwise all results will be 1, always.
For a relu, between 0 and infinity
For linear, any value

Convergence goes better if you have a limited activation instead of one that goes to infinity.
In the first case, you can recompile (after training) the model with a lower learning rate, usually we divide it by 10, where the default is 0.0001:

Case 2:

If data is ok, try decreasing the learning rate after your model stagnates.

The default learning rate for adam is 0.0001, we often divide it by 10:

from keras.optimizers import Adam

#after training enough with the default value: 
model.compile(loss='mse', optimizer=Adam(lr=0.00001)
trainingHistory2 = model.fit(.........)

#you can even do this again if you notice that the loss decreased and stopped again:
model.compile(loss='mse',optimizer=Adam(lr=0.000001)

If the problem was the learning rate, this will make your model learn more than it already did (there might be some difficult at the beginning until the optimizer adjusts itself).

Case 3:

If you got no success, maybe it's time to increase the model's capability. Maybe add more units to the layers, add more layers or even change the model.

Case 4:

There's probably nothing you can do about this...

But if you increased the model like in case 3, be careful with overfitting (keep some test data to compare the test loss versus the training loss).

Too good models can simply memorize your data instead of learning important insights about it.