I started with simple implementation of single variable linear gradient descent but don't know to extend it to multivariate stochastic gradient descent algorithm ?
Single variable linear regression
import tensorflow as tf
import numpy as np
# create random data
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data * 0.5
# Find values for W that compute y_data = W * x_data
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
y = W * x_data
# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# Before starting, initialize the variables
init = tf.initialize_all_variables()
# Launch the graph.
sess = tf.Session()
sess.run(init)
# Fit the line.
for step in xrange(2001):
sess.run(train)
if step % 200 == 0:
print(step, sess.run(W))
You have two part in your question:
To get a higher dimensional setting, you can define your linear problem
y = <x, w>
. Then, you just need to change the dimension of your VariableW
to match the one ofw
and replace the multiplicationW*x_data
by a scalar producttf.matmul(x_data, W)
and your code should run just fine.To change the learning method to a stochastic gradient descent, you need to abstract the input of your cost function by using
tf.placeholder
.Once you have defined
X
andy_
to hold your input at each step, you can construct the same cost function. Then, you need to call your step by feeding the proper mini-batch of your data.Here is an example of how you could implement such behavior and it should show that
W
quickly converges tow
.Two side notes:
The implementation below is called a mini-batch gradient descent as at each step, the gradient is computed using a subset of our data of size
mini_batch_size
. This is a variant from the stochastic gradient descent that is usually used to stabilize the estimation of the gradient at each step. The stochastic gradient descent can be obtained by settingmini_batch_size = 1
.The dataset can be shuffle at every epoch to get an implementation closer to the theoretical consideration. Some recent work also consider only using one pass through your dataset as it prevent over-fitting. For a more mathematical and detailed explanation, you can see Bottou12. This can be easily change according to your problem setup and the statistic property your are looking for.