I'm working with tensorflow hoping to train a deep CNN to do move prediction for the game Go. The dataset I created consists of 100,000 binary data files, where each datafile corresponds to a recorded game and contains roughly 200 training samples (one for each move in the game). I believe it will be very important to get good mixing when using SGD. I'd like my batches to contain samples from different games AND samples from different stages of the games. So for example simply reading one sample from the start of 100 files and shuffling isn't good b/c those 100 samples will all be the first move of each game.
I have read the tutorial on feeding data from files but I'm not sure if their provided libraries do what I need. If I were to hard code it myself I would basically initialize a bunch of file pointers to random locations within each file and then pull samples from random files, incrementing the file pointers accordingly.
So, my question is does tensorflow provide this sort of functionality or would it be easier to write my own code for creating batches?
Yes - what you want is to use a combination of two things.
First, randomly shuffle the order in which you input your datafiles, by reading from them using a tf.train.string_input_producer
with shuffle=True
that feeds into whatever input method you use (if you can put your examples into tf.Example proto format, that's easy to use with parse_example
). To be very clear, you put the list of filenames in the string_input_producer
and then read them with another method such as read_file
, etc.
Second, you need to mix at a finer granularity. You can accomplish this by feeding the input examples into a tf.train.shuffle_batch
node with a large capacity and large value of min_after_dequeue
. One particularly nice way is to use a shuffle_batch_join
that receives input from multiple files, so that you get a lot of mixing. Set the capacity of the batch big enough to mix well without exhausting your RAM. Tens of thousands of examples usually works pretty well.
Keep in mind that the batch functions add a QueueRunner
to the QUEUE_RUNNERS
collection, so you need to run tf.train.start_queue_runners()
In your case it is not a problem to do some preprocessing and create one file out of all the files you have. For this type of games, where the history is not important and the position determines everything your dataset can consist just from position -> next_move
.
For a more broad case TF provides everything to allow the shuffling you want. There are two types shuffling which serve different purposes and shuffle different things:
tf.train.string_input_producer
shuffle: Boolean. If true, the strings are randomly shuffled within each epoch.. So if you have a few files ['file1', 'file2', ..., 'filen']
this randomly selects a file from this list. If case of false, the files follow one after each other.
tf.train.shuffle_batch
Creates batches by randomly shuffling tensors. So it takes batch_size
tensors from your queue (you will need to create a queue with tf.train.start_queue_runners
) and shuffles them.