Let's say I've read in a textfile using a TextLineReader
. Is there some way to split this into train and test sets in Tensorflow
? Something like:
def read_my_file_format(filename_queue):
reader = tf.TextLineReader()
key, record_string = reader.read(filename_queue)
raw_features, label = tf.decode_csv(record_string)
features = some_processing(raw_features)
features_train, labels_train, features_test, labels_test = tf.train_split(features,
labels,
frac=.1)
return features_train, labels_train, features_test, labels_test
Something like following should work:
tf.split_v(tf.random_shuffle(...
As elham mentioned, you can use scikit-learn to do this easily. scikit-learn is an open source library for machine learning. There are tons of tools for data preparation including the model_selection
module, which handles comparing, validating and choosing parameters.
The model_selection.train_test_split()
method is specifically designed to split your data into train and test sets randomly and by percentage.
X_train, X_test, y_train, y_test = train_test_split(features,
labels,
test_size=0.33,
random_state=42)
test_size
is the percentage to reserve for testing and random_state
is to seed the random sampling.
I typically use this to provide train and validation data sets, and keep true test data separately. You could just run train_test_split
twice to do this as well. I.e. split the data into (Train + Validation) and Test, then split Train + Validation into two separate tensors.
import sklearn.model_selection as sk
X_train, X_test, y_train, y_test =
sk.train_test_split(features,labels,test_size=0.33, random_state = 42)
I managed to have a nice result using the map and filter functions of the tf.data.Dataset api. Just use the map function to randomly select the examples between train and testing. In order to do that you can, for each example, get a sample from a uniform distribution and check if the sample value is below the rate division.
def split_train_test(parsed_features, train_rate):
parsed_features['is_train'] = tf.gather(tf.random_uniform([1], maxval=100, dtype=tf.int32) < tf.cast(train_rate * 100, tf.int32), 0)
return parsed_features
def grab_train_examples(parsed_features):
return parsed_features['is_train']
def grab_test_examples(parsed_features):
return ~parsed_features['is_train']