I'm trying to build a model that gives reddit_score = f('subreddit','comment')
Mainly this is as an example i can then build on for a work project.
My code is here.
My problem is that i see that canned estimators e.g. DNNLinearCombinedRegressor must have feature_columns that are part of FeatureColumn
class.
I have my vocab file and know that if i was to just limit to the first word of a comment i could just do something like
tf.feature_column.categorical_column_with_vocabulary_file(
key='comment',
vocabulary_file='{}/vocab.csv'.format(INPUT_DIR)
)
But if i'm passing in say first 10 words from a comment then i'm not sure how to go from a string like "this is a pre padded 10 word comment xyzpadxyz xyzpadxyz"
to a feature_column
such that i can then build an embedding to pass to the deep
features in a wide and deep model.
It seems like it must be something really obvious or simple but can't for life of me find any existing examples with this particular set up (canned wide and deep, dataset api, and a mix of features e.g subreddit and a raw text feature like comment).
I was even thinking about doing the vocab integer lookup myself such that the comment
feature i pass in would be something like [23,45,67,12,1,345,7,99,999,999] and then maybe i could get it in via numeric_feature with a shape and then from there do something with it. But this feels a bit odd.
Adding answer as per approach from the post @Lak did, but adapted a little for dataset api.
Then in below function we can reference the field we made as part of
decode_csv()
:You could use tf.string_split(), then do tf.slice() to slice it, taking care to tf.pad() the strings with zeros first. Look at the title preprocessing operations in: https://towardsdatascience.com/how-to-do-text-classification-using-tensorflow-word-embeddings-and-cnn-edae13b3e575
Once you have the words, then you can create ten feature columns