Getting free text features into Tensorflow Canned

I'm trying to build a model that gives reddit_score = f('subreddit','comment')

Mainly this is as an example i can then build on for a work project.

My code is here.

My problem is that i see that canned estimators e.g. DNNLinearCombinedRegressor must have feature_columns that are part of FeatureColumn class.

I have my vocab file and know that if i was to just limit to the first word of a comment i could just do something like

tf.feature_column.categorical_column_with_vocabulary_file(
        key='comment',
        vocabulary_file='{}/vocab.csv'.format(INPUT_DIR)
        )

But if i'm passing in say first 10 words from a comment then i'm not sure how to go from a string like "this is a pre padded 10 word comment xyzpadxyz xyzpadxyz" to a feature_column such that i can then build an embedding to pass to the deep features in a wide and deep model.

It seems like it must be something really obvious or simple but can't for life of me find any existing examples with this particular set up (canned wide and deep, dataset api, and a mix of features e.g subreddit and a raw text feature like comment).

I was even thinking about doing the vocab integer lookup myself such that the comment feature i pass in would be something like [23,45,67,12,1,345,7,99,999,999] and then maybe i could get it in via numeric_feature with a shape and then from there do something with it. But this feels a bit odd.

标签： tensorflow google-cloud-ml tensorflow-datasets tensorflow-estimator

2条回答

Explosion°爆炸

2楼-- · 2019-02-15 07:17

Adding answer as per approach from the post @Lak did, but adapted a little for dataset api.

# Create an input function reading a file using the Dataset API
# Then provide the results to the Estimator API
def read_dataset(prefix, mode, batch_size):

    def _input_fn():

        def decode_csv(value_column):

            columns = tf.decode_csv(value_column, field_delim='|', record_defaults=DEFAULTS)
            features = dict(zip(CSV_COLUMNS, columns))

            features['comment_words'] = tf.string_split([features['comment']])
            features['comment_words'] = tf.sparse_tensor_to_dense(features['comment_words'], default_value=PADWORD)
            features['comment_padding'] = tf.constant([[0,0],[0,MAX_DOCUMENT_LENGTH]])
            features['comment_padded'] = tf.pad(features['comment_words'], features['comment_padding'])
            features['comment_sliced'] = tf.slice(features['comment_padded'], [0,0], [-1, MAX_DOCUMENT_LENGTH])
            features['comment_words'] = tf.pad(features['comment_sliced'], features['comment_padding'])
            features['comment_words'] = tf.slice(features['comment_words'],[0,0],[-1,MAX_DOCUMENT_LENGTH])

            features.pop('comment_padding')
            features.pop('comment_padded')
            features.pop('comment_sliced')

            label = features.pop(LABEL_COLUMN)

            return features, label

        # Use prefix to create file path
        file_path = '{}/{}*{}*'.format(INPUT_DIR, prefix, PATTERN)

        # Create list of files that match pattern
        file_list = tf.gfile.Glob(file_path)

        # Create dataset from file list
        dataset = (tf.data.TextLineDataset(file_list)  # Read text file
                    .map(decode_csv))  # Transform each elem by applying decode_csv fn

        tf.logging.info("...dataset.output_types={}".format(dataset.output_types))
        tf.logging.info("...dataset.output_shapes={}".format(dataset.output_shapes))

        if mode == tf.estimator.ModeKeys.TRAIN:

            num_epochs = None # indefinitely
            dataset = dataset.shuffle(buffer_size = 10 * batch_size)

        else:

            num_epochs = 1 # end-of-input after this

        dataset = dataset.repeat(num_epochs).batch(batch_size)

        return dataset.make_one_shot_iterator().get_next()

    return _input_fn

Then in below function we can reference the field we made as part of decode_csv() :

# Define feature columns
def get_wide_deep():

    EMBEDDING_SIZE = 10

    # Define column types
    subreddit = tf.feature_column.categorical_column_with_vocabulary_list('subreddit', ['news', 'ireland', 'pics'])

    comment_embeds = tf.feature_column.embedding_column(
        categorical_column = tf.feature_column.categorical_column_with_vocabulary_file(
            key='comment_words',
            vocabulary_file='{}/vocab.csv-00000-of-00001'.format(INPUT_DIR),
            vocabulary_size=100
            ),
        dimension = EMBEDDING_SIZE
        )

    # Sparse columns are wide, have a linear relationship with the output
    wide = [ subreddit ]

    # Continuous columns are deep, have a complex relationship with the output
    deep = [ comment_embeds ]

    return wide, deep

0人赞添加讨论(0) 举报

Getting free text features into Tensorflow Canned

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间