how many classes h2o deep learning algorithm accep

2020-04-20 08:21发布

问题:

I want to predict the response variable, and it has 700 classes.

Deep learning model parameters

from h2o.estimators import deeplearning

dl_model = deeplearning.H2ODeepLearningEstimator(
                                    hidden=[200,200],
                                    epochs  = 10,
                                missing_values_handling='MeanImputation',
                                max_categorical_features=4,
                                distribution='multinomial'
                            )

# Train the model
dl_model.train(x = Content_vecs.names,
                y='tags',
               training_frame   = data_split[0],
               validation_frame = data_split[1]
               )

Orginal Response Variable -Tags: 
apps, email, mail
finance,freelancers,contractors,zen99
genomes
gogovan
brazil,china,cloudflare
hauling,service,moving
ferguson,crowdfunding,beacon
cms,naytev
y,combinator
in,store,
conversion,logic,ad,attribution

Response variable tags: 
[74]
[156, 89]
[153, 13, 133, 40]
[150]
[474, 277, 113]
[181, 117]
[15, 87, 8, 11]

Error:

OSError: Job with key $03017f00000132d4ffffffff$_8355bcac0e9e98a86257f45c180e4898 failed with an exception: java.lang.UnsupportedOperationException: error cannot be computed: too many classes

stacktrace: java.lang.UnsupportedOperationException: error cannot be computed: too many classes at hex.ConfusionMatrix.err(ConfusionMatrix.java:92)

But in h2o-core/src/main/java/hex/ConfusionMatrix.javaConfusionMatrix.java is written that it can compute 1000 classes.

回答1:

When you say you have 700 classes, do you mean your response variable is made up of arrays of those 700 unique numbers? Because you gave this example:

Response variable tags: 
[74]
[156, 89]
[153, 13, 133, 40]
[150]
[474, 277, 113]
[181, 117]
[15, 87, 8, 11]

H2O cannot predict arrays. Each unique combination of numbers will be counting as a single class. You therefore likely have a lot more than 700 classes, from H2O's point of view.

If you look at the data over on Flow ( http://127.0.0.1:54321/ ) it will tell you how many unique levels there are in 'tags'. (You can also get it from the python API, using describe() on the frame, or categories() on the column in question will list all the levels.)

Your next question is going to be what to do about this. I suggest making that a new question, where you explain what the 700 values, and the arrays represent; it is almost certainly going to involve some domain-specific pre-processing. However you could try playing with categorical_encoding http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html