Predictionio evaluation fails with empty.maxBy exc

2019-07-14 06:06发布

I have downloaded the latest update on text classification template. I created a new app and imported stopwords.json and emails.json by specifying app id

$ pio import --appid <appID> --input data/stopwords.json
$ pio import --appid <appID> --input data/emails.json

Then I changed engine.json and given my app name in it.

{
   "id": "default",
   "description": "Default settings",
   "engineFactory":   "org.template.textclassification.TextClassificationEngine",
   "datasource": {
   "params": {
   "appName": "<myapp>",
   "evalK": 3
}

But the next step ie, evaluation fails with an error empty.maxBy. A part of error is pasted below

[INFO] [Engine$] Preparator:  org.template.textclassification.Preparator@79a13920
[INFO] [Engine$] AlgorithmList: List(org.template.textclassification.LRAlgorithm@420a8042)
[INFO] [Engine$] Serving: org.template.textclassification.Serving@faea4da
Exception in thread "main" java.lang.UnsupportedOperationException:  empty.maxBy
at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:223)
at scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
at org.template.textclassification.PreparedData.<init> (Preparator.scala:160)
at org.template.textclassification.Preparator.prepare(Preparator.scala:39)
at org.template.textclassification.Preparator.prepare(Preparator.scala:35)
at io.prediction.controller.PPreparator.prepareBase(PPreparator.scala:34)
at io.prediction.controller.Engine$$anonfun$25.apply(Engine.scala:758)
at scala.collection.MapLike$MappedValues.get(MapLike.scala:249)
at scala.collection.MapLike$MappedValues.get(MapLike.scala:249)
at scala.collection.MapLike$class.apply(MapLike.scala:140)
at scala.collection.AbstractMap.apply(Map.scala:58)

Then I tried pio train but training also fails after showing some observations. Error shown is java.lang.OutOfMemoryError: Java heap space. A part of the error is pasted below.

[INFO] [Engine$] Data santiy check is on.
[INFO] [Engine$] org.template.textclassification.TrainingData supports data sanity check. Performing check.

Observation 1 label: 1.0
Observation 2 label: 0.0
Observation 3 label: 0.0
Observation 4 label: 1.0
Observation 5 label: 1.0

[INFO] [Engine$] org.template.textclassification.PreparedData does not support data sanity check. Skipping check.
[WARN] [BLAS] Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
[WARN] [BLAS] Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
[INFO] [Engine$] org.template.textclassification.NBModel does not support data sanity check. Skipping check.
[INFO] [Engine$] EngineWorkflow.train completed
[INFO] [Engine] engineInstanceId=AU3g4XyhTrUUakX3xepP
[INFO] [CoreWorkflow$] Inserting persistent model
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at  java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:36)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at  com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29)

Is this because of memory shortage? I have run the previous version of same template with text classification data of greater than 40mb without issues. Is evaluation a must for training? Also could you please explain how the evaluation is performed?

1条回答
趁早两清
2楼-- · 2019-07-14 06:52

So I was just able to run the evaluation without the former issue, and the latter issue is related to memory usage.

Again, the empty.maxBy error occurs when your data isn't being read in via the DataSource. My first guess is that if you're using a different appName other than MyTextApp, make sure you also reflect that change in your EngineParamsList Object that is in the Evaluation.scala script. You'll see that you are creating a DataSourceParams object there for Evaluation.

For the OutofMemoryError, you should increase your driver-memory prior to training/ evaluation. This is done by doing the following:

pio train -- --driver-memory xG --executor-memory yG pio eval org.template.textclassification.AccuracyEvaluation org.template.textclassification.EngineParamsList -- --driver-memory xG --executor-memory yG

Setting --driver-memory to 1G or 2G should suffice.

As for how the evaluation is carried out, PredictionIO performs k-fold cross-validation by default. For this, your data is split into roughly k-equally sized parts. Let's say k is 3 for illustration purposes. Then a model is trained on 2/3 of the data, and the other 1/3 of the data is used as a test set to estimate prediction performance. This process is repeated for each 1/3 of the data, and then an average of the 3 performance estimates obtained is used as the final estimate for prediction performance (in a general setting you must yourself decide what is an appropriate metric to measure this). This process is repeated for each parameter setting, and model that you specify for testing.

Evaluation is not a necessary step for training and deploying, however, it is a way to select which parameters/algorithms should be used for training and deployment. It is known as model selection in machine learning/ statistics.


Edit: As for the text vectorization, each document is vectorized in the following way:

Say my document is:

"I am Marco."

The first step is to tokenize this, which would result in the following Array/List output:

["I", "am", "Marco"]

Then, you go through a bigram extraction, which stores the following set of token arrays/lists:

["I", "am"], ["am", "Marco"], ["I"], ["am"], ["Marco"]

Each one of these is a used as a feature to build vectors of bigram and word counts, and then apply a tf-idf transformation. Note that to build a vector, we must extract the bigrams from every single document, so that these feature vectors can turn out to be quite large. You can cut out a lot of this by increasing/decreasing the inverseIdfMin/inverseIdfMax values in the Preparator stage.

查看更多
登录 后发表回答