Consider the following example
dtrain <- data_frame(text = c("Chinese Beijing Chinese",
"Chinese Chinese Shanghai",
"Chinese Macao",
"Tokyo Japan Chinese"),
doc_id = 1:4,
class = c(1, 1, 1, 0))
dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)
> dtrain_spark
# Source: table<dtrain> [?? x 3]
# Database: spark_connection
text doc_id class
<chr> <int> <dbl>
1 Chinese Beijing Chinese 1 1
2 Chinese Chinese Shanghai 2 1
3 Chinese Macao 3 1
4 Tokyo Japan Chinese 4 0
Here I have the classic Naive Bayes example where class
identifies documents falling into the China
category.
I am able to run a Naives Bayes classifier in sparklyr
by doing the following:
dtrain_spark %>%
ft_tokenizer(input.col = "text", output.col = "tokens") %>%
ft_count_vectorizer(input_col = 'tokens', output_col = 'myvocab') %>%
select(myvocab, class) %>%
ml_naive_bayes( label_col = "class",
features_col = "myvocab",
prediction_col = "pcol",
probability_col = "prcol",
raw_prediction_col = "rpcol",
model_type = "multinomial",
smoothing = 0.6,
thresholds = c(0.2, 0.4))
which outputs:
NaiveBayesModel (Transformer)
<naive_bayes_5e946aec597e>
(Parameters -- Column Names)
features_col: myvocab
label_col: class
prediction_col: pcol
probability_col: prcol
raw_prediction_col: rpcol
(Transformer Info)
num_classes: int 2
num_features: int 6
pi: num [1:2] -1.179 -0.368
theta: num [1:2, 1:6] -1.417 -0.728 -2.398 -1.981 -2.398 ...
thresholds: num [1:2] 0.2 0.4
However, I have two major questions:
How can I assess the performance of this classifier in-sample? Where are the accuracy metrics?
Even more importantly, how can I use this trained model to predict new values, say, in the following
spark
test dataframe?
Test data:
dtest <- data_frame(text = c("Chinese Chinese Chinese Tokyo Japan",
"random stuff"))
dtest_spark <- copy_to(sc, dtest, overwrite = TRUE)
> dtest_spark
# Source: table<dtest> [?? x 1]
# Database: spark_connection
text
<chr>
1 Chinese Chinese Chinese Tokyo Japan
2 random stuff
Thanks!
In general (there are some models which provide some form of summary), evaluation on training dataset is a separate step in Apache Spark. This fits nicely in the native
Pipeline
API.Background:
Spark ML Pipelines are primarily build from two types of objects:
Transformers
- objects which providetransform
method, which mapDataFrame
to updatedDataFrame
.You can
transform
usingTransformer
withml_transform
method.Estimators
- objects which providefit
method, which mapDataFrame
toTransfomer
. By convention correspondingEstimator
/Transformer
pairs are calledFoo
/FooModel
.You can
fit
Estimator
insparklyr
usingml_fit
model.Additionally ML Pipelines can be combined with
Evaluators
(seeml_*_evaluator
andml_*_eval
methods) which can be used to compute different metrics on the transformed data, based on columns generated by a model (usually probability column or raw prediction).You can apply
Evaluator
usingml_evaluate
method.Are related components include cross validator and train validation splits, which can be used for parameter tuning.
Examples:
sparklyr
PipelineStages
can be evaluated eagerly (as in your own code), by passing data directly, or lazily by passing aspark_connection
instance and calling aforementioned methods (ml_fit
,ml_transform
, etc.).It means you can define a
Pipeline
as follows:Fit the
PipelineModel
:Transform, and apply one of available
Evaluators
:or
Use either
ml_transform
orml_predict
(the latter one is a convince wrapper, which applies further transformations on the output):Cross validation:
There is not enough data in the example, but you cross validate and fit hyperparameters as shown below:
Notes:
If you use
Pipelines
withVector
columns (notformula
-based calls), I strongly recommend using standardized (default) column names:label
for dependent variable.features
for assembled independent variables.rawPrediction
,prediction
,probability
for raw prediction, prediction and probability columns respectively.