I was trying to create and save a Pipeline with custom stages. I need to add a column
to my DataFrame
by using a UDF
. Therefore, I was wondering if it was possible to convert a UDF
or a similar action into a Transformer
?
My custom UDF
looks like this and I'd like to learn how to do it using an UDF
as a custom Transformer
.
def getFeatures(n: String) = {
val NUMBER_FEATURES = 4
val name = n.split(" +")(0).toLowerCase
((1 to NUMBER_FEATURES)
.filter(size => size <= name.length)
.map(size => name.substring(name.length - size)))
}
val tokenizeUDF = sqlContext.udf.register("tokenize", (name: String) => getFeatures(name))
I initially tried to extend the
Transformer
andUnaryTransformer
abstracts but encountered trouble with my application being unable to reachDefaultParamsWriteable
.As an example that may be relevant to your problem, I created a simple term normalizer as a UDF following along from this example. My goal is to match terms against patterns and sets to replace them with generic terms. For example:This is the class
I use it like this:
Now that I read the question a little closer, it sounds like you're asking how to avoid doing it this way lol. Anyways, I'll still post it in case someone in the future is looking for an easy way to apply a transformer-ish like functionality
If you wish to make the transformer writable as well, then you can re-implement the traits such as HasInputCol in the sharedParams library in a public package of your choice and then use them with DefaultParamsWritable trait to make the transformer persistable.
This way you can also avoid having to place part of your code inside the spark core ml packages but you kind of maintain a parallel set of params in your own package. This isnt really a problem given they hardly ever change.
But do track the bug in their JIRA board here that asks for some of the common sharedParams to be made public instead of private to the ml so that people can directly use those from outside classes.
It is not a fully featured solution but your can start with something like this:
Quick check:
You can even try to generalize it to something like this:
If you want to use UDF not the wrapped function you'll have to extend
Transformer
directly and overridetransform
method. Unfortunately majority of the useful classes is private so it can be rather tricky.Alternatively you can register UDF:
and use
SQLTransformer