I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm.
val dataset1 =
+-------------+----------+------+------+-----------------------+
| x'| y'| a'| b'| dataString(x'+y'+a')|
+-------------+----------+------+------+-----------------------+
| John| Smith| 55649| 28200| John|Smith|55649|
| Emma| Morales| 78439| 34200| Emma|Morales|78439|
| Janet| Alvarado| 89488| 29103| Janet|Alvarado|89488|
| Elizabeth| K| 36935| 38101| Elizabeth|K|36935|
| Cristin| Cruz| 75716| 70015| Cristin|Cruz|75716|
| Jack| Colello| 94552| 15609| Jack|Colello|94552|
| Anatolie| Trifa| 63011| 51181| Anatolie|Trifa|63011|
| Jaromir| Plch| 51237| 91798| Jaromir|Plch|51237|
+-------------+----------+------+------+-----------------------+
// very_large
val dataset2 =
+-------------+----------+------+-----------------------+
| x| y| a| dataString(x+y+a)|
+-------------+----------+------+-----------------------+
| John| Smith| 28200| John|Smith|28200|
| Emma| Morales| 17706| Emma|Morales|17706|
| Janet| Alvarado| 98809| Janet|Alvarado|98809|
| Elizabeth| Keatley| 36935|Elizabeth|Keatley|36935|
| Cristina| Cruz| 75716| Cristina|Cruz|75716|
| Jake| Colello| 15609| Jake|Colello|15609|
| Anatolie| Trifa| 63011| Anatolie|Trifa|63011|
| Rune| Eide| 41907| Rune|Eide|41907|
| Hortensia| Brumaru| 33836|Hortensia|Brumaru|33836|
| Adrien| Payet| 40463| Adrien|Payet|40463|
| Ashley| Howard| 12445| Ashley|Howard|12445|
| Pamela| Dean| 81311| Pamela|Dean|81311|
| Laura| Calvo| 82682| Laura|Calvo|82682|
| Flora| Parghel| 81206| Flora|Parghel|81206|
| Jaromír| Plch| 91798| Jaromír|Plch|91798|
+-------------+----------+------+-----------------------+
For string similarity, created | (pipe) separated dataString
.
Here is the code for similarity finding of dataString (x' + y' + a')
and dataString(x + y + a)
which is working fine,
val tokenizer = new RegexTokenizer().setPattern("\\|").setInputCol("dataString").setOutputCol("dataStringWords")
val vectorizer = new CountVectorizer().setInputCol("dataStringWords").setOutputCol("features")
val pipelineTV = new Pipeline().setStages(Array(tokenizer, vectorizer))
val modelTV = pipelineTV.fit(dataset1)
val isNoneZeroVector = udf({v: Vector => v.numNonzeros > 0}, DataTypes.BooleanType)
val dataset1_TV = modelTV.transform(dataset1).filter(isNoneZeroVector(col("features")))
val dataset2_TV = modelTV.transform(dataset2).filter(isNoneZeroVector(col("features")))
val lsh = new MinHashLSH().setNumHashTables(20).setInputCol("features").setOutputCol("hashValues")
val pipelineLSH = new Pipeline().setStages(Array(lsh))
val modelLSH = pipelineLSH.fit(dataset1_TV)
val dataset1_LSH = modelLSH.transform(dataset1_TV)
val dataset2_LSH = modelLSH.transform(dataset2_TV)
val finalResult = modelLSH.stages.last.asInstanceOf[MinHashLSHModel].approxSimilarityJoin(dataset1_LSH, dataset2_LSH, 0.5)
finalResult.show
As mentioned above code gives perfect result but my requirement is, I have to compare a
with a'
OR b'
, ie.
x' + y' + (a' OR b')
x + y + ( a )
Here I cannot join this two datasets as they have no common field, otherwise it will be cross join.
So is there any way to achieve string similarity with OR condition in grouped data in Apache Spark 2.2.0.