Efficient string matching in Apache Spark

Using an OCR tool I extracted texts from screenshots (about 1-5 sentences each). However, when manually verifying the extracted text, I noticed several errors that occur from time to time.

Given the text "Hello there

标签： python apache-spark pyspark string-matching fuzzy-search

1条回答

ら面具成の殇う

2楼-- · 2019-01-01 01:43

I wouldn't use Spark in the first place, but if you are really committed to the particular stack, you can combine a bunch of ml transformers to get best matches. You'll need Tokenizer (or split):

import org.apache.spark.ml.feature.RegexTokenizer

val tokenizer = new RegexTokenizer().setPattern("").setInputCol("text").setMinTokenLength(1).setOutputCol("tokens")

NGram (for example 3-gram)

import org.apache.spark.ml.feature.NGram

val ngram = new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams")

Vectorizer (for example CountVectorizer or HashingTF):

import org.apache.spark.ml.feature.HashingTF

val vectorizer = new HashingTF().setInputCol("ngrams").setOutputCol("vectors")

and LSH:

import org.apache.spark.ml.feature.{MinHashLSH, MinHashLSHModel}

// Increase numHashTables in practice.
val lsh = new MinHashLSH().setInputCol("vectors").setOutputCol("lsh")

Combine with Pipeline

import org.apache.spark.ml.Pipeline

val pipeline = new Pipeline().setStages(Array(tokenizer, ngram, vectorizer, lsh))

Fit on example data:

val query = Seq("Hello there 7l | real|y like Spark!").toDF("text")
val db = Seq(
  "Hello there


     
                      登录 后发表回答



   
   
   
  
   相关问题
      
    
    
   
   

     


   
   how to define constructor for Python's new Nam   

   



     


   
   streaming md5sum of contents of a large remote tar   

   



     


   
   How to maintain order of key-value in DataFrame sa   

   



     


   
   How to get the background from multiple images by   

   



     


   
   Evil ctypes hack in python   

   



        
      
    查看全部
   
   
  
   相关文章
 
   
   

     


   
   问个python基础问题，为什么时间不更新 及 name 'ss' is not   

     


   
   c#调用python3程序   

     


   
   如何安全的关闭程序   

     


   
   反爬能检测到JS模拟的键盘输入吗   

     


   
   有没有方法即使程序最小化也能对其发送按键   

     


   
   tkinter这样怎么不能分别赋值？   

     


   
   mouseMoveEvent奇怪的崩溃   

     


   
   在liunx 安装Levenshtein错误   

        
        
    查看全部
                 收藏的人(6)

Efficient string matching in Apache Spark

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间