Using an OCR tool I extracted texts from screenshots (about 1-5 sentences each). However, when manually verifying the extracted text, I noticed several errors that occur from time to time.
Given the text "Hello there
Using an OCR tool I extracted texts from screenshots (about 1-5 sentences each). However, when manually verifying the extracted text, I noticed several errors that occur from time to time.
Given the text "Hello there
I wouldn't use Spark in the first place, but if you are really committed to the particular stack, you can combine a bunch of ml transformers to get best matches. You'll need
Tokenizer
(orsplit
):NGram
(for example 3-gram)Vectorizer
(for exampleCountVectorizer
orHashingTF
):and
LSH
:Combine with
Pipeline
Fit on example data: