I'm trying to do a preg_replace taking into account similar texts in pattern. My goal is to remove a given string from a text outputted by an OCR software (some letters may be confused).
Let's give a code example:
$ocr = 'Appartamento sito in Vioolo San Vincenzo, n.4 e censito al ;
preg_replace('#\bVicolo San Vincenzo[, ]+([0-9]+|n[\.]? ?[0-9]+)?\b#', '<removed text>', $ocr);
NB: OCR confused the third letter, a c
, with an o
.
Improving OCR is not an option or possible here.
Input string:
Appartamento sito in Vioolo San Vincenzo, n.4 e censito al
Expected result after the above call to preg_replace:
Appartamento sito in e censito al
Actual result:
Appartamento sito in Vioolo San Vincenzo, n.4 e censito al
Texts should be considered similar in the meaning of PHP functions like levenshtein()
, similar_texts()
(while I'm not considering soundex()
or metaphone()
as texts aren't in English language).
Using preg_replace
is not mandatory, but I need at least the ability to evaluate strings against something equivalent to that pattern.