preg_replace accounting for similar texts

2019-08-14 04:19发布

问题:

I'm trying to do a preg_replace taking into account similar texts in pattern. My goal is to remove a given string from a text outputted by an OCR software (some letters may be confused).

Let's give a code example:

$ocr = 'Appartamento sito in Vioolo San Vincenzo, n.4 e censito al ;
preg_replace('#\bVicolo San Vincenzo[, ]+([0-9]+|n[\.]? ?[0-9]+)?\b#', '<removed text>', $ocr);

NB: OCR confused the third letter, a c, with an o.

Improving OCR is not an option or possible here.

Input string:

Appartamento sito in Vioolo San Vincenzo, n.4 e censito al

Expected result after the above call to preg_replace:

Appartamento sito in e censito al

Actual result:

Appartamento sito in Vioolo San Vincenzo, n.4 e censito al

Texts should be considered similar in the meaning of PHP functions like levenshtein(), similar_texts() (while I'm not considering soundex() or metaphone() as texts aren't in English language).

Using preg_replace is not mandatory, but I need at least the ability to evaluate strings against something equivalent to that pattern.