Levenshtein distance in regular expression

is there possiblity how to include levenshtein distance in regular expression query?

Except making union between permutations. Like searching "hello" with L.d. 1

.ello | h.llo | he.lo | hel.o | hell.

this is a lot stupid and un-usable for larger numbers of L.d.

标签： regex levenshtein-distance

2条回答

一夜七次

2楼-- · 2019-01-27 19:31

is there possiblity how to include levenshtein distance in regular expression query?

No, not in a sane way. Implementing - or using an existing - Levenshtein distance algorithm is the way to go.

0人赞添加讨论(0) 举报

We Are One

3楼-- · 2019-01-27 19:34

You can generate the regex programmatically. I will leave that as an exercise for the reader, but for the output of this hypothetical function (given an input of "word") you want something like this string:

"^(?>word|wodr|wrod|owrd|word.|wor.d|wo.rd|w.ord|.word|wor.?|wo.?d|w.?rd|.?ord)$"

In English, first you try to match on the word itself, then on every possible single transposition, then on every possible single insertion, then on every possible single omission or substitution (can be done simultaneously).

The length of that string, given a word of length n, is linear (and notably not exponential) with n.

Which is reasonable, I think.

You pass this to your regex generator (like in Ruby it would be Regexp.new(str)) and bam, you got a matcher for ANY word with a Damerau-Levenshtein distance of 1 from a given word.

(Damerau-Levenshtein distances of 2 are far more complicated.)

Note use of the (?> non-backtracing construct which means the order of the individual |'d expressions in that output matter.

I could not think of a way to "compact" that expression.

EDIT: I got it to work, at least in Elixir! https://github.com/pmarreck/elixir-snippets/blob/master/damerau_levenshtein_distance_1.exs

I wouldn't necessarily recommend this though (except for educational purposes) since it will only get you to distances of 1; a legit D-L library will let you compute distances > 1. Although since this is regex, it would probably work pretty fast once constructed (note that you should save the "compiled" regex somewhere since this code currently reconstructs it on EVERY comparison!)

0人赞添加讨论(0) 举报

Levenshtein distance in regular expression

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间