How to efficiently match an input string against s

How would one efficiently match one input string against any number of regular expressions?

One scenario where this might be useful is with REST web services. Let's assume that I have come up with a number of URL patterns for a REST web service's public interface:

/user/with-id/{userId}
/user/with-id/{userId}/profile
/user/with-id/{userId}/preferences
/users
/users/who-signed-up-on/{date}
/users/who-signed-up-between/{fromDate}/and/{toDate}
…

where {…} are named placeholders (like regular expression capturing groups).

_{Note: This question is not about whether the above REST interface is well-designed or not. (It probably isn't, but that shouldn't matter in the context of this question.)}

It may be assumed that placeholders usually do not appear at the very beginning of a pattern (but they could). It can also be safely assumed that it is impossible for any string to match more than one pattern.

Now the web service receives a request. Of course, one could sequentially match the requested URI against one URL pattern, then against the next one, and so on; but that probably won't scale well for a larger number of patterns that must be checked.

Are there any efficient algorithms for this?

Inputs:

An input string
A set of "mutually exclusive" regular expressions (ie. no input string may match more than one expression)

Output:

The regular expression (if any) that the input string matched against.

标签： regex rest pattern-matching

5条回答

唯我独甜

2楼-- · 2019-03-19 04:06

The Aho-Corasick algorithm is a very fast algorithm to match an input string against a set of patterns (actually keywords), that are preprocessed and organized in a trie, to speedup matching.

There are variations of the algorithm to support regex patterns (ie. http://code.google.com/p/esmre/ just to name one) that are probably worth a look.

Or, you could split the urls in chunks, organize them in a tree, then split the url to match and walk the tree one chunk at a time. The {userId} can be considered a wildcard, or match some specific format (ie. be an int).

When you reach a leaf, you know which url you matched

0人赞添加讨论(0) 举报

混吃等死

3楼-- · 2019-03-19 04:08

First I though that I couldn't see any good optimization for this process.

However, if you have a really large number of regexes you might want to partition them (I'm not sure if this is technically partitioning).

What I tell you to do is:

Suppose that you have 20 possible urls that start with user:

/user/with-id/X
/user/with-id/X/preferences # instead of preferences, you could have another 10 possibilities like /friends, /history, etc

Then, you also have 20 possible urls starting with users:

/users/who-signed-up-on
/users/who-signed-up-on-between     #others: /registered-for, /i-might-like, etc

And the list goes on for /products, /companies, etc instead of users.

What you could do in this case is using "multi-level" matching.

First, match the start of the string. You'd be matching for /products, /companies, /users, one at a time and ignoring the rest of the string. This way, you don't have to test all the 100 possibilities.

After you know the url starts with /users, you can match only the possible urls that start with users.

This way, you would reduce a lot of unneeded matches. You won't match the string for all the /procucts possibilities.

0人赞添加讨论(0) 举报

We Are One

4楼-- · 2019-03-19 04:12

Use named expressions and the OR operator, i.e. "(?P<re1>...)|(?P<re2>...)|...".

0人赞添加讨论(0) 举报

▲ chillily

5楼-- · 2019-03-19 04:14

The standard solution for matching multiple regular expressions against an input stream is a lexer-generator such as Flex (there are lots of these avalable, typically several for each programming langauge).

These tools take a set of regular expressions associated with "tokens" (think of tokens as just names for whatever a regular expression matches) and generates efficient finite-state automata to match all the regexes at the same time. This is linear time with a very small constant in the size of the input stream; hard to ask for "faster" than this. You feed it a character stream, and it emits the token name of the regex that matches "best" (this handles the case where two regexes can match the same string; see the lexer generator for the definition of this), and advances the stream by what was recognized. So you can apply it again and again to match the input stream for a series of tokens.

Different lexer generators will allow you to capture different bits of the recognized stream in differnt ways, so you can, after recognizing a token, pick out the part you care about (e.g., for a literal string in quotes, you only care about the string content, not the quotes).

0人赞添加讨论(0) 举报

做个烂人

6楼-- · 2019-03-19 04:29

If there is a hierarchy in the url structure, that should be used to maximize performance. Only an url that starts with /user/ can match any of the first three and so on.

I suggest storing the hierarchy to match in a tree corresponding to the url hierarchy, where each node matches a level in the hierarchy. To match an url, test the url against all roots of the tree where only nodes with regexes for "user" and "users" are. Matching url:s are tested against the children of those nodes until a match is found in a leaf node. A succesful match can be returned as the list of nodes from the root to the leaf. Named groups with property values such as {user-id} can be fetched from the nodes of the successful match.

0人赞添加讨论(0) 举报

How to efficiently match an input string against s

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间