most efficient way to find partial string matches

2019-02-17 05:13发布

I downloaded the Wikipedia article titles file which contains the name of every Wikipedia article. I need to search for all the article titles that may be a possible match. For example, I might have the word "hockey", but the Wikipedia article for hockey that I would want is "Ice_hockey". It should be a case-insensitive search too.

I'm using Python, and is there a more efficient way than to just do a line by line search? I'll be performing this search like 500 or a 1000 times per minute ideally. If line by line is my only option, are there some optimizations I can do within this?

I think there are several million lines in the file.

Any ideas?

Thanks.

3条回答
在下西门庆
2楼-- · 2019-02-17 06:06

I'd suggest you put your data into an sqlite database, and use the SQL 'like' operator for your searches.

查看更多
狗以群分
3楼-- · 2019-02-17 06:12

If you've got a fixed data set and variable queries, then the usual technique is to reorganise the data set into something that can be searched more easily. At an abstract level, you could break up each article title into individual lowercase words, and add each of them to a Python dictionary data structure. Then, whenever you get a query, convert the query word to lower case and look it up in the dictionary. If each dictionary entry value is a list of titles, then you can easily find all the titles that match a given query word.

This works for straightforward words, but you will have to consider whether you want to do matching on similar words, such as finding "smoking" when the query is "smoke".

查看更多
家丑人穷心不美
4楼-- · 2019-02-17 06:16

Greg's answer is good if you want to match on individual words. If you want to match on substrings you'll need something a bit more complicated, like a suffix tree (http://en.wikipedia.org/wiki/Suffix_tree). Once constructed, a suffix tree can efficiently answer queries for arbitrary substrings, so in your example it could match "Ice_Hockey" when someone searched for "hock".

查看更多
登录 后发表回答