High performance “contains” search in list of stri

I have a list of approx. 500,000 strings, each approx. 100 characters long. Given a search term, I want to identify all strings in the list that contain the search term. At the moment I am doing this with a plain old dataset using the Select method ("MATCH %term%"). This takes about 600ms on my laptop. I'd like to make it faster, maybe 100-200ms.

What would be a recommended approach?

Performance is critical so I can trade memory footprint for better performance if necessary (within reason). The list of strings will not change once initialised so calculating hashes would also be an option.

Does anyone have a recommendation and which C# data structures are best suited to the task?

标签： c# string search sorting contains

7条回答

▲ chillily

2楼-- · 2019-01-14 03:57

A trie or suffix tree would help in making this faster - this is essentially what fulltext search (usually) is using.

There are implementations in C# you can use, also see this SO thread: Looking for the suffix tree implementation in C#?

Also as mentioned by @leppie parallel execution will likely be already provide you with the x3 performance gain you are looking for. But then again you will have to measure closely, without that it's anyone's guess.

0人赞添加讨论(0) 举报

戒情不戒烟

3楼-- · 2019-01-14 03:57

Have you tried loading your strings into a List<string> and then using the Linq extensions Contains method?

var myList = new List<string>();
//Code to load your list goes here...

var searchTerm = "find this";
var match = myList.Contains(searchTerm);

0人赞添加讨论(0) 举报

乱世女痞

4楼-- · 2019-01-14 03:59

According to these benchmarks, the fastest way to check if a string occurs in a string is the following:

for (int x = 0; x < ss.Length; x++)
    for (int y = 0; y < sf.Length; y++
        c[y] += ((ss[x].Length - ss[x].Replace(sf[y], String.Empty).Length) / sf[y].Length > 0 ? 1 : 0);

Thus, you could:

Loop through the list using a Parallel.For construct
Implement the code above to check if a string contains what you're looking for. "SS" is the string[] of strings to search; "SF" is the string[] of strings to search for; c[y] is the total count of each one found.

Obviously you'd have to adapt them to your List[string] (or whatever data structure you're using).

0人赞添加讨论(0) 举报

唯我独甜

5楼-- · 2019-01-14 04:04

I've heard good things about Lucene.NET when it comes to performing quick full-text searches. They've done the work to figure out the fastest data structures and such to use. I'd suggest giving that a shot.

Otherwise, you might just try something like this:

var matches = list.AsParallel().Where(s => s.Contains(searchTerm)).ToList();

But it probably won't get you down to 100ms.

0人赞添加讨论(0) 举报

Deceive 欺骗

6楼-- · 2019-01-14 04:05

You should try to use Dictionary class. It's much faster than List because it's an indexed search.

Dictionary<String, String> ldapDocument = new Dictionary<String, String>();
//load your list here
//Sample -> ldapDocument.Add("014548787","014548787");
var match = ldapDocument.ContainsKey(stringToMatch);

0人赞添加讨论(0) 举报

我只想做你的唯一

7楼-- · 2019-01-14 04:11

Have you tried the following?

list.FindAll(x => x.Contains("YourTerm")).ToList();

For some reason the List.AsParallel().Where(...) is slower than list.FindAll(...) on my PC.

list.AsParallel().Where(x => x.Contains("YourTerm")).ToList();

Hope this will help you.

0人赞添加讨论(0) 举报

1 2 下一页

High performance “contains” search in list of stri

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间