Extract keywords from text in .NET

2019-04-09 20:54发布

I need to calculate how many times each keyword is reoccurring in a string, with sorting by highest number. What's the fastest algorithm available in .NET code for this purpose?

4条回答
做自己的国王
2楼-- · 2019-04-09 21:24

Simple algorithm: Split the string into an array of words, iterate over this array, and store the count of each word in a hash table. Sort by count when done.

查看更多
\"骚年 ilove
3楼-- · 2019-04-09 21:28

EDIT: code below groups unique tokens with count

string[] target = src.Split(new char[] { ' ' });

var results = target.GroupBy(t => new
{
    str = t,
    count = target.Count(sub => sub.Equals(t))
});

This is finally starting to make more sense to me...

EDIT: code below results in count correlated with target substring:

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";
string[] target = {"string", "the", "in"};

var results = target.Select((t, index) => new {str = t, 
    count = src.Select((c, i) => src.Substring(i)).
    Count(sub => sub.StartsWith(t))});

Results is now:

+       [0] { str = "string", count = 4 }   <Anonymous Type>
+       [1] { str = "the", count = 4 }  <Anonymous Type>
+       [2] { str = "in", count = 6 }   <Anonymous Type>

Original code below:

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";
string[] target = {"string", "the", "in"};

var results = target.Select(t => src.Select((c, i) => src.Substring(i)).
    Count(sub => sub.StartsWith(t))).OrderByDescending(t => t);

with grateful acknowledgement to this previous response.

Results from debugger (which need extra logic to include the matching string with its count):

-       results {System.Linq.OrderedEnumerable<int,int>}    
-       Results View    Expanding the Results View will enumerate the IEnumerable   
        [0] 6   int
        [1] 4   int
        [2] 4   int
查看更多
一纸荒年 Trace。
4楼-- · 2019-04-09 21:43

Dunno about fastest, but Linq is probably the most understandable:

var myListOfKeywords = new [] {"struct", "public", ...};

var keywordCount = from keyword in myProgramText.Split(new []{" ","(", ...})
   group by keyword into g
   where myListOfKeywords.Contains(g.Key)
   select new {g.Key, g.Count()}

foreach(var element in keywordCount)
   Console.WriteLine(String.Format("Keyword: {0}, Count: {1}", element.Key, element.Count));

You can write this in a non-Linq-y way, but the basic premise is the same; split the string up into words, and count the occurrences of each word of interest.

查看更多
够拽才男人
5楼-- · 2019-04-09 21:48

You could break the string into a collection of strings, one for each word, and then do a LINQ query on the collection. While I doubt it would be the fastest, it would probably be faster than regex.

查看更多
登录 后发表回答