Followup on answer to an earlier question.
Is there a way to further reduce this, avoiding the external String.Split
call? The goal is an associative container of {token, count}
.
string src = "for each character in the string, take the rest of the " +
"string starting from that character " +
"as a substring; count it if it starts with the target string";
string[] target = src.Split(new char[] { ' ' });
var results = target.GroupBy(t => new
{
str = t,
count = target.Count(sub => sub.Equals(t))
});
As you have it right now, it will work (to some extent) but is terribly inefficient. As is, the result is an enumeration of groupings, not the (word, count) pairs you might be thinking.
That overload of
GroupBy()
takes a function to select the key. You are effectively performing that calculation for every item in the collection. Without going the route of using regular expressions ignoring punctuation, it should be written like so:While 3-4 times slower, the Regex method is arguably more accurate:
For instance, the Regex method won't count
string
andstring,
as two separate entries, and will correctly tokenisesubstring
instead ofsubstring;
.EDIT
Read your previous question and realise my code doesn't quite conform to your spec. Regardless, it still demonstrates the advantage/cost of using Regex.
Here's a LINQ version without
ToDictionary()
, which may add unnecessary overhead depending on your needs...Or in query syntax...
Getting rid of
String.Split
doesn't leave many options on the table. One option isRegex.Matches
as spender demonstrated, and another isRegex.Split
(which doesn't give us anything new).Rather than grouping you could use either of these approaches:
The
Distinct
call is needed to avoid duplicate items. I went ahead and expanded the characters to split on to get the actual words devoid of punctuation. I found the first approach to be the quickest using spender's benchmarking code.Back to the requirement to order the results from your previously referenced question, you could easily extend the first approach as follows:
EDIT: got rid of the Tuple since the anonymous type was close at hand.