I've am trying to read a large text file and output the distinct words in it along with it's count. I've tried a couple of attempts so far, and this is by far the fastest solution I have come up with.
private static readonly char[] separators = { ' ' };
public IDictionary<string, int> Parse(string path)
{
var wordCount = new Dictionary<string, int>();
using (var fileStream = File.Open(path, FileMode.Open, FileAccess.Read))
using (var streamReader = new StreamReader(fileStream))
{
string line;
while ((line = streamReader.ReadLine()) != null)
{
var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);
foreach (var word in words)
{
if (wordCount.ContainsKey(word))
{
wordCount[word] = wordCount[word] + 1;
}
else
{
wordCount.Add(word, 1);
}
}
}
}
return wordCount;
}
How I am measuring my solution
I have a 200MB text, which I know the total word count for (via a text editor). I'm using the Stopwatch class
and counting the words to ensure accuracy and measuring the time taken. So far, it is taking around 9 seconds.
Other attempts
- I have tried to utilise multithreading to split out the work via the TPL library. This involved batching multiple lines, sending out the processing of the batch of lines to a separate task and locking the read/write operations in the dictionary. This however, seems to not provide me any performance improvements.
- It took around 30 seconds. I suspect the locking to read/write to the dictionary is too costly to gain any performance.
- I also had a look at the
ConcurrentDictionary
type, but theAddOrUpdate
method does require the calling code to handle the synchronization from my understanding, and brought no performance benefit.
I am sure there is a faster way to achieve this! Is there a better data structure to use for this problem?
Any suggestions/criticisms to my solution are welcome - trying to learn and improve here!
Cheers.
UPDATE: Here is the link to the test file i'm using.
If you're trying to count an specific word you can use the function strtok linked here and compare each word with the word that you are evaluating, I think that this isn't very costly but I never tried this with a big folder...
Your approach seems to be in-line with how most people would tackle it. You're right to notice that using multi-threading did not offer any significant gains, because the bottleneck is most likely IO bound, and no matter what kind of hardware you have, you cannot read faster than your hardware supports.
If you're really looking for speed improvements (I doubt you will get any), you could try and implement a producer-consumer pattern where one thread reads the file and other threads process the lines (maybe then parallelise the checking of words in a line). The trade off here is you're adding a lot more complex code in exchange for marginal improvements (only benchmarking can determine this).
http://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem
edit: also have a look at ConcurrentDictionary
Using a text file at 200mb, the following took little over 5 seconds on my machine.
I've gained quite much (from 25sec to 20sec on a file of 200mb) simply changing:
A variant based on
ConcurrentDictionary<>
andParallel.ForEach
(using theIEnumerable<>
overload). Note that instead of using anint
, I'm using anInterlockedInt
that usesInterlocked.Increment
to increment itself. Being a reference type, it works correctly with theConcurrentDictionary<>.GetOrAdd
...Note that the use of
Parallel.ForEach
is less efficient than using directly one thread for each physical core (you can see how in the history). While both solutions take less than 10 seconds of "wall" clock on my PC, theParallel.ForEach
uses 55 seconds of CPU time against the 33 seconds of theThread
solution.There is another trick that is valued around 5-10%:
You "group" the rows (choose a number between 10 and 100) in packets, so that there is less intra-thread communication. The workers then have to do a
foreach
on the received rows.I recommend setting your stream buffer sizes larger, and matching:
First of all, your code results in buffers too small for this kind of work. Second, since the reader's buffer is smaller than the stream's buffer, the data is copied first to the stream's buffer, then to the reader's buffer. This can be a performance destroyer for the type of work you're doing.
When the buffer sizes match, the stream's buffer will never be used - in fact it will never be allocated.
The best short answer I can give is to measure, measure, measure.
Stopwatch
is nice to get a feeling for where time is spent but eventually you'll end up sprinkling large swats of your code with it or you will have to find a better tool for this purpose. I would suggest getting a dedicated profiler tool for this, there are many available for C# and .NET.I've managed to shave off about 43% of the total runtime in three steps.
First I measured your code and got this:
This seems to indicate that there are two hotspots here that we can try to combat:
The last piece of the time spent is in reading the file and I really doubt we can gain much by changing that part of the code. One other answer here mentions using specific buffersizes, I tried this and could not gain measurable differences.
The first, string splitting, is somewhat easy but involves rewriting a very simple call to
string.Split
into a bit more code. The loop that processes one line I rewrote to this:I then rewrote the processing of one word to this:
This relies on the fact that:
TryGetValue
is cheaper than checking if the word exists and then retrieving its current countTryGetValue
fails to get the value (key does not exist) then it will initialize thecurrentCount
variable here to its default value, which is 0. This means that we don't really need to check if the word actually existed.The final loop thus looks like this:
The new measurement shows this:
Details:
SplitInternal
,FindEntry
andget_Item
TryGetValue
andSubstring
Here's difference details:
As you can see, we lost more time than we gained new time which resulted in a net improvement.
However, we can do better. We're doing 2 dictionary lookups here which involves calculating the hash code of the word, and comparing it to keys in the dictionary. The first lookup is part of the
TryGetValue
and the second is part ofwordCount[word] = ...
.We can remove the second dictionary lookup by creating a smarter data structure inside the dictionary at the cost of more heap memory used.
We can use Xanatos' trick of storing the count inside an object so that we can remove that second dictionary lookup:
This will only retrieve the count from the dictionary, the addition of 1 extra occurance does not involve the dictionary. The result from the method will also change to return this
WordCount
type as part of the dictionary instead of just anint
.Net result: ~43% savings.
Final piece of code: