I have text file with a size of over 50gb. Now i want to delete the duplicate words. But I have heard, that i need very much RAM to load every Word from the text file into an Hash Set. Can you tell me a very good way to delete every duplicate word from the text file? The Words are sorted by a white space, like this.
word1 word2 word3 ... ...
This approach uses a database to buffer the words found.
It also assumes that words - regardless of case - are equal.
The H2 documentation states that a database on a non-FAT filesystem has a maximum size of 4 TB (using the default page size of 2KB), which is more than enough for this purpose.
You need to add the H2 driver jar present on the classpath.
Note that I only tested this with a small file consisting of 10 words or so. You should try this attempt with your 50 gigabyte file and report back any errors.
Please be aware that this attempt
The time this attempt takes scales exponentially with the amount of words in the file.
The H2 answer is good, but maybe overkill. All the words in the english language won't be more than a few Mb. Just use a set. You could use this in RAnders00 program.
As an estimate of common words, I added this for war and peace (from gutenberg)
It completed in 0 seconds. You can't use
Files.lines
unless your huge source file has line breaks. With line breaks, it will process it line by line so it won't use too much memory.