As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.
I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!
If you could use UNIX shell commands you could do something like the following:
This would iterate through your whole file and only pass each unique occurrence once per sed call. This way you're not doing a bunch of searches you've done before.
You could use Set in the Collections library to store unique, seen values as you read the file.
There are two scalable solutions, where by scalable I mean disk and not memory based, depending whether the procedure should be stable or not, where by stable I mean that the order after removing duplicates is the same. if scalability isn't an issue, then simply use memory for the same sort of method.
For the non stable solution, first sort the file on the disk. This is done by splitting the file into smaller files, sorting the smaller chunks in memory, and then merging the files in sorted order, where the merge ignores duplicates.
The merge itself can be done using almost no memory, by comparing only the current line in each file, since the next line is guaranteed to be greater.
The stable solution is slightly trickier. First, sort the file in chunks as before, but indicate in each line the original line number. Then, during the "merge" don't bother storing the result, just the line numbers to be deleted.
Then copy the original file line by line, ignoring the line numbers you have stored above.
Try a simple HashSet that stores the lines you have already read. Then iterate over the file. If you come across duplicates they are simply ignored (as a Set can only contain every element once).
Hmm... 40 megs seems small enough that you could build a
Set
of the lines and then print them all back out. This would be way, way faster than doing O(n2) I/O work.It would be something like this (ignoring exceptions):
If the order is important, you could use a
LinkedHashSet
instead of aHashSet
. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.Edit: As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple
HashSet
instead ofLinkedHashSet
. But I doubt you'd notice the difference on an I/O bound operation like this one.