Deleting duplicate lines in a file using Java

As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.

I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!

标签： java file text file-io duplicates

14条回答

SAY GOODBYE

2楼-- · 2019-01-22 12:50

Okay, most answers are a bit silly and slow since it involves adding lines to some hashset or whatever and then moving it back from that set again. Let me show the most optimal solution in pseudocode:

Create a hashset for just strings.
Open the input file.
Open the output file.
while not EOF(input)
  Read Line.
  If not(Line in hashSet)
    Add Line to hashset.
    Write Line to output.
  End If.
End While.
Free hashset.
Close input.
Close output.

Please guys, don't make it more difficult than it needs to be. :-) Don't even bother about sorting, you don't need to.

0人赞添加讨论(0) 举报

叛逆

3楼-- · 2019-01-22 12:51

A similar approach

public void stripDuplicatesFromFile(String filename) {
    IOUtils.writeLines(
        new LinkedHashSet<String>(IOUtils.readLines(new FileInputStream(filename)),
        "\n", new FileOutputStream(filename + ".uniq"));
}

0人赞添加讨论(0) 举报

兄弟一词,经得起流年.

4楼-- · 2019-01-22 12:52

Read in the file, storing the line number and the line: O(n)
Sort it into alphabetical order: O(n log n)
Remove duplicates: O(n)
Sort it into its original line number order: O(n log n)

0人赞添加讨论(0) 举报

祖国的老花朵

5楼-- · 2019-01-22 12:52

If the order does not matter, the simplest way is shell scripting:

<infile sort | uniq > outfile

0人赞添加讨论(0) 举报

Explosion°爆炸

6楼-- · 2019-01-22 12:53

I have made two assumptions for this efficient solution:

There is a Blob equivalent of line or we can process it as binary
We can save the offset or a pointer to start of each line.

Based on these assumptions solution is: 1.read a line, save the length in the hashmap as key , so we have lighter hashmap. Save the list as the entry in hashmap for all the lines having that length mentioned in key. Building this hashmap is O(n). While mapping the offsets for each line in the hashmap,compare the line blobs with all existing entries in the list of lines(offsets) for this key length except the entry -1 as offset.if duplicate found remove both lines and save the offset -1 in those places in list.

So consider the complexity and memory usage:

Hashmap memory ,space complexity = O(n) where n is number of lines

Time Complexity - if no duplicates but all equal length lines considering length of each line = m, consider the no of lines =n then that would be , O(n). Since we assume we can compare blob , the m does not matter. That was worst case.

In other cases we save on comparisons although we will have little extra space required in hashmap.

Additionally we can use mapreduce on server side to split the set and merge results later. And using length or start of line as the mapper key.

0人赞添加讨论(0) 举报

Emotional °昔

7楼-- · 2019-01-22 12:54

The Hash Set approach is OK, but you can tweak it to not have to store all the Strings in memory, but a logical pointer to the location in the file so you can go back to read the actual value only in case you need it.

Another creative approach is to append to each line the number of the line, then sort all the lines, remove the duplicates (ignoring the last token that should be the number), and then sort again the file by the last token and striping it out in the output.

0人赞添加讨论(0) 举报

1 2 3 下一页

Deleting duplicate lines in a file using Java

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间