Deleting duplicate lines in a file using Java-第3页回答

Deleting duplicate lines in a file using Java

2019-01-22 12:25发布

As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.

I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!

标签： java file text file-io duplicates

14条回答

放我归山

2楼-- · 2019-01-22 13:11

Something like this, perhaps:

BufferedReader in = ...;
Set<String> lines = new LinkedHashSet();
for (String line; (line = in.readLine()) != null;)
    lines.add(line); // does nothing if duplicate is already added
PrintWriter out = ...;
for (String line : lines)
    out.println(line);

LinkedHashSet keeps the insertion order, as opposed to HashSet which (while being slightly faster for lookup/insert) will reorder all lines.

0人赞添加讨论(0) 举报

\"骚年 ilove

3楼-- · 2019-01-22 13:14

Does it matter in which order the lines come, and how many duplicates are you counting on seeing?

If not, and if you're counting on a lot of dupes (i.e. a lot more reading than writing) I'd also think about parallelizing the hashset solution, with the hashset as a shared resource.

0人赞添加讨论(0) 举报

上一页 1 2 3

Deleting duplicate lines in a file using Java

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间