How to remove duplicate words from a plain text fi-第2页回答

How to remove duplicate words from a plain text fi

2019-03-09 10:29发布

I have a plain text file with words, which are separated by comma, for example:

word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3

i want to delete the duplicates and to become:

word1, word2, word3, word4, word5, word6, word7

Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....

标签： linux file duplicates plaintext

10条回答

Melony?

2楼-- · 2019-03-09 11:19

open file with vim (vim filename) and run sort command with unique flag (:sort u).

0人赞添加讨论(0) 举报

虎瘦雄心在

3楼-- · 2019-03-09 11:25

i had the very same problem today.. a word list with 238,000 words but about 40, 000 of those were duplicates. I already had them in individual lines by doing

cat filename | tr " " "\n" | sort

to remove the duplicates I simply did

cat filename | uniq > newfilename .

Worked perfectly no errors and now my file is down from 1.45MB to 1.01MB

0人赞添加讨论(0) 举报

迷人小祖宗

4楼-- · 2019-03-09 11:28

I'd think you'll want to replace the spaces with newlines, use the uniq command to find unique lines, then replace the newlines with spaces again.

0人赞添加讨论(0) 举报

聊天终结者

5楼-- · 2019-03-09 11:33

I presumed you wanted the words to be unique on a single line, rather than throughout the file. If this is the case, then the Perl script below will do the trick.

while (<DATA>)
{
    chomp;
    my %seen = ();
    my @words = split(m!,\s*!);
    @words = grep { $seen{$_} ? 0 : ($seen{$_} = 1) } @words;
    print join(", ", @words), "\n";
}

__DATA__
word1, word2, word3, word2, word4, word5, word3, word6, word7, word3

If you want uniqueness over the whole file, you can just move the %seen hash outside the while (){} loop.

0人赞添加讨论(0) 举报

上一页 1 2

How to remove duplicate words from a plain text fi

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间