How to remove duplicate words from a plain text fi

2019-03-09 10:29发布

I have a plain text file with words, which are separated by comma, for example:

word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3

i want to delete the duplicates and to become:

word1, word2, word3, word4, word5, word6, word7

Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....

10条回答
Melony?
2楼-- · 2019-03-09 11:19

open file with vim (vim filename) and run sort command with unique flag (:sort u).

查看更多
虎瘦雄心在
3楼-- · 2019-03-09 11:25

i had the very same problem today.. a word list with 238,000 words but about 40, 000 of those were duplicates. I already had them in individual lines by doing

cat filename | tr " " "\n" | sort 

to remove the duplicates I simply did

cat filename | uniq > newfilename .

Worked perfectly no errors and now my file is down from 1.45MB to 1.01MB

查看更多
迷人小祖宗
4楼-- · 2019-03-09 11:28

I'd think you'll want to replace the spaces with newlines, use the uniq command to find unique lines, then replace the newlines with spaces again.

查看更多
聊天终结者
5楼-- · 2019-03-09 11:33

I presumed you wanted the words to be unique on a single line, rather than throughout the file. If this is the case, then the Perl script below will do the trick.

while (<DATA>)
{
    chomp;
    my %seen = ();
    my @words = split(m!,\s*!);
    @words = grep { $seen{$_} ? 0 : ($seen{$_} = 1) } @words;
    print join(", ", @words), "\n";
}

__DATA__
word1, word2, word3, word2, word4, word5, word3, word6, word7, word3

If you want uniqueness over the whole file, you can just move the %seen hash outside the while (){} loop.

查看更多
登录 后发表回答