I have a plain text file with words, which are separated by comma, for example:
word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3
i want to delete the duplicates and to become:
word1, word2, word3, word4, word5, word6, word7
Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....
Creating a unique list is pretty easy thanks to
uniq
, although most Unix commands like one entry per line instead of a comma-separated list, so we have to start by converting it to that:The harder part is putting this on one line again with commas as separators and not terminators. I used a perl one-liner to do this, but if someone has something more idiomatic, please edit me. :)
And don't forget the
-c
option for theuniq
utility if you're interested in getting a count of the words as well.Assuming that the words are one per line, and the file is already sorted:
If the file's not sorted:
If they're not one per line, and you don't mind them being one per line:
That doesn't remove punctuation, though, so maybe you want:
But that removes the hyphen from hyphenated words. "man tr" for more options.
Came across this thread while trying to solve much the same problem. I had concatenated several files containing passwords, so naturally there were a lot of doubles. Also, many non-standard characters. I didn't really need them sorted, but it seemed that was gonna be necessary for uniq.
I tried:
Tried:
And even tried passing it through cat first, just so I could see if we were getting a proper input.
I'm not sure what's happening. The strings "t\203tonnement" and "t\203tonner" aren't found in the file, though "t/203" and "tonnement" are found, but on separate, non-adjoining lines. Same with "zon\351s".
What finally worked for me was:
It also preserved words whose only difference was case, which is what I wanted. I didn't need the list sorted, so it was fine that it wasn't.
ruby -pi.bak -e '$_.split(",").uniq.join(",")' filename
?I'll admit the two kinds of quotations are ugly.
Here's an awk script that will leave each line in tact, only removing the duplicate words: