Is there a way to delete duplicate lines in a file in Unix?
I can do it with sort -u
and uniq
commands, but I want to use sed
or awk
.
Is that possible?
Is there a way to delete duplicate lines in a file in Unix?
I can do it with sort -u
and uniq
commands, but I want to use sed
or awk
.
Is that possible?
awk \'!seen[$0]++\' file.txt
seen
is an associative-array that Awk will pass every line of the file to. If a line isn\'t in the array then seen[$0]
will evaluate to false. The !
is a logical NOT operator and will invert the false to true. Awk will print the lines where the expression evaluates to true. The ++
increments seen
so that seen[$0] == 1
after the first time a line is found and then seen[$0] == 2
, and so on.
Awk evaluates everything but 0
and \"\"
(empty string) to true. If a duplicate line is placed in seen
then !seen[$0]
will evaluate to false and the line will not be written to the output.
From http://sed.sourceforge.net/sed1line.txt: (Please don\'t ask me how this works ;-) )
# delete duplicate, consecutive lines from a file (emulates \"uniq\").
# First line in a set of duplicate lines is kept, rest are deleted.
sed \'$!N; /^\\(.*\\)\\n\\1$/!P; D\'
# delete duplicate, nonconsecutive lines from a file. Beware not to
# overflow the buffer size of the hold space, or else use GNU sed.
sed -n \'G; s/\\n/&&/; /^\\([ -~]*\\n\\).*\\n\\1/d; s/\\n//; h; P\'
Perl one-liner similar to @jonas\'s awk solution:
perl -ne \'print if ! $x{$_}++\' file
This variation removes trailing whitespace before comparing:
perl -lne \'s/\\s*$//; print if ! $x{$_}++\' file
This variation edits the file in-place:
perl -i -ne \'print if ! $x{$_}++\' file
This variation edits the file in-place, and makes a backup file.bak
perl -i.bak -ne \'print if ! $x{$_}++\' file
The one-liner that Andre Miller posted above works except for recent versions of sed when the input file ends with a blank line and no chars. On my Mac my CPU just spins.
Infinite loop if last line is blank and has no chars:
sed \'$!N; /^\\(.*\\)\\n\\1$/!P; D\'
Doesn\'t hang, but you lose the last line
sed \'$d;N; /^\\(.*\\)\\n\\1$/!P; D\'
The explanation is at the very end of the sed FAQ:
The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one\'s intuitions
about how a command to \"append the Next line\" ought to behave.
Another fact favoring the change was that \"{N;command;}\" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone \"N;\" to \"$d;N;\".
An alternative way using Vim(Vi compatible):
Delete duplicate, consecutive lines from a file:
vim -esu NONE +\'g/\\v^(.*)\\n\\1$/d\' +wq
Delete duplicate, nonconsecutive and nonempty lines from a file:
vim -esu NONE +\'g/\\v^(.+)$\\_.{-}^\\1$/d\' +wq
$ echo -e \'1\\n2\\n2\\n3\\n3\\n3\\n4\\n4\\n4\\n4\\n5\' |sed -nr \'$!N;/^(.*)\\n\\1$/!P;D\'
1
2
3
4
5
the core idea is:
print ONLY once of each duplicate consecutive lines at its LAST appearance and use D command to implement LOOP.
Explains:
$!N;
: if current line is NOT the last line, use N
command to read the next line into pattern space
./^(.*)\\n\\1$/!P
: if the contents of current pattern space
is two duplicate string
separated by \\n
, which means the next line is the same
with current line, we can NOT print it according to our core idea; otherwise, which means current line is the LAST appearance of all of its duplicate consecutive lines, we can now use P
command to print the chars in current pattern space
util \\n
(\\n
also printed).D
: we use D
command to delete the chars in current pattern space
util \\n
(\\n
also deleted), then the content of pattern space
is the next line.D
command will force sed
to jump to its FIRST
command $!N
, but NOT read the next line from file or standard input stream.$ echo -e \'1\\n2\\n2\\n3\\n3\\n3\\n4\\n4\\n4\\n4\\n5\' |sed -nr \'p;:loop;$!N;s/^(.*)\\n\\1$/\\1/;tloop;D\'
1
2
3
4
5
the core idea is:
print ONLY once of each duplicate consecutive lines at its FIRST appearance and use : command & t command to implement LOOP.
Explains:
:loop
command set a label
named loop
.N
to read next line into the pattern space
.s/^(.*)\\n\\1$/\\1/
to delete current line if the next line is same with current line, we use s
command to do the delete
action.s
command is executed successfully, then use tloop
command force sed
to jump to the label
named loop
, which will do the same loop to the next lines util there are no duplicate consecutive lines of the line which is latest printed
; otherwise, use D
command to delete
the line which is the same with thelatest-printed line
, and force sed
to jump to first command, which is the p
command, the content of current pattern space
is the next new line.This can be achieved using awk
Below Line will display unique Values
awk file_name | uniq
You can output these unique values to a new file
awk file_name | uniq > uniq_file_name
new file uniq_file_name will contain only Unique values, no duplicates
cat filename | sort | uniq -c | awk -F\" \" \'$1<2 {print $2}\'
Deletes the duplicate lines using awk.