I can remove duplicate entries from small text files, but not large text files.
I have a file that's 4MB.
The beginning of the file looks like this:
aa
aah
aahed
aahed
aahing
aahing
aahs
aahs
aal
aalii
aalii
aaliis
aaliis
...
I want to remove the duplicates.
For example, "aahed" shows up twice, and I would only like it to show up once.
No matter what one-liner I've tried, the big list will not change.
If It type:
sort big_list.txt | uniq | less
I see:
aa
aah
aahed
aahed <-- didn't get rid of it
aahing
aahing <-- didn't get rid of it
aahs
aahs <-- didn't get rid of it
aal
...
However, If I copy a small chunk of words from the top of this text file and re-run the command on the small chunk of data, it does what's expected.
Are these programs refusing to sort because the file is too big? I didn't think 4MB was very big. It doesn't output a warning or anything.
I quickly wrote my own "uniq" program, and it has the same behavior. It works on a small subset of the list, but doesn't do anything to the 4MB text file. What's my issue?
EDIT: Here is a hex dump:
00000000 61 61 0a 61 61 68 0a 61 61 68 65 64 0a 61 61 68 |aa.aah.aahed.aah|
00000010 65 64 0d 0a 61 61 68 69 6e 67 0a 61 61 68 69 6e |ed..aahing.aahin|
00000020 67 0d 0a 61 61 68 73 0a 61 61 68 73 0d 0a 61 61 |g..aahs.aahs..aa|
00000030 6c 0a 61 61 6c 69 69 0a 61 61 6c 69 69 0d 0a 61 |l.aalii.aalii..a|
00000040 61 6c 69 69 73 0a 61 61 6c 69 69 73 0d 0a 61 61 |aliis.aaliis..aa|
61 61 68 65 64 0a
a a h e d \r
61 61 68 65 64 0d
a a h e d \n
Solved: Different line delimiters
The sort(1) command accepts a
-u
option for uniqueness of key.Just use
You can normalize line delimeters (convert CR+LF to LF):
To answer max taldykin's question about
awk '!_[$0]++' file
:awk '!_[$0]++' file
is the same as, which is the same as
, which means
Important points here:
$0
means the current record which usually is the current lineawk
, the ACTION part is optional and the default action is{ print; }
0
apart from
sort -u
you can also useawk '!_[$0]++' yourfile