“sort filename | uniq” does not work on large file

2019-05-11 00:50发布

问题:

I can remove duplicate entries from small text files, but not large text files.
I have a file that's 4MB.
The beginning of the file looks like this:

aa
aah
aahed
aahed
aahing
aahing
aahs
aahs
aal
aalii
aalii
aaliis
aaliis
...

I want to remove the duplicates.
For example, "aahed" shows up twice, and I would only like it to show up once.

No matter what one-liner I've tried, the big list will not change.

If It type: sort big_list.txt | uniq | less
I see:

aa
aah
aahed
aahed   <-- didn't get rid of it
aahing
aahing   <-- didn't get rid of it
aahs
aahs   <-- didn't get rid of it
aal
...

However, If I copy a small chunk of words from the top of this text file and re-run the command on the small chunk of data, it does what's expected.

Are these programs refusing to sort because the file is too big? I didn't think 4MB was very big. It doesn't output a warning or anything.

I quickly wrote my own "uniq" program, and it has the same behavior. It works on a small subset of the list, but doesn't do anything to the 4MB text file. What's my issue?

EDIT: Here is a hex dump:

00000000  61 61 0a 61 61 68 0a 61  61 68 65 64 0a 61 61 68  |aa.aah.aahed.aah|
00000010  65 64 0d 0a 61 61 68 69  6e 67 0a 61 61 68 69 6e  |ed..aahing.aahin|
00000020  67 0d 0a 61 61 68 73 0a  61 61 68 73 0d 0a 61 61  |g..aahs.aahs..aa|
00000030  6c 0a 61 61 6c 69 69 0a  61 61 6c 69 69 0d 0a 61  |l.aalii.aalii..a|
00000040  61 6c 69 69 73 0a 61 61  6c 69 69 73 0d 0a 61 61  |aliis.aaliis..aa|

61 61 68 65 64 0a
a  a  h  e  d  \r

61 61 68 65 64 0d
a  a  h  e  d  \n

Solved: Different line delimiters

回答1:

You can normalize line delimeters (convert CR+LF to LF):

sed 's/\r//' big_list.txt | sort -u


回答2:

The sort(1) command accepts a -u option for uniqueness of key.

Just use

 sort -u big_list.txt


回答3:

To answer max taldykin's question about awk '!_[$0]++' file:

awk '!_[$0]++' file is the same as

awk '!seen[$0]++' file

, which is the same as

awk '!seen[$0]++ { print; }' file

, which means

awk '
    {
        if (!seen[$0]) {
            print;
        }
        seen[$0]++;
    }' file

Important points here:

  1. $0 means the current record which usually is the current line
  2. In awk, the ACTION part is optional and the default action is { print; }
  3. In arithmetic context, an uninitialized var is 0


回答4:

apart from sort -u you can also use awk '!_[$0]++' yourfile