How to find single entries in a txt file?

2019-08-14 11:48发布

问题:

I have a txt file with 12 columns. Some lines are duplicated and some are not. As an example i copied to first 4 columns of my data.

0       0       chr12   48548073  
0       0       chr13   80612840
2       0       chrX    4000600 
2       0       chrX    31882528 
3       0       chrX    3468481 
4       0       chrX    31882726
4       0       chr3    75007624

Based on the first column, you can see that some there are duplicates except entry '3'. I would like to print the only single entries, in this case '3'.

The output will be

3       0       chrX    3468481

IS there a quick way of doing this with awk or perl? I can only think of using for loop in perl but given the fact that i have around 1.5M entries it will probably take some time.

回答1:

try this awk one-liner:

awk '{a[$1]++;b[$1]=$0}END{for(x in a)if(a[x]==1)print b[x]}' file


回答2:

Here is another way:

uniq -uw8 inputFile
  • -w8 will compare the first 8 characters (that is your first column) for uniqueness.
  • -u option will print only lines that appear once.

Test:

$ cat file
0       0       chr12   48548073  
0       0       chr13   80612840
2       0       chrX    4000600 
2       0       chrX    31882528 
3       0       chrX    3468481 
4       0       chrX    31882726
4       0       chr3    75007624

$ uniq -uw8 file
3       0       chrX    3468481 


回答3:

Not a one-liner but this small Perl script accomplishes the same task:

#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';

# get filehandle
open( my $fh, '<', 'test.txt');

# all lines from your file
my %line_map; 

while( my $line = <$fh> ) { # read a line

   my $key;
   my @values;

   # split on whitespace
   ($key, @values) = split(/\s+/, $line);

   # delete a line if it already exists in the map
   if( exists $line_map{$key} ) {
       delete $line_map{$key};
   } 
   else { # mark a line to show that it has been seen
      $line_map{$key} = join("\t", @values);
   }
}

# now the map should only contain non-duplicates
for my $k ( keys %line_map ) {
   print "$k\t", $line_map{$k}, "\n"; 
}


回答4:

Can't format properly for a comment. @JS웃 might be relying on GNU uniq ... this seems to work in BSD derived versions:

grep ^`cut -d" " -f1 col_data.txt  | uniq -u` file.txt

There simply must be a shorter perl answer :-)



回答5:

I knew there must be a perl one-liner response. Here it is - not heavily tested so caveat emptor ;-)

perl -anE 'push @AoA,[@F]; $S{$_}++ for @F[0];}{for $i (0..$#AoA) {for $j (grep {$S{$_}==1} keys %S) {say "@{$AoA[$i]}" if @{$AoA[$i]}[0]==$j}}' data.txt

The disadvantage of this approach is that it outputs the data in slightly modified format (this is easy enough to fix, I think) and it uses two for loops and a "butterfly operator" (!!) It also uses grep() (which introduces an implicit loop - i..e one that the code runs even if you don't have to code up a loop yourself) so it may be slow with 1.5 million records. I would like to see it compared to awk and uniq though.

On the plus side it uses no modules and should run on Windows and OSX. It works when there are several dozen similar records with unique first column and doesn't require the input to be sorted prior to checking for unique lines. The solution is mostly cribbed from the one-liner examples near the end of Effective Perl Programming by Joseph Hall, Joh McAdams, and brian d foy (a great book- when the smart match ~~ and given when dust settles I hope a new edition appears):

Here's how ( I think) it works:

  • since we're using -a we get the @F array for free so using it instead of splitting
  • since we're using -n we're inside a while() {} loop, so push the elements of @F into @AoA as anonymous arrays of references (the [] acts as an "anonymous array constructor"). That way they hang around and we can refer to them later (does this even make sense ???)
  • use the $seen{$_}++ idiom (we use $S instead of $seen) from the book mentioned above and described so well by @Axeman here on SO to look at the unique elements of @F[0] and set/increment keys in our %S hash according to how many times we see an element (or line) with a given value (i.e the line contents).
  • use a "butterfly" }{ to break out of the while then, in a separate block, we use two for loops to go through the outer array and examine each element (which are themselves anonymous arrays $i - one for each line) and then, for each inner anonymous array, grep which values go with keys that are equal to "1" in the %S hash we created previously (the for $j (grep {$S{$_}==1} keys %S), or inner loop) and consecutively place those values in $j.
  • finally, we iterate through the outer array and print any anonymous arrays where that array's first element equals the value of each ($j). We do that with: (@{$AoA[$i]}[0]==$j).

awk in the hands of @Kent is a bit more pithy. If anyone has suggestions on how to shorten or document my "line noise" (and I never say that about perl!) please add constructive comments!

Thanks for reading.