I have a txt file with 12 columns. Some lines are duplicated and some are not. As an example i copied to first 4 columns of my data.
0 0 chr12 48548073
0 0 chr13 80612840
2 0 chrX 4000600
2 0 chrX 31882528
3 0 chrX 3468481
4 0 chrX 31882726
4 0 chr3 75007624
Based on the first column, you can see that some there are duplicates except entry '3'.
I would like to print the only single entries, in this case '3'.
The output will be
3 0 chrX 3468481
IS there a quick way of doing this with awk or perl? I can only think of using for loop in perl but given the fact that i have around 1.5M entries it will probably take some time.
try this awk one-liner:
awk '{a[$1]++;b[$1]=$0}END{for(x in a)if(a[x]==1)print b[x]}' file
Here is another way:
uniq -uw8 inputFile
-w8
will compare the first 8 characters (that is your first column) for uniqueness.
-u
option will print only lines that appear once.
Test:
$ cat file
0 0 chr12 48548073
0 0 chr13 80612840
2 0 chrX 4000600
2 0 chrX 31882528
3 0 chrX 3468481
4 0 chrX 31882726
4 0 chr3 75007624
$ uniq -uw8 file
3 0 chrX 3468481
Not a one-liner but this small Perl script accomplishes the same task:
#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';
# get filehandle
open( my $fh, '<', 'test.txt');
# all lines from your file
my %line_map;
while( my $line = <$fh> ) { # read a line
my $key;
my @values;
# split on whitespace
($key, @values) = split(/\s+/, $line);
# delete a line if it already exists in the map
if( exists $line_map{$key} ) {
delete $line_map{$key};
}
else { # mark a line to show that it has been seen
$line_map{$key} = join("\t", @values);
}
}
# now the map should only contain non-duplicates
for my $k ( keys %line_map ) {
print "$k\t", $line_map{$k}, "\n";
}
Can't format properly for a comment. @JS웃 might be relying on GNU uniq
... this seems to work in BSD derived versions:
grep ^`cut -d" " -f1 col_data.txt | uniq -u` file.txt
There simply must be a shorter perl
answer :-)
I knew there must be a perl
one-liner response. Here it is - not heavily tested so caveat emptor ;-)
perl -anE 'push @AoA,[@F]; $S{$_}++ for @F[0];}{for $i (0..$#AoA) {for $j (grep {$S{$_}==1} keys %S) {say "@{$AoA[$i]}" if @{$AoA[$i]}[0]==$j}}' data.txt
The disadvantage of this approach is that it outputs the data in slightly modified format (this is easy enough to fix, I think) and it uses two for
loops and a "butterfly operator" (!!) It also uses grep()
(which introduces an implicit loop - i..e one that the code runs even if you don't have to code up a loop yourself) so it may be slow with 1.5 million records. I would like to see it compared to awk
and uniq
though.
On the plus side it uses no modules and should run on Windows and OSX. It works when there are several dozen similar records with unique first column and doesn't require the input to be sorted prior to checking for unique lines. The solution is mostly cribbed from the one-liner examples near the end of Effective Perl Programming by Joseph Hall, Joh McAdams, and brian d foy (a great book- when the smart match ~~
and given when
dust settles I hope a new edition appears):
Here's how ( I think) it works:
- since we're using
-a
we get the @F
array for free so using it instead of splitting
- since we're using
-n
we're inside a while() {}
loop, so push
the elements of @F
into @AoA
as anonymous arrays of references (the []
acts as an "anonymous array constructor"). That way they hang around and we can refer to them later (does this even make sense ???)
- use the
$seen{$_}++
idiom (we use $S
instead of $seen
) from the book mentioned above and described so well by @Axeman here on SO to look at the unique elements of @F[0]
and set/increment keys in our %S
hash according to how many times we see an element (or line) with a given value (i.e the line contents).
- use a "butterfly"
}{
to break out of the while
then, in a separate block, we use two for
loops to go through the outer array and examine each element (which are themselves anonymous arrays $i
- one for each line) and then, for each inner anonymous array, grep
which values go with keys
that are equal to "1" in the %S
hash we created previously (the for $j (grep {$S{$_}==1} keys %S)
, or inner loop) and consecutively place those values in $j
.
- finally, we iterate through the outer array and print any anonymous arrays where that array's first element equals the value of each (
$j
). We do that with: (@{$AoA[$i]}[0]==$j
).
awk
in the hands of @Kent is a bit more pithy. If anyone has suggestions on how to shorten or document my "line noise" (and I never say that about perl
!) please add constructive comments!
Thanks for reading.