I have a txt file with 12 columns. Some lines are duplicated and some are not. As an example i copied to first 4 columns of my data.
0 0 chr12 48548073
0 0 chr13 80612840
2 0 chrX 4000600
2 0 chrX 31882528
3 0 chrX 3468481
4 0 chrX 31882726
4 0 chr3 75007624
Based on the first column, you can see that some there are duplicates except entry '3'. I would like to print the only single entries, in this case '3'.
The output will be
3 0 chrX 3468481
IS there a quick way of doing this with awk or perl? I can only think of using for loop in perl but given the fact that i have around 1.5M entries it will probably take some time.
Not a one-liner but this small Perl script accomplishes the same task:
try this awk one-liner:
Can't format properly for a comment. @JS웃 might be relying on GNU
uniq
... this seems to work in BSD derived versions:There simply must be a shorter
perl
answer :-)Here is another way:
-w8
will compare the first 8 characters (that is your first column) for uniqueness.-u
option will print only lines that appear once.Test:
I knew there must be a
perl
one-liner response. Here it is - not heavily tested so caveat emptor ;-)The disadvantage of this approach is that it outputs the data in slightly modified format (this is easy enough to fix, I think) and it uses two
for
loops and a "butterfly operator" (!!) It also usesgrep()
(which introduces an implicit loop - i..e one that the code runs even if you don't have to code up a loop yourself) so it may be slow with 1.5 million records. I would like to see it compared toawk
anduniq
though.On the plus side it uses no modules and should run on Windows and OSX. It works when there are several dozen similar records with unique first column and doesn't require the input to be sorted prior to checking for unique lines. The solution is mostly cribbed from the one-liner examples near the end of Effective Perl Programming by Joseph Hall, Joh McAdams, and brian d foy (a great book- when the smart match
~~
andgiven when
dust settles I hope a new edition appears):Here's how ( I think) it works:
-a
we get the@F
array for free so using it instead of splitting-n
we're inside awhile() {}
loop, sopush
the elements of@F
into@AoA
as anonymous arrays of references (the[]
acts as an "anonymous array constructor"). That way they hang around and we can refer to them later (does this even make sense ???)$seen{$_}++
idiom (we use$S
instead of$seen
) from the book mentioned above and described so well by @Axeman here on SO to look at the unique elements of@F[0]
and set/increment keys in our%S
hash according to how many times we see an element (or line) with a given value (i.e the line contents).}{
to break out of thewhile
then, in a separate block, we use twofor
loops to go through the outer array and examine each element (which are themselves anonymous arrays$i
- one for each line) and then, for each inner anonymous array,grep
which values go withkeys
that are equal to "1" in the%S
hash we created previously (thefor $j (grep {$S{$_}==1} keys %S)
, or inner loop) and consecutively place those values in$j
.$j
). We do that with: (@{$AoA[$i]}[0]==$j
).awk
in the hands of @Kent is a bit more pithy. If anyone has suggestions on how to shorten or document my "line noise" (and I never say that aboutperl
!) please add constructive comments!Thanks for reading.