how to trim file - remove the columns with the sam

2020-03-02 04:41发布

I would like your help on trimming a file by removing the columns with the same value.

# the file I have (tab-delimited, millions of columns)
jack 1 5 9
john 3 5 0
lisa 4 5 7

# the file I want (remove the columns with the same value in all lines)
jack 1 9
john 3 0
lisa 4 7

Could you please give me any directions on this problem? I prefer a sed or awk solution, or maybe a perl solution.

Thanks in advance. Best,

标签: perl unix sed awk
8条回答
叛逆
2楼-- · 2020-03-02 04:43

Not fully tested but this seems to work for the provided test set, note that it destroys the original file...

#!/bin/bash

#change 4 below to match number of columns
for i in {2..4}; do
    cut -f $i input | sort | uniq -c > tmp
    while read a b; do
        if [ $a -ge 2 ]; then
            awk -vfield=$i '{$field="_";print}' input > tmp2
            $(mv tmp2 input)
        fi
    done < tmp
done

$ cat input
jack    1   5   9
john    3   5   0
lisa    4   5   7

$ ./cnt.sh 

$ cat input
jack 1 _ 9
john 3 _ 0
lisa 4 _ 7

Using _ to make the output clearer...

查看更多
Explosion°爆炸
3楼-- · 2020-03-02 04:44

If you know which column to strip out in advance then cut will be helpful:

cut --complement -d' ' -f 3 filename
查看更多
爷、活的狠高调
4楼-- · 2020-03-02 04:49

You can select the column to cut out like

# using bash/awk
# I had used 1000000 here, as you had written millions of columns but you should adjust it
for cols in `seq 2 1000000` ; do
    cut -d DELIMITER -f $cols FILE | awk -v c=$cols '{s+=$0} END {if (s/NR==$0) {printf("%i,",c)}}'
done | sed 's/,$//' > tmplist
cut --complement -d DELIMITER -f `cat tmplist` FILE

But it can be REALLY slow, because it's not optimized, and reads the file several times... so be careful with huge files.

Or you can read the whole file once with awk and select the dumpable columns, then use cut.

cut --complement -d DELIMITER -f `awk '{for (i=1;i<=NF;i++) {sums[i]+=$i}} END {for (i=1;i<=NF; i++) {if (sums[i]/NR==$i) {printf("%i,",c)}}}' FILE | sed 's/,$//'` FILE

HTH

查看更多
Root(大扎)
5楼-- · 2020-03-02 04:52

As I understand you want to go through each line and check if values in some column have no variance, and then i that case you can remove that column. If that is the case I have a suggestion, but not ready made script, but I think you'll be able to figure it out. You should look at cut. It extracts parts of line. You can use it to extract i.e. column one, then run uniq on outputted data, and then if after unique theres only one value, it means all values in that column are identical. This way you can collect numbers of columns that have no variance. You will need shell script to see how many columns you file has(i guess using head -n 1 and counting number of delimiters) and run such procedure on every column, storing column numbers in array, then in the end crafting cut line to remove columns that are of no interest. Granted its not awk or perl but should work, and would use only traditional Unix tools. Well you can use them in perl script if you want :)

Well and i if misunderstood the question maybe cut will still be useful:) it seems to be one of lesser known tools.

查看更多
够拽才男人
6楼-- · 2020-03-02 04:52

As far as I can tell, you'll need to make this a multi-pass program to meet your needs without blowing through memory. For starters, load a single line of the file into an array.

open FH,'datafile.txt' or die "$!";
my @mask;
my @first_line= split(/\s+/,<FH>);

Then you'll want to sequentially read in the other lines

while(my @next_line= split(/\s+/,<FH>)) {
/* compare each member of @first_line to @next_line
 * any match, make a mark in mask to true
 */

When you get to the bottom of the file, go back to the top and use mask to determine which colums to print.

查看更多
做自己的国王
7楼-- · 2020-03-02 04:58

Here's a quick perl script to figure out which columns can be cut.

open FH, "file" or die $!;
my @baseline = split /\t/,<FH>;         #snag the first row
my @linemap = 0..$#baseline;            #list all equivalent columns (all of them)

while(<FH>) {                           #loop over the file
    my @line = split /\t/;
    @linemap = grep {$baseline[$_] eq $line[$_]}  @linemap; #filter out any that aren't equal
}
print join " ", @linemap;
print "\n";

You can use many of the above recommendations to actually remove the columns. My favorite would probably the cut implementation, partly because the above perl script could be modified to give you the precise command (or even run it for you).

@linemap = map {$_+1} @linemap;                   #Cut is 1-index based
print "cut --complement -f ".join(",",@linemap)." file\n";
查看更多
登录 后发表回答