How to pull out all lines of a file matching each

2019-09-10 17:15发布

问题:

This is a similar question to what has been previously asked (see below for link) but this time I would like to output the common strings into rows instead of columns as shown below:

I have two files, each with one column that look like this:

File 1

chr1 106623434
chr1 106623436
chr1 106623442
chr1 106623468
chr1 10699400
chr1 10699405
chr1 10699408
chr1 10699415
chr1 10699426
chr1 10699448
chr1 110611528
chr1 110611550
chr1 110611552
chr1 110611554
chr1 110611560

File 2

chr1 1066234
chr1 106994
chr1 1106115

I want to search file 1 and pull out all lines that are an exact match with line 1 of file 2 and output all matches on it's own line. Then I want to do the same for line 2 of file 2 and so on until all matches of file 2 have been found in file 1 and output to it's own row. Also I am working with very large files so something that won't require file 2 to be completely stored in memory, otherwise it will not run to completion. Hopefully the output will look something like this:

chr1 106623434  chr1 106623436  chr1 106623442  chr1 106623468
chr1 10699400   chr1 10699405   chr1 10699408   chr1 10699415   chr1 10699426  chr1 10699448 
chr1 110611528  chr1 110611550  chr1 110611552  chr1 110611554  chr1 110611560  

Similar question at: How to move all strings in one file that match the lines of another to columns in an output file?

回答1:

as long as your patterns don't overlap completely this should work

$ while read p; do grep "$p" file1 | tr '\n' '\t'; echo "";  done < file2
chr1 106623434  chr1 106623436  chr1 106623442  chr1 106623468
chr1 10699400   chr1 10699405   chr1 10699408   chr1 10699415   chr1 10699426   chr1 10699448
chr1 110611528  chr1 110611550  chr1 110611552  chr1 110611554  chr1 110611560


回答2:

You could do this as it uses close to zero memory but it'll be very slow since it reads the whole of "file1" once for every line of "file2":

$ cat tst.awk
{
    ofs = ors = ""
    while ( (getline line < "file1") > 0) {
        if (line ~ "^"$0) {
            printf "%s%s", ofs, line
            ofs = "\t"
            ors = "\n"
        }
    }
    printf ors
    close("file1")
}

$ awk -f tst.awk file2
chr1 106623434  chr1 106623436  chr1 106623442  chr1 106623468
chr1 10699400   chr1 10699405   chr1 10699408   chr1 10699415   chr1 10699426   chr1 10699448
chr1 110611528  chr1 110611550  chr1 110611552  chr1 110611554  chr1 110611560


回答3:

you can try

awk -vOFS="\t" '
NR==FNR{                      #only file2
    keys[++i]=$0;             #'keys' store pattern to search ('i' contains number of keys)
    next;                     #stop processing the current record and 
                              #go on to the next record
}
{
    for(j=1; j<=i; ++j)
        #if line start with key then add
        if($0 ~ "^"keys[j])
            a[keys[j]] = a[keys[j]] (a[keys[j]]!=""?OFS:"") $0;
}
END{
    for(j=1; j<=i; ++j) print a[keys[j]];  #print formating lines
}' file2 file1

you get,

chr1 106623434  chr1 106623436  chr1 106623442  chr1 106623468
chr1 10699400   chr1 10699405   chr1 10699408   chr1 10699415   chr1 10699426   chr1 10699448
chr1 110611528  chr1 110611550  chr1 110611552  chr1 110611554  chr1 110611560


标签: awk grep