This is a similar question to what has been previously asked (see below for link) but this time I would like to output the common strings into rows instead of columns as shown below:
I have two files, each with one column that look like this:
File 1
chr1 106623434
chr1 106623436
chr1 106623442
chr1 106623468
chr1 10699400
chr1 10699405
chr1 10699408
chr1 10699415
chr1 10699426
chr1 10699448
chr1 110611528
chr1 110611550
chr1 110611552
chr1 110611554
chr1 110611560
File 2
chr1 1066234
chr1 106994
chr1 1106115
I want to search file 1 and pull out all lines that are an exact match with line 1 of file 2 and output all matches on it's own line. Then I want to do the same for line 2 of file 2 and so on until all matches of file 2 have been found in file 1 and output to it's own row. Also I am working with very large files so something that won't require file 2 to be completely stored in memory, otherwise it will not run to completion. Hopefully the output will look something like this:
chr1 106623434 chr1 106623436 chr1 106623442 chr1 106623468
chr1 10699400 chr1 10699405 chr1 10699408 chr1 10699415 chr1 10699426 chr1 10699448
chr1 110611528 chr1 110611550 chr1 110611552 chr1 110611554 chr1 110611560
Similar question at:
How to move all strings in one file that match the lines of another to columns in an output file?
as long as your patterns don't overlap completely this should work
$ while read p; do grep "$p" file1 | tr '\n' '\t'; echo ""; done < file2
chr1 106623434 chr1 106623436 chr1 106623442 chr1 106623468
chr1 10699400 chr1 10699405 chr1 10699408 chr1 10699415 chr1 10699426 chr1 10699448
chr1 110611528 chr1 110611550 chr1 110611552 chr1 110611554 chr1 110611560
You could do this as it uses close to zero memory but it'll be very slow since it reads the whole of "file1" once for every line of "file2":
$ cat tst.awk
{
ofs = ors = ""
while ( (getline line < "file1") > 0) {
if (line ~ "^"$0) {
printf "%s%s", ofs, line
ofs = "\t"
ors = "\n"
}
}
printf ors
close("file1")
}
$ awk -f tst.awk file2
chr1 106623434 chr1 106623436 chr1 106623442 chr1 106623468
chr1 10699400 chr1 10699405 chr1 10699408 chr1 10699415 chr1 10699426 chr1 10699448
chr1 110611528 chr1 110611550 chr1 110611552 chr1 110611554 chr1 110611560
you can try
awk -vOFS="\t" '
NR==FNR{ #only file2
keys[++i]=$0; #'keys' store pattern to search ('i' contains number of keys)
next; #stop processing the current record and
#go on to the next record
}
{
for(j=1; j<=i; ++j)
#if line start with key then add
if($0 ~ "^"keys[j])
a[keys[j]] = a[keys[j]] (a[keys[j]]!=""?OFS:"") $0;
}
END{
for(j=1; j<=i; ++j) print a[keys[j]]; #print formating lines
}' file2 file1
you get,
chr1 106623434 chr1 106623436 chr1 106623442 chr1 106623468
chr1 10699400 chr1 10699405 chr1 10699408 chr1 10699415 chr1 10699426 chr1 10699448
chr1 110611528 chr1 110611550 chr1 110611552 chr1 110611554 chr1 110611560