Using bash script (Ubuntu 16.04), I'm trying to compare 2 lists of ranges: does any number in any of the ranges in file1 coincide with any number in any of the ranges in file2? If so, print the row in the second file. Here I have each range as 2 tab-delimited columns (in file1, row 1 represents the range 1-4, i.e. 1, 2, 3, 4). The real files are quite big.
file1:
1 4
5 7
8 11
12 15
file2:
3 4
8 13
20 24
Desired output:
3 4
8 13
My best attempt has been:
awk 'NR=FNR { x[$1] = $1+0; y[$2] = $2+0; next};
{for (i in x) {if (x[i] > $1+0); then
{for (i in y) {if (y[i] <$2+0); then
{print $1, $2}}}}}' file1 file2 > output.txt
This returns an empty file.
I'm thinking that the script will need to involve range comparisons using if-then conditions and iterate through each line in both files. I've found examples of each concept, but can't figure out how to combine them.
Any help appreciated!
If Perl solution is preferred, then below one-liner would work
Break down:
Basically we set file = 1 when we are processing the first file and file = 2 when we are processing the second file. When we are in the first file, read the line into an array keyed on each field of the line. When we are in the second file, process the array (nums) and check if there is an entry for each field on the line. If there is, print it.
For GNU awk as I'm controlling the
for
scanning order for optimizing time:Test data:
Output:
and
If the ranges are ordered according to their lower bounds, we can use this to make the algorithms more efficient. The idea is to alternately proceed through the ranges in file1 and file2. More precisely, when we have a certain range R in
file2
, we take further and further ranges infile1
until we know whether these overlap with R. Once we know this, we switch to the next range infile2
.The script can be called with
./script.sh file2 file1
awk solution:
The output:
It depends on how big your files are, of course. If they are not big enough to exhaust the memory, you can try this 100% bash solution:
This is just a starting point. There are many possible performance / memory footprint improvements. But they strongly depend on the sizes of your files and on the distributions of your ranges.
EDIT 1: improved the range overlap test.
EDIT 2: reused the excellent optimization proposed by RomanPerekhrest (unset already printed ranges from
file2
). The performance should be better when the probability that ranges overlap is high.EDIT 3: performance comparison with the
awk
version proposed by RomanPerekhrest (after fixing the initial small bugs):awk
is between 10 and 20 times faster thanbash
on this problem. If performance is important and you hesitate betweenawk
andbash
, prefer: