Obtain patterns in one file from another using ack

2019-01-15 03:38发布

问题:

Is there a way to obtain patterns in one file (a list of patterns) from another file using ack as the -f option in grep? I see there is an -f option in ack but it's different with the -f in grep.

Perhaps an example will give you a better idea. Suppose I have file1:

file1:
a
c
e

And file2:

file2:
a  1
b  2
c  3
d  4
e  5

And I want to obtain all the patterns in file1 from file2 to give:

a  1
c  3
e  5

Can ack do this? Otherwise, is there a better way to handle the job (such like awk or using hash) because I have millions of records in both files and really need an efficient way to complete? Thanks!

回答1:

Here's a Perl one-liner that uses a hash to hold the set of wanted keys from file1 for O(1) (amortized time) lookups per iteration over the lines of file2. So it will run in O(m+n) time, where m is number of lines in your key set, and n is the number of lines in the file you're testing.

perl -ne'BEGIN{open K,shift@ARGV;chomp(@a=<K>);@hash{@a}=()}m/^(\p{alpha}+)\s/&&exists$hash{$1}&&print' tkeys file2

The key set will be held in memory while file2 is tested line by line against the keys.

Here's the same thing using Perl's -a command line option:

perl -ane'BEGIN{open G,shift@ARGV;chomp(@a=<G>);@h{@a}=();}exists$h{$F[0]}&&print' tkeys file2

The second version is probably a little easier on the eyes. ;)

One thing you have to remember here is that it's more likely that you're IO bound than processor bound. So the goal should be to minimize IO use. When the entire lookup key set is held in a hash that offers O(1) amortized lookups. The advantage this solution may have over other solutions is that some (slower) solutions will have to run through your key file (file1) one time for each line of file2. That sort of solution will be O(m*n) where m is the size of your key file, and n is the size of file2. On the other hand, this hash approach provides O(m+n) time. That's a magnitude of difference. It benefits by eliminating linear searches through the key-set, and further benefits by reading the keys via IO only one time.



回答2:

Well okay, if we've switched from comments to answers... ;-)

Here's an awk one-liner that does the same as DavidO's perl one-liner, but in awk. Awk is smaller and possibly leaner than Perl. But there are a few different implementations of awk. I have no idea whether yours will perform better than others, or than perl. You'll need to benchmark.

awk 'NR==FNR{a[$0]=1;next} {n=0;for(i in a){if($0~i){n=1}}} n' file1 file2

What does (should) this do?

The first part of the awk script matches only lines in file1 (where the record number in the current file equals the record number in total), and populates the array. The second part (which runs on subsequent files) steps through each item in the array and sees if it can be used as a regexp to match the current input line.

The second block of code starts with an "n", which was set either to 0 or 1 in the previous block. In awk, "1" evaluates as true, and a missing curly-bracket block is considered equivalent to {print}, so if the previous block found a match, this one will print the current line.

If file1 contains strings instead of regexps, then you can change this to make it run faster by replacing the first comparison with if(index($0,i))....

Use with caution. Your mileage may vary. Created in a facility that may contain nuts.



回答3:

nawk 'FNR==NR{a[$0];next}($1 in a)' file3 file4

tested:

pearl.384> cat file3
a
c
e
pearl.385> cat file4
a  1 
b  2 
c  3 
d  4 
e  5
pearl.386> nawk 'FNR==NR{a[$0];next}($1 in a)' file3 file4
a  1 
c  3 
e  5
pearl.387>


回答4:

TXR may be another option for handling your requirements. I'm too new to it to write what you need in it, but the author is a frequent contributor to StackOverflow. While I'm certain that you can do what you need with TXR, but I'm not certain it would perform better. You'd need to test.

Worth a look, if you're interested in an entire language devoted to pattern matching. :)



回答5:

You can convert the file into a regex for ack with tr. I used sed to remove the trailing pipe character.

ack "`tr '\n' '|' < patts | sed 's/.$//'`"

Note you need a couple of processes for this so the awk solution is probably more efficient, but this is quite easy to remember.