use perl to extract specific output lines

2019-09-04 06:37发布

问题:

I'm endeavoring to create a system to generalize rules from input text. I'm using reVerb to create my initial set of rules. Using the following command[*], for instance: $ echo "Bananas are an excellent source of potassium." | ./reverb -q | tr '\t' '\n' | cat -n

To generate output of the form:

    1  stdin
    2  1
    3  Bananas
    4  are an excellent source of
    5  potassium
    6  0
    7  1
    8  1
    9  6
   10  6
   11  7
   12  0.9999999997341693
   13  Bananas are an excellent source of potassium .
   14  NNS VBP DT JJ NN IN NN .
   15  B-NP B-VP B-NP I-NP I-NP I-NP I-NP O
   16  bananas
   17  be source of
   18  potassium

I'm currently piping the output to a file, which includes the preceding white space and numbers as depicted above.

What I'm really after is just the simple rule at the end, i.e. lines 16, 17 & 18. I've been trying to create a script to extract just that component and put it to a new file in the form of a Prolog clause, i.e. be source of(banans, potassium).

Is that feasible? Can Prolog rules contain white space like that?

I think I'm locked into getting all that output from reVerb so, what would be the best way to extract the desirable component? With a Perl script? Or maybe sed?

*Later I plan to replace this with a larger input file as opposed to just single sentences.

回答1:

This seems wasteful. Why not leave the tabs as they are, and use:

$ echo "Bananas are an excellent source of potassium." \
  | ./reverb -q | cut --fields=16,17,18

And yes, you can have rules like this in Prolog. See the answer by @mat. You need to know a bit of Prolog before you move on, I guess.

It is easier, however, to just make the string a a valid name for a predicate:

  • be_source_of with underscores instead of spaces
  • or 'be source of' with spaces, and enclosed in single quotes.

You can use probably awk to do what you want with the three fields. See for example the printf command in awk. Or, you can parse it again from Prolog directly. Both are beyond the scope of your current question, I feel.



回答2:

sed -n 'N;N
:cycle
$!{N
   D
   b cycle
   }
s/\(.*\)\n\(.*\)\n\(.*\)/\2 (\1,\3)/p' YourFile

if number are in output and not jsut for the reference, change last sed action by s/\^ *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)/\2 (\1,\3)/p

assuming the last 3 lines are the source of your "rules"



回答3:

Regarding the Prolog part of the question:

Yes, Prolog facts can contain whitespace like this, with suitable operator declarations present.

For example:

:- op(700, fx, be).
:- op(650, fx, source).
:- op(600, fx, of).

Example query and its result, to let you see the shape of terms that are created with this syntax:

?- write_canonical(be source of(a, b)).
be(source(of(a,b))).

Therefore, with these operator declarations, a fact like:

be source of(a, b).

is exactly the same as stating:

be(source(of(a,b)).

Depending on use cases and other definitions, it may even be an advantage to create this kind of facts (i.e., facts of the form be/1 instead of source_of/2). If this is the only kind of facts you need, you can simply write:

source_of(a, b).

This creates no redundant wrappers and is easier to use.

Or, as Boris suggested, you can use single quotes as in 'be source of'/2.