How to extract content between two patterns in Uni

2019-05-30 20:23发布

问题:

I have a file test.txt containing below code.

select * from emp where empid=1;  

select *   
from dep  
where jkdsfj  

select *   
from sal   
where jkdsfj  

I need to extract the content between "from" and "where".

Note: If "where" is on a new line it must still select the material between "from" and "where".

Output should be like:

emp  
dep  
sal  

How can I do it?

回答1:

Given the need to handle multiple lines, you can choose sed, or awk, or one of the more complex scripting languages like Perl or Python.

With a bit of care, sed is adequate. I created a file script.4 (having created script, script2, and losing most of what little hair was left on my head**, and restarting with script.1, script.2 and script.3, which were deliberately incomplete) like this:

/from.*where/  { s/.*from *//; s/ *where.*//;          p; n; }
/from/,/where/ { s/.*from *//; s/ *where.*//; /^ *$/d; p;    }

And I created a test file, data, like this:

select * from emp where empid=1;  

select *   
from dep  
where jkdsfj  

select *   
from sal   
where jkdsfj  

select elephants
from abject poverty
join flying tigers
where abelone = shellfish;

select mouse
from toolset
join animals where tail = cord
and buttons = legs

and ran the command like this, to get the output shown:

$ sed -n -f script.4 data
emp
dep  
sal   
abject poverty
join flying tigers
toolset
join animals
$

The script is 'simple'. For lines which contain both from and where, delete everything up to the from (plus any spaces after it), delete everything from the where onward (plus any spaces before it), print what's left, and go to the next line of input.

Otherwise, between a line which contains from and a line that contains where, delete everything up to the from (plus any spaces after it), delete everything from the where onward (plus any spaces before it), if the line is empty, delete it; otherwise print it. Note that adding an n command to the second line makes the script misbehave (I need to spend time working out why), but the delete operation can be added to the first command line without doing any harm (if a line contains from where, nothing is printed).

Note that there are many SELECT statements that would be mishandled by this code.

For example:

SELECT *
  FROM Table1 AS T1
  JOIN (SELECT T2.A, T3.B
          FROM Table2 AS T2
          JOIN Table3 AS T3 ON T2.PK = T3.FK
         WHERE T2.ColumnN > T3.ColumnM
       ) AS T4
    ON T1.A = T4.B
 WHERE T1.DateOfBirth > DATE(2000-01-01)

Quite apart from the upper-case keywords, the WHERE in the sub-query would be where the matching between FROM and WHERE stopped.


** In case you're curious about the cause of hair loss, look at Why does an n instead of a b or d or nothing change the behaviour of sed in this script?.



回答2:

The caveat noted in Jonathan Leffler's answer applies: won't work with nested SQL statements.

Here's a pragmatic solution that combines tr and sed:

With GNU Sed:

tr -s ' ' '\n' < test.txt | sed -n '/^from$/I,/^where$/I { s///; t; p; }'

With BSD Sed (also used on OSX; this is a POSIX-compliant solution that will also work with GNU Sed) - note that using I for case-insensitive matching is, unfortunately, not supported in BSD Sed, so the following only matches all-lowercase from and where:

tr -s ' ' '\n' < test.txt | sed -n -e '/^from$/,/^where$/ { s///; t' -e 'p; }'
  • tr -s ' ' '\n' effectively splits the input into individual tokens, each on a separate line.
  • The sed command then extracts the table names:
    • Note that the only difference between the GNU and the BSD command is that BSD Sed requires a newline after t, which in this case is implicitly provided by providing the rest of the script as a separate -e option:
    • /^from$/,/^where$/ matches ranges of lines, including the from and where lines.
    • s///; t is a trick that skips the from and where lines, effectively only printing (p) what's between them:
      • s/// is a dummy substitution:
        • Not specifying a regex inside // implies that the same regex that matched the current line is reused.
        • The result of the substitution is irrelevant - the only point is to perform any substitution, which will only happen for the endpoints of the range.
      • t then branches - in the absence of a target label name - to the end of the script, IF a substitution took place - which will only be the case for the endpoints of the range, effectively skipping them.
      • p - i.e., printing the current line - is then only performed for the lines between the from and where lines.

Caveat: If there are multiple whitespace-separated tokens between from and where, they are each output on a separate line.



回答3:

The caveat noted in Jonathan Leffler's answer applies: won't work with nested SQL statements.

If you're using GNU grep (as found on Linux), try the following:

tr -s '\n' ' ' < test.txt | grep -Pio '(?<= from ).*?(?= where )'
  • tr -s '\n' ' ' replaces runs of newlines with a space each, resulting in a single line.
    • (A side effect of this is that if tokens inside a from/where pair span multiple lines, they are output as a single-line, space-separated list.)
  • grep command:
    • -P activates support for PCREs (Perl-Compatible Regular Expressions), which provide advanced features such as look-ahead and look-behind assertions.
    • -o causes Grep to only output the matching part(s) of each line, and i performs case-insensitive matching.
    • (?<= from ) uses a look-behind assertion to match from without including it in the match
    • (?<= where ) uses a look-ahead assertion to match where without including it in the match
    • .*? non-greedily matches any sequence of characters; the non-greedy modifier ? is required to prevent .* from matching through the very last occurrence of where on the input line.

BSD grep (as also used on OSX), doesn't support -P, so an extra extraction step is needed to remove the from and where keyword from the matches, using awk:

tr -s '\n' ' ' < test.txt | grep -Eio ' from .*? where ' | 
  awk -F ' from | where ' '{ print $2 }'


回答4:

Not pretty, but working (on the same line):

grep "from.*where" test.txt | awk '{ print $2 }'

For universal solution, I'd use python, because grep doesn't work with multilines.



标签: shell unix