I have a file test.txt
containing below code.
select * from emp where empid=1;
select *
from dep
where jkdsfj
select *
from sal
where jkdsfj
I need to extract the content between "from" and "where".
Note: If "where" is on a new line it must still select the material between "from" and "where".
Output should be like:
emp
dep
sal
How can I do it?
Given the need to handle multiple lines, you can choose sed
, or awk
, or one of the more complex scripting languages like Perl or Python.
With a bit of care, sed
is adequate. I created a file script.4
(having created script
, script2
, and losing most of what little hair was left on my head**, and restarting with script.1
, script.2
and script.3
, which were deliberately incomplete) like this:
/from.*where/ { s/.*from *//; s/ *where.*//; p; n; }
/from/,/where/ { s/.*from *//; s/ *where.*//; /^ *$/d; p; }
And I created a test file, data
, like this:
select * from emp where empid=1;
select *
from dep
where jkdsfj
select *
from sal
where jkdsfj
select elephants
from abject poverty
join flying tigers
where abelone = shellfish;
select mouse
from toolset
join animals where tail = cord
and buttons = legs
and ran the command like this, to get the output shown:
$ sed -n -f script.4 data
emp
dep
sal
abject poverty
join flying tigers
toolset
join animals
$
The script is 'simple'. For lines which contain both from
and where
, delete everything up to the from
(plus any spaces after it), delete everything from the where
onward (plus any spaces before it), print what's left, and go to the next line of input.
Otherwise, between a line which contains from
and a line that contains where
,
delete everything up to the from
(plus any spaces after it), delete everything from the where
onward (plus any spaces before it), if the line is empty, delete it; otherwise print it. Note that adding an n
command to the second line makes the script misbehave (I need to spend time working out why), but the delete operation can be added to the first command line without doing any harm (if a line contains from where
, nothing is printed).
Note that there are many SELECT statements that would be mishandled by this code.
For example:
SELECT *
FROM Table1 AS T1
JOIN (SELECT T2.A, T3.B
FROM Table2 AS T2
JOIN Table3 AS T3 ON T2.PK = T3.FK
WHERE T2.ColumnN > T3.ColumnM
) AS T4
ON T1.A = T4.B
WHERE T1.DateOfBirth > DATE(2000-01-01)
Quite apart from the upper-case keywords, the WHERE in the sub-query would be where the matching between FROM and WHERE stopped.
** In case you're curious about the cause of hair loss, look at Why does an n
instead of a b
or d
or nothing change the behaviour of sed
in this script?.
The caveat noted in Jonathan Leffler's answer applies: won't work with nested SQL statements.
Here's a pragmatic solution that combines tr
and sed
:
With GNU Sed:
tr -s ' ' '\n' < test.txt | sed -n '/^from$/I,/^where$/I { s///; t; p; }'
With BSD Sed (also used on OSX; this is a POSIX-compliant solution that will also work with GNU Sed) - note that using I
for case-insensitive matching is, unfortunately, not supported in BSD Sed, so the following only matches all-lowercase from
and where
:
tr -s ' ' '\n' < test.txt | sed -n -e '/^from$/,/^where$/ { s///; t' -e 'p; }'
tr -s ' ' '\n'
effectively splits the input into individual tokens, each on a separate line.
- The
sed
command then extracts the table names:
- Note that the only difference between the GNU and the BSD command is that BSD Sed requires a newline after
t
, which in this case is implicitly provided by providing the rest of the script as a separate -e
option:
/^from$/,/^where$/
matches ranges of lines, including the from
and where
lines.
s///; t
is a trick that skips the from
and where
lines, effectively only printing (p
) what's between them:
s///
is a dummy substitution:
- Not specifying a regex inside
//
implies that the same regex that matched the current line is reused.
- The result of the substitution is irrelevant - the only point is to perform any substitution, which will only happen for the endpoints of the range.
t
then branches - in the absence of a target label name - to the end of the script, IF a substitution took place - which will only be the case for the endpoints of the range, effectively skipping them.
p
- i.e., printing the current line - is then only performed for the lines between the from
and where
lines.
Caveat: If there are multiple whitespace-separated tokens between from
and where
, they are each output on a separate line.
The caveat noted in Jonathan Leffler's answer applies: won't work with nested SQL statements.
If you're using GNU grep
(as found on Linux), try the following:
tr -s '\n' ' ' < test.txt | grep -Pio '(?<= from ).*?(?= where )'
tr -s '\n' ' '
replaces runs of newlines with a space each, resulting in a single line.
- (A side effect of this is that if tokens inside a
from
/where
pair span multiple lines, they are output as a single-line, space-separated list.)
grep
command:
-P
activates support for PCREs (Perl-Compatible Regular Expressions), which provide advanced features such as look-ahead and look-behind assertions.
-o
causes Grep to only output the matching part(s) of each line, and i
performs case-insensitive matching.
(?<= from )
uses a look-behind assertion to match from
without including it in the match
(?<= where )
uses a look-ahead assertion to match where
without including it in the match
.*?
non-greedily matches any sequence of characters; the non-greedy modifier ?
is required to prevent .*
from matching through the very last occurrence of where
on the input line.
BSD grep
(as also used on OSX), doesn't support -P
, so an extra extraction step is needed to remove the from
and where
keyword from the matches, using awk
:
tr -s '\n' ' ' < test.txt | grep -Eio ' from .*? where ' |
awk -F ' from | where ' '{ print $2 }'
Not pretty, but working (on the same line):
grep "from.*where" test.txt | awk '{ print $2 }'
For universal solution, I'd use python, because grep doesn't work with multilines.