I regularly use regex to transform text.
To transform, giant text files from the command line, perl lets me do this:
perl -pe < in.txt > out.txt
But this is inherently on a line-by-line basis. Occasionally, I want to match on multi-line things.
How can I do this in the command-line?
To slurp a file instead of doing line by line processing, use the -0777
switch:
perl -0777 -pe 's/.../.../g' in.txt > out.txt
As documented in perlrun #Command Switches
:
The special value -00
will cause Perl to slurp files in paragraph mode. Any value -0400
or above will cause Perl to slurp files whole, but by convention the value -0777
is the one normally used for this purpose.
Obviously, for large files this may not work well, in which case you'll need to code some type of buffer to do this replacement. We can't advise any better though without real information about your intent.
Grepping across line boundaries
So you want to grep across lines boundaries...
You quite possibly already have pcregrep
installed. As you may know, PCRE stands for Perl-Compatible Regular Expressions
, and the library is definitely Perl-style, though not identical to Perl.
To match across multiple lines, you have to turn on the multi-line mode -M
, which is not the same as (?m)
Running pcregrep -M "(?s)^b.*\d+" text.txt
On this text file:
a
b
c11
The output will be
b
c11
whereas grep would return empty.
Excerpt from the doc:
-M, --multiline Allow patterns to match more than one line. When this option is given, patterns may usefully contain literal newline char-
acters and internal occurrences of ^ and $ characters. The output
for a successful match may consist of more than one line, the last
of which is the one in which the match ended. If the matched string
ends with a newline sequence the output ends at the end of that line.
When this option is set, the PCRE library is called in "mul- tiline"
mode. There is a limit to the number of lines that can be matched,
imposed by the way that pcregrep buffers the input file as it scans
it. However, pcregrep ensures that at least 8K characters or the rest
of the document (whichever is the shorter) are available for
forward matching, and simi- larly the previous 8K characters (or all
the previous charac- ters, if fewer than 8K) are guaranteed to be
available for lookbehind assertions. This option does not work when
input is read line by line (see --line-buffered.)