Can awk patterns match multiple lines?

2019-01-05 17:38发布

站内文章 / Linux

10 0

我命由我不由天

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have some complex log files that I need to write some tools to process them. I have been playing with awk but I am not sure if awk is the right tool for this.

My log files are print outs of OSPF protocol decodes which contain a text log of the various protocol pkts and their contents with their various protocol fields identified with their values. I want to process these files and print out only certain lines of the log that pertain to specific pkts. Each pkt log can consist of a varying number of lines for that pkt's entry.

awk seems to be able to process a single line that matches a pattern. I can locate the desired pkt but then I need to match patterns in the lines that follow in order to determine if it is a pkt I want to print out.

Another way to look at this is that I would want to isolate several lines in the log file and print out those lines that are the details of a particular pkt based on pattern matches on several lines.

Since awk seems to be line-based, I am not sure if that would be the best tool to use.

If awk can do this, how it is done? If not, any suggestions on which tool to use for this?

回答1:

Awk can easily detect multi-line combinations of patterns, but you need to create what is called a state machine in your code to recognize the sequence.

Consider this input:

how
second half #1
now
first half
second half #2
brown
second half #3
cow

As you have seen, it's easy to recognize a single pattern. Now, we can write an awk program that recognizes second half only when it is directly preceded by a first half line. (With a more sophisticated state machine you could detect an arbitrary sequence of patterns.)

/second half/ {
  if(lastLine == "first half") {
    print
  }
}

{ lastLine = $0 }

If you run this you will see:

second half #2

Now, this example is absurdly simple and only barely a state machine. The interesting state lasts only for the duration of the if statement and the preceding state is implicit, depending on the value of lastLine. In a more canonical state machine you would keep an explicit state variable and transition from state-to-state depending on both the existing state and the current input. But you may not need that much control mechanism.

回答2:

Awk is really record-based. By default it thinks of a line as a record, but you can alter that with the RS (record separator) variable.

One way to approach this would be to do a first pass using sed (you could do this with awk, too, if you prefer), to separate the records with a different character like a form-feed. Then you can write your awk script where it will treat the group of lines as a single record.

For example, if this is your data:

animal 0
name: joe
type: dog
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat

To separate the records with form-feeds:

$ cat data | sed $'s|^\(animal.*\)|\f\\1|'

Now we'll take that and pass it through awk. Here's an example of conditionally printing a record:

$ cat data | sed $'s|^\(animal.*\)|\f\\1|' | awk '
      BEGIN { RS="\f" }                                     
      /type: cat/ { print }'

outputs:

animal 1
name: bill
type: cat

animal 2
name: ed
type: cat

Edit: as a bonus, here's how to do it with awk-ward ruby (-014 means use form-feed (octal code 014) as the record separator):

$ cat data | sed $'s|^\(animal.*\)|\f\\1|' |
      ruby -014 -ne 'print if /type: cat/'

回答3:

awk is able to process from start pattern until end pattern

/start-pattern/,/end-pattern/ {
  print
}

I was looking for how to match

 * Implements hook_entity_info_alter().
 */
function file_test_entity_type_alter(&$entity_types) {

so created

/\* Implements hook_/,/function / {
  print
}

which the content I needed. A more complex example is to skip lines and scrub off non-space parts. Note awk is a record(line) and word(split by space) tool.

# start,end pattern match using comma
/ \* Implements hook_(.*?)\./,/function (.\S*?)/ {
  # skip PHP multi line comment end
  $0 ~ / \*\// skip

  # Only print 3rd word
  if ($0 ~ /Implements/) {
    hook=$3
    # scrub of opening parenthesis and following.
    sub(/\(.*$/, "", hook)
    print hook
  }

  # Only print function name without parenthesis
  if ($0 ~ /function/) {
    name=$2

    # scrub of opening parenthesis and following.
    sub(/\(.*$/, "", name)

    print name
    print ""
  }
}

Hope this helps too.

See also ftp://ftp.gnu.org/old-gnu/Manuals/gawk-3.0.3/html_chapter/gawk_toc.html

回答4:

I do this sort of thing with sendmail logs, from time to time.

Given:

Jan 15 22:34:39 mail sm-mta[36383]: r0B8xkuT048547: to=<www@web3>, delay=4+18:34:53, xdelay=00:00:00, mailer=esmtp, pri=21092363, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:39 mail sm-mta[36383]: r0B8hpoV047895: to=<www@web3>, delay=4+18:49:22, xdelay=00:00:00, mailer=esmtp, pri=21092556, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:51 mail sm-mta[36719]: r0G3Youh036719: from=<obfTaIX3@nickhearn.com>, size=0, class=0, nrcpts=0, proto=ESMTP, daemon=IPv4, relay=[50.71.152.178]
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: lost input channel from [190.107.98.82] to IPv4 after rcpt
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: from=<amahrroc@europe.com>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=[190.107.98.82]
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas@javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)

I use a script something like this:

#!/usr/bin/awk -f

BEGIN {
  search=ARGV[1];  # Grab the first command line option
  delete ARGV[1];  # Delete it so it won't be considered a file
}

# First, store every line in an array keyed on the Queue ID.
# Obviously, this only works for smallish log segments, as it uses up memory.
{
  line[$6]=sprintf("%s\n%s", line[$6], $0);
}

# Next, keep a record of Queue IDs with substrings that match our search string.
index($0, search) {
  show[$6];
}

# Finally, once we've processed all input data, walk through our array of "found"
# Queue IDs, and print the corresponding records from the storage array.
END {
  for(qid in show) {
    print line[qid];
  }
}

to get the following output:

$ mqsearch airtel /var/log/maillog

Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas@javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)

The idea here is that I'm printing all lines that match the Sendmail Queue ID of the string I want to search for. The structure of the code is of course a product of the structure of the log file, so you'll need to customize your solution for the data you're trying to analyse and extract.

回答5:

`pcregrep -M` works pretty well for this.

From pcregrep(1):

-M, --multiline

Allow patterns to match more than one line. When this option is given, patterns may usefully contain literal newline characters and internal occurrences of ^ and $ characters. The output for a successful match may consist of more than one line, the last of which is the one in which the match ended. If the matched string ends with a newline sequence the output ends at the end of that line.

When this option is set, the PCRE library is called in “multiline” mode. There is a limit to the number of lines that can be matched, imposed by the way that pcregrep buffers the input file as it scans it. However, pcregrep ensures that at least 8K characters or the rest of the document (whichever is the shorter) are available for forward matching, and similarly the previous 8K characters (or all the previous characters, if fewer than 8K) are guaranteed to be available for lookbehind assertions. This option does not work when input is read line by line (see --line-buffered.)

标签：