Regexp in Grok sometimes catches a value sometimes

2019-09-11 11:00发布

问题:

I've a code in grok, which captures messages, and if they meet a given criteria, they get a tag.

My problem is, that sometimes this filter works while testing, and sometimes does not. The regexp in question is the following:

^(?!(?:\d\d\d\d-\d\d-\d\d.\d\d:\d\d:\d\d)).*$

This line checks if the given message does not begin with a given time stamp format. In other words: if the given message does not begin with this time stamp, then it gets a tag.

You can test it yourself with this online application: http://grokconstructor.appspot.com/do/match#result

For these test values, the regepx captures all messages which meets the criteria, so the two lines with "test" are highlighted in green:

test
2016-09-23 18:26:49,714
2016-09-23 18:26:40,244
test

However it captures the first date when the input is something like this:

2016-09-23 18:26:49,714
2016-09-23 18:26:40,244
test

I would like to understand what is the reason behind this behaviour, and how could I prevent it?

回答1:

Why not just match the timestamp in a sensible way? You can match multiple date formats like this:

date {
  match => [ "log_timestamp", "dd/MMM/YYYY HH:mm:ss", "dd/MMM/YYYY HH:mm:ss.SSS" ]
  timezone => "Etc/UTC"
  locale => "en-US"
}

This would match 23/SEP/2016 15:15:00 or 23/SEP/2016 15:15:00.123 (we made a change when we were versioning)

So long as it wouldn't appear elsewhere in the line this should pretty much cover you.



回答2:

I think know what causes the behavior to happen in the online tester, although I'm not sure why it happens or what pattern it follows exactly. (I'm familiar with regex, but nothing else here. Feel free to shed some light on this in the comments, if you know more.)

To replicate put the following as the lines where it says "Some log lines you want to match":

2016-09-23 18:26:49,714
2016-09-23 18:26:40,244
test

In the place where it says "The pattern that should match all logfile lines", put your regex:

^(?!(?:\d\d\d\d-\d\d-\d\d.\d\d:\d\d:\d\d)).*$

Don't mess with the checkboxes (I'm not sure what they do, but they should all be checked).

Hit Go! and you get this result, as revo mentioned in the comments:

To get the other result, set up things the exact same way (if you just submitted the regex it should still be set up), but add that same regex to the area that says "If you want to use logstash's multiline filter please specify the used pattern".

Hit Go! and you get this result:

The simple way to avoid this is not to use Logstash's multiline filter. (At least that's what I would assume.)



回答3:

It appears to be that at the beginning of some messages there was a BOM (byte order mark) which I could capture with the following regexp in Grok:

^(?:\xEF\xBB\xBF).*&

I could keep this mark on the clip board, but looks like StackOwerflow cuts it down, that's why my example didn't work for everyone.