可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I need to parse a logfiles that contains FIX protocol messages.

Each line contains header information (timestamp, logging level, endpoint), followed by a FIX payload.

I've used regex to parse the header information into named groups. E.g.:

 <?P<datetime>\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}.\d{6}) (?<process_id>\d{4}/\d{1,2})\s*(?P<logging_level>\w*)\s*(?P<endpoint>\w*)\s*

I then come to the FIX payload itself (^A is the separator between each tag) e.g:

8=FIX.4.2^A9=61^A35=A...^A11=blahblah...

I need to extract specific tags from this (e.g. "A" from 35=, or "blahblah" from 11=), and ignore all the other stuff - basically I need to ignore anything before "35=A", and anything after up to "11=blahblah", then ignore anything after that etc.

I do know there a libraries that might be able to parse each and every tag (http://source.kentyde.com/fixlib/overview), however, I was hoping for a simple approach using regex here if possible, since I really only need a couple of tags.

Is there a good way in regex to extract the tags I require?

Cheers, Victor

回答1:

Use a regex tool like expresso or regexbuddy.
Why don't you split on ^A and then match ([^=])+=(.*) for each one putting them into a hash? You could also filter with a switch that by default won't add the tags you're uninterested in and that has a fall through for all the tags you are interested in.

回答2:

No need to split on "\x01" then regex then filter. If you wanted just tags 34,49 and 56 (MsgSeqNum, SenderCompId and TargetCompId) you could regex:

dict(re.findall("(?:^|\x01)(34|49|56)=(.*?)\x01", raw_msg))

Simple regexes like this will work if you know your sender does not have embedded data that could cause a bug in any simple regex. Specifically:

No Raw Data fields (actually combination of data len and raw data like RawDataLength,RawData (95/96) or XmlDataLen, XmlData (212,213)
No encoded fields for unicode strings like EncodedTextLen, EncodedText (354/355)

To handle those cases takes a lot of additional parsing. I use a custom python parser but even the fixlib code you referenced above gets these cases wrong. But if your data is clear of these exceptions the regex above should return a nice dict of your desired fields.

Edit: I've left the above regex as-is but it should be revised so that the final match element be (?=\x01). The explanation can be found in @tropleee's answer here.

回答3:

^A is actually \x{01}, thats just how it shows up in vim. In perl, I had done this via a split on hex 1 and then a split on "=", at the second split, value [0] of the array is the Tag and value [1] is the Value.