I am working on HL7 messages and I need a regex. This doesn't work:
HL7 message=MSH|^~\&|DATACAPTOR|123|123|20100816171948|ORU^R01|081617194802900|P|2.3|8859/1
My regex is:
MSH|^~\&|DATACAPTOR|\d{3}|\d{3}|(\d{4}\d{2}\d{2}\d{2}\d{2}\d{2})|ORU\\^R01|\d{20}|P|2.3|8859/1
Can anybody suggest a regex for special characters? I am using this code:
strRegex = "\\vMSH|^~\\&|DATACAPTOR|\\d{3}|\\d{3}|
(\\d{4}\\d{2}\\d{2}\\d{2}\\d{2}\\d{2})|ORU\\^R01|\\d{20}|P|2.3|8859/1";
Regex rx = new Regex(strRegex, RegexOptions.Compiled | RegexOptions.IgnoreCase );
|
,^
, and\
are all special characters in regular expressions, so you'd have to escape them with\
. Remember\
is also an escape character within a regular string literal so you'd have to escape that, too:But it's generally a lot easier to use a verbatim string literal (
@"…"
):Finally, note that
(\d{4}\d{2}\d{2}\d{2}\d{2}\d{2})
can be simplified to(\d{14})
.However, for a structure like this, it's probably easier to just use the
Split
method.Warning: HL7 messages may use different control characters—starting the 4th character in the MSH segment as a field separator (in this case
|^~\&
are the control characters). It's best to parse the control characters first if you don't control your input and these control characters may change.For me your question describes two distinct problems.
Problem 1) "..I need a regex..this doesn't work..My regex is..anybody suggest a (better) regex..?"
This is the good part of your question.
As already pointed out by @p-s-w-g some special characters in regular expressions must be escaped. Page Microsoft Developer Network: Character Escapes in Regular Expressions tells you which characters are special and how to escape them.
In order to easily test if your regex recognizes the grammar you may find useful some interactive regex testing tools, e.g. Regex Hero or The Regulator
Problem 2) "I am working on HL7 messages..this doesn't work..My regex is..anybody suggest a (better) regex..?"
This is the bad part of your question.
The
MSH|^~\&|DATACAPTOR|123|123|20100816171948|ORU^R01|081617194802900|P|2.3|8859/1
example shown in your question is already not valid HL7 message fragment. It is something similar to HL7 but it is was already damaged probably by some text pre-processing code. HL7 v2 messages are not transmitted using text protocol that can be manipulated using text tools. The protocol is binary but at the same time partially readable and thus controllable by humans without any special tools. But it is binary protocol and must be processed as such. Regex is a tool for working with text strings not binary strings. And although it may seem possible to outsmart some ancient 20 years old protocol by a new-age regex one-liner, it is not good approach. I have tried to explain the why not in the comment part of your question.
Basic decoding of the fragment is:
The
! missing !
pieces are really missing. In normal MSH segment they should be there at their corresponding positions, just having default empty value.By reading Health Level Seven, Version 2.3.1 © 1999 - Chapter 2.24.1 MSH - message header segment we can see that
The message was created 4 years ago in 2010, probably by Capsule Tech, Inc.'s DataCaptor™ and formatted by rules defined by Health Level Seven, Version 2.3© 1997 that is by 17 years old and several times updated standard and was supposed to be used by one of the countries listed in Wikipedia: ISO/IEC 8859-1
From your question I can't see more, but whatever you are trying to do and whatever data you are going to process for whatever reason, the code fragment you are starting with is already wrong, in general the HL7 regex parsing approach is strange and if you're working on a serious software to be used anywhere in the healthcare industry, please consider writing or using a serious and tested parser, e.g. the one used by NHapi library http://sourceforge.net/p/nhapi/code/HEAD/tree/NHapi20/NHapi.Base/Parser/PipeParser.cs