I'm trying to extract tag
from the From:
header of a SIP messages.
My regex: ^(From:|f:)((?!\\n\\w).)*;[ ]*tag[ ]*=[ ]*([[:alnum:]]*)
RFC 3261 allows multi-line headers. This new line should start with a whitespace. But i have a problem with multi-line headers. If the tag
is on a new line, my regex is not working.
Example correct SIP Message:
INVITE sip:13@10.10.1.13 SIP/2.0
Via: SIP/2.0/UDP 10.10.1.99:5060;branch=z9hG4bK343bf628;rport
Contact: <sip:15@10.10.1.99>
Call-ID: 326371826c80e17e6cf6c29861eb2933@10.10.1.99
CSeq: 102 INVITE
User-Agent: Asterisk PBX
Max-Forwards: 70
Date: Wed, 06 Dec 2009 14:12:45 GMT
Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY
Supported: replaces
Content-Type: application/sdp
Content-Length: 258
From: "Test 15" <sip:15@10.10.1.99>
; tag = fromtag
To: <sip:13@10.10.1.13>;tag=totag
v=0
o=root 1821 1821 IN IP4 10.10.1.99
s=session
c=IN IP4 10.10.1.99
t=0 0
m=audio 11424 RTP/AVP 0 8 101
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16
a=silenceSupp:off - - - -
a=ptime:20
a=sendrecv
How can I properly parse the multi-line headers? Thanks in advance.
I'd second the motion to use/generate a proper parser.
There's nothing stopping you from parsing the headers in a separate step, but you can still specify the grammar declaratively, which is the main point.
The best part here is indeed
- the declarative style making it easier to extend with more grammar (the surrounding bits or more details like disallowing CTL characters)
- the "free" debugging tools (
#define BOOST_SPIRIT_DEBUG
, done)
Here's a simple take on the multiline header syntax :
rfc 2616
Header fields can be extended over multiple lines by preceding each extra line with at least one SP or HT
rfc 822
field = field-name ":" [ field-body ] CRLF
field-name = 1*<any CHAR, excluding CTLs, SPACE, and ":">
field-body = field-body-contents
[CRLF LWSP-char field-body]
field-body-contents =
<the ASCII characters making up the field-body, as
defined in the following sections, and consisting
of combinations of atom, quoted-string, and
specials tokens, or else consisting of texts>
So without further ado, here's a simple grammar for roughly that, parsing from any range of input iterators into a std::map:
using Headers = std::map<std::string, std::string>;
Here's the core of the parser:
auto& crlf = "\r\n";
auto& tspecials = " \t><@,;:\\\"/][?=}{:";
rule<It, std::string()> token, value;
token = +~char_(tspecials); // FIXME? should filter CTLs
value = *(char_ - (crlf >> &(~blank | eoi)));
Headers headers;
bool ok = phrase_parse(first, last, (token >> ':' >> value) % crlf >> omit[*lit(crlf)], blank, headers);
#ifdef DEBUG
if (ok) std::cerr << "DEBUG: Parse success\n";
else std::cerr << "DEBUG: Parse failed\n";
if (first!=last) std::cerr << "DEBUG: Remaining unparsed input: '" << std::string(first,last) << "'\n";
#endif
You can see a live demo parsing the sample headers from your question:
Live On Coliru
Printing:
Key: 'Allow', Value: 'INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY'
Key: 'CSeq', Value: '102 INVITE'
Key: 'Call-ID', Value: '326371826c80e17e6cf6c29861eb2933@10.10.1.99'
Key: 'Contact', Value: '<sip:15@10.10.1.99>'
Key: 'Content-Length', Value: '258'
Key: 'Content-Type', Value: 'application/sdp'
Key: 'Date', Value: 'Wed, 06 Dec 2009 14:12:45 GMT'
Key: 'From', Value: '"Test 15" <sip:15@10.10.1.99>
; tag = fromtag'
Key: 'Max-Forwards', Value: '70'
Key: 'Supported', Value: 'replaces'
Key: 'To', Value: '<sip:13@10.10.1.13>;tag=totag'
Key: 'User-Agent', Value: 'Asterisk PBX'
Key: 'Via', Value: 'SIP/2.0/UDP 10.10.1.99:5060;branch=z9hG4bK343bf628;rport'
Note that the \r\n
combo is kept as-is in the value for the From
header. If you want to normalize that to some other LWS character, such as a simple ' '
, use e.g.
value = *(omit[ crlf >> !(~blank | eoi) ] >> attr(' ') | (char_ - crlf));
A line continuation character can be a space or a horizontal tab... depending on your regex parser you can match on \r\n[ /t] for line-continuation. That said, your regex might be quite complex testing that wherever linear-whitespace could be found... you might be better off with a custom parser breaking the header lines and testing for what you need.
Please try with the following pattern:
/(?i)(f(?:rom):(?:(?!^[^\r?\n]+)[\S\s])*((?:\s;\s*([^=]+)\s*=\s*([^\r?\n]+))))/g
Working demo at RegEx101