可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to extract tag from the From: header of a SIP messages.

My regex: ^(From:|f:)((?!\\n\\w).)*;[ ]*tag[ ]*=[ ]*([[:alnum:]]*)

RFC 3261 allows multi-line headers. This new line should start with a whitespace. But i have a problem with multi-line headers. If the tag is on a new line, my regex is not working.

Example correct SIP Message:

INVITE sip:13@10.10.1.13 SIP/2.0
Via: SIP/2.0/UDP 10.10.1.99:5060;branch=z9hG4bK343bf628;rport
Contact: <sip:15@10.10.1.99>
Call-ID: 326371826c80e17e6cf6c29861eb2933@10.10.1.99
CSeq: 102 INVITE
User-Agent: Asterisk PBX
Max-Forwards: 70
Date: Wed, 06 Dec 2009 14:12:45 GMT
Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY
Supported: replaces
Content-Type: application/sdp
Content-Length: 258
From: "Test 15" <sip:15@10.10.1.99>
 ; tag   =    fromtag
To: <sip:13@10.10.1.13>;tag=totag

v=0
o=root 1821 1821 IN IP4 10.10.1.99
s=session
c=IN IP4 10.10.1.99
t=0 0
m=audio 11424 RTP/AVP 0 8 101
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16
a=silenceSupp:off - - - -
a=ptime:20
a=sendrecv

How can I properly parse the multi-line headers? Thanks in advance.

回答1:

I'd second the motion to use/generate a proper parser.

There's nothing stopping you from parsing the headers in a separate step, but you can still specify the grammar declaratively, which is the main point.

The best part here is indeed

the declarative style making it easier to extend with more grammar (the surrounding bits or more details like disallowing CTL characters)
the "free" debugging tools (#define BOOST_SPIRIT_DEBUG, done)

Here's a simple take on the multiline header syntax :

rfc 2616

Header fields can be extended over multiple lines by preceding each extra line with at least one SP or HT

rfc 822

 field       =  field-name ":" [ field-body ] CRLF

 field-name  =  1*<any CHAR, excluding CTLs, SPACE, and ":">

 field-body  =  field-body-contents
                [CRLF LWSP-char field-body]

 field-body-contents =
               <the ASCII characters making up the field-body, as
                defined in the following sections, and consisting
                of combinations of atom, quoted-string, and
                specials tokens, or else consisting of texts>

So without further ado, here's a simple grammar for roughly that, parsing from any range of input iterators into a std::map:

using Headers = std::map<std::string, std::string>;

Here's the core of the parser:

    auto& crlf       = "\r\n";
    auto& tspecials = " \t><@,;:\\\"/][?=}{:";

    rule<It, std::string()> token, value;

    token = +~char_(tspecials); // FIXME? should filter CTLs
    value = *(char_ - (crlf >> &(~blank | eoi)));

    Headers headers;
    bool ok = phrase_parse(first, last, (token >> ':' >> value) % crlf >> omit[*lit(crlf)], blank, headers);

#ifdef DEBUG
    if (ok)          std::cerr << "DEBUG: Parse success\n";
    else             std::cerr << "DEBUG: Parse failed\n";
    if (first!=last) std::cerr << "DEBUG: Remaining unparsed input: '" << std::string(first,last) << "'\n";
#endif

You can see a live demo parsing the sample headers from your question:

Live On Coliru

Printing:

Key: 'Allow', Value: 'INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY'
Key: 'CSeq', Value: '102 INVITE'
Key: 'Call-ID', Value: '326371826c80e17e6cf6c29861eb2933@10.10.1.99'
Key: 'Contact', Value: '<sip:15@10.10.1.99>'
Key: 'Content-Length', Value: '258'
Key: 'Content-Type', Value: 'application/sdp'
Key: 'Date', Value: 'Wed, 06 Dec 2009 14:12:45 GMT'
Key: 'From', Value: '"Test 15" <sip:15@10.10.1.99>
; tag   =    fromtag'
Key: 'Max-Forwards', Value: '70'
Key: 'Supported', Value: 'replaces'
Key: 'To', Value: '<sip:13@10.10.1.13>;tag=totag'
Key: 'User-Agent', Value: 'Asterisk PBX'
Key: 'Via', Value: 'SIP/2.0/UDP 10.10.1.99:5060;branch=z9hG4bK343bf628;rport'

Note that the \r\n combo is kept as-is in the value for the From header. If you want to normalize that to some other LWS character, such as a simple ' ', use e.g.

value = *(omit[ crlf >> !(~blank | eoi) ] >> attr(' ') | (char_ - crlf));

回答2:

A line continuation character can be a space or a horizontal tab... depending on your regex parser you can match on \r\n[ /t] for line-continuation. That said, your regex might be quite complex testing that wherever linear-whitespace could be found... you might be better off with a custom parser breaking the header lines and testing for what you need.

回答3:

Please try with the following pattern:

/(?i)(f(?:rom):(?:(?!^[^\r?\n]+)[\S\s])*((?:\s;\s*([^=]+)\s*=\s*([^\r?\n]+))))/g

Working demo at RegEx101

How to parse multi-line headers of SIP message usi

问题:

回答1:

回答2:

回答3:

收藏的人(0)

How to parse multi-line headers of SIP message usi

问题:

回答1:

回答2:

回答3:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮