Trying to parse out vCard name entry with Regex

2019-08-08 01:28发布

问题:

I have the following Regex to parse out a vCard: (VB)

        Dim options As New RegexOptions()
        options = RegexOptions.IgnoreCase Or RegexOptions.Multiline Or RegexOptions.IgnorePatternWhitespace
        regex = New Regex("(?<strElement>(N)) (;[^:]*)? (;CHARSET=UTF-8)? (:(?<strSurname>([^;\n\r]*))) (;(?<strGivenName>([^;\n\r]*)))? (;(?<strMidName>([^;\n\r]*)))? (;(?<strPrefix>([^;\n\r]*)))? (;(?<strSuffix>[^;\n\r]*))?", options)
        m = regex.Match(s)
        If m.Success Then
            Surname = m.Groups("strSurname").Value
            GivenName = m.Groups("strGivenName").Value
            MiddleName = m.Groups("strMidName").Value
            Prefix = m.Groups("strPrefix").Value
            Suffix = m.Groups("strSuffix").Value
        End If

It works when I have a vCard like:

BEGIN:VCARD
VERSION:2.1
N:Bacon;Kevin;Francis;Mr.;Jr.
FN: Mr. Kevin Francis Bacon Jr.
ORG:Movies.com

But it doesn't work correctly when the vCard is like this:

BEGIN:VCARD
VERSION:2.1
N:Bacon;Kevin
FN:Kevin Bacon
ORG:Movies.com

The regex assigns the <strSuffix> to Kevin, and not <strGivenName> like I wanted. How can I fix this?

Adapted regex came from here: vCard regex

回答1:

You should be good with regex pattern

^N(?:;(?!CHARSET=UTF-8)[^:]*|)(?:;CHARSET=UTF-8|):(?<strSurname>[^;\n\r]*);?(?<strGivenName>[^;\n\r]*);?(?<strMidName>[^;\n\r]*);?(?<strPrefix>[^;\n\r]*);?(?<strSuffix>[^;\n\r]*)

See this example and this example.



回答2:

I would avoid parsing each line with a unique regex, but instead tokenize each line. Then have the resulting process determine if there are missing (optional) items. Here is a pattern which simply tokenizes each line by its code and data items (use explicit capture & multiline).

^(?<Code>[^:]+)(:)((?<Tokens>[^;\r\n]+)(;?))+

That puts the emphasis on creating individual code objects which handle the business logic of whether data is missing or not. Failures are no longer regex failures, but business logic post processing failures which IMHO are better to debug and maintain.



标签: .net regex vcard