Regex - Extracting volume and chapter numbers from

2019-07-09 13:33发布

问题:

Hey,
I'm trying to import some legacy data into a brand new system, it's almost done, but there's a huge problem! Assuming these kinda data:

Blabla Vol.1 chapter 2
ABCD in the era of XYZ volume 2 First Chapter  
A really useless book Eighth vol  
Blala Sixth Vol Chapter 5  
Lablah V6C7 2002  
FooBar Vol6 C3 by Dr. Foo Bar
Regex: A tool in Hell V1 Eleventh Chapter

Confused!! I tried to write that regex to extract volume and chapter numbers but you know it's REGEX! Can anyone please guide me through this?

回答1:

Here is a regular expression that will match your example :

/^.+?(?|(?:\bVol.?|\bvolume[ ]+|V)(\d+)|[ ]+([a-z]+)[ ]+vol\b).?(?:(?|(?:C|chapter[ ]+)(\d+)|[ ]+([a-z]+)[ ]+Chapter\b).?)?$/im

You can live edit the regex and/or add tests here.

In this link :

  • element [0] in the array refers to the matches array
  • element [1] the volumes array
  • element [2] the chapter array

  • I assumed that volumes always comes before chapters as stated in your examples.



    回答2:

    In my opinion, it is always best to break this into separate steps. In the first step, you might convert the titles with the pattern "/Vol.[0-9]+\s+chapter\s[0-9]+$/i". In the second pass, you might convert the titles matching the pattern "/[a-z]+(th|nd|st)\svol/i". Etc.

    Trying to write one regular expression to capture all of these cases usually does not end well and is almost always consistently buggy. Here's an interesting article I found the other day detailing the perils of overly complex regexing.



    回答3:

    As these expressions are not "regular" at all, a single regular expression will be difficult. If you have a finite set of "ways" the chapter and volume are displayed, then you could use multiple regular expressions to attempt to extract that information.

    Or if you can define some rules such as "the chapter number is always in the format [chapter #]" then that would also help!



    回答4:

    If the output is always the same things on the same lines the first thing I would do is explode("\n", $data) and work with the correct line. If consistent you could then match for

    '/ (.*) Vol Chapter ([0-9]*)/'

    or something.

    BTW, this page has always helped me with regex testing. http://www.quanetic.com/Regex