Hey,
I'm trying to import some legacy data into a brand new system, it's almost done, but there's a huge problem! Assuming these kinda data:
Blabla Vol.1 chapter 2
ABCD in the era of XYZ volume 2 First Chapter
A really useless book Eighth vol
Blala Sixth Vol Chapter 5
Lablah V6C7 2002
FooBar Vol6 C3 by Dr. Foo Bar
Regex: A tool in Hell V1 Eleventh Chapter
Confused!! I tried to write that regex to extract volume and chapter numbers but you know it's REGEX! Can anyone please guide me through this?
Here is a regular expression that will match your example :
/^.+?(?|(?:\bVol.?|\bvolume[ ]+|V)(\d+)|[ ]+([a-z]+)[ ]+vol\b).?(?:(?|(?:C|chapter[ ]+)(\d+)|[ ]+([a-z]+)[ ]+Chapter\b).?)?$/im
You can live edit the regex and/or add tests here.
In this link :
element [0] in the array refers to the matches array
element [1] the volumes array
element [2] the chapter array
I assumed that volumes
always comes before chapters as stated in your examples.
In my opinion, it is always best to break this into separate steps. In the first step, you might convert the titles with the pattern "/Vol.[0-9]+\s+chapter\s[0-9]+$/i". In the second pass, you might convert the titles matching the pattern "/[a-z]+(th|nd|st)\svol/i". Etc.
Trying to write one regular expression to capture all of these cases usually does not end well and is almost always consistently buggy. Here's an interesting article I found the other day detailing the perils of overly complex regexing.
As these expressions are not "regular" at all, a single regular expression will be difficult. If you have a finite set of "ways" the chapter and volume are displayed, then you could use multiple regular expressions to attempt to extract that information.
Or if you can define some rules such as "the chapter number is always in the format [chapter #]" then that would also help!
If the output is always the same things on the same lines the first thing I would do is explode("\n", $data) and work with the correct line. If consistent you could then match for
'/ (.*) Vol Chapter ([0-9]*)/'
or something.
BTW, this page has always helped me with regex testing.
http://www.quanetic.com/Regex