Optional Capture Group Not Capturing

2019-07-20 02:38发布

<tr><td align=right>Name:</td><td align=left><b><font color=black>Nathan</font></b></td></tr>
<tr><td align=right>Extension:</td><td align=left><b>222</b></td></tr>

I have the above HTML glob of text (can't be changed) and I'd like a regular expression that returns 3 capturing groups, the label (Name|Extension) the font color (black|red) and the data (\w+).

I'm having some trouble returning capture group 2, the font color. As you can see, it's not present on the "Extension" row of the table, so I've made the capture group optional. When I do that, it's not matching at all on the first row. I've tried fiddling a lot with trial and error of a bunch of different combinations of quantifiers, but I still can't get the result I'm looking for.

Here's the pattern I have so far: (Name|Extension):.*?(?:<font color=(black|red)>)?.*?>(\w+)

I believe the .*? is consuming what would be the optional capture group and only matching the 1st and 3rd group. If someone could explain to me where I've gone wrong, that would be great.

Edit: As someone who is trying to learn more about regular expressions, I would appreciate it if people could interpret the data I have above as immutable text rather than HTML.

标签: regex parsing
2条回答
手持菜刀,她持情操
2楼-- · 2019-07-20 03:34

Here's the atrocity you're looking for:

 (Name|Extension).*?<b>[<font color=]{0,12}(black|red)?>?(.*?)</.*

It's fragile as hell and I would absolutely not expect it to work if the format of the HTML with which you're dealing differs even slightly from the example you provided. If that HTML is reliably awful, though, you should be OK, I think.

Do note that this is not to be taken as evidence that Signor Mendoza is wrong with regard to the inherent impossibility of parsing HTML with regexes; quite the contrary, it is evidence that he is absolutely correct in every particular. This isn't parsing; this is cheating, and like I say, you're only going to get away with it if the source HTML you're working with is as ugly throughout as it is in the sample you gave.

Test case:

 <tr><td align=right>Name:</td><td align=left><b><font color=black>Nathan</font></b></td></tr>
 <tr><td align=right>Extension:</td><td align=left><b>222</b></td></tr>
 <tr><td align=right>Name:</td><td align=left><b><font color=red>Thomas</font></b></td></tr>
 <tr><td align=right>Extension:</td><td align=left><b>223</b></td></tr>
 <tr><td align=right>Name:</td><td align=left><b><font color=black>Frank</font></b></td></tr>
 <tr><td align=right>Extension:</td><td align=left><b>224</b></td></tr>
 <tr><td align=right>Name:</td><td align=left><b><font color=red>Steve</font></b></td></tr>
 <tr><td align=right>Extension:</td><td align=left><b>225</b></td></tr>
 <tr><td align=right>Name:</td><td align=left><b><font color=black>Tony</font></b></td></tr>
 <tr><td align=right>Extension:</td><td align=left><b>226</b></td></tr>

Result:

 Name black Nathan
 Extension  222
 Name red Thomas
 Extension  223
 Name black Frank
 Extension  224
 Name red Steve
 Extension  225
 Name black Tony
 Extension  226
查看更多
Juvenile、少年°
3楼-- · 2019-07-20 03:35

The problem is the reluctant quantifiers. The first .*? consumes nothing at first, allowing the next part of the regex to try matching the FONT tag right after the :. It doesn't find one, but that's okay because that part's optional. Then the second .*? takes over, consuming only as much as it has to until the >(\w+) can match. So if there is a FONT tag, it's getting matched by the second .*?, not by the optional group as you intended.

But don't bother making the quantifiers greedy; it might work, but more likely it will just fail less efficiently. Try this instead:

<td[^>]*>(Name|Extension):</td><td[^>]*><b>(?:<font color=(black|red)>)?([^<]*)<

Because I explicitly matched all the tags following the label, it's in the correct position to match the FONT tag if there is one. If it's there, group(2) will contain the color; otherwise it will be null.

查看更多
登录 后发表回答