Extract content from each first TD in a Table

I've got some HTML that looks like this:

<tr class="row-even">
    <td align="center">abcde</td>
    <td align="center"><a href="deluserconfirm.html?user=abcde"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></a></td>
</tr>
<tr class="row-odd">
    <td align="center">efgh</td>
    <td align="center"><a href="deluserconfirm.html?user=efgh"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></a></td>
</tr>
<tr class="row-even">
    <td align="center">ijkl</td>
    <td align="center"><a href="deluserconfirm.html?user=ijkl"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></a></td>
</tr>

And I need to retrieve the values, abcde, efgh, and ijkl

This is the regex I'm currently using:

preg_match_all('/(<tr class="row-even">|<tr class="row-odd">)<td align="center">(.*)<\/td><\/tr>/xs', $html, $matches);

Yes, I'm not very good at them. As with most of my regex attempts, this is not working. Can anyone tell me why?

Also, I know about html/xml parsers, but it would require a significant code revisit to make that happen. So that's for later. We need to stick with regex for now.

EDIT: To clarify, I need the values between the first <td align="center"></td> tag after either <tr class="row-even"> or <tr class="row-odd">

标签： php regex preg-match-all

6条回答

甜甜的少女心

2楼-- · 2019-09-04 08:44

Actually, you dont need a too big change in your codebase. Fetching Text Nodes is always the same with DOM and XPath. All that does change is the XPath, so you could wrap the DOM code into a function that replaces your preg_match_all. That would be just a tiny change, e.g.

include_once "dom.php";
$matches = dom_match_all('//tr/td[1]', $html);

where dom.php just contains:

// dom.php
function dom_match_all($query, $html, array $matches = array()) {
    $dom = new DOMDocument;
    libxml_use_internal_errors(TRUE);
    $dom->loadHTML($html);
    libxml_clear_errors();
    $xPath = new DOMXPath($dom);
    foreach( $xPath->query($query) as $node ) {
        $matches[] = $node->nodeValue;
    }
    return $matches;
}

and would return

Array
(
    [0] => abcde
    [1] => efgh
    [2] => ijkl
)

But if you want a Regex, use a Regex. I am just giving ideas.

0人赞添加讨论(0) 举报

可以哭但决不认输i

3楼-- · 2019-09-04 08:51

~<tr class="row-(even|odd)">\s*<td align="center">(.*?)</td>~m

Notice the m modifier and the use of \s*.

Also, you can make the first group non-capturing via ?:. I.e., (?:even|odd) as you're probably not interested in the class attribute :)

0人赞添加讨论(0) 举报

Summer. ? 凉城

4楼-- · 2019-09-04 08:56

This is just a quick and dirty regex to meet your needs. It could easily be cleaned up and optimized, but it's a start.

<tr[^>]+>[^\n]*\n               #Match the opening <tr> tag
  \s*<td[^>]+>([^<]+)[^\n]+\n   #Group the wanted data
  [^\n]+\n                      #Match next line
</tr>                           #Match closing tag

Here is an alternative way, which may be more robust:

deluserconfirm.html\?user=([^"]+)

0人赞添加讨论(0) 举报

叼着烟拽天下

5楼-- · 2019-09-04 08:57

This is what I came up with

<td align="center">([^<]+)</td>

I'll explain. One of the challenges here is what's between the tags could be either the text you're looking for, or an tag. In the regex the [^<]+ says to match one or more characters that is not the < character. That's great, because that means the won't match, and the the group will only match until the tag is found.

0人赞添加讨论(0) 举报

一夜七次

6楼-- · 2019-09-04 09:04

Try this:

preg_match_all('/(?:<tr class="row-even">|<tr class="row-odd">).<td align="center">(.*?)<\/td>/s', $html, $matches);

Changes made:

You've not accounted for the newline between the tags
You don't need to x modifier as it will discard the space in the regex.
Make the matching non-greedy by using .*? in place of .*.

Working link

0人赞添加讨论(0) 举报

欢心

7楼-- · 2019-09-04 09:10

Disclaimer: Using regexps to parse HTML is dangerous.

To get the innerhtml of the first TD in each TR, use this regexp:

/<tr[^>]*>\s*<td[^>]>(.+?)<\/td>/si

0人赞添加讨论(0) 举报

Extract content from each first TD in a Table

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间