I use DOMxpath to remove html tags that have empty text node but to keep <br/>
tags,
$xpath = new DOMXPath($dom);
while(($nodeList = $xpath->query('//*[not(text()) and not(node()) and not(self::br)]')) && $nodeList->length > 0)
{
foreach ($nodeList as $node)
{
$node->parentNode->removeChild($node);
}
}
it works perfectly until I came across another problem,
$content = '<p><br/><br/><br/><br/></p>';
How do remove this kind of messy <br/>
and<p>
? which means I don't want to allow <br/>
alone with <p>
but I allow <br/>
with proper text like this only,
$content = '<p>first break <br/> second break <br/> the last line</p>';
Is that possible?
Or is it better with a regular expression?
I tried something like this,
$nodeList = $xpath->query("//p[text()=<br\s*\/?>\s*]");
foreach($nodeList as $node)
{
$node->parentNode->removeChild($node);
}
but it return this error,
Warning: DOMXPath::query() [domxpath.query]: Invalid expression in...
You can select the unwanted p using XPath:
"//p[count(*)=count(br) and br and normalize-space(.)='']"
Note to select empty-text nodes shouldn't you better use (?):
"//*[normalize-space(.)='' and not(self::br)]"
This will select any element (but br) whithout text nodes, nodes like:
<p><b/><i/></p>
or
<p> <br/> <br/>
</p>
included.
I have almost same situation, i use:
$document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file));
And use urlencode()
to change it back for display or inserting to database.
Its work for me.
You could get rid of them all by simply checking to see that the only things within a paragraph are spaces and <br />
tags: preg_replace("\<p\>(\s|\<br\s*\/\>)*\<\/p\>","",$content);
Broken down:
\<p\> # Match for <p>
( # Beginning of a group
\s # Match a space character
| # or...
\<br\s*\/\> # match a <br /> tag, with any number (including 0) spaces between the <br and />
)* # Match this whole group (spaces or <br /> tags) 0 or more times.
\<\/p\> # Match for </p>
I will mention, however, that unless your HTML is well-formatted (one-line, no strange spaces or paragraph classes, etc), you should not use regex to parse this. If it is, this regex should work just fine.