I am trying to get Text to HTML Ratio on a given webpage. I am using a strip_html_tags
to strip out the html tags and comparing it to the original content on the page to get the ratio. My issue is that I feel like my strip_html_tags
function may not get all the tags on webpage. Is there a better way to do this... maybe that just replaces everything that starts with < and >. I can already point out that I am missing a lot of tags that should be stripped in the regex but there has to be a better way to do all this.
function strip_html_tags($text)
{
$text = preg_replace(array(
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<object[^>]*?.*?</object>@siu',
'@<embed[^>]*?.*?</embed>@siu',
'@<applet[^>]*?.*?</applet>@siu',
'@<noframes[^>]*?.*?</noframes>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
'@<noembed[^>]*?.*?</noembed>@siu',
'@</?((address)|(blockquote)|(center)|(del))@iu',
'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
'@</?((table)|(th)|(td)|(caption))@iu',
'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
'@</?((frameset)|(frame)|(iframe))@iu',
'#<[\/\!]*?[^<>]*?>#siu', // Strip out HTML tags
'#<![\s\S]*?--[ \t\n\r]*>#siu' // Strip multi-line comments including CDATA
), array(
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0"
), $text);
return strip_tags($text);
}
function check_ratio($url)
{
$file_content = // getting data from curl request here
$page_size = mb_strlen($file_content, '8bit');
$content = strip_html_tags($file_content);
$text_size = mb_strlen($content, '8bit');
$content = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", " ", $content);
$len_real = strlen($file_content);
$len_strip = strlen($content);
return round((($len_strip / $len_real) * 100), 2);
}
DOMNode::$textContent can be a starting point:
It also includes data from tags you probably won't consider "text", such as
<style>
or<script>
but it shouldn't be difficult to take that into account.This is using a regex.
Update 1:
-Have to add an atomic group around the tag body of invisible content,
or could cause catastrophic backtracking if quotes are unbalanced.
-Added list of invisible content it will remove:
script, style, head, object, embed, applet, noframes, noscript, noembed
If no closing tag, just the tag will be removed, otherwise it's content is removed with the tags.
DEMO
Find Raw Regex
Replace with nothing.
Various stringed / delimited representations
Expanded
Benchmark:
Sample Analysis, page size
126,000 bytes
:why are you reinventing the wheel?
here's the better way: http://php.net/manual/en/function.strip-tags.php