Text to html ratio of a page issue

2019-06-13 19:54发布

I am trying to get Text to HTML Ratio on a given webpage. I am using a strip_html_tags to strip out the html tags and comparing it to the original content on the page to get the ratio. My issue is that I feel like my strip_html_tags function may not get all the tags on webpage. Is there a better way to do this... maybe that just replaces everything that starts with < and >. I can already point out that I am missing a lot of tags that should be stripped in the regex but there has to be a better way to do all this.

function strip_html_tags($text)
{
    $text = preg_replace(array(
        '@<head[^>]*?>.*?</head>@siu',
        '@<style[^>]*?>.*?</style>@siu',
        '@<script[^>]*?.*?</script>@siu',
        '@<object[^>]*?.*?</object>@siu',
        '@<embed[^>]*?.*?</embed>@siu',
        '@<applet[^>]*?.*?</applet>@siu',
        '@<noframes[^>]*?.*?</noframes>@siu',
        '@<noscript[^>]*?.*?</noscript>@siu',
        '@<noembed[^>]*?.*?</noembed>@siu',
        '@</?((address)|(blockquote)|(center)|(del))@iu',
        '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
        '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
        '@</?((table)|(th)|(td)|(caption))@iu',
        '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
        '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
        '@</?((frameset)|(frame)|(iframe))@iu',
        '#<[\/\!]*?[^<>]*?>#siu', // Strip out HTML tags
        '#<![\s\S]*?--[ \t\n\r]*>#siu' // Strip multi-line comments including CDATA
    ), array(
        ' ',
        ' ',
        ' ',
        ' ',
        ' ',
        ' ',
        ' ',
        ' ',
        ' ',
        "\n\$0",
        "\n\$0",
        "\n\$0",
        "\n\$0",
        "\n\$0",
        "\n\$0",
        "\n\$0",
        "\n\$0"
    ), $text);
    return strip_tags($text);
}

function check_ratio($url)
{
    $file_content = // getting data from curl request here
    $page_size    = mb_strlen($file_content, '8bit');
    $content      = strip_html_tags($file_content);
    $text_size    = mb_strlen($content, '8bit');
    $content      = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", " ", $content);
    $len_real     = strlen($file_content);
    $len_strip    = strlen($content);
    return round((($len_strip / $len_real) * 100), 2);
}

3条回答
小情绪 Triste *
2楼-- · 2019-06-13 20:28

DOMNode::$textContent can be a starting point:

$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML(file_get_contents('http://www.google.com'));
libxml_use_internal_errors(false);

$items = $domd->getElementsByTagName('body');
var_dump($items[0]->textContent);

It also includes data from tags you probably won't consider "text", such as <style> or <script> but it shouldn't be difficult to take that into account.

查看更多
三岁会撩人
3楼-- · 2019-06-13 20:31

This is using a regex.

Update 1:

-Have to add an atomic group around the tag body of invisible content,
or could cause catastrophic backtracking if quotes are unbalanced.

-Added list of invisible content it will remove:

script, style, head, object, embed, applet, noframes, noscript, noembed

If no closing tag, just the tag will be removed, otherwise it's content is removed with the tags.

DEMO


Find Raw Regex

<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>  

Replace with nothing.


Various stringed / delimited representations

Delimiter only:  /<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+)?\s*>)[\S\s]*?<\/\1\s*(?=>))|(?:\/?[\w:]+\s*\/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*\/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>/
Single Quote & Delimiter:  '/<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|\'[\S\s]*?\'|(?:(?!\/>)[^>])?)+)?\s*>)[\S\s]*?<\/\1\s*(?=>))|(?:\/?[\w:]+\s*\/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|\'[\S\s]*?\'|[^>]?)+\s*\/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>/'
Double Quote only:  "<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\\s+(?>\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>)[\\S\\s]*?</\\1\\s*(?=>))|(?:/?[\\w:]+\\s*/?)|(?:[\\w:]+\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>"

Expanded

 # <(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>

 <
 (?:
      (?:
           (?:
                # Invisible content; end tag req'd
                (                             # (1 start)
                     script
                  |  style
                  |  head
                  |  object
                  |  embed
                  |  applet
                  |  noframes
                  |  noscript
                  |  noembed 
                )                             # (1 end)
                (?:
                     \s+ 
                     (?>
                          " [\S\s]*? "
                       |  ' [\S\s]*? '
                       |  (?:
                               (?! /> )
                               [^>] 
                          )?
                     )+
                )?
                \s* >
           )

           [\S\s]*? </ \1 \s* 
           (?= > )
      )

   |  (?: /? [\w:]+ \s* /? )
   |  (?:
           [\w:]+ 
           \s+ 
           (?:
                " [\S\s]*? " 
             |  ' [\S\s]*? ' 
             |  [^>]? 
           )+
           \s* /?
      )
   |  \? [\S\s]*? \?
   |  (?:
           !
           (?:
                (?: DOCTYPE [\S\s]*? )
             |  (?: \[CDATA\[ [\S\s]*? \]\] )
             |  (?: -- [\S\s]*? -- )
             |  (?: ATTLIST [\S\s]*? )
             |  (?: ENTITY [\S\s]*? )
             |  (?: ELEMENT [\S\s]*? )
           )
      )
 )
 >

Benchmark:

Regex1:   <(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
Options:  < none >
Completed iterations:   3  /  3     ( x 1000 )
Matches found per iteration:   3780
Elapsed Time:    43.52 s,   43523.08 ms,   43523084 µs

Sample Analysis, page size 126,000 bytes:

      3,780 tags / page
  x   3,000 iterations
--------------------------
 11,340,000 total tags
  /  43.52  seconds
--------------------------
    260,569 tags / second  
  /   3,780 tags / page
--------------------------
   70 pages / second
查看更多
女痞
4楼-- · 2019-06-13 20:32

why are you reinventing the wheel?

here's the better way: http://php.net/manual/en/function.strip-tags.php

查看更多
登录 后发表回答