Apply wordwrap to html content, excluding html att

I'm not used to regular expressions so this might seem easy while tricky for me.

Basically, i'm applying wordwrap to content, that contains classic html tags : , ...

  $text = wordwrap($text, $cutLength, " ", $wordCut);
  $text = nl2br(bbcode_parser($text));
  return $text;

As you can see, my problem is pretty simple : all I want is to apply wordwrap() to my content, excluding what could be in html attributes : href , src ...

Could someone help me out ? Thanks a lot !

标签： php regex html-parsing word-wrap

2条回答

别忘想泡老子

2楼-- · 2019-07-30 13:58

You shouldn't use regex for html parsing of course, but this should separate out
content should you want to. I have limited knowledge of php so this just illustrates procedure.

$tags = 
'  <
   (?:
       /?\w+\s*/?
     | \w+\s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*/?
     | !(?:DOCTYPE.*?|--.*?--)
   )>
';

$scripts =
'   <
   (?:
       (?:script|style) \s*
     | (?:script|style) \s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*
   )>
   .*?
   </(?:script|style)\s*>
';

$regex = / ($scripts | $tags) | ((?:(?!$tags).)+) /xsg;

The replacement string is Group1 catted to the return value of your word wrap function (which is passed the content, Group2 string) so something like: replacement = \1 . textwrap( \2 )
Inside of textwrap you decide what to do with the content.

Tested in Perl (btw its very slow and watered down for clarity):

use strict;
use warnings;

my $tags = 
'  <
   (?:
       /?\w+\s*/?
     | \w+\s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*/?
     | !(?:DOCTYPE.*?|--.*?--)
   )>
';

my $scripts =
'   <
   (?:
       (?:script|style) \s*
     | (?:script|style) \s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*
   )>
   .*?
   </(?:script|style)\s*>
';

my $html = join '', <DATA>;

while ( $html =~ / ($scripts | $tags) | ((?:(?!$tags).)+) /xsg ) {
    if (defined $2 && $2 !~ /^\s+$/) {
        print $2,"\n";
    }
}

0人赞添加讨论(0) 举报

We Are One

3楼-- · 2019-07-30 14:10

Use any DOM parser capable of extracting the text nodes from the document. Iterate over the text nodes, apply wordwrap on them and write them back to their respective text nodes.

The approach is identical to that one given in

How to replace text URLs and exclude URLs in HTML tags?

just instead of checking the text content for links, you apply your wordwrap on them.

The more general phrasing of your problem would be: "How to (selectively) fetch the text content of a HTML document to apply a function to it"

0人赞添加讨论(0) 举报

Apply wordwrap to html content, excluding html att

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间