Cleaning HTML by removing extra/redundant formatti

2019-01-16 18:17发布

I have been using CKEditor wysiwyg editor for a website where users are allowed to use the HTML editor to add some comments. I ended up having some extremely redundant nested HTML code in my database that is slowing down the viewing/editing of these comments.

I have comments that look like this (this is a very small example. I have comments with over 100 nested tags):

<p>
 <strong>
  <span style="font-size: 14px">
   <span style="color: #006400">
     <span style="font-size: 14px">
      <span style="font-size: 16px">
       <span style="color: #006400">
        <span style="font-size: 14px">
         <span style="font-size: 16px">
          <span style="color: #006400">This is a </span>
         </span>
        </span>
       </span>
      </span>
     </span>
    </span>
    <span style="color: #006400">
     <span style="font-size: 16px">
      <span style="color: #b22222">Test</span>
     </span>
    </span>
   </span>
  </span>
 </strong>
</p>

My questions are:

  • Is there any library/code/software that can do a smart (i.e. format-aware) clean-up of the HTML code, removing all redundant tags that have no effect on the formatting (because they're overridden by inner tags) ? I've tried many existing online solutions (such as HTML Tidy). None of them do what I want.

  • If not, I'll need to write some code for HTML parsing and cleaning. I am planning to use PHP Simple HTML DOM to traverse the HTML tree and find all tags that have no effect. Do you suggest any other HTML parser that is more suitable for my purpose?

Thanks

.

Update:

I have written some code to analyze the HTML code that I have. All the HTML tags that I have are:

  • <span> with styles for font-size and/or color
  • <font> with attributes color and/or size
  • <a> for links (with href)
  • <strong>
  • <p> (single tag to wrap the whole comment)
  • <u>

I can easily write some code to convert the HTML code into bbcode (e.g. [b], [color=blue], [size=3], etc). So I above HTML will become something like:

[b][size=14][color=#006400][size=14][size=16][color=#006400]
[size=14][size=16][color=#006400]This is a [/color][/size]
[/size][/color][/size][/size][color=#006400][size=16]
[color=#b22222]Test[/color][/size][/color][/color][/size][/b]

The question now is: Is there an easy way (algorithm/library/etc) to clean-up the messy (as messy as that original HTML) bbcode that will be generated?

thanks again

11条回答
Juvenile、少年°
2楼-- · 2019-01-16 18:51

I don't have time to finish this... maybe someone else can help. This javascript removes exact duplicate tags and disallowed tags too...

There are a few problems/things to be done,
1) regenerated tags need to be closed
2) it will only remove a tag if the tag-name & attributes are identical to another within that nodes children, so its not 'smart' enough to remove all unnecessary tags.
3) it will look through the allowed CSS variables and extract ALL those values from an element, and then write it to the output HTML, so for example:

var allowed_css = ["color","font-size"];
<span style="font-size: 12px"><span style="color: #123123">

Will be translated into:

<span style="color:#000000;font-size:12px;"> <!-- inherited colour from parent -->
<span style="color:#123123;font-size:12px;"> <!-- inherited font-size from parent -->

Code:

<html>

<head>
<script type="text/javascript">
var allowed_css = ["font-size", "color"];
var allowed_tags = ["p","strong","span","br","b"];
function initialise() {
    var comment = document.getElementById("comment");
    var commentHTML = document.getElementById("commentHTML");
    var output = document.getElementById("output");
    var outputHTML = document.getElementById("outputHTML");
    print(commentHTML, comment.innerHTML, false);
    var out = getNodes(comment);
    print(output, out, true);
    print(outputHTML, out, false);
}
function print(out, stringCode, allowHTML) {
    out.innerHTML = allowHTML? stringCode : getHTMLCode(stringCode);
}
function getHTMLCode(stringCode) {
    return "<code>"+((stringCode).replace(/</g,"&lt;")).replace(/>/g,"&gt;")+"</code>";
}
function getNodes(elem) {
    var output = "";
    var nodesArr = new Array(elem.childNodes.length);
    for (var i=0; i<nodesArr.length; i++) {
        nodesArr[i] = new Array();
        nodesArr[i].push(elem.childNodes[i]);
        getChildNodes(elem.childNodes[i], nodesArr[i]);
        nodesArr[i] = removeDuplicates(nodesArr[i]);
        output += nodesArr[i].join("");
    }
    return output;
}
function removeDuplicates(arrayName) {
    var newArray = new Array();
    label:
    for (var i=0; i<arrayName.length; i++) {  
        for (var j=0; j<newArray.length; j++) {
            if(newArray[j]==arrayName[i])
                continue label;
        }
        newArray[newArray.length] = arrayName[i];
    }
    return newArray;
}
function getChildNodes(elemParent, nodesArr) {
    var children = elemParent.childNodes;
    for (var i=0; i<children.length; i++) {
        nodesArr.push(children[i]);
        if (children[i].hasChildNodes())
            getChildNodes(children[i], nodesArr);
    }
    return cleanHTML(nodesArr);
}
function cleanHTML(arr) {
    for (var i=0; i<arr.length; i++) {
        var elem = arr[i];
        if (elem.nodeType == 1) {
            if (tagNotAllowed(elem.nodeName)) {
                arr.splice(i,1);
                i--;
                continue;
            }
            elem = "<"+elem.nodeName+ getAttributes(elem) +">";
        }
        else if (elem.nodeType == 3) {
            elem = elem.nodeValue;
        }
        arr[i] = elem;
    }
    return arr;
}
function tagNotAllowed(tagName) {
    var allowed = " "+allowed_tags.join(" ").toUpperCase()+" ";
    if (allowed.search(" "+tagName.toUpperCase()+" ") == -1)
        return true;
    else
        return false;
}
function getAttributes(elem) {
    var attributes = "";
    for (var i=0; i<elem.attributes.length; i++) {
      var attrib = elem.attributes[i];
      if (attrib.specified == true) {
        if (attrib.name == "style") {
            attributes += " style=\""+getCSS(elem)+"\"";
        } else {
            attributes += " "+attrib.name+"=\""+attrib.value+"\"";
        }
      }
    }
    return attributes
}
function getCSS(elem) {
    var style="";
    if (elem.currentStyle) {
        for (var i=0; i<allowed_css.length; i++) {
            var styleProp = allowed_css[i];
            style += styleProp+":"+elem.currentStyle[styleProp]+";";
        }
    } else if (window.getComputedStyle) {
        for (var i=0; i<allowed_css.length; i++) {
            var styleProp = allowed_css[i];
            style += styleProp+":"+document.defaultView.getComputedStyle(elem,null).getPropertyValue(styleProp)+";";
        }
    }
    return style;
}
</script>
</head>

<body onload="initialise()">

<div style="float: left; width: 300px;">
<h2>Input</h2>
<div id="comment">
<p> 
 <strong> 
  <span style="font-size: 14px"> 
   <span style="color: #006400"> 
     <span style="font-size: 14px"> 
      <span style="font-size: 16px"> 
       <span style="color: #006400"> 
        <span style="font-size: 14px"> 
         <span style="font-size: 16px"> 
          <span style="color: #006400">This is a </span> 
         </span> 
        </span> 
       </span> 
      </span> 
     </span> 
    </span> 
    <span style="color: #006400"> 
     <span style="font-size: 16px"> 
      <span style="color: #b22222"><b>Test</b></span> 
     </span> 
    </span> 
   </span> 
  </span> 
 </strong> 
</p> 
<p>Second paragraph.
<span style="color: #006400">This is a span</span></p>
</div>
<h3>HTML code:</h3>
<div id="commentHTML"> </div>
</div>

<div style="float: left; width: 300px;">
<h2>Output</h2>
<div id="output"> </div>
<h3>HTML code:</h3>
<div id="outputHTML"> </div>
</div>

<div style="float: left; width: 300px;">
<h2>Tasks</h2>
<big>
<ul>
<li>Close Tags</li>
<li>Ignore inherited CSS style in method getCSS(elem)</li>
<li>Test with different input HTML</li>
</ul>
</big>
</div>

</body>

</html>
查看更多
可以哭但决不认输i
3楼-- · 2019-01-16 18:52

Cleanup HTML collapses tags which seems to be what you are asking for. However, it creates a validated HTML document with CSS moved to inline styles. Many other HTML formatters won't do this because it changes the structure of the HTML document.

查看更多
神经病院院长
4楼-- · 2019-01-16 18:59

Here is a solution that uses the browser to get the nested element's properties. No need to cascade the properties up, since the css computed styles is ready to read from the browser.

Here is an example: http://jsfiddle.net/mmeah/fUpe8/3/

var fixedCode = readNestProp($("#redo"));
$("#simp").html( fixedCode );

function readNestProp(el){
 var output = "";
 $(el).children().each( function(){
    if($(this).children().length==0){
        var _that=this;
        var _cssAttributeNames = ["font-size","color"];
        var _tag = $(_that).prop("nodeName").toLowerCase();
        var _text = $(_that).text();
        var _style = "";
        $.each(_cssAttributeNames, function(_index,_value){
            var css_value = $(_that).css(_value);
            if(typeof css_value!= "undefined"){
                _style += _value + ":";
                _style += css_value + ";";
            }
        });
        output += "<"+_tag+" style='"+_style+"'>"+_text+"</"+_tag+">";
    }else if(
        $(this).prop("nodeName").toLowerCase() !=
        $(this).find(">:first-child").prop("nodeName").toLowerCase()
    ){
        var _tag = $(this).prop("nodeName").toLowerCase();
        output += "<"+_tag+">" + readNestProp(this) + "</"+_tag+">";
    }else{
        output += readNestProp(this);
    };
 });
 return output;
}

A better solution to typing in all possible css attributes like:
var _cssAttributeNames = ["font-size","color"];
Is to use a solution like mentioned here: Can jQuery get all CSS styles associated with an element?

查看更多
欢心
5楼-- · 2019-01-16 19:00

I know you're looking for an HTML DOM cleanser, but maybe js can help?

function getSpans(){ 
var spans=document.getElementsByTagName('span') 
    for (var i=0;i<spans.length;i++){ 
    spans[i].removeNode(true);
        if(i == spans.length) {
        //add the styling you want here
        }
    } 
} 
查看更多
孤傲高冷的网名
6楼-- · 2019-01-16 19:00

If you want to use jQuery, try this:

<p>
<strong>
  <span style="font-size: 14px">
   <span style="color: #006400">
     <span style="font-size: 14px">
      <span style="font-size: 16px">
       <span style="color: #006400">
        <span style="font-size: 14px">
         <span style="font-size: 16px">
          <span style="color: #006400">This is a </span>
         </span>
        </span>
       </span>
      </span>
     </span>
    </span>
    <span style="color: #006400">
     <span style="font-size: 16px">
      <span style="color: #b22222">Test</span>
     </span>
    </span>
   </span>
  </span>
 </strong>
</p>
<br><br>
<div id="out"></div> <!-- Just to print it out -->


$("span").each(function(i){
    var ntext = $(this).text();
    ntext = $.trim(ntext.replace(/(\r\n|\n|\r)/gm," "));
    if(i==0){
        $("#out").text(ntext);
    }        
});

You get this as a result:

<div id="out">This is a                                                                    Test</div>

You could then format it anyway you want. Hope that helps you think a little differently about it...

查看更多
爱情/是我丢掉的垃圾
7楼-- · 2019-01-16 19:03

Rather than waste your precious server time parsing bad HTML I would suggest you fix the root of the problem instead.

A simple solution would be to limit the characters each commentor could make to include the entire html char count as opposed to just the text count (at least that would stop infinately-large nested tags).

You could improve on that by allowing the user to switch between HTML-view and text-view - I'm sure most people would see a load of junk when in the HTML view and simply CTRL+A & DEL it.

I think it would be best if you had your own formatting chars you would parse and replace with the formatting i.e. like stack-overflow has **bold text**, visible to the poster. Or just BB-code would do, visibile to the poster.

查看更多
登录 后发表回答