I have been using CKEditor wysiwyg editor for a website where users are allowed to use the HTML editor to add some comments. I ended up having some extremely redundant nested HTML code in my database that is slowing down the viewing/editing of these comments.
I have comments that look like this (this is a very small example. I have comments with over 100 nested tags):
<p>
<strong>
<span style="font-size: 14px">
<span style="color: #006400">
<span style="font-size: 14px">
<span style="font-size: 16px">
<span style="color: #006400">
<span style="font-size: 14px">
<span style="font-size: 16px">
<span style="color: #006400">This is a </span>
</span>
</span>
</span>
</span>
</span>
</span>
<span style="color: #006400">
<span style="font-size: 16px">
<span style="color: #b22222">Test</span>
</span>
</span>
</span>
</span>
</strong>
</p>
My questions are:
Is there any library/code/software that can do a smart (i.e. format-aware) clean-up of the HTML code, removing all redundant tags that have no effect on the formatting (because they're overridden by inner tags) ? I've tried many existing online solutions (such as HTML Tidy). None of them do what I want.
If not, I'll need to write some code for HTML parsing and cleaning. I am planning to use PHP Simple HTML DOM to traverse the HTML tree and find all tags that have no effect. Do you suggest any other HTML parser that is more suitable for my purpose?
Thanks
.
Update:
I have written some code to analyze the HTML code that I have. All the HTML tags that I have are:
<span>
with styles forfont-size
and/orcolor
<font>
with attributescolor
and/orsize
<a>
for links (withhref
)<strong>
<p>
(single tag to wrap the whole comment)<u>
I can easily write some code to convert the HTML code into bbcode (e.g. [b]
, [color=blue]
, [size=3]
, etc). So I above HTML will become something like:
[b][size=14][color=#006400][size=14][size=16][color=#006400]
[size=14][size=16][color=#006400]This is a [/color][/size]
[/size][/color][/size][/size][color=#006400][size=16]
[color=#b22222]Test[/color][/size][/color][/color][/size][/b]
The question now is: Is there an easy way (algorithm/library/etc) to clean-up the messy (as messy as that original HTML) bbcode that will be generated?
thanks again
It may not exactly address your exact problem, but what I would have done in your place is to simply eliminate all HTML tags completely, retain only pain text and line breaks.
After that was done, switch to markdown our bbcode to format your comments better. A WYSIWYG is rarely useful.
The reason forthat is because you said that all you had in the comments is presentational data, which frankly, isn't that much important.
Try not to parse the HTML with DOM but maybe with SAX (http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm)
SAX parses a document from the beginning and sends events like 'start of element' and 'end of 'element' to call the callback functions you define
Then you can build a kind of stack for all events If you have text, you could save the effect of your stack on that text.
After that you process the stack to build up new HTML with only the effect you want.
You should look into HTMLPurifier, it's a great tool for parsing HTML and removing unnecessary and unsafe content from it. Look into the removing empty spans configs and stuff. It can be a bit of a beast to configure I admit, but that's only because it's so versatile.
It's also quite heavy, so you'd want to save the output of it the database (As opposed to reading the raw from the database and then parsing it with purifier every time.
Introduction
The best solution have seen so far is using
HTML Tidy
http://tidy.sourceforge.net/It also ensures that the HTML document is
xhtml
compatibleExample
If you RUN
Output
You can get the CSS
Output
Our the FULL HTML
Output
Function Used
================================================
Edit 1 : Dirty Hack (Not Recommended)
================================================
Based on your last comment its like you want to retain the depreciate style ..
HTML Tidy
might not allow you to do that since itsdepreciated
but you can do thisOutput
Class Used
I remember that Adobe (Macromedia) Dreamweaver, at least slightly old versions had an option, 'Clean up HTML', and also a 'Clean up word html' to remove redundant tags etc from any webpage.