I am using HTML Purifier (http://htmlpurifier.org/)
I just want to remove <script>
tags only.
I don't want to remove inline formatting or any other things.
How can I achieve this?
One more thing, it there any other way to remove script tags from HTML
I had been struggling with this question. I discovered you only really need one function. explode('>', $html); The single common denominator to any tag is < and >. Then after that it's usually quotation marks ( " ). You can extract information so easily once you find the common denominator. This is what I came up with:
I see this really only working for script tags because you will never have nested script tags. Of course, you can easily add more code that does the same check and gather nested tags.
I call it accordion coding. implode();explode(); are the easiest ways to get your logic flowing if you have a common denominator.
An example modifing ctf0's answer. This should only do the preg_replace once but also check for errors and block char code for forward slash.
If you are using php 7 you can use the null coalesce operator to simplify it even more.
I would use BeautifulSoup if it's available. Makes this sort of thing very easy.
Don't try to do it with regexps. That way lies madness.
the problem with the script tag arrows is that they can have more than one variant
so instead of creating a pattern array with like a bazillion variant, imho a better solution would be
this will remove anything that look like
script.../script
regardless of the arrow code/variant and u can test it in here https://regex101.com/r/lK6vS8/1A simple way by manipulating string.
Because this question is tagged with regex I'm going to answer with poor man's solution in this situation:
However, regular expressions are not for parsing HTML/XML, even if you write the perfect expression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.
Remember, anything that user inputs should be considered not safe.
Better solution here would be to use
DOMDocument
which is designed for this. Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:I have removed the HTML intentionally because even this can bork.