Note: I take care of SQL injection and output escaping elsewhere - this question is about input filtering only, thanks.
I'm in the middle of refactoring my user input filtering functions. Before passing the GET/POST parameter to a type-specific filter with filter_var() I do the following:
- check the parameter encoding with mb_detect_encoding()
- convert to UTF-8 with iconv() (with //IGNORE) if it's not ASCII or UTF-8
- clean white-spaces with a function found on GnuCitizen.org
- pass the result thru strip_tags() - no tags allowed at all, Markdown only
Now the question: does it still make sense to pass the parameter to a filter like htmLawed or HTML Purifier, or can I think of the input as safe? It seems to me that these two differ mostly on the granularity of allowed HTML elements and attributes (which I'm not interested into, as I remove everything), but htmLawed docs have a section about 'dangerous characters' that suggests there might be a reason to use it. In this case, what would be a sane configuration for it?