I use Owasp Anti samy with Ebay policy file to prevent XSS attacks on my website.
I also use Hibernate search to index my objects.
When I use this code:
String html = "special word: été";
// use the Ebay configuration file
Policy policy = Policy.getInstance(xssPolicyFile.getInputStream());
AntiSamy as = new AntiSamy();
CleanResults cr = as.scan(html, policy);
// result is now : "special word: été"
result = cr.getCleanHTML();
As you can see all chars "é" has been transformed to their html entity equivalent "é
"
My page is on UTF-8, so I don't need this transformation. Moreover, when I index this text with Hibernate Search, it indexes the word with html entities, so I can't find word "été" on my index.
How can I force antisamy to not transform special chars to their html entity equivalent ?
thanks
PS: an issue has been opened : http://code.google.com/p/owaspantisamy/issues/detail?id=99
I ran into the same problem this morning.
I have encapsulated antisamy in a class and I use apache StringEscapeUtil from apache common-lang to restore special characters.
CleanResults cleanResults = antiSamy.scan(taintedHtml);
cleanedHtml = cleanResults.getCleanHTML();
return StringEscapeUtils.unescapeHtml(cleanedHtml)
The result is a cleaned up HTML without the HTML escaping of special characters.
Hope this helps.
Like Mohamad said it in a comment, Antisamy has just released a new directive named : entityEncodeIntlChars
here is the detail : http://code.google.com/p/owaspantisamy/source/detail?r=240
It seems that this directive solves the problem.
After scouring the AntiSamy source code, I found no way of changing this behavior apart from modifying AntiSamy.
Check out this one: http://code.google.com/p/owaspantisamy/source/browse/#svn/trunk/dotNet/current/source/owaspantisamy/html/scan
Grab the source and notice that key classes (AntiSamyDOMScanner, CleanResults) use standard framework objects (like XmlDocument). Compile and run with the binary you compiled - so that you can see everything in a debugger - as in which of the major classes actually corrupts your data. With that in hand you'll be able to either change a few properties on major objects to make it stop or inject your own post-processing to revert the wrongdoing (say with a regexp). Latter you can expose that as additional top-level property, say one named NoMess :-)
Chances are that behavior in that respect is different between languages (there's 3 in that trunk) but the same tactics will work no matter which one you have to deal with.