I have strings like:
Avery® Laser & Inkjet Self-Adhesive
I need to convert them to
Avery Laser & Inkjet Self-Adhesive.
I.e. remove special characters and convert html special chars to regular ones.
I have strings like:
Avery® Laser & Inkjet Self-Adhesive
I need to convert them to
Avery Laser & Inkjet Self-Adhesive.
I.e. remove special characters and convert html special chars to regular ones.
You can use the
StringEscapeUtils
class from Apache Commons Text project.Maybe you can use something like:
in some project I did something like:
}
Incase you want to mimic what php function htmlspecialchars_decode does use php function get_html_translation_table() to dump the table and then use the java code like,
First use
StringEscapeUtils#unescapeHtml4()
(or#unescapeXml()
, depending on the original format) to unescape the&
into a&
. Then useString#replaceAll()
with[^\x20-\x7e]
to get rid of characters which aren't inside the printable ASCII range.Summarized:
..which produces
(without the trailing dot as in your example, but that wasn't present in the original ;) )
That said, this however look like more a request to workaround than a request to solution. If you elaborate more about the functional requirement and/or where this string did originate, we may be able to provide the right solution. The
®
namely look like to be caused by using the wrong encoding to read the string in and the&
look like to be caused by using a textbased parser to read the string in instead of a fullfledged HTML parser.