I have strings like:
Avery® Laser & Inkjet Self-Adhesive
I need to convert them to
Avery Laser & Inkjet Self-Adhesive.
I.e. remove special characters and convert html special chars to regular ones.
I have strings like:
Avery® Laser & Inkjet Self-Adhesive
I need to convert them to
Avery Laser & Inkjet Self-Adhesive.
I.e. remove special characters and convert html special chars to regular ones.
Avery® Laser & Inkjet Self-Adhesive
First use StringEscapeUtils#unescapeHtml4()
(or #unescapeXml()
, depending on the original format) to unescape the &
into a &
. Then use String#replaceAll()
with [^\x20-\x7e]
to get rid of characters which aren't inside the printable ASCII range.
Summarized:
String clean = StringEscapeUtils.unescapeHtml4(dirty).replaceAll("[^\\x20-\\x7e]", "");
..which produces
Avery Laser & Inkjet Self-Adhesive
(without the trailing dot as in your example, but that wasn't present in the original ;) )
That said, this however look like more a request to workaround than a request to solution. If you elaborate more about the functional requirement and/or where this string did originate, we may be able to provide the right solution. The ®
namely look like to be caused by using the wrong encoding to read the string in and the &
look like to be caused by using a textbased parser to read the string in instead of a fullfledged HTML parser.
You can use the StringEscapeUtils
class from Apache Commons Text project.
Maybe you can use something like:
yourTxt = yourTxt.replaceAll("&", "&");
in some project I did something like:
public String replaceAcutesHTML(String str) {
str = str.replaceAll("á","á");
str = str.replaceAll("é","é");
str = str.replaceAll("í","í");
str = str.replaceAll("ó","ó");
str = str.replaceAll("ú","ú");
str = str.replaceAll("Á","Á");
str = str.replaceAll("É","É");
str = str.replaceAll("Í","Í");
str = str.replaceAll("Ó","Ó");
str = str.replaceAll("Ú","Ú");
str = str.replaceAll("ñ","ñ");
str = str.replaceAll("Ñ","Ñ");
return str;
}
Incase you want to mimic what php function htmlspecialchars_decode does use php function get_html_translation_table() to dump the table and then use the java code like,
static Hashtable html_specialchars_table = new Hashtable();
static {
html_specialchars_table.put("<","<");
html_specialchars_table.put(">",">");
html_specialchars_table.put("&","&");
}
static String htmlspecialchars_decode_ENT_NOQUOTES(String s){
Enumeration en = html_specialchars_table.keys();
while(en.hasMoreElements()){
String key = (String)en.nextElement();
String val = (String)html_specialchars_table.get(key);
s = s.replaceAll(key, val);
}
return s;
}