When I parse the XML, it contains abnormal hex characters.
So I tried to replace it with empty space. But it doesn't work at all.
Original character : �
hex code : (253, 255)
code :
xmlData = String.replace(String.fromCharCode(253,255)," ");
retrun xmlData;
I'd like to remove "ýÿ" characters from description.
Is there anyone who have a trouble with replacing hex character to empty space?
Based on the answers, I've modified the code as follows:
testData = String.fromCharCode(253,255);
xmlData = xmlData.replace(String.fromCharCode(253,255), " ");
console.log(xmlData);
but it still shows '�' on the screen..
Do you know why this still happens?
The character code is actually 255 * 256 + 253 = 65533, so you would get something like this:
xmlData = xmlData.replace(String.fromCharCode(65533)," ");
String String.fromCharCode(253,255)
is of two characters.
You should call replace()
on a string instance not on String
:
var testData = String.fromCharCode(253,255);
var xmlData = testData.replace(String.fromCharCode(253,255), " ");
alert(xmlData);
Working example: http://jsfiddle.net/StURS/2/
Just had this problem with a messed up SQL-dump that contained both valid UTF-8 codes and invalid forcing a more manual conversion. As the above examples don't address replacement and finding better matches I figured that I put my two cents in here for those that are struggling with similar encoding problems. The following code:
- parses my sql-dump
- splits according to queries
- finds character codes outside the 256 scope
- outputs the codes and the string with context where the code appears
- replaces the Swedish ÅÄÖ with correct codes using regular expressions
- outputs the replaced string for control
"use strict";
const readline = require("readline");
const fs = require("fs");
var fn = "my_problematic_sql_dump.sql";
var lines = fs.readFileSync(fn).toString().split(/;\n/);
const Aring = new RegExp(String.fromCharCode(65533) +
"\\" + String.fromCharCode(46) + "{1,3}", 'g');
const Auml = new RegExp(String.fromCharCode(65533) +
String.fromCharCode(44) + "{1,3}", 'g');
const Ouml = new RegExp(String.fromCharCode(65533) +
String.fromCharCode(45) + "{1,3}", 'g');
for (let i in lines){
let l = lines[i];
for (let ii = 0; ii < l.length; ii++){
if (l.charCodeAt(ii) > 256){
console.log("\n Invalid code at line " + i + ":")
console.log("Code: ", l.charCodeAt(ii), l.charCodeAt(ii + 1),
l.charCodeAt(ii + 2), l.charCodeAt(ii + 3))
let core_str = l.substring(ii, ii + 20)
console.log("String: ", core_str)
core_str = core_str.replace(/[\r\n]/g, "")
.replace(Ouml, "Ö")
.replace(Auml, "Ä")
.replace(Aring, "Å")
console.log("After replacements: ", core_str)
}
}
}
The resulting output will look something like this:
Invalid code at line 18:
Code: 65533 45 82 65533
String: �-R�,,LDRALEDIGT', N
After replacements: ÖRÄLDRALEDIGT', N
Invalid code at line 18:
Code: 65533 44 44 76
String: �,,LDRALEDIGT', NULL
After replacements: ÄLDRALEDIGT', NULL
Invalid code at line 19:
Code: 65533 46 46 46
String: �...ker med fam till
After replacements: Åker med fam till
A few things that I found worth noting:
- The
65533
is sometimes followed by a varying number of regular characters that decide the actual character hence the {1,3}
- The
Aring
contains a .
, i.e. matches anything and needs the additional \\