I am programmatically building a URI with the help of the encodeURIComponent
function using user provided input. However, when the user enters invalid unicode characters (such as U+DFFF
), the function throws an exception with the following message:
The URI to be encoded contains an invalid character
I looked this up on MSDN, but that didn't tell me anything I didn't already know.
To correct this error
- Ensure the string to be encoded contains only valid Unicode sequences.
My question is, is there a way to sanitize the user provided input to remove all invalid Unicode sequences before I pass it on to the encodeURIComponent
function?
Taking the programmatic approach to discover the answer, the only range that turned up any problems was \ud800-\udfff, the range for high and low surrogates:
for (var regex = '/[', firstI = null, lastI = null, i = 0; i <= 65535; i++) {
try {
encodeURIComponent(String.fromCharCode(i));
}
catch(e) {
if (firstI !== null) {
if (i === lastI + 1) {
lastI++;
}
else if (firstI === lastI) {
regex += '\\u' + firstI.toString(16);
firstI = lastI = i;
}
else {
regex += '\\u' + firstI.toString(16) + '-' + '\\u' + lastI.toString(16);
firstI = lastI = i;
}
}
else {
firstI = i;
lastI = i;
}
}
}
if (firstI === lastI) {
regex += '\\u' + firstI.toString(16);
}
else {
regex += '\\u' + firstI.toString(16) + '-' + '\\u' + lastI.toString(16);
}
regex += ']/';
alert(regex); // /[\ud800-\udfff]/
I then confirmed this with a simpler example:
for (var i = 0; i <= 65535 && (i <0xD800 || i >0xDFFF ) ; i++) {
try {
encodeURIComponent(String.fromCharCode(i));
}
catch(e) {
alert(e); // Doesn't alert
}
}
alert('ok!');
And this fits with what MSDN says because indeed all those Unicode characters (even valid Unicode "non-characters") besides surrogates are all valid Unicode sequences.
You can indeed filter out high and low surrogates, but when used in a high-low pair, they become legitimate (as they are meant to be used in this way to allow for Unicode to expand (drastically) beyond its original maximum number of characters):
alert(encodeURIComponent('\uD800\uDC00')); // ok
alert(encodeURIComponent('\uD800')); // not ok
alert(encodeURIComponent('\uDC00')); // not ok either
So, if you want to take the easy route and block surrogates, it is just a matter of:
urlPart = urlPart.replace(/[\ud800-\udfff]/g, '');
If you want to strip out unmatched (invalid) surrogates while allowing surrogate pairs (which are legitimate sequences but the characters are rarely ever needed), you can do the following:
function stripUnmatchedSurrogates (str) {
return str.replace(/[\uD800-\uDBFF](?![\uDC00-\uDFFF])/g, '').split('').reverse().join('').replace(/[\uDC00-\uDFFF](?![\uD800-\uDBFF])/g, '').split('').reverse().join('');
}
var urlPart = '\uD801 \uD801\uDC00 \uDC01'
alert(stripUnmatchedSurrogates(urlPart)); // Leaves one valid sequence (representing a single non-BMP character)
If JavaScript had negative lookbehind the function would be a lot less ugly...