I am programmatically building a URI with the help of the encodeURIComponent
function using user provided input. However, when the user enters invalid unicode characters (such as U+DFFF
), the function throws an exception with the following message:
The URI to be encoded contains an invalid character
I looked this up on MSDN, but that didn't tell me anything I didn't already know.
To correct this error
- Ensure the string to be encoded contains only valid Unicode sequences.
My question is, is there a way to sanitize the user provided input to remove all invalid Unicode sequences before I pass it on to the encodeURIComponent
function?
Taking the programmatic approach to discover the answer, the only range that turned up any problems was \ud800-\udfff, the range for high and low surrogates:
I then confirmed this with a simpler example:
And this fits with what MSDN says because indeed all those Unicode characters (even valid Unicode "non-characters") besides surrogates are all valid Unicode sequences.
You can indeed filter out high and low surrogates, but when used in a high-low pair, they become legitimate (as they are meant to be used in this way to allow for Unicode to expand (drastically) beyond its original maximum number of characters):
So, if you want to take the easy route and block surrogates, it is just a matter of:
If you want to strip out unmatched (invalid) surrogates while allowing surrogate pairs (which are legitimate sequences but the characters are rarely ever needed), you can do the following:
If JavaScript had negative lookbehind the function would be a lot less ugly...