A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.
How can I check if a string is a valid UTF-8?
A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.
How can I check if a string is a valid UTF-8?
I think you misunderstand what "UTF-8 characters" means. UTF-8 is an encoding of Unicode which can represent pretty-much every single character and glyph that has ever existed in recorded human history, so that extent there are no "invalid" UTF-8 characters.
RTF is a formatting system which works independently of the underlying encoding system - you can use RTF with ASCII, UTF-8, UTF-16 and others. Textboxes in HTML only respect plain text, so any RTF formatting will be automatically stripped (unless you're using a "rich-edit" component, which I assume you're not).
But you do describe things like whitespace characters (like tabs: \t
) are represented in Unicode (and so, UTF-8). A string containing those characters is still "valid UTF-8", it's just invalid as far as your business-requirements are concerned.
I suggest just stripping-out unwanted characters using a regular-expression that matches non-visible characters (from here: Match non printable/non ascii characters and remove from text )
textBoxContent = textBoxContent.replace(/[^\x20-\x7E]+/g, '');
The expression [^\x20-\x7E]
matches any character NOT in the codepoint range 0x20
(32, a normal space character ' '
) to 0x7E
(127, the tidle '~'
character), all others will be removed.
Unicode's first 127 codepoints are identical to ASCII and can be seen here: http://www.asciitable.com/
Just an idea:
function checkUTF8(text) {
var utf8Text = text;
try {
// Try to convert to utf-8
utf8Text = decodeURIComponent(escape(text));
// If the conversion succeeds, text is not utf-8
}catch(e) {
// console.log(e.message); // URI malformed
// This exception means text is utf-8
}
return utf8Text; // returned text is always utf-8
}