I've read about how Zalgo text works, and I'm looking to learn how a chat or forum software could prevent that kind of annoyance. More precisely, what is the complete set of Unicode combining characters that needs to:
a) either be stripped, assuming chat participants are to use only languages that don't require combining marks (i.e. you could write "fiancé" with a combining mark, but you'd be a bit Zalgo'ed yourself if you insisted on doing so); or,
b) reduced to maximum 8 consecutive characters (the maximum encountered in actual languages)?
EDIT: In the meantime I found a completely differently phrased question ("How to protect against... diacritics?"), which is essentially the same as this one. I made its title more explicit so others will find it as well.
Using PHP and the mindset of a demolition worker you can get rid of the Zalgo with the iconv function. Of course that will kill any other UTF-8 chars too.
Assuming you're very serious about this and want a technical solution you could do as follows:
This could be fun to implement but in practice it would likely be better to go to step four straight away.
Edit: Here's a more practical, if blunt, solution in Python 2.7. Unicode characters classified as "Mark, nonspacing" and "Mark, enclosing" appear to be the main tools used to create the Zalgo effect. Unlike the above idea this won't try to determine the "aesthetics" of the text but will instead simply remove all such characters. (Needless to say, this will trash text in many, many languages. Read on for a better solution.) To filter out more character categories add them to
ZALGO_CHAR_CATEGORIES
.Example input:
Output:
Finally, if you're looking to detect, rather than unconditionally remove, Zalgo text you could perform character frequency analysis. The program below does that for each line of the input file. The function
is_zalgo
calculates a "Zalgo score" for each word of the string it is given (the score is the number of potential Zalgo characters divided by the total number of characters). It then looks if the third quartile of the words' scores is greater thanTHRESHOLD
. IfTHRESHOLD
equals0.5
it means we're trying to detect if one out of each four words has more than 50% Zalgo characters. (TheTHRESHOLD
of 0.5 was guessed and may require adjustment for real-world use.) This type of algorithm is probably the best in terms of payoff/coding effort.Sample output:
Make the box
overflow:hidden
. It doesn't actually disable Zalgo text, but it prevents it from damaging other comments.Preview on JSFiddle
You can get rid off Zalgo text in your application using strip-combining-marks by Mathias Bynens.
The module strip-combining-marks is available for browsers (via Bower) and Node.js applications (via npm).
Here is an example on how to use it with npm:
A related question was asked before: https://stackoverflow.com/questions/5073191/how-is-zalgo-text-implemented but it's interesting to go into prevention here.
In terms of preventing this you can choose several strategies: