I've read about how Zalgo text works, and I'm looking to learn how a chat or forum software could prevent that kind of annoyance. More precisely, what is the complete set of Unicode combining characters that needs to:
a) either be stripped, assuming chat participants are to use only languages that don't require combining marks (i.e. you could write "fiancé" with a combining mark, but you'd be a bit Zalgo'ed yourself if you insisted on doing so); or,
b) reduced to maximum 8 consecutive characters (the maximum encountered in actual languages)?
EDIT: In the meantime I found a completely differently phrased question ("How to protect against... diacritics?"), which is essentially the same as this one. I made its title more explicit so others will find it as well.
Assuming you're very serious about this and want a technical solution you could do as follows:
- Split the incoming text into smaller units (words or sentences);
- Render each unit on the server with your font of choice (with a huge line height and lots of space below the baseline where the Zalgo "noise" would go);
- Train a machine learning algorithm to judge if it looks too "dark" and "busy";
- If the algorithm's confidence is low defer to human moderators.
This could be fun to implement but in practice it would likely be better to go to step four straight away.
Edit: Here's a more practical, if blunt, solution in Python 2.7. Unicode characters classified as "Mark, nonspacing" and "Mark, enclosing" appear to be the main tools used to create the Zalgo effect. Unlike the above idea this won't try to determine the "aesthetics" of the text but will instead simply remove all such characters. (Needless to say, this will trash text in many, many languages. Read on for a better solution.) To filter out more character categories add them to ZALGO_CHAR_CATEGORIES
.
#!/usr/bin/env python
import unicodedata
import codecs
ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']
with codecs.open("zalgo", 'r', 'utf-8') as infile:
for line in infile:
print ''.join([c for c in unicodedata.normalize('NFD', line) if unicodedata.category(c) not in ZALGO_CHAR_CATEGORIES]),
Example input:
1
H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
2
H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
3
Output:
1
How does Zalgo text work?
2
How does Zalgo text work?
3
Finally, if you're looking to detect, rather than unconditionally remove, Zalgo text you could perform character frequency analysis. The program below does that for each line of the input file. The function is_zalgo
calculates a "Zalgo score" for each word of the string it is given (the score is the number of potential Zalgo characters divided by the total number of characters). It then looks if the third quartile of the words' scores is greater than THRESHOLD
. If THRESHOLD
equals 0.5
it means we're trying to detect if one out of each four words has more than 50% Zalgo characters. (The THRESHOLD
of 0.5 was guessed and may require adjustment for real-world use.) This type of algorithm is probably the best in terms of payoff/coding effort.
#!/usr/bin/env python
from __future__ import division
import unicodedata
import codecs
import numpy
ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']
THRESHOLD = 0.5
DEBUG = True
def is_zalgo(s):
if len(s) == 0:
return False
word_scores = []
for word in s.split():
cats = [unicodedata.category(c) for c in word]
score = sum([cats.count(banned) for banned in ZALGO_CHAR_CATEGORIES]) / len(word)
word_scores.append(score)
total_score = numpy.percentile(word_scores, 75)
if DEBUG:
print total_score
return total_score > THRESHOLD
with codecs.open("zalgo", 'r', 'utf-8') as infile:
for line in infile:
print is_zalgo(unicodedata.normalize('NFD', line)), "\t", line
Sample output:
0.911483990148
True Señor, could you or your fiancé explain, H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
0.333333333333
False Příliš žluťoučký kůň úpěl ďábelské ódy.
Make the box overflow:hidden
. It doesn't actually disable Zalgo text, but it prevents it from damaging other comments.
<style>
.comment {
/* the overflow: hidden is what prevents one comment's combining marks from affecting its siblings */
overflow: hidden;
/* the padding gives space for any legitimate combining marks */
padding: 0.5em;
/* the rest are just to visually divide the three comments */
border: solid 1px #ccc;
margin-top: -1px;
margin-bottom: -1px;
}
</style>
<div class=comment>The below comment looks aweful.</div>
<div class=comment>H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡</div>
<div class=comment>The above comment looks aweful.</div>
Preview on JSFiddle
A related question was asked before: https://stackoverflow.com/questions/5073191/how-is-zalgo-text-implemented but it's interesting to go into prevention here.
In terms of preventing this you can choose several strategies:
- prevent combining diacritics entirely (and piss off many international users),
- filter out combining characters using whitelisting or blacklisting (and piss off a smaller percentage of international users)
- prevent a certain number of combining characters (and piss of an even smaller percentage of users)
- have a healthy moderator community (with all the downsides that has, see your question as an example here)
You can get rid off Zalgo text in your application using strip-combining-marks by Mathias Bynens.
The module strip-combining-marks is available for browsers (via Bower) and Node.js applications (via npm).
Here is an example on how to use it with npm:
var stripCombiningMarks = require("strip-combining-marks");
var zalgoText = 'U̼̥̻̮͍͖n͠i͏c̯̮o̬̝̠͉̤d͖͟e̫̟̗͟ͅ';
var stripptedText = stripCombiningMarks(zalgoText); // "Unicode"
Using PHP and the mindset of a demolition worker you can get rid of the Zalgo with the iconv function. Of course that will kill any other UTF-8 chars too.
$unZalgoText = iconv("UTF-8", "ISO-8859-1//IGNORE", $zalgoText);