regex email invisible text

2019-09-05 01:13发布

问题:

I am getting lots of spam with so-called "invisible" text - large blocks of gibberish hidden from view with white font color on white background or in comment tags. In cPanel "account level filters" I am trying to build a regex filter on the email body.

This one (to catch gibberish in comment tags) results in too many false positives because it catches legitimate HTML text which contains occasional comment tags:

\<![ \r\n\t]*--[\S\s]{400,6000}--[ \r\n\t]*\>

These two (for white text on white background) are not very effective - because there are so many ways to write the offending HTML - and I can't figure out how to write clever enough regex:

\<div style=\"color:white\">[ \r\n\t]*.{1500,6000}[ \r\n\t]*\<\/div>

color=[\"\']*\#FFFFF[0-9A-E]

Thanks in advance for your suggestions!


examples...

<div style="color:white">
Several paragraphs of gibberish designed to fool filters.
</div>


<!--
Several paragraphs of gibberish designed to fool filters.
-->

回答1:

These are decent weak indicators for detecting spam. I highly advise against using them to independently block messages. Consider a system like SpamAssassin, which actually has regexps like what you're trying to write, instead. SpamAssassin assigns a small number of points to each indicator and then sums them up to see if there was enough to label a message as spam.

SpamAssassin rules of note:

  • __HTML_COMMENT_10000
  • HTML_FONT_TINY
  • HTML_FONT_LOW_CONTRAST

Here is a SpamAssassin rule definition to more exactly address your white-on-white issue:

rawbody  __JOE_COLOR_WHITE   /\bcolor[:=][\s\"\']{0,5}(?:white|\#[ef]{3}|\#[ef].[ef].[ef].)/i
rawbody  __JOE_BGCOLOR_WHITE /\b(?:bgcolor|background(?:-color)?)[:=][\s\'\"]{0,5}(?:white|\#[ef]{3}|\#[ef].[ef].[ef].)/i
meta     JOE_WHITE_ON_WHITE  __JOE_COLOR_WHITE && __JOE_BGCOLOR_WHITE
score    JOE_WHITE_ON_WHITE  0.5
describe JOE_WHITE_ON_WHITE  Part of the email has white text, another part has white bg

I'm matching a somewhat broader definition of "white" but that appears to be your intent ("FFFFF0" has slightly less blue. My regex is twice as broad, applies to all three RGB channels, and also matches the shorter three hex form. The weakness to the rule I defined above is that it doesn't ensure the white text is actually rendered on a white background. This should be "close enough" but may accidentally hit some non-spam marketing/newsletter mail.



标签: regex spam