I have many HTML pages in which I need to check the existence of blacklisted words. I know that the built-in in
is much faster then regular expressions but here I'm trying to compare many in
against a single regular expression.
Since
re.match() checks for a match only at the beginning of the string
I used a regex similar to .*(word|word...)
and I replaced newline characters with a space.
Code
from timeit import timeit
import re
from urllib2 import urlopen
html = urlopen('http://en.wikipedia.org/wiki/Main_Page').read()
# Random reversed strings to avoid unwanted match + one secure match
words = [
"zihw","elbadartnu", "retlob", "ssenenif", "nnub", "detartsehcro",
"elbappirnu", "banehc", "rebmunbus", "gnizilodi", "noituac", "deludehcsnu",
"/body", "latnanosnocerp", "cihportomeh"
]
def in_test(html, blacklist):
html_lower = html.lower()
return any(k in html_lower for k in blacklist):
def search_test(html, pattern):
if re.search(pattern, html):
return True
return False
def match_test(html, pattern):
html_line = html.replace("\r\n", " ").replace("\r", " ").replace("\n", " ")
if re.match(pattern, html_line):
return True
return False
# patternX is word|word|word... patternX_exc is .*(word|word|...)
pattern5 = re.compile("|".join(words[:5]), re.I)
pattern5_exc = re.compile(".*(" + "|".join(words[:5]) + ")", re.I)
pattern10 = re.compile("|".join(words[:10]), re.I)
pattern10_exc = re.compile(".*(" + "|".join(words[:10]) + ")", re.I)
pattern15a = re.compile("|".join(words[:15]), re.I)
pattern15a_exc = re.compile(".*(" + "|".join(words[:15]) + ")", re.I)
words[12] = "doctype" # A secure match at the beginning of the page
pattern15b = re.compile("|".join(words[:15]), re.I)
pattern15b_exc = re.compile(".*(" + "|".join(words[:15]) + ")", re.I)
words[12] = "featured list" # A secure match at ~half page
pattern15c = re.compile("|".join(words[:15]), re.I)
pattern15c_exc = re.compile(".*(" + "|".join(words[:15]) + ")", re.I)
in
vs re.match
vs re.search
with no match
print timeit("in_test(html, words[:5])", "from __main__ import *")
print timeit("search_test(html, pattern5)", "from __main__ import *")
print timeit("match_test(html, pattern5_exc)", "from __main__ import *")
0.127397060394
2.05020999908
2.17416286469
print timeit("in_test(html, words[:10])", "from __main__ import *")
print timeit("search_test(html, pattern10)", "from __main__ import *")
print timeit("match_test(html, pattern10_exc)", "from __main__ import *")
0.210324048996
3.73544692993
3.8765540123
These tests don't match any words. in
is clearly the winner and the speed seems to increase linearly with the number of words.
in
vs re.match
vs re.search
with match in different position
print timeit("in_test(html, words[:15])", "from __main__ import *")
# Match at the end
print timeit("search_test(html, pattern15a)", "from __main__ import *")
print timeit("match_test(html, pattern15a_exc)", "from __main__ import *")
# Match at the beginning
print timeit("search_test(html, pattern15b)", "from __main__ import *")
print timeit("match_test(html, pattern15b_exc)", "from __main__ import *")
# Match at ~half page
print timeit("search_test(html, pattern15c)", "from __main__ import *")
print timeit("match_test(html, pattern15c_exc)", "from __main__ import *")
The output is
0.258332967758
5.9074420929
0.0433299541473
0.000770807266235
6.0548210144
2.47815990448
3.25421690941
When a match occurs, the regex can be much faster than in
but it depends on the position of the match. At the beginning, re.search
will be better, at the end re.match
is the better choice, at ~half page both will be significantly slower than in
.
Regular expressions will help me not to duplicate words (eg. è
, è
, ...), and let me forget about upper/lowercase (especially with non ascii characters). But the speed seems to be too much variable and, on average, slower than in
.
Are these tests correct? If so, are there any other built-in method that I can test or other procedures that will help me in this scenario? The blacklist is going to grow, so I need to take this into account.