I'm trying to 'defrontpagify' the html of a MS FrontPage generated website, and I'm writing a BeautifulSoup script to do it.
However, I've gotten stuck on the part where I try to strip a particular attribute (or list attributes) from every tag in the document that contains them. The code snippet:
REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
'dir','face','size','color','style','class','width','height','hspace',
'border','valign','align','background','bgcolor','text','link','vlink',
'alink','cellpadding','cellspacing']
# remove all attributes in REMOVE_ATTRIBUTES from all tags,
# but preserve the tag and its content.
for attribute in REMOVE_ATTRIBUTES:
for tag in soup.findAll(attribute=True):
del(tag[attribute])
It runs without error, but doesn't actually strip any of the attributes. When I run it without the outer loop, just hard coding a single attribute (soup.findAll('style'=True), it works.
Anyone see know the problem here?
PS - I don't much like the nested loops either. If anyone knows a more functional, map/filter-ish style, I'd love to see it.
I use this one:
or
Thanks to https://stackoverflow.com/a/22497855/1907997
I am using BeautifulSoup 4 with python 2.7 and for me
tag.attrs
is a dictionary rather than a list. Therefore I had to modify this code:Just ftr: the problem here is that if you pass HTML attributes as keyword arguments, the keyword is the name of the attribute. So your code is searching for tags with an attribute of name
attribute
, as the variable does not get expanded.This is why
To fix the problem, pass the attribute you are looking for as a
dict
:Hth someone in the future, dtk
[0]: Although it needs to be
find_all(style=True)
in your example, without the quotes, becauseSyntaxError: keyword can't be an expression
The line
does not find any
tag
s. There might be a way to usefindAll
; I'm not sure. However, this works: