using regex on beautiful soup tags

2019-08-06 13:13发布

I have been recently using beautiful soup 4 and I have been struggling to understand some basics of this (I was quite ok with bs3.x for some reason). So, for example, lets start off by doing something simple like:

data=soup.find_all('h2')

which yields me something like:

<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\">more-accurate-data</a></h2>

which is fine. But when I want to regex the above string, using something along the lines off (assuming the above is stored in "temp"):

t=str(re.compile(r"""<h2><a href=\\"/accurate(.*?)\\">""").search(str(temp)).group(1))

I get:

AttributeError: 'NoneType' object has no attribute 'group'

which I find strange - because, when I do on the python interpretter, something like:

k=r"""<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\">more-accurate-data</a></h2>"""

and then use the above regex, everything works fine. I am wondering why the "tags" type generated by bs4 seems non regex'able. Now I feel maybe I am doing something stupid or maybe something has changed between bs3.x and bs4 which I am not aware of. Any help on this would be appreciated. Thanks.

1条回答
啃猪蹄的小仙女
2楼-- · 2019-08-06 14:10

You should try to see the repr of the string:

>>> a=r"""<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\">more-accurate-data</a></h2>"""
>>> print repr(a)
'<h2><a href=\\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\\">more-accurate-data</a></h2>'

And the regex works with this representation:

>>> regex = re.compile(r"""<h2><a href=\\"/accurate(.*?)\\">""")
>>> regex.match(a)
<_sre.SRE_Match object at 0x20fbf30>

The problem is that the result from beautiful soup is different, because you did not print its repr. When dealing with regexes it's a good idea to check the repr of the strings involved to avoid things like this.

查看更多
登录 后发表回答