using regex on beautiful soup tags

I have been recently using beautiful soup 4 and I have been struggling to understand some basics of this (I was quite ok with bs3.x for some reason). So, for example, lets start off by doing something simple like:

data=soup.find_all('h2')

which yields me something like:

<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\">more-accurate-data</a></h2>

which is fine. But when I want to regex the above string, using something along the lines off (assuming the above is stored in "temp"):

t=str(re.compile(r"""<h2><a href=\\"/accurate(.*?)\\">""").search(str(temp)).group(1))

I get:

AttributeError: 'NoneType' object has no attribute 'group'

which I find strange - because, when I do on the python interpretter, something like:

k=r"""<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\">more-accurate-data</a></h2>"""

and then use the above regex, everything works fine. I am wondering why the "tags" type generated by bs4 seems non regex'able. Now I feel maybe I am doing something stupid or maybe something has changed between bs3.x and bs4 which I am not aware of. Any help on this would be appreciated. Thanks.

标签： python regex python-2.7 beautifulsoup

1条回答

啃猪蹄的小仙女

2楼-- · 2019-08-06 14:10

You should try to see the repr of the string:

>>> a=r"""<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\">more-accurate-data</a></h2>"""
>>> print repr(a)
'<h2><a href=\\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\\">more-accurate-data</a></h2>'

And the regex works with this representation:

>>> regex = re.compile(r"""<h2><a href=\\"/accurate(.*?)\\">""")
>>> regex.match(a)
<_sre.SRE_Match object at 0x20fbf30>

The problem is that the result from beautiful soup is different, because you did not print its repr. When dealing with regexes it's a good idea to check the repr of the strings involved to avoid things like this.

0人赞添加讨论(0) 举报

using regex on beautiful soup tags

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间