I have been recently using beautiful soup 4 and I have been struggling to understand some basics of this (I was quite ok with bs3.x for some reason). So, for example, lets start off by doing something simple like:
data=soup.find_all('h2')
which yields me something like:
<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&skuId=210-0023\">more-accurate-data</a></h2>
which is fine. But when I want to regex the above string, using something along the lines off (assuming the above is stored in "temp"):
t=str(re.compile(r"""<h2><a href=\\"/accurate(.*?)\\">""").search(str(temp)).group(1))
I get:
AttributeError: 'NoneType' object has no attribute 'group'
which I find strange - because, when I do on the python interpretter, something like:
k=r"""<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&skuId=210-0023\">more-accurate-data</a></h2>"""
and then use the above regex, everything works fine. I am wondering why the "tags" type generated by bs4 seems non regex'able. Now I feel maybe I am doing something stupid or maybe something has changed between bs3.x and bs4 which I am not aware of. Any help on this would be appreciated. Thanks.
You should try to see the
repr
of the string:And the regex works with this representation:
The problem is that the result from beautiful soup is different, because you did not print its repr. When dealing with regexes it's a good idea to check the
repr
of the strings involved to avoid things like this.