I'm a bit confused by using CSS selectors with axis combinators in BeautifulSoup. Below is the simple code to illustrate what I mean:
from bs4 import BeautifulSoup as bs
import requests
response = requests.get('https://stackoverflow.com/questions/tagged/python')
soup = bs(response.text)
print(len(soup.select('#mainbar > div')))
returns 6
children... but
print(len(soup.select('#mainbar>div')))
returns 0
children...
The same with '#mainbar ~ div'
(found 1 sibling) and #mainbar~div'
(found nothing)
From documentation those spaces are optional, but in fact I got different output with BeautifulSoup for the same selectors (as I thought)
So is it bs4
bug or this behavior depends on version of CSS or something else?
This is confirmed as a bug here: https://bugs.launchpad.net/beautifulsoup/+bug/1717851
The selector, from a CSS perspective is fine with/without.
I will see if I can find further evidence.
The individual reporting the bug states:
The issue, as far as I see, is that since the code is only doing a
shlex.split
, it doesn't treat div
, >
, and span
as separate
entities is a space is left out on either side of >
.
in case you want to patch it, see bs4/element.py
line 1440 replace
tokens = shlex.split(selector)
with
selector = re.sub(r'\s*([+>~])\s*', r' \1 ', selector)
tokens = shlex.split(selector)
Demo:
<script type="text/javascript" src="//cdn.datacamp.com/dcl-react.js.gz"></script>
<div data-datacamp-exercise data-lang="python">
<code data-type="sample-code">
import re, shlex
def testSelect(selector):
selector = re.sub(r'\s*([+>~])\s*', r' \1 ', selector)
tokens = shlex.split(selector)
print(tokens)
testSelect('#mainbar > div ~ p') # default
testSelect('#mainbar>div~p')
testSelect('#mainbar >div+ p')
testSelect('#mainbar.classA')
testSelect('#mainbar p')
</code>
</div>