I'm trying to parse some html with BeautifulSoup4 and Python 2.7.6, but the string is returning "None". The HTML i'm trying to parse is:
<div class="booker-booking">
2 rooms
·
USD 0
<!-- Commission: USD -->
</div>
The snippet from python I have is:
data = soup.find('div', class_='booker-booking').string
I've also tried the following two:
data = soup.find('div', class_='booker-booking').text
data = soup.find('div', class_='booker-booking').contents[0]
Which both return:
u'\n\t\t2\xa0rooms \n\t\t\xb7\n\t\tUSD\xa00\n\t\t\n
I'm ultimately trying to get the first line into a variable just saying "2 Rooms", and the third line into another variable just saying "USD 0".
.string
returnsNone
because the text node is not the only child (there is a comment).To remove Unicode whitespace:
To get your final variables:
After you have done
data = soup.find('div', class_='booker-booking').text
you've extracted the data you need from the HTML. Now you just need to format it to get "2 Rooms" and "USD 0. The first step is probably splitting the data by line:Which will give
[u'', u'\t\t2\xa0rooms ', u'\t\t\xb7', u'\t\tUSD\xa00', u'\t\t', u'']
Now you need to get rid of the whitespace, unescape the html characters, and remove the lines that don't have data:
You will be left with the data you want: