beautifulsoup 4 + python: string returns 'None

2019-08-07 11:31发布

I'm trying to parse some html with BeautifulSoup4 and Python 2.7.6, but the string is returning "None". The HTML i'm trying to parse is:

<div class="booker-booking">
    2&nbsp;rooms
    &#0183;
    USD&nbsp;0
    <!-- Commission: USD  -->
</div>

The snippet from python I have is:

 data = soup.find('div', class_='booker-booking').string

I've also tried the following two:

data = soup.find('div', class_='booker-booking').text
data = soup.find('div', class_='booker-booking').contents[0]

Which both return:

u'\n\t\t2\xa0rooms \n\t\t\xb7\n\t\tUSD\xa00\n\t\t\n

I'm ultimately trying to get the first line into a variable just saying "2 Rooms", and the third line into another variable just saying "USD 0".

2条回答
▲ chillily
2楼-- · 2019-08-07 12:18

.string returns None because the text node is not the only child (there is a comment).

from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(html)
div = soup.find('div', 'booker-booking')
# remove comments
text = " ".join(div.find_all(text=lambda t: not isinstance(t, Comment)))
# -> u'\n    2\xa0rooms\n    \xb7\n    USD\xa00\n     \n'

To remove Unicode whitespace:

text = " ".join(text.split())
# -> u'2 rooms \xb7 USD 0'
print text
# -> 2 rooms · USD 0

To get your final variables:

var1, var2 = [s.strip() for s in text.split(u"\xb7")]
# -> u'2 rooms', u'USD 0'
查看更多
We Are One
3楼-- · 2019-08-07 12:19

After you have done data = soup.find('div', class_='booker-booking').text you've extracted the data you need from the HTML. Now you just need to format it to get "2 Rooms" and "USD 0. The first step is probably splitting the data by line:

import string
lines = string.split(data, '\n')

Which will give [u'', u'\t\t2\xa0rooms ', u'\t\t\xb7', u'\t\tUSD\xa00', u'\t\t', u'']

Now you need to get rid of the whitespace, unescape the html characters, and remove the lines that don't have data:

import HTMLParser
h = HTMLParser.HTMLParser()
formatted_lines =  [string.strip(h.unescape(line)) for line in lines if len(line) > 3]

You will be left with the data you want:

print formatted_lines[0]
#2 rooms
print formatted_lines[1]
#USD 0
查看更多
登录 后发表回答