beautifulsoup 4 + python: string returns 'None

I'm trying to parse some html with BeautifulSoup4 and Python 2.7.6, but the string is returning "None". The HTML i'm trying to parse is:

<div class="booker-booking">
    2&nbsp;rooms
    &#0183;
    USD&nbsp;0
    <!-- Commission: USD  -->
</div>

The snippet from python I have is:

 data = soup.find('div', class_='booker-booking').string

I've also tried the following two:

data = soup.find('div', class_='booker-booking').text
data = soup.find('div', class_='booker-booking').contents[0]

Which both return:

u'\n\t\t2\xa0rooms \n\t\t\xb7\n\t\tUSD\xa00\n\t\t\n

I'm ultimately trying to get the first line into a variable just saying "2 Rooms", and the third line into another variable just saying "USD 0".

标签： python parsing html-parsing beautifulsoup

2条回答

▲ chillily

2楼-- · 2019-08-07 12:18

.string returns None because the text node is not the only child (there is a comment).

from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(html)
div = soup.find('div', 'booker-booking')
# remove comments
text = " ".join(div.find_all(text=lambda t: not isinstance(t, Comment)))
# -> u'\n    2\xa0rooms\n    \xb7\n    USD\xa00\n     \n'

To remove Unicode whitespace:

text = " ".join(text.split())
# -> u'2 rooms \xb7 USD 0'
print text
# -> 2 rooms · USD 0

To get your final variables:

var1, var2 = [s.strip() for s in text.split(u"\xb7")]
# -> u'2 rooms', u'USD 0'

0人赞添加讨论(0) 举报

We Are One

3楼-- · 2019-08-07 12:19

After you have done data = soup.find('div', class_='booker-booking').text you've extracted the data you need from the HTML. Now you just need to format it to get "2 Rooms" and "USD 0. The first step is probably splitting the data by line:

import string
lines = string.split(data, '\n')

Which will give [u'', u'\t\t2\xa0rooms ', u'\t\t\xb7', u'\t\tUSD\xa00', u'\t\t', u'']

Now you need to get rid of the whitespace, unescape the html characters, and remove the lines that don't have data:

import HTMLParser
h = HTMLParser.HTMLParser()
formatted_lines =  [string.strip(h.unescape(line)) for line in lines if len(line) > 3]

You will be left with the data you want:

print formatted_lines[0]
#2 rooms
print formatted_lines[1]
#USD 0

0人赞添加讨论(0) 举报

beautifulsoup 4 + python: string returns 'None

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间