Using BeautifulSoup to search html for string

I am using BeautifulSoup to look for user entered strings on a specific page. For example, I want to see if the string 'Python' is located on the page: http://python.org

When I used: find_string = soup.body.findAll(text='Python') find_string returned []

But when I used: find_string = soup.body.findAll(text=re.compile('Python'), limit=1) find_string returned [u'Python Jobs'] as expected

What is the difference between these two statements that makes the second statement work when there are more than one instances of the word to be searched

标签： python beautifulsoup

3条回答

▲ chillily

2楼-- · 2020-01-27 02:38

text='Python' searches for elements that have the exact text you provided:

import re
from BeautifulSoup import BeautifulSoup

html = """<p>exact text</p>
   <p>almost exact text</p>"""
soup = BeautifulSoup(html)
print soup(text='exact text')
print soup(text=re.compile('exact text'))

Output

[u'exact text']
[u'exact text', u'almost exact text']

"To see if the string 'Python' is located on the page http://python.org":

import urllib2
html = urllib2.urlopen('http://python.org').read()
print 'Python' in html # -> True

If you need to find a position of substring within a string you could do html.find('Python').

0人赞添加讨论(0) 举报

smile是对你的礼貌

3楼-- · 2020-01-27 02:39

I have not used BeuatifulSoup but maybe the following can help in some tiny way.

import re
import urllib2
stuff = urllib2.urlopen(your_url_goes_here).read()  # stuff will contain the *entire* page

# Replace the string Python with your desired regex
results = re.findall('(Python)',stuff)

for i in results:
    print i

I'm not suggesting this is a replacement but maybe you can glean some value in the concept until a direct answer comes along.

0人赞添加讨论(0) 举报

Evening l夕情丶

4楼-- · 2020-01-27 02:53

The following line is looking for the exact NavigableString 'Python':

>>> soup.body.findAll(text='Python')
[]

Note that the following NavigableString is found:

>>> soup.body.findAll(text='Python Jobs') 
[u'Python Jobs']

Note this behaviour:

>>> import re
>>> soup.body.findAll(text=re.compile('^Python$'))
[]

So your regexp is looking for an occurrence of 'Python' not the exact match to the NavigableString 'Python'.

0人赞添加讨论(0) 举报

Using BeautifulSoup to search html for string

Output

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间