Using BeautifulSoup to search html for string

2020-01-27 01:56发布

I am using BeautifulSoup to look for user entered strings on a specific page. For example, I want to see if the string 'Python' is located on the page: http://python.org

When I used: find_string = soup.body.findAll(text='Python') find_string returned []

But when I used: find_string = soup.body.findAll(text=re.compile('Python'), limit=1) find_string returned [u'Python Jobs'] as expected

What is the difference between these two statements that makes the second statement work when there are more than one instances of the word to be searched

3条回答
▲ chillily
2楼-- · 2020-01-27 02:38

text='Python' searches for elements that have the exact text you provided:

import re
from BeautifulSoup import BeautifulSoup

html = """<p>exact text</p>
   <p>almost exact text</p>"""
soup = BeautifulSoup(html)
print soup(text='exact text')
print soup(text=re.compile('exact text'))

Output

[u'exact text']
[u'exact text', u'almost exact text']

"To see if the string 'Python' is located on the page http://python.org":

import urllib2
html = urllib2.urlopen('http://python.org').read()
print 'Python' in html # -> True

If you need to find a position of substring within a string you could do html.find('Python').

查看更多
smile是对你的礼貌
3楼-- · 2020-01-27 02:39

I have not used BeuatifulSoup but maybe the following can help in some tiny way.

import re
import urllib2
stuff = urllib2.urlopen(your_url_goes_here).read()  # stuff will contain the *entire* page

# Replace the string Python with your desired regex
results = re.findall('(Python)',stuff)

for i in results:
    print i

I'm not suggesting this is a replacement but maybe you can glean some value in the concept until a direct answer comes along.

查看更多
Evening l夕情丶
4楼-- · 2020-01-27 02:53

The following line is looking for the exact NavigableString 'Python':

>>> soup.body.findAll(text='Python')
[]

Note that the following NavigableString is found:

>>> soup.body.findAll(text='Python Jobs') 
[u'Python Jobs']

Note this behaviour:

>>> import re
>>> soup.body.findAll(text=re.compile('^Python$'))
[]

So your regexp is looking for an occurrence of 'Python' not the exact match to the NavigableString 'Python'.

查看更多
登录 后发表回答