可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I am using BeautifulSoup to scrape a url and I had the following code
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})
Now in the above code we can use findAll
to get tags and information related to them, but I want to use xpath. Is it possible to use xpath with BeautifulSoup? If possible, can anyone please provide me an example code so that it will be more helpful?
回答1:
Nope, BeautifulSoup, by itself, does not support XPath expressions.
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
Once you've parsed your document into an lxml tree, you can use the .xpath()
method to search for elements.
import urllib2
from lxml import etree
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)
Of possible interest to you is the CSS Selector support; the CSSSelector
class translates CSS statements into XPath expressions, making your search for td.empformbody
that much easier:
from lxml.cssselect import CSSSelector
td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
# Do something with these table cells.
Coming full circle: BeautifulSoup itself does have pretty decent CSS selector support:
for cell in soup.select('table#foobar td.empformbody'):
# Do something with these table cells.
回答2:
I can confirm that there is no XPath support within Beautiful Soup.
回答3:
Martijn's code no longer works properly (it is 4+ years old by now...), the etree.parse()
line prints to the console and doesn't assign the value to the tree
variable. Referencing this, I was able to figure out this works using requests and lxml:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')
print 'Buyers: ', buyers
print 'Prices: ', prices
回答4:
BeautifulSoup has a function named findNext from current element directed childern,so:
father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')
Above code can imitate the following xpath:
div[class=class_value]/div[id=id_value]
回答5:
I've searched through their docs and it seems there is not xpath option. Also, as you can see here on a similar question on SO, the OP is asking for a translation from xpath to BeautifulSoup, so my conclusion would be - no, there is no xpath parsing available.
回答6:
This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.
Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.
from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()
回答7:
when you use lxml all simple:
tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[@class="shared-components"]/@href')
but when use BeautifulSoup BS4 all simple too:
- first remove "//" and "@"
- second - add star before "="
try this magic:
soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')
as you see, this does not support sub-tag, so i remove "/@href" part