BeautifulSoup：提取锚文本标签(BeautifulSoup: extract text

我想提取：

从下面的文字SRC image标签和
锚标签的文本是里面div类数据

我成功设法提取IMG SRC，但我有麻烦提取锚标记上的文字。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

这里是链接整个HTML页面。

这里是我的代码：

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

我所试图做的是提取图像SRC（链接），里面的标题div class=data ，因此，例如：

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

应提取：

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

Answer 1:

这将帮助：

from bs4 import BeautifulSoup

data = '''<div class="image">
        <a href="http://www.example.com/eg1">Content1<img  
        src="http://image.example.com/img1.jpg" /></a>
        </div>
        <div class="image">
        <a href="http://www.example.com/eg2">Content2<img  
        src="http://image.example.com/img2.jpg" /> </a>
        </div>'''

soup = BeautifulSoup(data)

for div in soup.findAll('div', attrs={'class':'image'}):
    print(div.find('a')['href'])
    print(div.find('a').contents[0])
    print(div.find('img')['src'])

如果您正在寻找进入亚马逊产品，那么你应该使用官方的API。至少有一个Python包，这将缓解你刮的问题和使用条款中保留您的活动。

Answer 2:

在我的情况下，它的工作这样的：

from BeautifulSoup import BeautifulSoup as bs

url="http://blabla.com"

soup = bs(urllib.urlopen(url))
for link in soup.findAll('a'):
        print link.string

希望能帮助到你！

Answer 3:

我建议去了LXML路线和使用XPath。

from lxml import etree
# data is the variable containing the html
data = etree.HTML(data)
anchor = data.xpath('//a[@class="title"]/text()')

Answer 4:

以上所有的答案真的帮我建立我的答案，因为这样我投所有的答案，其他用户把它拿出来：但我终于凑齐了我自己的答案，我正在处理确切的问题：

由于问题明确规定，我不得不进入一些兄弟姐妹及其子女在DOM结构：该解决方案将遍历图像中的DOM结构，并使用产品标题构建映像名称和图像保存到本地目录。

import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
from BeautifulSoup import BeautifulSoup as bs
import requests

def getImages(url):
    #Download the images
    r = requests.get(url)
    html = r.text
    soup = bs(html)
    output_folder = '~/amazon'
    #extracting the images that in div(s)
    for div in soup.findAll('div', attrs={'class':'image'}):
        modified_file_name = None
        try:
            #getting the data div using findNext
            nextDiv =  div.findNext('div', attrs={'class':'data'})
            #use findNext again on previous object to get to the anchor tag
            fileName = nextDiv.findNext('a').text
            modified_file_name = fileName.replace(' ','-') + '.jpg'
        except TypeError:
            print 'skip'
        imageUrl = div.find('img')['src']
        outputPath = os.path.join(output_folder, modified_file_name)
        urlretrieve(imageUrl, outputPath)

if __name__=='__main__':
    url = r'http://www.amazon.com/s/ref=sr_pg_1?rh=n%3A172282%2Ck%3Adigital+camera&keywords=digital+camera&ie=UTF8&qid=1343600585'
    getImages(url)

Answer 5:

>>> txt = '<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> '
>>> fragment = bs4.BeautifulSoup(txt)
>>> fragment
<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 
>>> fragment.find('a', {'class': 'title'})
<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
>>> fragment.find('a', {'class': 'title'}).string
u'Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)'

文章来源: BeautifulSoup: extract text from anchor tag