Extract links for certain section only from blogsp

I am trying to extract links for certain section only from Blogspot. But the output shows the codes extract all the link inside the page.

Here is the codes:

import urlparse
import urllib
from bs4 import BeautifulSoup

url = "http://ellywonderland.blogspot.com/"

urls = [url]
visited = [url]

while len(urls) >0:
      try:
          htmltext = urllib.urlopen(urls[0]).read()
      except:
          print urls[0]

      soup = BeautifulSoup(htmltext)

      urls.pop(0)
      print len (urls)

      for tags in soup.find_all(attrs={'class': "post-title entry-title"}):
           for tag in soup.findAll('a',href=True):
                tag['href'] = urlparse.urljoin(url,tag['href'])
                if url in tag['href'] and tag['href'] not in visited:
                    urls.append(tag['href'])
                    visited.append(tag['href'])

print visited

Here is the html codes for section that I want to extract:

<h3 class="post-title entry-title" itemprop="name">
<a href="http://ellywonderland.blogspot.com/2011/02/pre-wedding-vintage.html">Pre-wedding * Vintage*</a>

Thank you.

标签： python beautifulsoup web-crawler

2条回答

霸刀☆藐视天下

2楼-- · 2019-02-28 08:43

you need add .get to the object:

print Objecta.get('href')

Example from http://www.crummy.com/software/BeautifulSoup/bs4/doc/:

for link in soup.find_all('a'):
    print(link.get('href'))

0人赞添加讨论(0) 举报

Root（大扎）

3楼-- · 2019-02-28 09:00

If you don't necessarily need to use BeautifulSoup I think it would be easier to do something like this:

import feedparser

url = feedparser.parse('http://ellywonderland.blogspot.com/feeds/posts/default?alt=rss')
for x in url.entries:
    print str(x.link)

Output:

http://ellywonderland.blogspot.com/2011/03/my-vintage-pre-wedding.html
http://ellywonderland.blogspot.com/2011/02/pre-wedding-vintage.html
http://ellywonderland.blogspot.com/2010/12/tissue-paper-flower-crepe-paper.html
http://ellywonderland.blogspot.com/2010/12/menguap-menurut-islam.html
http://ellywonderland.blogspot.com/2010/12/weddings-idea.html
http://ellywonderland.blogspot.com/2010/12/kawin.html
http://ellywonderland.blogspot.com/2010/11/vitamin-c-collagen.html
http://ellywonderland.blogspot.com/2010/11/port-dickson.html
http://ellywonderland.blogspot.com/2010/11/ellys-world.html

feedparser can parse the RSS feed of the blogspot page and can return the data you want, in this case the href for the post titles.

0人赞添加讨论(0) 举报

Extract links for certain section only from blogsp

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间