I am trying to extract links for certain section only from Blogspot. But the output shows the codes extract all the link inside the page.
Here is the codes:
import urlparse
import urllib
from bs4 import BeautifulSoup
url = "http://ellywonderland.blogspot.com/"
urls = [url]
visited = [url]
while len(urls) >0:
try:
htmltext = urllib.urlopen(urls[0]).read()
except:
print urls[0]
soup = BeautifulSoup(htmltext)
urls.pop(0)
print len (urls)
for tags in soup.find_all(attrs={'class': "post-title entry-title"}):
for tag in soup.findAll('a',href=True):
tag['href'] = urlparse.urljoin(url,tag['href'])
if url in tag['href'] and tag['href'] not in visited:
urls.append(tag['href'])
visited.append(tag['href'])
print visited
Here is the html codes for section that I want to extract:
<h3 class="post-title entry-title" itemprop="name">
<a href="http://ellywonderland.blogspot.com/2011/02/pre-wedding-vintage.html">Pre-wedding * Vintage*</a>
Thank you.
you need add .get to the object:
print Objecta.get('href')
Example from http://www.crummy.com/software/BeautifulSoup/bs4/doc/:
If you don't necessarily need to use
BeautifulSoup
I think it would be easier to do something like this:Output:
feedparser can parse the RSS feed of the blogspot page and can return the data you want, in this case the
href
for the post titles.