I am trying to extract from <li>
tags the dates and store them in an Excel file.
<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>
Code:
import urllib2
import os
from datetime import datetime
import re
os.environ["LANG"]="en_US.UTF-8"
from bs4 import BeautifulSoup
page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
li =soup.find_all("li")
count = 0
while count < len(li):
soup = BeautifulSoup(li[count])
date_string, rest = soup.li.text.split(':', 1)
print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
count+=1
Error:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\trytest.py", line 13, in <module>
soup =BeautifulSoup(li[count])
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 161, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable
[Finished in 4.0s with exit code 1]
I don't know how to write each text extracted in excel thus. Haven't included in it the code. Refer question: Web crawler to extract in between the list
The problem is - there are irrelevant
li
tags that don't contain the data you need.Be more specific. For example, if you want to get the list of events from the "20th century", first find the header and get the list of events from it's parent's following
ul
sibling. Also, not every item in the list has the date in the%B %d, %Y
format - you need to handle it viatry/except
block:Prints:
Updated version (getting all ul groups under a century):