I am trying to extract from <li>
tags the dates and store them in an Excel file.
<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>
Code:
import urllib2
import os
from datetime import datetime
import re
os.environ["LANG"]="en_US.UTF-8"
from bs4 import BeautifulSoup
page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
li =soup.find_all("li")
count = 0
while count < len(li):
soup = BeautifulSoup(li[count])
date_string, rest = soup.li.text.split(':', 1)
print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
count+=1
Error:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\trytest.py", line 13, in <module>
soup =BeautifulSoup(li[count])
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 161, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable
[Finished in 4.0s with exit code 1]
I don't know how to write each text extracted in excel thus. Haven't included in it the code. Refer question: Web crawler to extract in between the list
The problem is - there are irrelevant li
tags that don't contain the data you need.
Be more specific. For example, if you want to get the list of events from the "20th century", first find the header and get the list of events from it's parent's following ul
sibling. Also, not every item in the list has the date in the %B %d, %Y
format - you need to handle it via try/except
block:
import urllib2
from datetime import datetime
from bs4 import BeautifulSoup
page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
events = soup.find('span', id='20th_century').parent.find_next_sibling('ul')
for event in events.find_all('li'):
try:
date_string, rest = event.text.split(':', 1)
print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
except ValueError:
print event.text
Prints:
19/09/1902
30/12/1903
11/01/1908
24/12/1913
23/10/1942
09/03/1946
1954 500-800 killed at Kumbha Mela, Allahabad.
01/01/1956
02/01/1971
03/12/1979
20/10/1982
29/05/1985
13/03/1988
20/08/1988
Updated version (getting all ul groups under a century):
events = soup.find('span', id='20th_century').parent.find_next_siblings()
for tag in events:
if tag.name == 'h2':
break
for event in tag.find_all('li'):
try:
date_string, rest = event.text.split(':', 1)
print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
except ValueError:
print event.text