Web crawler to extract from list elements

2020-05-09 01:50发布

问题:

I am trying to extract from <li> tags the dates and store them in an Excel file.

<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>

Code:

import urllib2
import os 
from datetime import datetime
import re
os.environ["LANG"]="en_US.UTF-8"
from bs4 import BeautifulSoup

page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
li =soup.find_all("li")
count = 0
while count < len(li):
   soup = BeautifulSoup(li[count])
   date_string, rest = soup.li.text.split(':', 1)
   print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
   count+=1

Error:

Traceback (most recent call last):
  File "C:\Users\sony\Desktop\Trash\Crawler Try\trytest.py", line 13, in <module>
    soup =BeautifulSoup(li[count])
  File "C:\Python27\lib\site-packages\bs4\__init__.py", line 161, in __init__
    markup = markup.read()
TypeError: 'NoneType' object is not callable
[Finished in 4.0s with exit code 1]

I don't know how to write each text extracted in excel thus. Haven't included in it the code. Refer question: Web crawler to extract in between the list

回答1:

The problem is - there are irrelevant li tags that don't contain the data you need.

Be more specific. For example, if you want to get the list of events from the "20th century", first find the header and get the list of events from it's parent's following ul sibling. Also, not every item in the list has the date in the %B %d, %Y format - you need to handle it via try/except block:

import urllib2
from datetime import datetime
from bs4 import BeautifulSoup


page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)

events = soup.find('span', id='20th_century').parent.find_next_sibling('ul')
for event in events.find_all('li'):
    try:
        date_string, rest = event.text.split(':', 1)
        print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
    except ValueError:
        print event.text

Prints:

19/09/1902
30/12/1903
11/01/1908
24/12/1913
23/10/1942
09/03/1946
1954 500-800 killed at Kumbha Mela, Allahabad.
01/01/1956
02/01/1971
03/12/1979
20/10/1982
29/05/1985
13/03/1988
20/08/1988

Updated version (getting all ul groups under a century):

events = soup.find('span', id='20th_century').parent.find_next_siblings()
for tag in events:
    if tag.name == 'h2':
        break
    for event in tag.find_all('li'):
        try:
            date_string, rest = event.text.split(':', 1)
            print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
        except ValueError:
            print event.text