Web crawler to extract from list elements

I am trying to extract from <li> tags the dates and store them in an Excel file.

<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>

Code:

import urllib2
import os 
from datetime import datetime
import re
os.environ["LANG"]="en_US.UTF-8"
from bs4 import BeautifulSoup

page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
li =soup.find_all("li")
count = 0
while count < len(li):
   soup = BeautifulSoup(li[count])
   date_string, rest = soup.li.text.split(':', 1)
   print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
   count+=1

Error:

Traceback (most recent call last):
  File "C:\Users\sony\Desktop\Trash\Crawler Try\trytest.py", line 13, in <module>
    soup =BeautifulSoup(li[count])
  File "C:\Python27\lib\site-packages\bs4\__init__.py", line 161, in __init__
    markup = markup.read()
TypeError: 'NoneType' object is not callable
[Finished in 4.0s with exit code 1]

I don't know how to write each text extracted in excel thus. Haven't included in it the code. Refer question: Web crawler to extract in between the list

标签： python parsing web-scraping beautifulsoup web-crawler

1条回答

仙女界的扛把子

2楼-- · 2020-05-09 02:04

The problem is - there are irrelevant li tags that don't contain the data you need.

Be more specific. For example, if you want to get the list of events from the "20th century", first find the header and get the list of events from it's parent's following ul sibling. Also, not every item in the list has the date in the %B %d, %Y format - you need to handle it via try/except block:

import urllib2
from datetime import datetime
from bs4 import BeautifulSoup


page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)

events = soup.find('span', id='20th_century').parent.find_next_sibling('ul')
for event in events.find_all('li'):
    try:
        date_string, rest = event.text.split(':', 1)
        print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
    except ValueError:
        print event.text

Prints:

19/09/1902
30/12/1903
11/01/1908
24/12/1913
23/10/1942
09/03/1946
1954 500-800 killed at Kumbha Mela, Allahabad.
01/01/1956
02/01/1971
03/12/1979
20/10/1982
29/05/1985
13/03/1988
20/08/1988

Updated version (getting all ul groups under a century):

events = soup.find('span', id='20th_century').parent.find_next_siblings()
for tag in events:
    if tag.name == 'h2':
        break
    for event in tag.find_all('li'):
        try:
            date_string, rest = event.text.split(':', 1)
            print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
        except ValueError:
            print event.text

0人赞添加讨论(0) 举报

Web crawler to extract from list elements

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间