Making a basic web scraper in Python with only bui

2019-07-21 17:49发布

问题:

Learning Python, I'm trying to make a web scraper without any 3rd party libraries, so that the process isn't simplified for me, and I know what I am doing. I looked through several online resources, but all of which have left me confused about certain things.

The html looks something like this,

<html>
<head>...</head>
<body>
    *lots of other <div> tags*
<div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal"">
<form class ="subform">...</form>
<div class = "subdiv1" >...</div>
<div class = "subdiv2" >...</div>
    *lots of other <div> tags*
</body>
</html>

I want the scraper to extract the <div class = "want"...>*content*</div> and save that into a html file.

I have a very basic idea of how I need to go about this.

import urllib
from urllib import request
#import re
#from html.parser import HTMLParser

response = urllib.request.urlopen("http://website.com")
html = response.read()

#Some how extract that wanted data

f = open('page.html', 'w')
f.write(data)
f.close()

回答1:

The standard library comes with a variety of Structured Markup Processing Tools, which you can use for parsing the HTML and then searching it to extract your div.

There's a whole lot of choices there. What do you use?

html.parser looks like the obvious choice, but I'd actually start with ElementTree instead. It's a very nice and very powerful API, and there's tons of documentation and sample code all over the web to get you started, and a lot of experts using it on a daily basis who can help you with your problems. If it turns out that etree can't parse your HTML, you will have to use something else… but try it first.

For example, with a few minor fixes to you snipped HTML so it's actually valid, and so there's actually some text worth getting out of your div:

<html>
<head>...</head>
<body>
    *lots of other <div /> tags*
<div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal">spam spam spam
<form class ="subform">...</form>
<div class = "subdiv1" >...</div>
<div class = "subdiv2" >...</div>
    *lots of other <div /> tags*
</div>
</body>
</html>

You can use code like this (I'm assuming you know, or are willing to learn, XPath):

tree = ElementTree.fromstring(page)
mydiv = tree.find('.//div[@class="want"]')

Now you've got a reference to the div with class "want". You can get its direct text with this:

print(mydiv.text)

But if you want to extract the whole subtree, that's even easier:

data = ElementTree.tostring(mydiv)

If you want to wrap that up in a valid <html> and <body> and/or remove the <div> itself, you'll have to do that part manually. The documentation explains how to build up elements using a simple tree API: you create a head and a body to put in the html, then stick the div in the body, then tostring the html, and that's about it.