How to get an HTML file using Python?

2019-02-03 05:07发布

问题:

I am not very familiar with Python. I am trying to extract the artist names (for a start :)) from the following page: http://www.infolanka.com/miyuru_gee/art/art.html.

How do I retrieve the page? My two main concerns are; what functions to use and how to filter out useless links from the page?

回答1:

Example using urlib and lxml.html:

import urllib
from lxml import html

url = "http://www.infolanka.com/miyuru_gee/art/art.html"
page = html.fromstring(urllib.urlopen(url).read())

for link in page.xpath("//a"):
    print "Name", link.text, "URL", link.get("href")

output >>
    [('Aathma Liyanage', 'athma.html'),
     ('Abewardhana Balasuriya', 'abewardhana.html'),
     ('Aelian Thilakeratne', 'aelian_thi.html'),
     ('Ahamed Mohideen', 'ahamed.html'),
    ]


回答2:

I think "eyquem" way would be my choice too, but I like to use httplib2 instead of urllib. urllib2 is too low level lib for this work.

import httplib2, re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>') http = httplib2.Http() headers, body = http.request("http://www.infolanka.com/miyuru_gee/art/art.html")
li = pat.findall(body) print li



回答3:

  1. Use urllib2 to get the page.

  2. Use BeautifulSoup to parse the HTML (the page) and get what you want!



回答4:

Check this my friend

import urllib.request

import re

pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')

url = 'http://www.infolanka.com/miyuru_gee/art/art.html'

sock = urllib.request.urlopen(url).read().decode("utf-8")

li = pat.findall(sock)

print(li)


回答5:

Or go straight forward:

import urllib

import re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')

url = 'http://www.infolanka.com/miyuru_gee/art/art.html'
sock = urllib.urlopen(url)
li = pat.findall(sock.read())
sock.close()

print li


回答6:

And respect robots.txt and throttle your requests :)

(Apparently urllib2 does already according to this helpful SO post).



回答7:

Basically, there's a function call:

render_template()

You can easly return single page or list of pages with it and it reads all files automaticaly from a your_workspace\templates .

Example:

/root_dir /templates /index1.html, /index2.html /other_dir /

routes.py

@app.route('/') def root_dir(): return render_template('index1.html')

@app.route(/<username>) def root_dir_with_params(username): retun render_template('index2.html', user=username)

index1.html - without params

<html> <body> <h1>Hello guest!</h1> <button id="getData">Get Data!</button> </body> </html>

index2.html - with params

<html> <body> <!-- Built-it conditional functions in the framework templates in Flask --> {% if name %} <h1 style="color: red;">Hello {{ user }}!</h1> {% else %} <h1>Hello guest.</1> <button id="getData">Get Data!</button> </body> </html>