Extracting and parsing HTML from a secure website

2019-05-17 18:04发布

问题:

Let's dive into this, shall we?

Ok, I need to write a script (I don't care what language, prefer something like Python or Javascript, but whatever works I will take time to learn). The script will access multiple URL's, extract text from each site and store it into a folder on my PC. (From there I am manipulating the data with Python, which I know how to do.)

EDIT: Currently I am using python's NLTK module. Here is a simple version of my code:

url  = "<URL HERE>"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)

This code works fine for both http and https, but not for instances where authentication is required.

Is there a Python module which deals with secure authentication?

Thanks in advance for help! And to the mods who will view this as a bad question, please just give me ways to make it better. I need ideas..from people, not Google.

回答1:

Mechanize (2) is one option, other is just with urllib2