I looked at previous similar questions and got only more confused.
In python 3.4, I want to read an html page as a string, given the url.
In perl I do this with LWP::Simple, using get().
A matplotlib 1.3.1 example says: import urllib; u1=urllib.urlretrieve(url)
.
python3 can't find urlretrieve
.
I tried u1 = urllib.request.urlopen(url)
, which appears to get an HTTPResponse
object, but I can't print it or get a length on it or index it.
u1.body
doesn't exist. I can't find a description of the HTTPResponse
in python3.
Is there an attribute in the HTTPResponse
object which will give me the raw bytes of the html page?
(Irrelevant stuff from other questions include urllib2
, which doesn't exist in my python, csv parsers, etc.)
Edit:
I found something in a prior question which partially (mostly) does the job:
u2 = urllib.request.urlopen('http://finance.yahoo.com/q?s=aapl&ql=1')
for lines in u2.readlines():
print (lines)
I say 'partially' because I don't want to read separate lines, but just one big string.
I could just concatenate the lines, but every line printed has a character 'b' prepended to it.
Where does that come from?
Again, I suppose I could delete the first character before concatenating, but that does get to be a kloodge.
urllib.request.urlopen(url).read()
should return you the raw HTML page as a string.Try the 'requests' module, it's a lot more simpler.
more info here > http://docs.python-requests.org/en/master/
Reading an html page with urllib is fairly simple to do. Since you want to read it as a single string I will show you.
Import urllib.request:
Prepare our request
Always use a "try/except" when requesting a web page as things can easily go wrong. urlopen() requests the page.
Type is a great function that will tell us what 'type' a variable is. Here, response is a http.response object.
The read function for our response object will store the html as bytes to our variable. Again type() will verify this.
Now we use the decode function for our bytes variable to get a single string.
If you do want to split up this string into separate lines, you can do so with the split() function. In this form we can easily iterate through to print out the entire page or do any other processing.
Hopefully this provides a little more detailed of an answer. Python documentation and tutorials are great, I would use that as a reference because it will answer most questions you might have.
This will work similar to
urllib.urlopen
.Note that Python3 does not read the html code as a string but as a
bytearray
, so you need to convert it to one withdecode
.