I have read a lot of answers regarding web scraping that talk about BeautifulSoup, Scrapy e.t.c. to perform web scraping.
Is there a way to do the equivalent of saving a page's source from a web brower?
That is, is there a way in Python to point it at a website and get it to save the page's source to a text file with just the standard Python modules?
Here is where I got to:
import urllib
f = open('webpage.txt', 'w')
html = urllib.urlopen("http://www.somewebpage.com")
#somehow save the web page source
f.close()
Not much I know - but looking for code to actually pull the source of the page so I can write it. I gather that urlopen just makes a connection.
Perhaps there is a readlines() equivalent for reading lines of a web page?
You may try urllib2
:
import urllib2
page = urllib2.urlopen('http://stackoverflow.com')
page_content = page.read()
with open('page_content.html', 'w') as fid:
fid.write(page_content)
Updated code, for Python 3 (where urllib2 is deprecated):
from urllib.request import urlopen
html = urlopen("http://www.google.com/")
with open('page_content.html', 'w') as fid:
fid.write(html)
Answer from SoHei will not work because it's missing html.read() and the file must be opened with 'wb' parameter instead of just a 'w'. The 'b' indicates that data will be written in binary mode (since .read() returns sequence of bytes).
The fully working code is:
from urllib.request import urlopen
html = urlopen("http://www.google.com/")
page_content = html.read()
with open('page_content.html', 'wb') as fid:
fid.write(page_content)