I have written my first bit of python code to scrape a website.
import csv
import urllib2
from BeautifulSoup import BeautifulSoup
c = csv.writer(open("data.csv", "wb"))
soup = BeautifulSoup(urllib2.urlopen('http://www.kitco.com/kitco-gold-index.html').read())
table = soup.find('table', id="datatable_main")
rows = table.findAll('tr')[1:]
for tr in rows:
cols = tr.findAll('td')
text = []
for td in cols:
text.append(td.find(text=True))
c.writerow(text)
When I test it locally in my ide called pyCharm it works good but when I try it out on my server which runs CentOS, I get the following error:
domainname.com [~/public_html/livegold]# python scraper.py
Traceback (most recent call last):
File "scraper.py", line 8, in <module>
rows = table.findAll('tr')[:]
AttributeError: 'NoneType' object has no attribute 'findAll'
I'm guessing I don't have a module installed remotely, I've been hung up on this for two days any help would be greatly appreciated! :)
You are ignoring any errors that could occur in
urllib2.urlopen
, if for some reason you are getting an error trying to get that page on your server, which you don't get testing locally you are effectively passing in an empty string (''
) or a page you don't expect (such as a 404 page) toBeautifulSoup
.Which in turn makes your
soup.find('table', id="datatable_main")
returnNone
since the document is something you don't expect.You should either make sure you can get the page you are trying to get on your server, or handle exceptions properly.
There is no
table
withid
datatable_main
in the page that the script read.Try printing the returned page to the terminal - perhaps your script is failing to contact the web server? Sometimes hosting services prevent outgoing HTTP connections.