Python newbie here. Python 2.7 with beautifulsoup 4.
I am trying to get parse a webpage to get columns using BeautifulSoup. The webpage has tables inside tables; but table 4 is the one that I want, it does not have any headers or th tag. I want to get the data into column.
from bs4 import BeautifulSoup
import urllib2
url = 'http://finance.yahoo.com/q/op?s=aapl+Options'
htmltext = urllib2.urlopen(url).read()
soup = BeautifulSoup(htmltext)
#Table 8 has the data needed; it is nested under other tables though
# specific reference works as below:
print soup.findAll('table')[8].findAll('tr')[2].findAll('td')[2].contents
# Below loop erros out:
for row in soup.findAll('table')[8].findAll('tr'):
column2 = row.findAll('td')[2].contents
print column2
# "Index error: list index out of range" is what I get on second line of for loop.
I saw this as a working solution in another example but didnt work for me. Also tried iterating around tr:
mytr = soup.findAll('table')[8].findAll('tr')
for row in mytr:
print row.find('td') #works but gives only first td as expected
print row.findAll('td')[2]
which gives an error that row is a list which is out of index.
So:
- First findAll('table') - works
- second findAll('tr') - works
- third findAll('td') - works only if ALL [ ] are numbers and not variables.
e.g.
print soup.findAll('table')[8].findAll('tr')[2].findAll('td')[2].contents
Above works as it is specific reference but not through variables. But I need it inside a loop to get full column.