I'm trying to grab a table from the following webpage
http://www.bloomberg.com/markets/companies/country/hong-kong/
I have some sample code which was kindly provided by Phil Bozak here:
grabbing table from html using Google script
which grabs the table for this website:
http://www.airchina.com.cn/www/en/html/index/ir/traffic/
As you can see from Phil's code, there is alot of "getElement()" in the code. If i look at the html code for the Air China website. It looks like it's nested four times? that's why the string of .getElement?
Now I look at the source code for the Bloomberg page and its is load with "div"...
the question is can someone show me how to grab the table from this the Bloomberg page?
and just a brief explanation of the theory also would be useful. Thanks a bunch.
Let's flip your question upside down, and start with the theory. Methodology might be a better word for it.
You want to get at something specific in a structured page. To do that, you either need a way to zap right to the element (which can be done if it's labeled in a unique way that we can access), OR you need to navigate the structure more-or-less manually. You already know how to look at the source of a page, so you're familiar with this step. Here's a screenshot of Firefox Inspector, highlighting the element we're interested in.
We can see the hierarchy of elements that lead to the table: html, body, div, div, div.ticker, table.ticker_data. We can also see the source:
Neat! It's labeled! Unfortunately, that class info gets dropped when we process the HTML in our script. Bummer. If it was
id="ticker_data"
instead, we could use the getElementByVal() utility from this answer to reach it, and give ourselves some immunity from future restructuring of the page. Put a pin in that - we'll come back to it.It can help to visualize this in the debugger. Here's a utility script for that - run it in debug mode, and you'll have your HTML document laid out to explore:
This is what our page looks like in the debugger:
You might be wondering what the numbered elements are, since you don't see them in the source. When there are multiples of an element type at the same level in an XML document, the parser presents them as an array, numbered
0..n
. Thus, when we see0
under adiv
in the debugger, that's telling us that there are multiple<div>
tags in the HTML source at that level, and we can access them as an array, for example.div[0]
.Ok, theory behind us, let's go ahead and see how we can access the table by brute-force.
Knowing the hierarchy, including the div arrays shown in the debugger, we could do this, ala Phil's previous answer. I'll do some weird indenting to illustrate the document structure:
As a much more compact alternative to all those
.getElement()
calls, we can navigate using dot notation.And that's that.
Let's go back to that pinned idea. In the debugger, we can see that there are various attributes attached to elements. In particular, there's an "id" on that div[5] that contains the div that contains the table. Remember, in the source we saw "class" attributes, but note that they don't make it this far.
Still, the fact that a kindly programmer put this "id" in place means we can do this, with
getDivById()
from that earlier question:If they move things around, we might still be able to find that table, without changing our code.
You already know what to do once you have the table element, so we're done here!