Scraping HTML from Financial Statements

2019-08-17 00:51发布

问题:

First attempt at learning to work with HTML in Visual Studio and C#. I am using html agility pack library. to do the parsing.

From this page I am attempting to pull out information from various places within this page and save them as correctly formatted strings

here is my current code (taken from: shriek )

HtmlNode tdNode = document.DocumentNode.DescendantNodes().FirstOrDefault(n => n.Name == "td"
&& n.InnerText.Trim() == "Net Income");
if (tdNode != null)
{
  HtmlNode trNode = tdNode.ParentNode;
  foreach (HtmlNode node in trNode.DescendantNodes().Where(n => n.NodeType ==     HtmlNodeType.Element))
  {
    Console.WriteLine(node.InnerText.Trim());
    //Output:
    //Net Income
    //265.00
    //298.00
    //601.00
    //672.00
    //666.00
  }
 }

It works correctly however I want to get more information and I am unsure of how to search through the html correctly. First I would like to also be able to select these numbers from the annual data, not only from the quarterly, (View option at the top of the page).

I would also like to get the dates for each column of numbers, both quarterly and annual (the "As of ..." at the top of each column)

also for future projects, does google provide an API for this?

回答1:

If you take a close look at the original input html source, you will see its data is organized around 6 main sections that are DIV html elements with one of the following 'id' attributes: "incinterimdiv" "incannualdiv" "balinterimdiv" "balannualdiv" "casinterimdiv" "casannualdiv". Obviously, these matches Income Statement, Balance Sheet, and Cash Flow for Quaterly or Annual Data.

Now, when you're scraping a site with Html Agility Pack, I suggest you use XPATH wich is the easiest way to get to any node inside the HTML code, without any dependency on XML, as Html Agility Pack supports plain XPATH over HTML.

XPATH has to be learned, for sure, but is very elegant because it does so many things in just one line. I know this may look old-fashioned with the new cool C#-oriented XLinq syntax :), but XPATH is much more concise. It also enables you to concentrate the bindings between your code and the input HTML in plain old strings, and avoid recompilation of the code when the input source evolves (for example, when the ID change). This make your scraping code more robust, and future-proof. You could also put the XPATH bindings in an XSL(T) file, to be able to transform the HTML into the data presented as XML.

Anyway, enough digression :) Here is a sample code that allows you to get the financial data from a specific line title, and another that gets all data from all lines (from one of the 6 main sections):

        HtmlWeb web = new HtmlWeb();
        HtmlDocument doc = web.Load("http://www.google.com/finance?q=NASDAQ:TXN&fstype=ii");

        // How get a specific line:
        // 1) recursively get all DIV elements with the 'id' attribute set to 'casannualdiv'
        // 2) get all TABLE elements under, with the 'id' attribute set to 'fs-table'
        // 3) recursively get all TD elements containing the given text (trimmed)
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[@id='casannualdiv']/table[@id='fs-table']//td[normalize-space(text()) = 'Deferred Taxes']"))
        {
            Console.WriteLine("Title:" + node.InnerHtml.Trim());

            // get all following sibling TD elements
            foreach (HtmlNode sibling in node.SelectNodes("following-sibling::td"))
            {
                Console.WriteLine(" data:" + sibling.InnerText.Trim()); // InnerText works also for negative values
            }
        }

        // How to get all lines:
        // 1) recursively get all DIV elements with the 'id' attribute set to 'casannualdiv'
        // 2) get all TABLE elements under, with the 'id' attribute set to 'fs-table'
        // 3) recursively get all TD elements containing the class 'lft lm'
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[@id='casannualdiv']/table[@id='fs-table']//td[@class='lft lm']"))
        {
            Console.WriteLine("Title:" + node.InnerHtml.Trim());
            foreach (HtmlNode sibling in node.SelectNodes("following-sibling::td"))
            {
                Console.WriteLine(" data:" + sibling.InnerText.Trim());
            }
        }


回答2:

You have two options. One is to reverse engineer the HTML page, figure out what JavaScript code is run when you click on Annual Data, see where it gets the data from and ask for the data.

The second solution, which is more robust, is to use a platform such as Selenium, that actually emulates the web browser and runs JavaScript for you.

As far as I could tell, there's no CSV interface to the financial statements. Perhaps Yahoo! has one.



回答3:

If you need to navigate around to get to the right page, then you probably want to look into using WatiN. WatiN was designed as an automated testing tool for web pages and drives a selected web browser to get the page. It also allows you to identify input fields and enter text in textboxes or push buttons. It's a lot like HtmlAgilityPack, so you shouldn't find it too difficult to master.



回答4:

I would highly recommend against this approach. The HTML that google is spitting out is likely highly volatile, so even once you solidify your parsing approach to get all of the data you need, in a day, a week or a month the HTML format could all change and you would need to rewrite your parsing logic.

You should try to use something more static, like XBRL.

SEC publishes this XBRL for each publicly traded company here = http://xbrl.sec.gov/

You can use this toolkit to work with the data programatically - http://code.google.com/p/xbrlware/

EDIT: The path of least resistance is actually using http://www.xignite.com/xFinancials.asmx, but this service costs money.