How to scrape invisible html?

2019-01-12 11:39发布

问题:

Is it possible?

For example values of this data table are hidden in html source:

http://www.cmegroup.com/trading/energy/crude-oil/european-dated-brent-swap-futures_quotes_settlements_futures.html

回答1:

Technically they are not invisible, the values you look for are not in the initial HTML document that you requested. for more explanation read this How do you scrape AJAX pages?



回答2:

Take a look at the below example. Import JSON.bas module into the VBA project for JSON processing.

Option Explicit

Sub Test()

    Dim sJSONString As String
    Dim vJSON
    Dim sState As String
    Dim aData()
    Dim aHeader()

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "http://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/5081/FUT?tradeDate=04/06/2018&strategy=DEFAULT&pageSize=500", False
        .send
        sJSONString = .responseText
    End With
    JSON.Parse sJSONString, vJSON, sState
    vJSON = vJSON("settlements")
    JSON.ToArray vJSON, aData, aHeader
    With Sheets(1)
        .Cells.Delete
        .Cells.WrapText = False
        OutputArray .Cells(1, 1), aHeader
        Output2DArray .Cells(2, 1), aData
        .Columns.AutoFit
    End With

End Sub

Sub OutputArray(oDstRng As Range, aCells As Variant)

    With oDstRng
        .Parent.Select
        With .Resize(1, UBound(aCells) - LBound(aCells) + 1)
            .NumberFormat = "@"
            .Value = aCells
        End With
    End With

End Sub

Sub Output2DArray(oDstRng As Range, aCells As Variant)

    With oDstRng
        .Parent.Select
        With .Resize( _
                UBound(aCells, 1) - LBound(aCells, 1) + 1, _
                UBound(aCells, 2) - LBound(aCells, 2) + 1)
            .NumberFormat = "@"
            .Value = aCells
        End With
    End With

End Sub

Scraping is based on parsing XHR response by URL http://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/5081/FUT?tradeDate=04/06/2018&strategy=DEFAULT&pageSize=500, which you can find in logged requests in browser (e. g. Chrome) developer tools on network tab after the page is loaded.

The output of the above code with requested parameters tradeDate=04/06/2018&strategy=DEFAULT&pageSize=500 for me as follows:

BTW, the similar approach applied in the following answers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14.