how to get at this data

2019-09-20 13:27发布

I am looking to scrape the three items that are highlighted and bordered from the html sample below. I've also highlighted a few markers that look useful.

How would you do this?

enter image description here

A Solution

Ok so this wasn't a great question and I'm actually surprised it didn't get down-voted more! Oh well, here are some bread crumbs for someone else.

Three of the four items of info I want are the inner text of a span element with a known id (ie, $0.83 for "yfs_l10_gm150220c00036500"), so I the following helper class seems to be a decent and direct shot:

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetSpanTextForId
'
' Returns the inner text from a span element known by the passed id
'
' param doc:     the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetSpanTextForId(ByRef doc As HTMLDocument, ByVal spanId As String) As Double
'   Error Handling
    On Error GoTo ErrHandler
    Dim sRoutine        As String
    sRoutine = cModule & ".GetSpanTextForId"
     
    CheckArgNotNothing doc, "doc"
    CheckArgNotBadString spanId, "spanId"
'   Procedure
    Dim oSpan As HTMLSpanElement
    Set oSpan = doc.getElementById(spanId)
    Check Not oSpan Is Nothing, "Could not find span with id: " & Bracket(spanId)
    GetSpanTextForId = oSpan.innerText
    
    Exit Function

ErrHandler:
    Select Case DspErrMsg(sRoutine)
         Case Is = vbAbort:  Stop: Resume    'Debug mode - Trace
         Case Is = vbRetry:  Resume          'Try again
         Case Is = vbIgnore:                 'End routine
     End Select


End Function

The only item not directly known by a span is the OpenInterest which is part of a table that is the 2nd child of an element with an id. The following methods return the cell that immediately follows the cell with the text I want (ie, "Open Interest")

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetOpenInterest
'
' The latest available Open Interest.
'
' param doc:     the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetOpenInterest(ByRef doc As HTMLDocument) As Integer
    Dim tbl As IHTMLTable
    Set tbl = GetSummaryDataTable(doc, 1)
    Dim k As Integer
    k = mWebScrapeHelpers.GetCellNumberForTextStartingWith(tbl, "Open Interest:")
    GetOpenInterest = CInt(mWebScrapeHelpers.GetCellTextFromCellNumber(tbl, k + 1))
End Function


Function GetCellNumberForTextStartingWith(ByRef tbl As IHTMLTable, ByRef s As String) As Integer
'   Error Handling
    On Error GoTo ErrHandler
    Dim sRoutine        As String
    sRoutine = cModule & ".GetCellNumberForTextStartingWith"
    
    CheckArgNotNothing tbl, "tbl"
    
'   Procedure
    Dim tblCell As HTMLTableCell
    Dim k As Integer

    For Each tblCell In tbl.Cells
        If tblCell.innerText Like ("*" & s) Then
            GetCellNumberForTextStartingWith = k
            Exit Function
        End If
        k = k + 1
    Next
    
    ' if we got here it was not found so
    GetCellNumberForTextStartingWith = -1
    Exit Function

ErrHandler:
    Select Case DspErrMsg(sRoutine)
         Case Is = vbAbort:  Stop: Resume    'Debug mode - Trace
         Case Is = vbRetry:  Resume          'Try again
         Case Is = vbIgnore:                 'End routine
     End Select
     
End Function

Function GetCellTextFromCellNumber(ByRef tbl As IHTMLTable, ByRef nbr As Integer) As String
'   Error Handling
    On Error GoTo ErrHandler
    Dim sRoutine        As String
    sRoutine = cModule & ".GetCellNumberForTextStartingWith"
    
    CheckArgNotNothing tbl, "tbl"
    Check tbl.Cells.Length > 0, "table is empty"
    Check tbl.Cells.Length >= nbr, "table only has " & tbl.Cells.Length & " cells; can't get cell number " & nbr
    
'   Procedure
    GetCellTextFromCellNumber = tbl.Cells(nbr).innerText
    Exit Function

ErrHandler:
    Select Case DspErrMsg(sRoutine)
         Case Is = vbAbort:  Stop: Resume    'Debug mode - Trace
         Case Is = vbRetry:  Resume          'Try again
         Case Is = vbIgnore:                 'End routine
     End Select


End Function

These methods work fine but it does seem there are lots of different approaches that would work, including the regex parsing approach suggested as an answer. The excellent link by RedShift got more to the point of analyzing the html and coming up with a strategy.

Cheers

1条回答
孤傲高冷的网名
2楼-- · 2019-09-20 13:58

I would probably use an XML parser to get the text content first (or this: xmlString.replace(/<[^>]+>/g, "") to replace all tags with empty strings), then use the following regexes to extract the information you need:

/-OPR\s+(\d+\.\d+)/
/Bid:\s+(\d+\.\d+)/
/Ask:\s+(\d+\.\d+)/
/Open Interest:\s+(\d+,\d+)/

This process can easily be done in nodejs (more info)or with any other language that supports regex.


live demo:

  • Waits 1 second, then removes tags.
  • Waits another second, then finds all patterns and creates a table.

wait = true; // Set to false to execute instantly.

var elem = document.getElementById("parsingStuff");
var str = elem.textContent;

var keywords = ["-OPR", "Bid:", "Ask:", "Open Interest:"];
var output = {};
var timeout = 0;

if (wait) timeout = 1000;

setTimeout(function() { // Removing tags.
  elem.innerHTML = elem.textContent;
}, timeout);

if (wait) timeout = 2000;

setTimeout(function() { // Looking for patterns.
  for (var i = 0; i < keywords.length; i++) {
    output[keywords[i]] = str.match(RegExp(keywords[i] + "\\s+(\\d+[\\.,]\\d+)"))[1];
  }

  // Creating basic table of found data.
  elem.innerHTML = "";
  var table = document.createElement("table");
  for (k in output) {
    var tr = document.createElement("tr");
    var th = document.createElement("th");
    var td = document.createElement("td");

    th.style.border = "1px solid gray";
    td.style.border = "1px solid gray";

    th.textContent = k;
    td.textContent = output[k];

    tr.appendChild(th);
    tr.appendChild(td);

    table.appendChild(tr);
  }
  elem.appendChild(table);
}, timeout);
<div id="parsingStuff">
  <div class="yfi_rt_quote_summary" id="yfi_rt_quote_summary">
    <div class="hd">
      <div class="title">
        <h2>GM Feb 2015 36.500 call (GM150220C00036500)</h2>
        <span class="rtq_exch">
        <span class="rtq_dash">-</span>OPR
        </span>
        <span class="wl_sign"></span>
      </div>
    </div>
    <div class="yfi_rt_quote_summary_rt_top sigfig_promo_1">
      <div>
        <span class="time_rtq_ticker">

        <span id="yfs_110_gm150220c00036500">0.83</span>
        </span>
      </div>
    </div>undefined</div>undefined
  <div class="yui-u first yfi-start-content">
    <div class="yfi_quote_summary">
      <div id="yfi_quote_summary_data" class="rtq_table">
        <table id="table1">
          <tr>
            <th scope="row" width="48%">Bid:</th>
            <td class="yfnc_tabledata1">

              <span id="yfs_b00_gm150220c00036500">0.76</span>
            </td>
          </tr>
          <tr>
            <th scope="row" width="48%">Ask:</th>
            <td class="yfnc_tabledata1">

              <span id="yfs_a00_gm150220c00036500">0.90</span>
            </td>
          </tr>
        </table>
        <table id="table2">
          <tr>
            <th scope="row" width="48%">Open Interest:</th>

            <td class="yfnc_tabledata1">11,579</td>
          </tr>
        </table>
      </div>
    </div>
  </div>
</div>

查看更多
登录 后发表回答