I am looking to scrape the three items that are highlighted and bordered from the html sample below. I've also highlighted a few markers that look useful.
How would you do this?
A Solution
Ok so this wasn't a great question and I'm actually surprised it didn't get down-voted more! Oh well, here are some bread crumbs for someone else.
Three of the four items of info I want are the inner text of a span element with a known id (ie, $0.83 for "yfs_l10_gm150220c00036500"), so I the following helper class seems to be a decent and direct shot:
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetSpanTextForId
'
' Returns the inner text from a span element known by the passed id
'
' param doc: the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetSpanTextForId(ByRef doc As HTMLDocument, ByVal spanId As String) As Double
' Error Handling
On Error GoTo ErrHandler
Dim sRoutine As String
sRoutine = cModule & ".GetSpanTextForId"
CheckArgNotNothing doc, "doc"
CheckArgNotBadString spanId, "spanId"
' Procedure
Dim oSpan As HTMLSpanElement
Set oSpan = doc.getElementById(spanId)
Check Not oSpan Is Nothing, "Could not find span with id: " & Bracket(spanId)
GetSpanTextForId = oSpan.innerText
Exit Function
ErrHandler:
Select Case DspErrMsg(sRoutine)
Case Is = vbAbort: Stop: Resume 'Debug mode - Trace
Case Is = vbRetry: Resume 'Try again
Case Is = vbIgnore: 'End routine
End Select
End Function
The only item not directly known by a span is the OpenInterest which is part of a table that is the 2nd child of an element with an id. The following methods return the cell that immediately follows the cell with the text I want (ie, "Open Interest")
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetOpenInterest
'
' The latest available Open Interest.
'
' param doc: the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetOpenInterest(ByRef doc As HTMLDocument) As Integer
Dim tbl As IHTMLTable
Set tbl = GetSummaryDataTable(doc, 1)
Dim k As Integer
k = mWebScrapeHelpers.GetCellNumberForTextStartingWith(tbl, "Open Interest:")
GetOpenInterest = CInt(mWebScrapeHelpers.GetCellTextFromCellNumber(tbl, k + 1))
End Function
Function GetCellNumberForTextStartingWith(ByRef tbl As IHTMLTable, ByRef s As String) As Integer
' Error Handling
On Error GoTo ErrHandler
Dim sRoutine As String
sRoutine = cModule & ".GetCellNumberForTextStartingWith"
CheckArgNotNothing tbl, "tbl"
' Procedure
Dim tblCell As HTMLTableCell
Dim k As Integer
For Each tblCell In tbl.Cells
If tblCell.innerText Like ("*" & s) Then
GetCellNumberForTextStartingWith = k
Exit Function
End If
k = k + 1
Next
' if we got here it was not found so
GetCellNumberForTextStartingWith = -1
Exit Function
ErrHandler:
Select Case DspErrMsg(sRoutine)
Case Is = vbAbort: Stop: Resume 'Debug mode - Trace
Case Is = vbRetry: Resume 'Try again
Case Is = vbIgnore: 'End routine
End Select
End Function
Function GetCellTextFromCellNumber(ByRef tbl As IHTMLTable, ByRef nbr As Integer) As String
' Error Handling
On Error GoTo ErrHandler
Dim sRoutine As String
sRoutine = cModule & ".GetCellNumberForTextStartingWith"
CheckArgNotNothing tbl, "tbl"
Check tbl.Cells.Length > 0, "table is empty"
Check tbl.Cells.Length >= nbr, "table only has " & tbl.Cells.Length & " cells; can't get cell number " & nbr
' Procedure
GetCellTextFromCellNumber = tbl.Cells(nbr).innerText
Exit Function
ErrHandler:
Select Case DspErrMsg(sRoutine)
Case Is = vbAbort: Stop: Resume 'Debug mode - Trace
Case Is = vbRetry: Resume 'Try again
Case Is = vbIgnore: 'End routine
End Select
End Function
These methods work fine but it does seem there are lots of different approaches that would work, including the regex parsing approach suggested as an answer. The excellent link by RedShift got more to the point of analyzing the html and coming up with a strategy.
Cheers
I would probably use an XML parser to get the text content first (or this: xmlString.replace(/<[^>]+>/g, "") to replace all tags with empty strings), then use the following regexes to extract the information you need:
This process can easily be done in nodejs (more info)or with any other language that supports regex.
live demo: