Parsing HTML with VB DOTNET

2019-08-01 22:25发布

问题:

I am trying to parse some data from a website to get specific items from their tables. I know that any tag with the bgcolor attribute set to #ffffff or #f4f4ff is where I want to start and my actual data sits in the 2nd within that .

Currently I have:

Private Sub runForm()


    Dim theElementCollection As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("TR")
    For Each curElement As HtmlElement In theElementCollection
        Dim controlValue As String = curElement.GetAttribute("bgcolor").ToString
        MsgBox(controlValue)
        If controlValue.Equals("#f4f4ff") Or controlValue.Equals("#ffffff") Then

        End If
    Next
End Sub

This code gets the TR element that I need, but I have no idea how (if it is possible) to then investigate the inner elements. If not, what do you think would be the best route to take? The site does not really label any of their tables. The 's i am looking for basically look like:

<td><b><font size="2"><a href="/movie/?id=movieTitle.htm">The Movie</a></font></b></td>

I want to pull out "The Movie" text and add it to a text file.

回答1:

Use the InnerHtml property of the HtmlElement object (curElement) you have, like this:

For Each curElement As HtmlElement In theElementCollection
    Dim controlValue As String = curElement.GetAttribute("bgcolor").ToString
    MsgBox(controlValue)
    If controlValue.Equals("#f4f4ff") Or controlValue.Equals("#ffffff") Then
        Dim elementValue As String = curElement.InnerHtml
    End If
Next

Read the documentation of HtmlElement.InnerHtml Property for more information.

UPDATE:

To get the second child of the <tr> HTML element, use a combination of FirstChild and then NextSibling, like this:

For Each curElement As HtmlElement In theElementCollection
    Dim controlValue As String = curElement.GetAttribute("bgcolor").ToString
    MsgBox(controlValue)
    If controlValue.Equals("#f4f4ff") Or controlValue.Equals("#ffffff") Then
        Dim firstChildElement = curElement.FirstChild
        Dim secondChildElement = firstChildElement.NextSibling

        ' secondChildElement should be the second <td>, now get the value of the inner HTML
        Dim elementValue As String = secondChildElement.InnerHtml
    End If
Next