Dual class in html scrapping vba

2019-08-01 14:42发布

I am trying to extract prices from this HTML page using the VBA code below:

Here's the HTML snippet:

<div class="box-text box-text-products">
    <div class="title-wrapper">
        <p class="category uppercase is-smaller no-text-overflow product-cat op-7">
    Xikar Lighters      
        </p>
        <p class="name product-title">
            <a href="https://www.havanahouse.co.uk/product/xikar-allume-single-jet-flame-racing-cigar-lighter-bluewhite-stripe/">Xikar Allume Single Jet Flame Racing Cigar Lighter &#8211; Blue/White Stripe</a>
        </p>
    </div>
    <div class="price-wrapper">
        <span class="price">
            <del>
                <span class="woocommerce-Price-amount amount">
                    <span class="woocommerce-Price-currencySymbol">&pound;</span>48.00
                </span>
            </del>
            <ins>
                <span class="woocommerce-Price-amount amount">
                    <span class="woocommerce-Price-currencySymbol">&pound;</span>45.00
                </span>
            </ins>
        </span>
    </div>
</div>
<!-- box-text -->undefined</div>undefined<!-- box -->undefined</div>undefined<!-- .col-inner -->undefined</div>undefined<!-- col -->

I am using the below code but I get an error:

For Each oElement In oHtml.getElementsByClassName("woocommerce-Price-amoun t amount")
    If oElement.getElementsByTagName("del") Then Exit For

    If oElement.innerText <> 0  Then
        Cells(counter, 3) = CDbl(oElement.innerText)
        counter = counter + 1
    End If
Next oElement

1条回答
放荡不羁爱自由
2楼-- · 2019-08-01 15:24

Take a look at the below example:

Option Explicit

Sub Test()

    Dim sUrl As String
    Dim oWS As Worksheet
    Dim i As Long
    Dim sResp As String
    Dim sCont As String
    Dim oMatch

    sUrl = "https://www.havanahouse.co.uk/?post_type=product"
    Set oWS = ThisWorkbook.Sheets(1)
    oWS.Cells.Delete
    i = 1
    Do
        With CreateObject("MSXML2.XMLHTTP")
            .Open "GET", sUrl, False
            .send
            sResp = .ResponseText
        End With
        With CreateObject("VBScript.RegExp")
            .Global = True
            .MultiLine = True
            .IgnoreCase = True
            .Pattern = "<div class=""shop-container"">([\s\S]*?)<div class=""container"">"
            With .Execute(sResp)
                If .Count = 0 Then Exit Do
                sCont = .Item(0).SubMatches(0)
            End With
            .Pattern = "<div class=""title-wrapper"">([\s\S]*?)</div><div class=""price-wrapper"">([\s\S]*?)</div>"
            For Each oMatch In .Execute(sCont)
                oWS.Cells(i, 1) = GetInnerText(oMatch.SubMatches(0))
                oWS.Cells(i, 2) = GetInnerText(oMatch.SubMatches(1))
                oWS.Columns.AutoFit
                i = i + 1
                DoEvents
            Next
            oWS.Cells(i, 1).Select
            .Pattern = "<a class=""next page-number""[\s\S]*?href=""([^""]*)"""
            With .Execute(sResp)
                If .Count = 0 Then Exit Do
                sUrl = .Item(0).SubMatches(0)
            End With
        End With
    Loop

End Sub

Function GetInnerText(sText As String) As String

    Static oHtmlfile As Object
    Static oDiv As Object

    If oHtmlfile Is Nothing Then
        Set oHtmlfile = CreateObject("htmlfile")
        oHtmlfile.Open
        Set oDiv = oHtmlfile.createElement("div")
    End If
    oDiv.innerHTML = sText
    GetInnerText = oDiv.innerText

End Function

The output for me is as follows:

output

Generally RegEx's aren't recommended for HTML parsing, so there is disclaimer. Data being processed in this case is quite simple that is why it is parsed with RegEx. About RegEx: introduction (especially syntax), introduction JS, VB flavor.

BTW there are another answers using the similar approach: 1, 2, 3, 4, 5.

查看更多
登录 后发表回答