Excel macro to search a website with excel data an

2020-08-01 08:32发布

问题:

I am hoping someone can help....

I have 8000 values in a excel spreadsheet that I need to search in a website and then record a specific line of data from the website to be inputted back into the excel spreadsheet.

I have found a previous post which searches for the data I am looking at excel macro to search a website and extract results

with the code being;

Sub URL_Get_ABN_Query()
    strSearch = Range("a1")
    With ActiveSheet.QueryTables.Add( _
                      Connection:="URL;http://www.abr.business.gov.au/SearchByABN.aspx?SearchText=" & strSearch & "&safe=active", _
                     Destination:=Range("a5"))

        .BackgroundQuery = True
        .TablesOnlyFromHTML = True
        .Refresh BackgroundQuery:=False
        .SaveData = True
    End With
    `enter code here`
End Sub

However, when I run then Macro in Excel it collects all the data from the website like this.

I only want the 'entity type' data line to be inputted. I have searched everywhere and can not seem to find how to extend the code to only grab this line of information and input to the corresponding cell (i.e. ABN(b2)search, find input 'entity type' and paste into Company Type(c2).

Alternatively, I have also tried to find how to get the macro to fill the information vertically instead of horizontally as then I could delete the columns that are not needed, I thought this may be a simpler way to run this macro but alas again I could not find help to do it. I also tried to record the macro with developer but that did not work either.

I also need to loop it to run the next ABN and populate the corresponding field and so on (B3>C3, B4>C4, etc etc).

I would love some help figuring, I am a beginner in VBA and think what I want to do is beyond my skill level at this time. I am trying to understand through tutorials, google searches and help pages but can not seem to find how or if this can be done.

My alternative is to do this manually for each of the 8000 data points, copying each abn, searching in the website and then copying the entity type and pasting into excel, I did try this first but after a while started searching for a better way. Can you help????

回答1:

This is absolutely possible. You've got what I often find the hardest part, sourcing the information from another platform. To make this work I would separate it out a little bit and for simplicity use 2 sheets (Sheet1 with your known data and Sheet2 for the web data).

Loop through your table of ~8000 businesses. We can identify this from the UsedRange number of Rows. We know that the ABN is in column 2 (also known as B) so we copy that into the variable to pass to the function. The function will return the "Entity type:" to column 3 (C) of the same row.

Sub LoopThroughBusinesses() 
    Dim i As Integer
    Dim ABN As String
    For i = 2 To Sheet1.UsedRange.Rows.Count
        ABN = Sheet1.Cells(i, 2)
        Sheet1.Cells(i, 3) = URL_Get_ABN_Query(ABN)
    Next i
End Sub

Change the subroutine you created to a Function so it returns the entity type you are after. The function will save the data into Sheet2 and then return just the Entity data that we are after.

Function URL_Get_ABN_Query(strSearch As String) As String   ' Change it from a Sub to a Function that returns the desired string
    ' strSearch = Range("a1") ' This is now passed as a parameter into the Function
    Dim entityRange As Range
    With Sheet2.QueryTables.Add( _
            Connection:="URL;http://www.abr.business.gov.au/SearchByABN.aspx?SearchText=" & strSearch & "&safe=active", _
            Destination:=Sheet2.Range("A1"))   ' Change this destination to Sheet2

        .BackgroundQuery = True
        .TablesOnlyFromHTML = True
        .Refresh BackgroundQuery:=False
        .SaveData = True
    End With

    ' Find the Range that has "Entity Type:"
    Set entityRange = Sheet2.UsedRange.Find("Entity type:")

    ' Then return the value of the cell to its' right
    URL_Get_ABN_Query = entityRange.Offset(0, 1).Value2

    ' Clear Sheet2 for the next run
    Sheet2.UsedRange.Delete

End Function


回答2:

You do not want a load of connections (queryTables) set up in this way. It will be so slow if even possible. At 8000 requests, provided xmlhttp is not blocked or throttled, the below method will be significantly faster. If there does appear to be slowing/blocking then add in a small wait every x requests.

If possible use xmlhttp to gather data. Use css selectors to specifically target the entity type. Store values in an array and write out with loop at end. Use a class to hold the xmlhttp object for greater efficiency. Provide your class with methods including how to handle not found (example given). Add some further optimizations e.g. given is switching off screen-updating. This assumes your search numbers are in column B from B2. The code below also does some basic checks that there is something present in column B and handles the case of there being 1 or more numbers.

Good code is modular and you want a function to return something and a sub to perform actions. A single sub/function shouldn't complete lots of tasks. You want to easily debug with code that follows the principle of single responsibility (or close to it).

class clsHTTP

Option Explicit

Private http As Object  
Private Sub Class_Initialize()
    Set http = CreateObject("MSXML2.XMLHTTP")
End Sub

Public Function GetHTML(ByVal URL As String) As String
    Dim sResponse As String
    With http
        .Open "GET", URL, False
        .send
        GetHTML = StrConv(.responseBody, vbUnicode)
    End With
End Function

Public Function GetEntityType(ByVal html As HTMLDocument) As String
    On Error GoTo errhand:
     GetEntityType = html.querySelector("a[href*='EntityTypeDescription']").innerText
    Exit Function
errhand:
    GetEntityType = "Not Found"
End Function

Standard module:

Option Explicit 
Public Sub GetInfo()
    Dim http As clsHTTP, sResponse As String, lastRow As Long, groupResults(), i As Long, html As HTMLDocument
    Set html = New HTMLDocument
    Set http = New clsHTTP
    Const BASE_URL As String = "http://www.abr.business.gov.au/ABN/View/"
    Application.ScreenUpdating = False

    With ThisWorkbook.Worksheets("Sheet1")
        lastRow = .Cells(.rows.Count, "B").End(xlUp).Row
        Select Case lastRow
        Case 1
            Exit Sub
        Case 2
            ReDim arr(1, 1): arr(1, 1) = .Range("B2").Value
        Case Else
            arr = .Range("B2:B" & lastRow).Value
        End Select

        ReDim groupResults(1 To lastRow - 1)

        With http
            For i = LBound(arr, 1) To UBound(arr, 1)
                If Len(BASE_URL & arr(i, 1)) > Len(BASE_URL) Then
                    sResponse = .GetHTML(BASE_URL & arr(i, 1))
                    html.body.innerHTML = sResponse
                    groupResults(i) = .GetEntityType(html)
                    sResponse = vbNullString: html.body.innerHTML = vbNullString
                End If
            Next
        End With
        For i = LBound(groupResults) To UBound(groupResults)
            .Cells(i + 1, "C") = groupResults(i)
        Next
    End With
    Application.ScreenUpdating = True
End Sub

References (VBE> Tools > References):

  1. Microsoft HTML Object Library

CSS selectors:

I use the fact the entity description is a hyperlink (a tag) and that its value contains the string EntityTypeDescription to use a css attribute = value with contains (*) operator to target.