Loop through links and download PDF's

2020-05-06 15:04发布

问题:

I have a code that has been here for a while with different types of questions. This is getting closer to it's final version. However now I have a problem that there is mistake in the code and part of it is not functioning correct.

The idea is to go through the links and grab PDF files. Links are getting stored in sLinks, see line with comment "Check that links are stored in sLinks". Code goes forward and files are getting stored in C:\temp\, but then after 12 PDF's are in folder I am getting an error and debugger is pointing to xHttp.Open "GET", sLink.

I went through PDF's and it looks like all are downloaded... as there are some are the same on several pages and also there is one Policy PDF on two pages at least. That's why there are 17 links and 12 files. Anyway why it is throwing an error?

What might be the problem?

Here is my code:

Sub DownloadFiles()
    Dim xHttp       As Object: Set xHttp = CreateObject("Microsoft.XMLHTTP")
    Dim hDoc        As MSHTML.HTMLDocument
    Dim Anchors     As Object
    Dim Anchor      As Variant
    Dim sPath       As String
    Dim wholeURL    As String

    Dim internet As InternetExplorer
    Dim internetdata As HTMLDocument
    Dim internetlink As Object
    Dim internetinnerlink As Object
    Dim arrLinks As Variant
    Dim sLink As String
    Dim iLinkCount As Integer
    Dim iCounter As Integer
    Dim sLinks As String

    Set internet = CreateObject("InternetExplorer.Application")
    internet.Visible = False
    internet.navigate ("https://www.nordicwater.com/products/waste-water/")

        Do While internet.Busy
          DoEvents
        Loop
        Do Until internet.readyState = READYSTATE_COMPLETE
            DoEvents
        Loop

        Set internetdata = internet.document
        Set internetlink = internetdata.getElementsByTagName("a")

        i = 1

        For Each internetinnerlink In internetlink
            If Left$(internetinnerlink, 36) = "https://www.nordicwater.com/product/" Then

                sLinks = sLinks & internetinnerlink.href & vbCrLf
                i = i + 1

            Else
            End If

    ThisWorkbook.Worksheets("Sheet1").range("B1").Value = sLinks ' Check that links are stored in sLinks

    Next internetinnerlink

    wholeURL = "https://www.nordicwater.com/"
    sPath = "C:\temp\"

    arrLinks = Split(sLinks, vbCrLf)
    iLinkCount = UBound(arrLinks) + 1

    For iCounter = 1 To iLinkCount
    sLink = arrLinks(iCounter - 1)
        'Get the directory listing
        xHttp.Open "GET", sLink ' DEBUGGER IS POINTING HERE
        xHttp.send

        'Wait for the page to load
        Do Until xHttp.readyState = 4
            DoEvents
        Loop

        'Put the page in an HTML document
        Set hDoc = New MSHTML.HTMLDocument
        hDoc.body.innerHTML = xHttp.responseText

        'Loop through the hyperlinks on the directory listing
        Set Anchors = hDoc.getElementsByTagName("a")

        For Each Anchor In Anchors

            'test the pathname to see if it matches your pattern
            If Anchor.pathname Like "*.pdf" Then

                xHttp.Open "GET", wholeURL & Anchor.pathname, False
                xHttp.send

                With CreateObject("Adodb.Stream")
                    .Type = 1
                    .Open
                    .write xHttp.responseBody
                    .SaveToFile sPath & getName(wholeURL & Anchor.pathname), 2 '//overwrite
                End With

            End If

        Next

    Next

End Sub

Function to build file name out of link:

Function getName(pf As String) As String
    getName = Split(pf, "/")(UBound(Split(pf, "/")))
End Function

EDIT:

I have fixed first issue. arrLinks = Split(p_sLinks, vbCrLf) changed to arrLinks = Split(sLinks, vbCrLf) as it should be. Now I am facing another problem.

回答1:

I would add a If Len(sLink) > 0 check before calling the HTTP GET.

The problem is with this line:

sLinks = sLinks & internetinnerlink.href & vbCrLf

It will add an extra vbCrLf at the end of the sLinks list. It should be:

If sLinks <> "" Then sLinks = sLinks & vbCrLf
sLinks = sLinks & internetinnerlink.href

This way there won't be a vbCrLf after the last link