On making MATCH function like FIND function

2019-09-05 18:29发布

问题:

I'm trying to make MATCH function work like FIND function. First of all, I generate the dummy data to be use for testing. Here is the routine I use:

Sub Data_Generator()
Randomize
Dim Data(1 To 100000, 1 To 1)
Dim p As Single
For i = 1 To 100000
    p = Rnd()
    If p < 0.4 Then
        Data(i, 1) = "A"
    ElseIf p >= 0.4 And p <= 0.7 Then
        Data(i, 1) = "B"
    Else
        Data(i, 1) = "C"
    End If
Next i
Range("A1:A100000") = Data
End Sub

Now, I create a sub-routine to find the string A in the range Data. There are two methods I use here that employ MATCH function. The first method is to reset the range of lookup array like the following code:

Sub Find_Match_1()
T0 = Timer
Dim i As Long, j As Long, k As Long, Data As Range
Dim Output(1 To 100000, 1 To 1)
On Error GoTo Finish
Do
    Set Data = Range(Cells(j + 1, 1), "A100000")   'Reset the range of lookup array
    i = WorksheetFunction.Match("A", Data, 0)
    j = j + i
    Output(j, 1) = j                               'Label the position of A
    k = k + 1                                      'Counting the number of [A] found
Loop
Finish:
    Range("B1:B100000") = Output
    InputBox "The number of [A] found are " & k & " in", "Process is complete", Timer - T0
End Sub

And for the second method, I assign the cell of range where A is located by value vbNullString instead of resetting Range("A1:A100000"). The idea is to delete the string A after being found and to expect MATCH function to find the next string A in the Range("A1:A100000"). Here is the code to implement the second method:

Sub Find_Match_2()
T0 = Timer
Dim n As Long, i As Long, j As Long
Dim Data_Store()
Dim Output(1 To 100000, 1 To 1)
Data_Store = Range("A1:A100000")
On Error GoTo Finish
Do
    j = WorksheetFunction.Match("A", Range("A1:A100000"), 0)
    Output(j, 1) = j
    Cells(j, 1) = vbNullString
    n = n + 1
Loop
Finish:
Range("A1:A100000") = Data_Store
Range("B1:B100000") = Output
InputBox "The number of [A] found  are " & n & " in", "Process is complete", Timer - T0
End Sub

The goal is to determine which method is better at employing MATCH function in its performance. It turns out the first method only completes less than 0.4 seconds meanwhile the second method completes about a minute on my PC. So my questions are:

  1. Why does the second method take time too long to complete?
  2. How does one improve the performance of the second method?
  3. Can MATCH function be used in an array?

回答1:

I agree that this is more of a Code Review question, but I chose to look into it for my own curiosity, so I'll share what I found.

I think you're hitting a very classic case of N vs N^2 computational complexity. Look at your two methods, which seem remarkably similar, and consider what they're actually doing, keeping in mind that the MATCH function is probably just a linear search when you use Match_type=0 (because your data is unsorted, whereas other match types could do a binary search on your sorted data).

Method 1:

  • Start at A1
  • Continue down the range until an "A" is found
  • Restart at the cell below the MATCH

Method 2:

  • Start at A1
  • Continue down the range until an "A" is found
  • Clear the "A"
  • Restart at A1

It should be instantly apparent that while one method is continually shrinking the range it searches, the other is always starting at the first cell and searching the whole range. This will account for some of the speedup, and already boosts Method 1 to a nice lead, but it's not even nearly the full story.

The real key lies in the amount of work Match has to do for each situation. Because its range constantly shrinks and moves its start further down the list, whichever cell Method 1's Match starts from, it only has to search a small number of cells before it hits an A and resumes the outer loop. Meanwhile, Method 2 is continually destroying A's, making them less and less dense and forcing itself to search more and more of the range before getting any hits. By the end, Method 2 is looping through almost 100,000 empty cells/B's/C's before finding its next A.

So on average, the Match for Method 1 is only looking through a couple of cells each time, while the Match for Method 2 is taking longer and longer as time goes on, until the end when it is forced to loop through the entire range. On top of that, Method 2 is doing a bunch of writes to cell values, which is slower than you might think when you have to do it tens of thousands of times.

In all honesty, your best bet would be to just loop through the cells yourself once, looking for A's and handling them as you go. MATCH brings no advantage to the table, and Method 1 is basically just a more complicated version of the loop I described.

I'd write this something like:

Sub Find_Match_3()
    T0 = Timer
    Dim k As Long, r As Range
    Dim Output(1 To 100000, 1 To 1)
    For Each r In Range("A1:A100000").Cells
        If r.Value = "A" Then
            Output(r.Row, 1) = r.Row        'Label the position of A
            k = k + 1                       'Counting the number of [A] found
        End If
    Next

    Range("B1:B100000") = Output
    InputBox "The number of [A] found are " & k & " in", "Process is complete", Timer - T0
End Sub

Which is about 30% faster on my machine.