-->

How to remove duplicates that are case SENSITIVE i

2019-09-18 20:18发布

问题:

I have tried using this formula field and copying to all >100k records I have.

=IF(SUMPRODUCT(--EXACT(A2,$B$1:B1)),"",A2)

where:

  1. column A = column with all data including duplicates
  2. column B = column to display data (from column A) if unique otherwise empty string

However I hit this issue:

Yes my Excel 2016 is 32bit and yes my laptop is only 8GB RAM. But I have read up that people with 64bit and 16GB RAM experienced the same error as me.

I know there is a function in Excel function : Data > Select Column(s)> Remove Duplicates. However this function deletes case INSENSITIVE data only.

Please advise me how I can overcome this issue. I am open to using stuff like Crystal Reports or some sort of freeware to solve this issue. Please advise.

回答1:

You may try something like this. Before trying this backup your data. The code below will remove the duplicates from the column A and it is case sensitive.

Sub GetUniqueValues()
Dim x, dict
Dim lr As Long
lr = Cells(Rows.Count, 1).End(xlUp).Row
x = Range("A2:A" & lr).Value
Set dict = CreateObject("Scripting.Dictionary")
For i = 1 To UBound(x, 1)
    dict.Item(x(i, 1)) = ""
Next i
Range("A2:A" & lr).ClearContents
Range("A2").Resize(dict.Count).Value = Application.Transpose(dict.keys)
End Sub

Edited Code:

Sub GetUniqueValues()
Dim x, dict, y
Dim lr As Long
Application.ScreenUpdating = False

lr = Cells(Rows.Count, 1).End(xlUp).Row
x = Range("A2:A" & lr).Value
Set dict = CreateObject("Scripting.Dictionary")
For i = 1 To UBound(x, 1)
    dict.Item(x(i, 1)) = ""
Next i
ReDim y(1 To dict.Count, 1 To 1)

i = 0
For Each it In dict.keys
   i = i + 1
   y(i, 1) = it
Next it

Range("A2:A" & lr).ClearContents
Range("A2").Resize(dict.Count).Value = y

Application.ScreenUpdating = True
End Sub


回答2:

For a general solution the VBA approach already suggested is probably to prefer. But for something that works only once, you can probably make it work the way you intended with only a little bit of adaptation in how you apply =IF(SUMPRODUCT(--EXACT(A2,$B$1:B1)),"",A2). I also tried to use a COUNTIF algorithm, which is much faster than SUMPRODUCT, but that's not case sensitive.

Since I am also running 32-bit Excel with 8GB memory I was curious to see if I could replicate the memory issue. I generated a list of 100,000 random 5-letter strings in column A. Only 10 letters were used (ABCDEFGHJK), so in 100,000 strings some would occur more than once. I then applied the formula suggested by the OP in column B to filter out only unique values. It did indeed work, but it took quite some time. But I never ran into the memory issue that the OP did.

Proposed solution:
Based on these observations, one possible solution to you particular problem might be to copy column A to a new, temporary workbok and run your SUMPRODUCT formula there while all other workbooks are closed. Once it has finished you could just paste the result back to the original column in the original file. Actually removing the duplicates could be done by simply filtering on that column so that all dublicates (empty cells) are grouped together and then remove those rows. Details of my attempt to replicate can be found below.

SUMPRODUCT: Approximately 1 hour
First I tried the same formula as in the OP, =IF(SUMPRODUCT(--EXACT(A2,$B$1:B1)),"",A2), but doing only 10,000 rows at a time (by inserting empty rows at row 10,000, 20,000 etc. and copying down ten thousand rows at a time.) Each set of 10,000 rows took a couple of minutes to complete. When I did the whole shebang as one giant copy operation for all 100,000 cells at once, the operation took around one hour to complete and Excel was unresponsive in the meantime. Memory usage was 1,4 GB and the CPU averaged over 50% capacity (monitored with the Windows Task Manager). I also tried to run the formula when I had already manipulated the data in various ways (thus consuming more memory), which pushed CPU capacity to 100% and caused a couple of crashes. I managed to avoid that by simply closing Excel to clear the memory and running the operation again from a fresh restart with no other workbooks open.

As you can see in the following screenshots the formula worked and the unique entries become rarer further down the list (as expected since they are random). I assigned 1 to cells contaning duplicates so I could count them easily. There were 36,843 such instances.

First rows, no duplicates:

Last rows, mostly duplicates (cells with 1):

COUNTIF: 8.5 minutes
Compared to the SUMPRODUCT algorithm which took around one hour to complete, the following COUNTIF formula completed the same job in only 8,5 minutes, but it would not distinguish between lower and upper case. This approach requires the use of a help column. COUNTIF returns the number of instances that a particular string has been used in the range above the current cell, so every time a string is encountered for the first time, it will return 1. Cell B2 contains =COUNTIF($A$2:$A2,A2), and copying this down for all 100,000 rows took around eight and a half minutes. Then, in a separate colum I just used a simple IF formula to filter out the unique values from column A; cell C2 contains =IF(B2=1,A2,1), which returns the string in column A if it is unique; otherwise 1 is returned (to allow easy comparison with SUMPRODUCT). Copying this IF formula down for all 100,000 rows is practically instantaneous. The sum of 1s in column C after this operation was, reassuringly, the same as in the case of SUMPRODUCT, 36,843.

INDEX: Failure
I also played around with an array formula using the INDEX and MATCH functions. This formula that does the same job as COUNTIF, but also filters out the empty rows: =INDEX($A$2:$A$100001,MATCH(0,COUNTIF($E$1:E1,$A$2:$A$100001),0)). This should be entered in cell B2 as an array formula (Ctrl + Shift + Enter) and then copied down. Copying individual cells one at a time worked fine for a few dozen rows, but anything more than that caused Excel to crash. I even tried running this overnight, but the operation never finished. (The formula could be extended to become case sensitive, but I didn't bother to try.)

One thing to note, however, with the failed INDEX formula was that the behavior described above occured when the formula was applied in a separate workbook. I also tried to run this formula in column D in the same workbook as the COUNTIF formula. Then I did actually run into the memory issue described in the OP, which, unsurprisingly, suggests that the problem with memory depends on the rest of the data in the workbook.



回答3:

Using power of sorting to get unique values. No libraries required. Can be converted easily to select complete rows:

Sub GetUniqueValues()
    'Sort once so we can run through the list without nested loops
    Sheet1.Range("$A:$A").Sort Key1:=Sheet1.Range("$A:$A"), Header:=xlYes, MatchCase:=True
    count = Application.WorksheetFunction.CountA(Sheet1.Range("$A:$A"))
    LastCell = 1

    For i = 2 To count
        If Sheet1.Cells(i, 1).Value = Sheet1.Cells(LastCell, 1).Value Then
            'Remove second/third/fourth occurrences
            Sheet1.Cells(i, 1).Clear
        Else
            'If its first occurrence of this value, make a note of its position
            LastCell = i
        End If
    Next

    'Sort again to move the cells emptied out to the bottom
    Sheet1.Range("$A:$A").Sort Key1:=Sheet1.Range("$A:$A"), Header:=xlYes, MatchCase:=True
End Sub