Using OR logic on an array as argument in Sumprodu

2019-04-22 13:20发布

问题:

I'm having a fairly large dataset where I need to combine multiple entries into a single value. My dataset contains data on the combination of two datasets, each using their own ID's and keys.

I thought of using a Sumproduct() function like this:

=SUMPRODUCT(--('Raw data'!C:C=Landgebruik!A2);--('Raw data'!O:O={20;21;22;23;40});'Raw data'!S:S)

With Landgebruik!A2 holding an ID for the first dataset, which I need to aggregate the second dataset to.

'Raw data'!O:O contains the ID's from the second dataset. In the case above I need to sum the area (in 'Raw data'!S:S) when the value of the second ID is any of these values: {20;21;22;23;40}. (OR logic) The column only contains integer values.

Is there any other way of fixing this then duplicating --('Raw data'!O:O=20) for all values in the array?

EDIT:

I went with the work-around for now, which was: =SUMPRODUCT(--('Raw data'!C:C=Landgebruik!A2);--('Raw data'!O:O=20)+('Raw data'!O:O=20)+('Raw data'!O:O=21)+('Raw data'!O:O=22)+('Raw data'!O:O=23)+('Raw data'!O:O=40);'Raw data'!S:S). But I feel that there should be a more elegant way of doing this.

回答1:

You could make a small change to your current formula; change the ; to * (-- are also unneeded in that particular case):

=SUMPRODUCT(('Raw data'!C:C=Landgebruik!A2)*('Raw data'!O:O={20;21;22;23;40})*'Raw data'!S:S)

And that should work.


When you feed separate parameters to SUMPRODUCT, each parameter has to be of the same size. But when you multiply them like this, it forces evaluation and the arrays expand.

For example, if you take two arrays, 5x1 and 1x5, you get a 5x5 resulting array:



回答2:

You can use text search for this:

--NOT(ISERROR(FIND('Raw data'!O:O,"2021222340")))

But you have to be careful that a shorter ID is not found incorrectly in a longer ID, e.g. if you want to search among the IDs { 123, 456, 789 } then 12 is not considered to be among the IDs. So a simple text search like the above would not work. You need a delimiter character to break up the string of IDs. Usually I use the pipe character for this purpose, since I cannot remember any case when it occurred in the original text of an Excel file, and because it makes the formula human-readable:

--NOT(ISERROR(FIND("|"&'Raw data'!O:O&"|","|20|21|22|23|40|")))

Examples:

'Raw data'!O:O is 20 => |21| is found in |20|21|22|23|40|

'Raw data'!O:O is 2 => |2| is not found in |20|21|22|23|40|

(If your IDs may include the pipe character, then you can use CHR(1), a long forgotten ASCII code for SOH meaning start of header; of course, it's less readable.)

The whole formula:

=SUMPRODUCT(--('Raw data'!C:C=Landgebruik!A2),--NOT(ISERROR(FIND("|"&'Raw data'!O:O&"|","|20|21|22|23|40|"))),'Raw data'!S:S)

(Sorry, my Excel uses , instead of ;)



回答3:

Even though this has been done hundreds of times before, hey maybe microsoft switched up the formulas or something.

I am partial to the method Jerry and Me suggested as they are simple as hell and concise, but you pay a heavy performance cost.

Tom's formula looks ugly to me but was fastest by far, about 4x faster than my initial example. We were able to incorporate the {}s with Tom's formula, but to get it to work we had to wrap the sumifs function with a sum function. This slowed down the formula considerably but made it prettier.

z32a7ul had a great solution too. I really like the use of -- and learned how to use |s to search for a text and only that text. At first glance I thought that it would not work on a number such as 2323 but it does.

Mock up example was as follows:

A1:A5000 was filled with LandgeBruik,

B1:B5000 was filled with 40's

C1:5000 was filled with 1's.


The results:

=SUMPRODUCT((A1:A5000="LandgeBruik")*(B1:B5000={20,21,22,23,40})*C1:C5000)

19.186031 seconds elapsed

59,818,073 ticks

{=SUM(IF(A1:A5000="Landgebruik",1,0)*IF(B1:B5000={20,21,22,23,40},1,0)*C1:C5000)}

26.124411 seconds elapsed

81,450,506 ticks

{=SUM((A1:A5000=""Landgebruik"")*(B1:B5000={20,21,22,23,40})*C1:C5000)}

21.111835 seconds elapsed

65,822,330 ticks

"=SUMIFS(C1:C5000,B1:B5000,"">=20"",B1:B5000,""<=23"",A1:A5000,""=Landgebruik"")+SUMIFS(C1:C5000,B1:B5000,""=40"",A1:A5000,""=Landgebruik"")"

6.732804 seconds elapsed

20,991,490 ticks

"=SUM(SUMIFS(C1:C5000,A1:A5000,"Landgebruik",B1:B5000,{21,22,23,24,40}))"

16.954528 seconds elapsed

52,860,709 ticks

"=SUMPRODUCT(--(A1:A5000=""Landgebruik""),--NOT(ISERROR(FIND(""|""&B1:B5000&""|"",""|20|21|22|23|40|""))),C1:C5000)"

11.822379 seconds elapsed

36,859,729 ticks


Handy class TimerWin64 used to time these:

Option Explicit

Private Declare PtrSafe Function QueryPerformanceCounter Lib "kernel32" (lpPerformanceCount As LongInteger) As Long
Private Declare PtrSafe Function QueryPerformanceFrequency Lib "kernel32" (lpFrequency As LongInteger) As Long

Private Type LongInteger
    First32Bits As Long
    Second32Bits As Long
End Type

Private Type TimerAttributes
    CounterInitial As Double
    CounterNow As Double
    PerformanceFrequency As Double
End Type

Private Const MaxValue_32Bits = 4294967296#
Private this As TimerAttributes

    Private Sub Class_Initialize()
        PerformanceFrequencyLet
    End Sub

        Private Sub PerformanceFrequencyLet()
            Dim TempFrequency As LongInteger
            QueryPerformanceFrequency TempFrequency
            this.PerformanceFrequency = ParseLongInteger(TempFrequency)
        End Sub

    Public Sub CounterInitialLet()
        Dim TempCounterIntital As LongInteger
        QueryPerformanceCounter TempCounterIntital
        this.CounterInitial = ParseLongInteger(TempCounterIntital)
    End Sub

    Public Sub PrintTimeElapsed()
        CounterNowLet
        If CounterInitalIsSet = True Then
            Dim TimeElapsed As Double
            TimeElapsed = (this.CounterNow - this.CounterInitial) / this.PerformanceFrequency
            Debug.Print Format(TimeElapsed, "0.000000"); " seconds elapsed "

            Dim TicksElapsed As Double
            TicksElapsed = (this.CounterNow - this.CounterInitial)
            Debug.Print Format(TicksElapsed, "#,##0"); " ticks"
        End If
    End Sub

        Private Function CounterNowLet()
            Dim TempTimeNow As LongInteger
            QueryPerformanceCounter TempTimeNow
            this.CounterNow = ParseLongInteger(TempTimeNow)
        End Function

        Private Function CounterInitalIsSet() As Boolean
            If this.CounterInitial = 0 Then
                MsgBox "Counter Initial Not Set"
                CounterInitalIsSet = False
            Else
                CounterInitalIsSet = True
            End If
        End Function

        Private Function ParseLongInteger(ByRef LongInteger As LongInteger) As Double
            Dim First32Bits As Double
            First32Bits = LongInteger.First32Bits

            Dim Second32Bits As Double
            Second32Bits = LongInteger.Second32Bits

            If First32Bits < 0 Then First32Bits = First32Bits + MaxValue_32Bits
            If Second32Bits < 0 Then Second32Bits = First32Bits + MaxValue_32Bits

            ParseLongInteger = First32Bits + (MaxValue_32Bits * Second32Bits)
        End Function

Here is the code used to run the test:

Option Explicit

Sub testFunctions()

    With Application
        .Calculation = xlCalculationManual
        .ScreenUpdating = False
        .DisplayStatusBar = False
        .EnableEvents = False
    End With

    Dim ws As Worksheet
    Set ws = Sheets("Test")

    Dim Index As Long

    Dim Timer As TimerWin64
    Set Timer = New TimerWin64

    Timer.CounterInitialLet

    For Index = 1 To 5000

    Dim rngAddress As String
        rngAddress = "D" & Index
        ws.Range(rngAddress).Formula = "=SUMPRODUCT(--(A1:A5000=""Landgebruik""),--NOT(ISERROR(FIND(""|""&B1:B5000&""|"",""|20|21|22|23|40|""))),C1:C5000)"
    Next Index

    Timer.PrintTimeElapsed

    With Application
        .Calculation = xlCalculationAutomatic
        .ScreenUpdating = True
        .DisplayStatusBar = True
        .EnableEvents = True
    End With
End Sub


回答4:

You could split it into two SUMIFS as mentioned in the comment. If all values are integers, then comparing 'Raw data'!O:O to 20,21,22 and 23 is the same as testing it for >=20 and <=23. The value 40 has to be done separately.

=SUMIFS('Raw Data'!S:S,'Raw Data'!C:C,Landgebruik!A2,'Raw Data'!O:O,">="&20,'Raw Data'!O:O,"<="&23)
+SUMIFS('Raw Data'!S:S,'Raw Data'!C:C,Landgebruik!A2,'Raw Data'!O:O,40)

in my locale

or

=SUMIFS('Raw Data'!S:S;'Raw Data'!C:C;Landgebruik!A2;'Raw Data'!O:O;">="&20;'Raw Data'!O:O;"<="&23)
+SUMIFS('Raw Data'!S:S;'Raw Data'!C:C;Landgebruik!A2;'Raw Data'!O:O;40)

in your locale.

This only works when several of the criteria are consecutive integers.

Speed considerations

SUMIFS is thought to be about five times faster than sumproduct so may be the preferred option for large datasets as demonstrated here

You could argue that the more general suggestion of (effectively) five SUMIFS within a SUM from @ BrakNicku should be about as fast as one SUMPRODUCT, but the SUM(SUMIFS) would probably still win because formulas like SUMIFS handle full-column references more efficiently than array formulas.



回答5:

I want to give a shot to this question, after asking OP for some clarifications, because English is not my main language and I think I have misunderstood something.

So, what I did to simulate situation, made a new workbook with 2 sheets.

One sheet is named Landgebruik and got a value in A2 and I did this:

The second sheet is named Raw data. I hide some columns to use only columns C, O and S. In column S I input just values equal to 1. In column O I did randomly values equal to {20,21,22,23,40} and in Column C I did randomly values which were A or B. And it looks like this (please, note I hide some columns):

And question would like to sum values in column S but only If column O is equal to 20 or 21 or 22 or 23 o 40 and column C is equal to Landgebruik!A2 (in my test, value in there is letter A)

We can use an array formula to filter the data in column S and then, once filtered, sum values that meet requirements. In my test, the correct result would be 8, beause only 8 values in column S meet requirements of column C and O. In the image, the right rows are highlighted in yellow.

OP already did this, but wants to know if there is a shorter/elegant formula.

Shortest formula I found is like this:

=SUM(IF($O$2:$O$28={20;21;22;23;40};IF($C$2:$C$28=Landgebruik!$A$2;$S$2:$S$28)))

This is an array formula, so it must be inserted pressing CTRL+SHIFT+ENTER or it won't work!

HOW IT WORKS:

First IF takes all values in column S and ignores all where equivalent in Column O are not 20 or 21 or 22 or 23 or 40. Second IF takes that new array, and ignores all values where equivalent in column C are not equal to Landgebruik!$A$2. Final array is sumed up by the function SUM

I've tried to explain the best I can. I hope you can adapt this to your needs.



回答6:

If you are interested in performance (calculation speed) and are not afraid of matrix calculation, you can use MMULT:

=SUMPRODUCT(--('Raw data'!C:C=Landgebruik!A2),MMULT(--('Raw data'!O:O={20,21,22,23,24}),TRANSPOSE({1,1,1,1,1})),'Raw data'!S:S)

Explanation:

First, you create a 1048576×5 matrix, where the value in the i-th row and j-th column is 1 if the ID in 'Raw data'!O:O's i-th line is the same as the j-th value in the enumeration {20,21,22,23,24}, 0 otherwise.

Second, you multiply this by a vector of 1s (5 1s because {20,21,22,23,24} contains five elements), which means that you accept all the five values.

Third, from the above you get a vector where the i-th element is 1 if the ID is among the accepted values, 0 otherwise, and you put this vector next to the others in your SUMPRODUCT.

(Sorry, my Excel uses ',' instead of ';'. If you want to shorten the formula, you may write {1;1;1;1;1} instead of TRANSPOSE({1,1,1,1,1}). But you have to find out what your Excel uses instead of ';' to separate rows, most probably '.'.)

Note: It may improve the speed of the calculation if you refer to the range which actualy contain values, not the whole column, e.g. 'Raw data'!C1:C123 instead of 'Raw data'!C:C.

If you insert new rows with Shift+Space Ctrl++ above the last row already included, then the references in your formulas will be updated automatically. Alternatively, you may use Names with special formulas that grow the Range referred to by determining the last non-empty cell.

Update

I made some measurements to compare the efficiency of these approaches. I used random data of 10000 rows and I recalculated each formula 1000 times. You can see the elapsed time in the second column.

I commented out the other formulas while I ran this VBA code to measure the time:

Public Sub MeasureCalculationTime()
    Dim datStart As Date: datStart = Now

    Dim i As Long: For i = 1 To 1000
        Application.Calculate
    Next i

    Dim datFinish As Date: datFinish = Now
    Dim dblSeconds As Double: dblSeconds = (datFinish - datStart) * 24 * 60 * 60
    Debug.Print "Calculation finished at " & datFinish; " took " & dblSeconds & " seconds"
End Sub

In this scenario, MMULT was not the fastest.

However, I would like to point out that it is the most flexible because

  1. You may use it with switches: You refer to a cell range instead of the {1,1,1,1,1}, and you will be able to include / exclude IDs in the selection very quickly. Like you put into A1:A5 {20,21,22,23,24} and next to it, into B1:B5 {1,1,1,1,1}. If you want to exclude 21, then you rewrite B2 to 0, if you want to include it, you write it back to 1.

  2. You may use more complicated criteria, where you have to compare multiple levels. Like:

    =SUMPRODUCT(MMULT(--(CarId=CarOwner),--(CarOwner=ListOfJobs),--(ListOfJobs=JobsByDepartment),--(DepartmentIncludedInSelection=1)),FuelConsumption)

Note: The above line is just pseudocode, MMULT has only two parameters.



回答7:

This could work:

={SUMPRODUCT(--('Raw data'!C:C=Landgebruik!A2);--IFERROR(MATCH('Raw data'!O:O;{20;21;22;23;40};0)>0;0);'Raw data'!S:S)}

This needs to be entered as an array formula.