Efficiently merge large object datasets having mul

2020-05-09 06:13发布

In a Powershell script, I have two data sets that have multiple columns. Not all these columns are shared.

For example, data set 1:

A B    XY   ZY  
- -    --   --  
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
6 val6 foo6 bar6

and data set 2:

A B    ABC  GH  
- -    ---  --  
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
6 val6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8

I want to merge these two dataset, specifying which columns act as key (A and B in my simple case). The expected result is:

A B    XY   ZY   ABC  GH  
- -    --   --   ---  --  
1 val1 foo1 bar1          
2 val2 foo2 bar2          
3 val3 foo3 bar3 foo3 bar3
4 val4 foo4 bar4 foo4 bar4
5 val5 foo5 bar5 foo5 bar5
6 val6 foo6 bar6 foo6 bar6
7 val7           foo7 bar7
8 val8           foo8 bar8

The concept is very similar to a SQL cross join query.

I have been able to successfully write a function that merge objects. Unfortunately, the duration of the computation is exponential.

If I generate my data sets using :

$dsLength = 10
$dataset1 = 0..$dsLength | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}

I get these results:

  • $dsLength = 10 ==> 33ms (fine)
  • $dsLength = 100 ==> 89ms (fine)
  • $dsLength = 1000 ==> 1563ms (acceptable)
  • $dsLength = 5000 ==> 35764ms (too much)
  • $dsLength = 10000 ==> 138047ms (too much)
  • $dsLength = 20000 ==> 573614ms (far too much)

How can I merge datasets efficiently when data sets are large (my target is around 20K items) ?

Right now, I have these function defined :

function Merge-Objects{
    param(
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset1,
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset2,
        [Parameter()]
        [string[]]$Properties
    )

    $result = @()

    $ds1props = $Dataset1 | gm -MemberType Properties
    $ds2props = $Dataset2 | gm -MemberType Properties
    $ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
    $ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }

    foreach($row1 in $Dataset1){
        $result += $row1
        $ds2propsNotInDs1Props | % {
            $row1 | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
        }
    }

    foreach($row2 in $Dataset2){
        $existing = foreach($candidate in $result){
            $match = $true
            foreach($prop in $Properties){
                if(-not ($row2.$prop -eq $candidate.$prop)){
                    $match = $false                   
                    break                  
                }
            }
            if($match){
                $candidate
                break
            }
        }
        if(!$existing){
            $ds1propsNotInDs2Props | % {
                $row2 | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
            }
            $result += $row2
        }else{
            $ds2propsNotInDs1Props | % {
                $existing.$($_.Name) = $row2.$($_.Name)
            }

        }
    }

    $result
}

I call these function like this :

Measure-Command -Expression {

    $data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B" 

}

My feeling is that the slowness is due to the second loop, where I try to match an existing row in each iteration

[Edit] Second approach using a hash as index. Surprisingly, it's event slower than first try

$dsLength = 1000
$dataset1 = 0..$dsLength | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}

function Get-Hash{
    param(
        [Parameter(Mandatory=$true)]
        [object]$InputObject,
        [Parameter()]
        [string[]]$Properties    
    )

    $InputObject | Select-object $properties | Out-String
}


function Merge-Objects{
    param(
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset1,
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset2,
        [Parameter()]
        [string[]]$Properties
    )

    $result = @()
    $index = @{}

    $ds1props = $Dataset1 | gm -MemberType Properties
    $ds2props = $Dataset2 | gm -MemberType Properties
    $allProps = $ds1props + $ds2props | select -Unique

    $ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
    $ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }

    $ds1index = @{}

    foreach($row1 in $Dataset1){
        $tempObject = new-object psobject
        $result += $tempObject
        $ds2propsNotInDs1Props | % {
            $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
        }
        $ds1props | % {
            $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $row1.$($_.Name)
        }

        $hash1 = Get-Hash -InputObject $row1 -Properties $Properties
        $ds1index.Add($hash1, $tempObject)

    }

    foreach($row2 in $Dataset2){
        $hash2 = Get-Hash -InputObject $row2 -Properties $Properties

        if($ds1index.ContainsKey($hash2)){
            # merge object
            $existing = $ds1index[$hash2]
            $ds2propsNotInDs1Props | % {
                $existing.$($_.Name) = $row2.$($_.Name)
            }
            $ds1index.Remove($hash2)

        }else{
            # add object
            $tempObject = new-object psobject
            $ds1propsNotInDs2Props | % {
                $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
            }
            $ds2props | % {
                $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $row2.$($_.Name)
            }
            $result += $tempObject
        }
    }

    $result
}

Measure-Command -Expression {

    $data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B" 

}

[Edit2] Putting Measure-Commands around the two loops show that event the first loop is yet slow. Actually the first loop take more than 50% of the total time

2条回答
贪生不怕死
2楼-- · 2020-05-09 06:51

There has been a lot of doubts for me to incorporate a binary search (a hash table) into my Join-Object cmdlet (see also: In Powershell, what's the best way to join two tables into one?) as there are a few issues to overcome that are conveniently left out from the example in the question.

Unfortunately, I can't compete with the performance of @mhhollomon solution:

dsLength Steve1 Steve2 mhhollomon Join-Object
-------- ------ ------ ---------- -----------
      10     19    129         21          50
     100    145    915        158         329
    1000   2936   9646       1575        3355
    5000  56129  69558       5814       12653
   10000 183813  95472      14740       25730
   20000 761450 265061      36822       80644

But I think that I can add some value:

not right

Hash keys are strings, which means that you need to cast the related properties to strings, which is a little questionable simple because:

$Left -eq $Right ≠ "$Left" -eq "$Right"

In most cases it will work, especially when source is a .csv file, but it might go wrong, e.g. in case the data comes from a cmdlet where $Null does mean something else then a empty string (''). Therefore, I recommend to explicitly define $Null keys, e.g. with a Control character.
And as properties values could easily contain a colon (:), I would also recommend to use a control character for separating (joining) multiple keys.

Also right

There is another pitfall by using a hash table which actually doesn't have to be an issue: what if the left ($dataset1) and/or right ($dataset2) have multiple matches. Take e.g. the following data sets:

$dataset1 = ConvertFrom-SourceTable '

    A B    XY    ZY  
    - -    --    --  
    1 val1 foo1  bar1
    2 val2 foo2  bar2
    3 val3 foo3  bar3
    4 val4 foo4  bar4
    4 val4 foo4a bar4a
    5 val5 foo5  bar5
    6 val6 foo6  bar6
'

$dataset2 = ConvertFrom-SourceTable '

    A B    ABC   GH  
    - -    ---   --  
    3 val3 foo3  bar3
    4 val4 foo4  bar4
    5 val5 foo5  bar5
    5 val5 foo5a bar5a
    6 val6 foo6  bar6
    7 val7 foo7  bar7
    8 val8 foo8  bar8
'

In this case, I would expect a similar outcome a the SQL join and no Item has already been added. Key in dictionary error:

$Dataset1 | FullJoin $dataset2 -On A, B | Format-Table

A B    XY    ZY    ABC   GH
- -    --    --    ---   --
1 val1 foo1  bar1
2 val2 foo2  bar2
3 val3 foo3  bar3  foo3  bar3
4 val4 foo4  bar4  foo4  bar4
4 val4 foo4a bar4a foo4  bar4
5 val5 foo5  bar5  foo5  bar5
5 val5 foo5  bar5  foo5a bar5a
6 val6 foo6  bar6  foo6  bar6
7 val7             foo7  bar7
8 val8             foo8  bar8

Only Right

As you might have figured out, there is no reason to put both sides in a hash table, but you might consider to stream the left side (rather than choking the input). In the example of the question, both datasets are loaded directly into memory which is hardly a used case. It is more common that your data comes from somewhere else e.g. remote from active directory were you might be able to concurrently search each incoming object in the hash table before the next object comes in. The same counts for the following cmdlet: it might directly start processing the output and doesn't have to wait till your cmdlet is finished (note that the data is immediately released from the Join-Object cmdlet when it is ready). In such a case measuring the performance using Measure-Command requires a complete different approach...
See also: Computer Programming: Is the PowerShell pipeline sequential mode more memory efficient? Why or why not?

查看更多
3楼-- · 2020-05-09 06:57

I agree with @Matt. Use a hashtable -- something like the below. This should run in m + 2n rather than mn time.

Timings on my system

original Solution above

#10    TotalSeconds      :   0.07788
#100   TotalSeconds      :   0.37937
#1000  TotalSeconds      :   5.25092
#10000 TotalSeconds      : 242.82018
#20000 TotalSeconds      : 906.01584

This definitely looks O(n^2)

Solution Below

#10    TotalSeconds      :  0.094
#100   TotalSeconds      :  0.425
#1000  TotalSeconds      :  3.757
#10000 TotalSeconds      : 45.652
#20000 TotalSeconds      : 92.918

This looks linear.

Solution

I used three techniques to increase the speed:

  1. Change over to a hashtable. This allows constant time lookups so that you don't have to have nested loops. This is the only change really needed to go from O(n^2) to linear time. It does have the disadvantage that there is more setup work done. So, the advantage of linear time won't be seen until the loop count is large enough to pay for the setup.
  2. Use ArrayList instead of a native array. Adding an item to a native array requires that the array be reallocated and all the items to be copied. So this is also an O(n^2) operation. Since this operation is being done at the engine level, the constant is very small, so it really won't make a difference until much later.
  3. Use PsObject.Copy to create the new object. This is a small optimization compared to the other two, but it cut the run time in half for me.

--

function Get-Hash{
    param(
        [Parameter(Mandatory=$true)]
        [object]$InputObject,
        [Parameter()]
        [string[]]$Properties    
    )

    $arr = [System.Collections.ArrayList]::new()

    foreach($p in $Properties) { $arr += $InputObject.$($p) }

    return ( $arr -join ':' )
}

function Merge-Objects{
    param(
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset1,
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset2,
        [Parameter()]
        [string[]]$Properties
    )

    $results = [System.Collections.ArrayList]::new()

    $ds1props = $Dataset1 | gm -MemberType Properties
    $ds2props = $Dataset2 | gm -MemberType Properties
    $ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
    $ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }


    $hash = @{}
    $Dataset2 | % { $hash.Add( (Get-Hash $_ $Properties), $_) }

    foreach ($row in $dataset1) {

        $key = Get-Hash $row $Properties

        $tempObject = $row.PSObject.Copy()

        if ($hash.containskey($key)) {
            $r2 = $hash[$key]

            $hash.remove($key)
            $ds2propsNotInDs1Props | % {
                $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $r2.$($_.Name)
            }

        } else {
            $ds2propsNotInDs1Props | % {
                $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
            }
        }
        [void]$results.Add($tempObject)
    }

    foreach ($row in $hash.values ) {
        # add missing dataset2 objects and extend
        $tempObject = $row.PSObject.Copy()

        $ds1propsNotInDs2Props | % {
            $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
        }

        [void]$results.Add($tempObject)
    }

    $results
}

########

$dsLength = 10000
$dataset1 = 0..$dsLength | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}

Measure-Command -Expression {

    $data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B" 

}
查看更多
登录 后发表回答