In a Powershell script, I have two data sets that have multiple columns. Not all these columns are shared.
For example, data set 1:
A B XY ZY
- - -- --
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
6 val6 foo6 bar6
and data set 2:
A B ABC GH
- - --- --
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
6 val6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8
I want to merge these two dataset, specifying which columns act as key (A and B in my simple case). The expected result is:
A B XY ZY ABC GH
- - -- -- --- --
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3 foo3 bar3
4 val4 foo4 bar4 foo4 bar4
5 val5 foo5 bar5 foo5 bar5
6 val6 foo6 bar6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8
The concept is very similar to a SQL cross join query.
I have been able to successfully write a function that merge objects. Unfortunately, the duration of the computation is exponential.
If I generate my data sets using :
$dsLength = 10
$dataset1 = 0..$dsLength | %{
New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}
I get these results:
$dsLength = 10
==> 33ms (fine)$dsLength = 100
==> 89ms (fine)$dsLength = 1000
==> 1563ms (acceptable)$dsLength = 5000
==> 35764ms (too much)$dsLength = 10000
==> 138047ms (too much)$dsLength = 20000
==> 573614ms (far too much)
How can I merge datasets efficiently when data sets are large (my target is around 20K items) ?
Right now, I have these function defined :
function Merge-Objects{
param(
[Parameter(Mandatory=$true)]
[object[]]$Dataset1,
[Parameter(Mandatory=$true)]
[object[]]$Dataset2,
[Parameter()]
[string[]]$Properties
)
$result = @()
$ds1props = $Dataset1 | gm -MemberType Properties
$ds2props = $Dataset2 | gm -MemberType Properties
$ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
$ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }
foreach($row1 in $Dataset1){
$result += $row1
$ds2propsNotInDs1Props | % {
$row1 | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
}
foreach($row2 in $Dataset2){
$existing = foreach($candidate in $result){
$match = $true
foreach($prop in $Properties){
if(-not ($row2.$prop -eq $candidate.$prop)){
$match = $false
break
}
}
if($match){
$candidate
break
}
}
if(!$existing){
$ds1propsNotInDs2Props | % {
$row2 | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
$result += $row2
}else{
$ds2propsNotInDs1Props | % {
$existing.$($_.Name) = $row2.$($_.Name)
}
}
}
$result
}
I call these function like this :
Measure-Command -Expression {
$data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B"
}
My feeling is that the slowness is due to the second loop, where I try to match an existing row in each iteration
[Edit] Second approach using a hash as index. Surprisingly, it's event slower than first try
$dsLength = 1000
$dataset1 = 0..$dsLength | %{
New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}
function Get-Hash{
param(
[Parameter(Mandatory=$true)]
[object]$InputObject,
[Parameter()]
[string[]]$Properties
)
$InputObject | Select-object $properties | Out-String
}
function Merge-Objects{
param(
[Parameter(Mandatory=$true)]
[object[]]$Dataset1,
[Parameter(Mandatory=$true)]
[object[]]$Dataset2,
[Parameter()]
[string[]]$Properties
)
$result = @()
$index = @{}
$ds1props = $Dataset1 | gm -MemberType Properties
$ds2props = $Dataset2 | gm -MemberType Properties
$allProps = $ds1props + $ds2props | select -Unique
$ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
$ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }
$ds1index = @{}
foreach($row1 in $Dataset1){
$tempObject = new-object psobject
$result += $tempObject
$ds2propsNotInDs1Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
$ds1props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $row1.$($_.Name)
}
$hash1 = Get-Hash -InputObject $row1 -Properties $Properties
$ds1index.Add($hash1, $tempObject)
}
foreach($row2 in $Dataset2){
$hash2 = Get-Hash -InputObject $row2 -Properties $Properties
if($ds1index.ContainsKey($hash2)){
# merge object
$existing = $ds1index[$hash2]
$ds2propsNotInDs1Props | % {
$existing.$($_.Name) = $row2.$($_.Name)
}
$ds1index.Remove($hash2)
}else{
# add object
$tempObject = new-object psobject
$ds1propsNotInDs2Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
$ds2props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $row2.$($_.Name)
}
$result += $tempObject
}
}
$result
}
Measure-Command -Expression {
$data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B"
}
[Edit2] Putting Measure-Commands around the two loops show that event the first loop is yet slow. Actually the first loop take more than 50% of the total time
There has been a lot of doubts for me to incorporate a binary search (a hash table) into my Join-Object cmdlet (see also: In Powershell, what's the best way to join two tables into one?) as there are a few issues to overcome that are conveniently left out from the example in the question.
Unfortunately, I can't compete with the performance of @mhhollomon solution:
But I think that I can add some value:
not right
Hash keys are strings, which means that you need to cast the related properties to strings, which is a little questionable simple because:
In most cases it will work, especially when source is a
.csv
file, but it might go wrong, e.g. in case the data comes from a cmdlet where$Null
does mean something else then a empty string (''
). Therefore, I recommend to explicitly define$Null
keys, e.g. with a Control character.And as properties values could easily contain a colon (
:
), I would also recommend to use a control character for separating (joining) multiple keys.Also right
There is another pitfall by using a hash table which actually doesn't have to be an issue: what if the left (
$dataset1
) and/or right ($dataset2
) have multiple matches. Take e.g. the following data sets:$dataset1 =
ConvertFrom-SourceTable
'
$dataset2 =
ConvertFrom-SourceTable
'
In this case, I would expect a similar outcome a the SQL join and no
Item has already been added. Key in dictionary
error:Only Right
As you might have figured out, there is no reason to put both sides in a hash table, but you might consider to stream the left side (rather than choking the input). In the example of the question, both datasets are loaded directly into memory which is hardly a used case. It is more common that your data comes from somewhere else e.g. remote from active directory were you might be able to concurrently search each incoming object in the hash table before the next object comes in. The same counts for the following cmdlet: it might directly start processing the output and doesn't have to wait till your cmdlet is finished (note that the data is immediately released from the
Join-Object
cmdlet when it is ready). In such a case measuring the performance usingMeasure-Command
requires a complete different approach...See also: Computer Programming: Is the PowerShell pipeline sequential mode more memory efficient? Why or why not?
I agree with @Matt. Use a hashtable -- something like the below. This should run in
m + 2n
rather thanmn
time.Timings on my system
original Solution above
This definitely looks O(n^2)
Solution Below
This looks linear.
Solution
I used three techniques to increase the speed:
--