Using SSIS, How do I find the cities with the larg

2020-02-07 04:44发布

I have a dataflow task with information that looks something like this:

Province | City    | Population
-------------------------------
Ontario  | Toronto | 7000000
Ontario  | London  |  300000
Quebec   | Quebec  |  300000
Quebec   | Montreal| 6000000

How do I use the Aggregate transformation to get the city with the largest population in each province:

Province | City    | Population
-------------------------------
Ontario  | Toronto | 7000000
Quebec   | Montreal| 6000000

If I set "Province" as the Group-By column and "Population" to the "Max" aggregate, what do I do with the City column?

2条回答
家丑人穷心不美
2楼-- · 2020-02-07 05:22

Completely agree with @PaulStock that aggregates are best left to source systems. An aggregate in SSIS is a fully blocking component much like a sort and I've already made my argument on that point.

But there are times when doing those operations in the source system just aren't going to work. The best I've been able to come up with is to basically double process the data. Yes, ick but I was never able to find a way to pass a column through unaffected. For Min/Max scenarios, I'd want that as an option but obviously something like a Sum would make it hard for the component to know what the "source" row it'd tie to.

2005

A 2005 implementation would look like this. Your performance is not going to be good, in fact a few orders of magnitude from good as you'll have all these blocking transforms in there in addition to having to reprocess your source data.

2005 aggregate

Merge join 2005 Merge Inner Join

2008

In 2008, you have the option of using the Cache Connection Manager which would help eliminate the blocking transformations, at least where it matters, but you're still going to have to pay the cost of double processing your source data.

Drag two data flows onto the canvas. The first will populate the cache connection manager and should be where the aggregate takes place.

Warm cache connection manager

Now that the cache has the aggregated data in there, drop a lookup task in your main data flow and perform a lookup against the cache.

General lookup tab

General lookup tab

Select the cache connection manager

Connection manager lookup tab

Map the appropriate columns

Columns lookup tab

Great success Great success 2008 data flow

Script task

The third approach I can think of, 2005 or 2008, is to write it your own self. As a general rule, I try to avoid the script tasks but this is a case where it probably makes sense. You will need to make it an asynchronous script transformation but simply handle your aggregations in there. More code to maintain but you can save yourself the trouble of reprocessing your source data.

Finally, as a general caveat, I'd investigate what the impact of ties will do to your solution. For this data set, I would expect something like Guelph to suddenly swell and tie Toronto but if it did, what should the package do? Right now, both will result in 2 rows for Ontario but is that the intended behaviour? Script, of course, allows you to define what happens in the case of ties. You could probably stand the 2008 solution on its head by caching the "normal" data and using that as your lookup condition and using the aggregates to pull back just one of the ties. 2005 can probably do the same just by putting the aggregate as the left source for the merge join

Edits

Jason Horner had a good idea in his comment. A different approach would be to use a multicast transformation and perform the aggregation in one stream and bring it back together. I couldn't figure out how to make it work with a union all but we could use sorts and merge join much like in the above. This is probably a better approach as it saves us the trouble of reprocessing the source data.

enter image description here

查看更多
Anthone
3楼-- · 2020-02-07 05:33

Instead of using the Aggregate transformation, could you use a SQL query instead?

SELECT
    p.province,
    p.city,
    p.[population]
FROM
    temp_pop P
    JOIN ( SELECT
            province,
            [population] = MAX([POPULATION])
           FROM
            temp_pop
           GROUP BY
            province
         ) AS M ON p.province = M.province AND
                   p.[population] = M.[population]
查看更多
登录 后发表回答