20/01/2011

SSIS and distinct union

 

How to join data from several sources knowing that there are or might be duplicates in both sources?

In a SQL query one can use UNION (instead of UNION ALL) to merge several sources and to remove duplicates.

e.g. (knowing that both sources have same columns)

SELECT *
FROM SourceA
UNION
SELECT *
FROM SourceB

In SSIS there’s no such component to accomplish this task immediately.

To accomplish the same behavior in SSIS as in a SQL query, one should combine a “UNION ALL”-component with a “SORT”-component.

The “SORT”-component provides an option to remove the duplicate rows.

image

In our example above, edit the “SORT”-component to specify the sorting order based on the column or columns that uniquely identifies a record (for example the record-ID column).

In the SORT-editor (the normal editor or the advanced editor) one can select the option to remove the rows with duplicate sort values as shown below.

image image
Normal Editor Advanced Editor

Once this property is set to true, the combination of the “UNION ALL”-component and the “SORT”-component achieves the same thing as our “UNION” query, so your output from the “SORT”-component will no longer contain duplicate rows.