How to join data from several sources knowing that there are or might be duplicates in both sources?
In a SQL query one can use UNION (instead of UNION ALL) to merge several sources and to remove duplicates.
e.g. (knowing that both sources have same columns)
SELECT * FROM SourceA UNION SELECT * FROM SourceB |
In SSIS there’s no such component to accomplish this task immediately.
To accomplish the same behavior in SSIS as in a SQL query, one should combine a “UNION ALL”-component with a “SORT”-component.
The “SORT”-component provides an option to remove the duplicate rows.
In our example above, edit the “SORT”-component to specify the sorting order based on the column or columns that uniquely identifies a record (for example the record-ID column).
In the SORT-editor (the normal editor or the advanced editor) one can select the option to remove the rows with duplicate sort values as shown below.
Normal Editor | Advanced Editor |
Once this property is set to true, the combination of the “UNION ALL”-component and the “SORT”-component achieves the same thing as our “UNION” query, so your output from the “SORT”-component will no longer contain duplicate rows.