| title | Identify Similar Data Rows with the Fuzzy Grouping Transformation | ||||
|---|---|---|---|---|---|
| description | Identify Similar Data Rows with the Fuzzy Grouping Transformation | ||||
| author | chugugrace | ||||
| ms.author | chugu | ||||
| ms.date | 03/14/2017 | ||||
| ms.service | sql | ||||
| ms.subservice | integration-services | ||||
| ms.topic | how-to | ||||
| helpviewer_keywords |
|
[!INCLUDEsqlserver-ssis]
To add and configure a Fuzzy Grouping transformation, the package must already include at least one Data Flow task and a source.
-
In [!INCLUDEssBIDevStudioFull], open the [!INCLUDEssISnoversion] project that contains the package you want.
-
In Solution Explorer, double-click the package to open it.
-
Click the Data Flow tab, and then, from the Toolbox, drag the Fuzzy Grouping transformation to the design surface.
-
Connect the Fuzzy Grouping transformation to the data flow by dragging the connector from the data source or a previous transformation to the Fuzzy Grouping transformation.
-
Double-click the Fuzzy Grouping transformation.
-
In the Fuzzy Grouping Transformation Editor dialog box, on the Connection Manager tab, select an OLE DB connection manager that connects to a [!INCLUDEssNoVersion] database.
[!NOTE]
The transformation requires a connection to a [!INCLUDEssNoVersion] database to create temporary tables and indexes. -
Click the Columns tab and, in the Available Input Columns list, select the check box of the input columns to use to identify similar rows in the dataset.
-
Select the check box in the Pass Through column to identify the input columns to pass through to the transformation output. Pass-through columns are not included in the process of identification of duplicate rows.
[!NOTE]
Input columns that are used for grouping are automatically selected as pass-through columns, and they cannot be unselected while used for grouping. -
Optionally, update the names of output columns in the Output Alias column.
-
Optionally, update the names of cleaned columns in the Group OutputAlias column.
[!NOTE]
The default names of columns are the names of the input columns with a "_clean" suffix. -
Optionally, update the type of match to use in the Match Type column.
[!NOTE]
At least one column must use fuzzy matching. -
Specify the minimum similarity level columns in the Minimum Similarity column. The value must be between 0 and 1. The closer the value is to 1, the more similar the values in the input columns must be to form a group. A minimum similarity of 1 indicates an exact match.
-
Optionally, update the names of similarity columns in the Similarity Output Alias column.
-
To specify the handling of numbers in data values, update the values in the Numerals column.
-
To specify how the transformation compares the string data in a column, modify the default selection of comparison options in the Comparison Flags column.
-
Click the Advanced tab to modify the names of the columns that the transformation adds to the output for the unique row identifier (_key_in), the duplicate row identifier (_key_out), and the similarity value (_score).
-
Optionally, adjust the similarity threshold by moving the slider bar.
-
Optionally, clear the token delimiter check boxes to ignore delimiters in the data.
-
Click OK.
-
To save the updated package, click Save Selected Items on the File menu.
Fuzzy Grouping Transformation
Integration Services Transformations
Integration Services Paths
Data Flow Task