Automated Filtering of duplicate records during dataset appends

You can prevent duplicate records being appended to a dataset by defining one or more Match Keys that specify the rules for identifying Duplicates. Once you have defined one or more Match Keys, then during the import process any uploaded records that are identified as a duplicate of a record in the same file or already existing in the Dataset will be filtered out and reported as a duplicate.

The automated filtering process works in the following way:

Users create definitions for generating Match Keys for each dataset
When data is imported Ipiphany will generate Match Keys for both the existing dataset records and the new records to be imported
During the import process if any records that are to be imported have generated Match Keys that match other records in the import file or already existing in the Dataset, then these records are flagged as Duplicates and filtered out
Any filtered Duplicates are reported in the import notification email and are available for download from the import request

How the Duplicate matching works

The Duplicate matching process works as follows for each Match Key defined for a Dataset:

1) Match Key values are generated for all records imported or existing in the database

A Match Key value is only generated if the record values for all the Match Key attributes are not null
If a Match Key is defined with a single Attribute then its value is the value of the data in the record for that Attribute
If a Match Key is defined with multiple Attributes then the value is a concatenation of the values for those Attributes

2) Then the system checks the data to be appended to determine if two or more records share any Match Key values. Where this occurs then the Duplicate records are removed i.e. only one is retained

3) Finally the system checks whether any of the records to be appended shares a Match Key value with any record already existing in the dataset. Where this occurs then the record to be appended is removed i.e. only the existing record in the Dataset is retained

4) Any records removed as Duplicates are written to the Error Report file along with information on the Match Key and matching record that caused them to be removed.

Note: If you define Match Keys after appending data, then any Duplicate records already in the Dataset are not automatically removed. To achieve this you will need to reimport the data, or identify any existing Duplicates and use the Purge function to remove them.

Defining Dataset Match Keys

There are three ways you can define Match Keys

From the Dataset Settings/Match Keys tab you can Create, Modify and Delete Match Keys
From the Dataset/Attributes Screen you can specify an Attribute as a Unique Key (Match Key with a single Attribute)
When creating a dataset, appending data or setting up a Connection

Defining Match Keys for an Existing Dataset

The first way you can define Match Keys or edit existing ones is from the Dataset Match Keys screen.

1) Select 'Dataset settings' from the Dataset Menu.

2) Select the Match Keys tab

3) Create a Match Key

When defining a Match Key keep the following considerations in mind:

Each Match Key consists of one or more Attributes
A maximum of five Attributes can be used with composite Match Keys
Each dataset can have a maximum of ten Match Keys
Only Attributes of type 'Other' can be used to create single Attribute Match Keys (Unique Keys)
Attributes of type 'Verbatim' cannot be used in Match Keys
For Attributes of type 'Other' the options to trim or ignore case are available

Defining Unique Match Keys from the Attributes and Attribute Mapping dialogs

The second way you can define Match Keys is from the Attribute or Attribute Mapping dialogs.

The majority of Duplicates can be identified using a the value of a single Attribute i.e. a Unique Key. The easiest way to define this type of Match Key is from the Attribute or Attribute mapping screen.

If the Attribute is of type 'Other' then the 'Unique Key' checkbox can be ticked, which will result in a Match Key being created for this single Attribute.

Defining Match Keys during a Data Append or Connection setup

The third way to define Match Keys is when creating a dataset, appending data or setting up a connection. In all these workflows the option exists to add new Match Keys on the last step of the process.

Reporting Filtered Duplicates

If any records were filtered as Duplicates during the data import process, then this will be detailed in the Import Summary Notification email.

Note you can download the error report to view the Duplicate records and details of why they were excluded.

Automated Duplicate Filtering

Automated Filtering of duplicate records during dataset appends

How the Duplicate matching works

Defining Dataset Match Keys

Reporting Filtered Duplicates