-
Notifications
You must be signed in to change notification settings - Fork 0
Actions
Javier Otegui edited this page Dec 28, 2016
·
1 revision
Depending on the action
parameter, there are different ways of handling duplicates.
When using the report
method, users will receive a JSON-like document specifying the duplicates in their records. There will be no new file, and nothing will be changed from the original file. The document will have the following structure:
{
"email": "Email to send notification to",
"fields": "Number of fields of the data set",
"records": "Number of records parsed (with duplicates)",
"warnings": "List of warning messages, if any",
"file": "Link to the generated file. Only if 'action' is 'remove' or 'flag'",
"strict_duplicates": {
"count": "Number of rows that are exact copies of other rows",
"ids": "List consisting of the IDs of the duplicate rows. Only if ID field is provided or can be determined",
"index_pairs": "List consisting of the positions of duplicate record pairs. Only if 'count' > 0"
},
"partial_duplicates": {
"count": "Number of rows that are partial copies of other rows",
"ids": "List consisting of the IDs of the duplicate rows. Only if ID field is provided or can be determined",
"index_pairs": "List consisting of the positions of duplicate record pairs. Only if 'count' > 0"
},
... # More duplicate types as developed
}
Pretty self-explanatory. The resulting file will omit duplicate rows.
This is the default action, as of Aug-4. With this action, three new fields will be added to the response file:
-
isDuplicate
is a boolean field indicating whether or not the row is a duplicate of another row -
dupicateType
is a controlled vocabulary indicating the type of duplicate:full
,partial
or any other that might come in the future. -
duplicateOf
is a list of all the other record IDs for which the current record is a duplicate. Even for strict duplicates, for the sake of consistency, it makes sense to make this field a list.
The current list of duplicate types is:
Duplicate type | Code |
---|---|
No duplicate | 0 |
Strict duplicate | 1 |
Partial duplicate | 2 |
Action | isDuplicate | Effect |
---|---|---|
report | no | Nothing |
report | yes | Add to list |
remove | no | Write row with no flags |
remove | yes | Don't write row |
flag | no | Write row with [0, null, null]
|
flag | yes | Write row with [1, type, list_of_dupes]
|
This repository is part of the VertNet project.
For more information, please check out the project's home page and GitHub organization page