Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent duplicate sibling contact capture #9601

Open
ChinHairSaintClair opened this issue Nov 1, 2024 · 9 comments
Open

Prevent duplicate sibling contact capture #9601

ChinHairSaintClair opened this issue Nov 1, 2024 · 9 comments
Labels
Type: Feature Add something new

Comments

@ChinHairSaintClair
Copy link
Contributor

ChinHairSaintClair commented Nov 1, 2024

Is your feature request related to a problem? Please describe.
Currently, there appears to be no built-in deterrent against creating records with names similar to existing siblings.

Describe the solution you'd like
Prevent duplicate place/person creation and display possible duplicates for consideration. On record submission (through create or edit flow), we want to show the possible duplicate items to our user. They can then navigate to the possible duplicate item via a link, and proceed with record changes there, or circumvent the duplicate check & proceed with record submission.

Describe alternatives you've considered
Despite improving our search functionality, and training the CHWs to use it, usage of the search feature before creating new records remains low, leading to frequent duplicates.

Additional context
We noticed that our CHWs either forget they've previously captured items or miss previously captured items due to it being slightly mistyped. This has resulted in quite a few duplicate records being created on all user created levels of our hierarchy. We want to fix this at the source before tasks are rolled out to make sure no unnecessary/incorrect tasks fill up our CHWs worklist. This will naturally also improve the accuracy of our data for reporting purposes.

We have a working prototype that we will soon upstream, which can produce the following:
Image

Please see the following related discussions for more info:
https://forum.communityhealthtoolkit.org/t/mitigate-duplicate-data-capture/3313
#6363

As a "damage control" step, as discussed with the medic team, we plan to use Databricks (our tool primarily responsible for pulling couchDB data into our community database0 to also push "flags" to potential duplicate items. That will, in turn, cause tasks to trigger on the app. CHWs are expected to then confirm/deny possible duplicate and determine what should happen to the record (delete, merge, other).

@jkuester
Copy link
Contributor

Overview

First I just want to clarify the space and existing problem-set to clarify exactly what should be addressed in this issue. I think there are three separate, but related, problems to solve under the heading of "Duplicate prevention":

  1. Prevent user from creating a duplicate contact - (this current issue)
    • Runs in the webapp. Works offline.
    • Only checks against contact's siblings due to performance and data-access considerations.
  2. Flag/merge duplicate contacts on the server - Prevent and/or merge duplicate contacts #6363
    • Sentinel transition? Historically, not possible to do much interesting here because of the limitations of Couch views. However, Lucene may change the game.
    • Checks contacts across entire instance. Flags for resolution by admin (or auto-merges if possible).
  3. Prevent user from creating duplicate reports - Workflow to merge likely duplicates upon submission #6309
    • Proposed approach is to warn/block users when they try to create a new report for a contact that already has the same report existing in a particular time frame.

Details

With that summary in mind, we can dig into the details of how to specifically prevent a (potentially offline) user from creating a duplicate contact. The prototype PR provides a great starting point for this conversation. My goal here is to synthesis/generalize the details from that prototype into a design summary here that is easier for folks to understand and discuss (also including some of my own suggestions and editorializing). Once we coalesce on a particular design approach, we can return to the actual code and make that happen.

Configuration

Since different types of contacts contain different levels of data that might need to be checked for duplication, it makes sense to configure the rules for dup checking individually for each contact type. I think it would be logical to include this config in the app-settings contact_types config.

For each contact type, we need to define the rules for what constitutes a duplicate. The most flexible way to do this would be to allow a custom function to be included in the config that accepts two contacts and returns a boolean indicating if the contacts are duplicate or not. The Levenshtein library could be in-context for this function's logic to make use of. This would allow for the dupe logic to be as complex or simple as necessary. The main downside of this function-based approach is that I guess it would be almost impossible to re-use this configuration for any kind of server-side dupe checking (in any future solution for #6363). It is not feasible from a performance perspective to run a function like this against every contact in the db. (Honestly, though, the more I consider this, the less I think we should try to optimize this config for any kind of server-side reuse... It seems like the most likely possibilities for server-side dupe checking functionality in the near future are not going to happen with Couch data, but via an external data store, e.g. DOT or Databricks)

The prototype PR presents an alternative approach where, instead of a function, for each contact type we define which fields should be compared from the contact docs and which algorithm (e.g. Levenshtein vs NormalizedLevenshtein) should be used to compare them. (Note that I think it would be best to hold off on including the queryParams functionality or any other kind of override config in the initial MVP for duplicate checking. The goal is to add the minimum viable functionality first and then we can extend it further in future PRs.) Some considerations around this declarative approach are that it does not really allow for inter-field dependencies (e.g. specifying that contacts could have the same name or the same phone but not both). Also, unlike the approach in the PR, we probably need to be able to define the comparison algorithm for each field. Levenshtien is great for string fields, but we might need to compare things like dates or numbers in the future.

I think more discussion about the best form of configuration would be valuable here! One thing I would love to see is a simple way to enable some "default" dupe checks (e.g. something that just validates the name field).

Functionality

Moving on to the functionality of how this should actually work in webapp. I think the example from the PR is a solid approach where, when opening a new contact form, it also triggers a lookup of the existing sibling contacts of the new contact (via medic-client/contacts_by_parent). When the contact form is submitted, the configured duplicate checking is run against those sibling contacts to determine if the new contact is a duplicate of any existing ones. Checking against the direct siblings seems like the most reasonable balance of functionality and performance. An offline user can only know about what is visible on their device, anyway, so it is impossible to guarantee the new contact will be unique for the whole instance. This means we have to stop checking somewhere. Also, we need to do the check at the end of the form (once the new contact info has been entered), but before the form has been closed (so the user can go back and change data if necessary).

I am not sure if this is covered in the current PR functionality or not, but I think it will be important to do the same duplicate checking when editing a contact.

UX

As demonstrated in the screenshot above, the current UX in the PR is to present the user with a list of the found duplicates (along with links to go to their profile page). One incremental enhancement that might make sense here would be to present the duplicates more as a proper list of contacts (or even "contact cards") with various important identifying information displayed (instead of highlighting specifically which data is being matched for the duplicate check). If a user clicks into one of the other contacts, they will loose all the data they have entered into the contact form, so we want to give them as much info as is feasible about the contacts before they navigate away from the form.

Instead of including the list of contacts inline in the form page, it might be better to pop a modal containing the list. 🤔 (Either way, we should be able to use a xforms-value-changed listener to clear the dupe error and list of contacts when the user updates a value in form.)

Another consideration is what the default behavior should be if duplicate contacts are found. Should we warn the user, but still let them submit the new contact? Or, should we totally disable the submit button and prevent the new contact from being added? Ultimately, this is something that we could make configurable, but it would be good to have a simple approach in the MVP and add configuration later...


@ChinHairSaintClair @fardarter Please weigh in here with anything that I have missed or mistaken or additional thoughts of considerations that you have!

@garethbowen let me know what you think about this proposed approach. What other stake-holders should we pull into this conversation to make sure we can maintain momentum on this feature?

@garethbowen
Copy link
Member

Should we warn the user, but still let them submit the new contact?

I think this is a must. We can't assume anything about naming conventions and it's quite possible for two people at the same family to have the same name. The point of this feature is to stop a CHW doing the wrong thing accidentally, not prevent them from doing an action on purpose.

Prevent user from creating duplicate reports

I'm really interested in seeing if we can make this workflow generic enough that it works for all report types, not just contact creation. We have so many examples of duplicates being created accidentally across all types that this would be powerful. I worry that some of the thoughts here (like Levenshtein distance, using multiple fields, etc) are over and above what's actually needed. Can we just check the exact name and family? For reports, can we just check report code, reported date, and subject? If we can simplify it enough, can it work out-of-the-box without configuration?

When the contact form is submitted, the configured duplicate checking is run against those sibling contacts to determine if the new contact is a duplicate of any existing ones.

The ideal solution would warn about duplicates as early as possible. Some forms are very long and forcing a user to enter all the details before telling them about dupes we could have found after the first input field was complete would be a very frustrating UX. In my head this looks like a validation error on the name input with a checkbox to bypass the check, but implementing it as an enketo validation would be difficult I think? But however it's done, notifying as early as possible would be a huge win.

What other stake-holders should we pull into this conversation to make sure we can maintain momentum on this feature?

I think the eCHIS Kenya team would be interested in this too.

@jkuester
Copy link
Contributor

I'm really interested in seeing if we can make this workflow generic enough that it works for all report types, not just contact creation. ... For reports, can we just check report code, reported date, and subject? If we can simplify it enough, can it work out-of-the-box without configuration?

This was my initial thoughts as well, but then I was convinced by your comment on the other issue that the "duplicate report" workflow was quite different from duplicate contacts and perhaps there is not much overlap in the configs/logic. Specifically, when detecting a "duplicate report" is is probably much less about the contents of the report than just as you said: the type of report, who it is for, and when it is submitted. Basically for reports we would be looking for other reports of the same type that were created for the same contact in a particular timeframe. These checks can happen up-front before even loading the form. All of this is pretty different from the "duplicate contact" flow where the most important thing is the content that the user enters for the new contact. So we cannot do an upfront check for a duplicate contact. Also, it is likely that more config is necessary to allow the contact dupe checking to be really useful. (Maybe we can find some sensible pre-sets, but is seems like lots of tuning may be needed for some cases...)

Because of this, I am skeptical of a "one-size-fits-all" solution for dupe-doc checking that covers both reports and contacts. (And even if we do decide to go that route, we would not need to support dupe-checking reports in this MVP PR.) It seems like the most important thing to decide at this point is if we think report dupe checking will need to allow for the same level of flexible configuration as contacts (e.g. specifying which fields should be dupe checked). If so, then yeah it probably makes sense to at least design the contact dupe-checking to be extended later for also checking reports. If not, then I think we probably just leave the report dupe checking to its own issue and not worry about it here.

The ideal solution would warn about duplicates as early as possible. .... but implementing it as an enketo validation would be difficult I think?

Okay, this got me thinking that maybe I have been coming at this from the wrong direction! What if, instead of configuring the dupe-checking in the app-settings, we did it in the actual form xlsx files? In the form config we could use a custom column to mark all the fields that should be dupe-checked (and maybe even indicate what comparison algorithm to use). Then we could have a custom Enekto widget that would listen for changes to any of these fields and trigger the dupe-checking logic when any of the values change. (Then, the widget could trip the enketo validation logic to make the error look like a constraint violation if we wanted to go in that direction.) The data/logic flow is going to be quite a bit more complex (e.g. how to get the docs to check against, how the dupe-checking logic knows all the fields that are supposed to be included, what to do if we actually find a duplicate, etc).

The main downside I see to configuring things in the form is that for contact forms it would be important that only the fields that map directly to the contact doc are eligible for dupe checking. Fields in the inputs group or the intro group cannot be dupe-checked. But, it seems feasible to include a validation in cht-conf to help prevent this.

(Tell me if I am wrong here, though! 😅 I feel like I am seeing Enekto widget + custom xlsx column as the solution to all problems lately....)

I think the eCHIS Kenya team would be interested in this too.

@eljhkrr just putting this on your radar! Please jump in to the discussion here if you have any specific concerns, requirements, or ideas!

@ChinHairSaintClair
Copy link
Contributor Author

ChinHairSaintClair commented Nov 27, 2024

First, thank you, @jkuester , for explaining the concept and the prototype PR so clearly.

Our operational needs

Let's headline the things we need from this solution:

  1. We need the text duplicate check to handle various common sources of duplication, eg:
    a. shortened names (Kat versus Katherine, Masi versus Masiphumelele)
    b. spelling errors and casing differences.
    c. multiple valid references, eg str, str., st, st., street, ave, ave., avenue or even "ave (avenue)" or "rd (road)"
  2. We need a way to mark items as duplicates/canonical.
  3. We need to offer CHWs the ability to submit despite a notification that there are duplicates -- they are the expert on the ground in the moment.

Critical alignment issue

We'll cover a number of issue in this response (thank you very much for your engagement), but what I think we're looking for with the greatest focus is rapid alignment on a proposed data structure. The idea is that if we agree on a data structure, we can safely implement our own solution that will align. At worst, we might need to adjust the config load source. What we won't (we hope) need to do is a fundamental rethink/rewrite.

This will let us progress operationally on our side (where the need is urgent) while still making space for necessary discussions around the best UX/UI, code implementation (none of this is implemented yet) and other architectural concerns.

Please let us know if you agree with this approach or have an alternative theory.

Proposal

Config data structure for duplicate checks

Below is our proposed data structure for a duplicate check system. Leaving aside the question of where the config should live, a declarative data structure should be portable to contact_types, <form>.properties, or to the .xlsx context. There are a few elements, which we'll delve into further below, but first let me provide a broad overview.

Assume the following person-edit form:

type name
begin group person
string name
select_one yes_no is_user_flagged_duplicate
end group person

And its accompanying duplicate check structure:

{
   "props": { // The returned array will contain items that evaluate to true for the conditions - duplicates.
      "logic": "or",
      "conditions": [
         {
             "formPropPath": "/data/person/name", // The form question path whose resolved value will be compared to the corresponding value in sibling database documents.
             "dbDocRef": "name", // The name of the property to be compared in the sibling database document.
             // Strategy ALWAYS returns bool - implies status of "yes matches" or "no it doesn't match".
             "strategy": { 
                "type": "Levenshtein", // Various
                "threshold": 1.0 // Config overrides
             },
        },
        {
            "formPropPath": "/data/person/phone_number",
            "dbDocRef": "phone_number",
            "strategy": {
                "type": "Equals"
            },
        },
      ]
   },
   "shouldCheckForDuplicates": { // Optional. Defaults to "true"
      // Takes the same structure as "props" but for "dbDocRef"
      "logic": "or", 
      "conditions": [
         {
            "formPropPath": "/data/person/is_user_flagged_duplicate",
            "strategy": {
               "type": "Equal",
               "value": "no"
            }
         }
      ]
   }
}

Duplicate conditions

Every condition we define is used to potentially identify a sibling database document as a duplicate. For example, if we match on the name property, every doc of the same type within the parent place will be flagged as a duplicate if the names are similar. These conditions can be grouped to evaluate together.

{
    "logic": "or",
    "conditions": [
        {
            "formPropPath": "/data/person/phone_number",
            "dbDocRef": "phone_number",
            "strategy": {
                "type": "Equals"
            }
        },
        {
            "formPropPath": "/data/person/name",
            "dbDocRef": "name",
            "strategy": {
                "type": "Levenshtein",
                "threshold": 1.0
            }
        }
    ]
}
prop purpose
formPropPath The path of the field/question in the currently open form whose value is of interest.
dbDocRef The property name of a sibling database document whose value is of interest.
strategy ALWAYS return a bool as it is a status check -- "yes, it matches" or "no, it does not match".

The values from formPropPath and the sibling dbDocRef definitions are compared using the specified strategy.

Note: Notice the "Equal" strategy doesn't take a value here like it does in the above shouldCheckForDuplicates object. This is because the current value is being compared to sibling values.

Any properties (eg threshold) could be provided via config, with sensible defaults.

Doing logic

Consider the following:

prop purpose
props Declares a set of conditions for what sibling values count as duplicates for a returned list of candidates.
logic States how to evaluate a nested group. Support for ors and/or ands proposed. Defaults to undefined, but will error if undefined with more than one item.
conditions Accepts groups of properties a single "duplicate condition" (as shown above), or a combination of the two for evaluation. Single items in the array will just evaluate to their value.
shouldCheckForDuplicates Prevents the duplicate check (props) from firing if the provided set of conditions evaluates to false. Default value is true.

With the below we can, as per @jkuester 's example, identify duplicates when both the name and phone_number match a sibling's:

"props": {
    "logic": "or",
    "conditions": [
        {
            "logic": "and",
            // Where first_name + last_name = name
            "conditions": [
                {
                    "formPropPath": "/data/person/first_name",
                    "dbDocRef": "first_name",
                    "strategy": {
                        "type": "Levenshtein"
                    }
                },
                {
                    "formPropPath": "/data/person/last_name",
                    "dbDocRef": "last_name",
                    "strategy": {
                        "type": "Levenshtein"
                    }
                },
                {
                    "formPropPath": "/data/person/phone_number",
                    "dbDocRef": "phone_number",
                    "strategy": {
                        "type": "Equals"
                    }
                }
            ]
        },
        {
            "logic": "and",
            "conditions": [
                {
                    "formPropPath": "/data/person/first_name",
                    "dbDocRef": "first_name",
                    "strategy": {
                        "type": "Levenshtein"
                    }
                },
                {
                    "formPropPath": "/data/person/last_name",
                    "dbDocRef": "last_name",
                    "strategy": {
                        "type": "Levenshtein"
                    }
                },
                {
                    "formPropPath": "/data/person/id_number",
                    "dbDocRef": "id_number",
                    "strategy": {
                        "type": "Equals"
                    }
                }
            ]
        },
    ]
},
"shouldCheckForDuplicates": { // Optional -- defaults to "true"
    "logic": "or",
    "conditions": [
        ...
    ]
}

For the above example, let's imagine we have a mom and a child registered on our system. The mother, Jade, will be saved with her phone number and the child, Savanah, will share their parent's phone number. With the above config, should we attempt to create an entry of either person where both the name and phone_number are the same as an existing one, the duplicate list should be populated with their entry.

Notice we can also simultaneously express that we ALSO want duplicates listed where the name and id_number are the same.

Marking canonical entries

Unfortunately, since the go-live last November, our users have created duplicates. At times, this was for legitimate reasons, such as when no house number or land parcel number (locally, ERF number) is available. The CHWs often use 'Sa number not known' in this circumstance (and type it variably). Because of elements like these, or due to typos, it is difficult to accurately search for existing documents before creating a new ones (search is also not especially accurate).

We needed some way of identifying the gold/canonical document and marking it as canonical, so the UI should not alert about other duplicate siblings. By marking certain documents as canonical, we will be able to make decisions about updating/deleting/merging duplicates on the backend. By evaluating the current form values in the shouldCheckForDuplicates, we can opt out of the siblings database lookup when a document is marked as canonical.

We can express this in the same structure as in the props object. In this case we need to provide a comparator value for the Equal strategy (the value) is required, and threshold is overridden for NormalizedLevenshtein (a default could be set).

{
   "props":{
      "logic": "or",
      "conditions": [
         ...
      ]
   },
   "shouldCheckForDuplicates": {
      "logic": "or",
      "conditions": [
         {
            "formPropPath": "/data/person/is_canonical",
            "strategy": {
               "type": "Equal",
               "value": "no"
            }
         },
         {
            "formPropPath": "/data/person/name",
            "strategy": {
               "type": "NormalizedLevenshtein",
               "value": "Sibo",
               "threshold": 0.5 
            }
         }
      ]
   }
}

In the above example, we're saying that if the current form's is_canonical field is "yes" or if the person's name is some threshold distance from Sibo (just an example of the logic), the document is a known duplicate and we should not fire the database query or prompt a user to consider duplicates.

We haven't thought through all cases, but this structure may well be able to handle the inter-field deps @jkuester is thinking about.

Defaults checks

Possible default strategies per field.

Contact

Below is a proposal of the default strategies per field, using the person as an example.

"props": {
    "logic": "and",
    "conditions": [
        {
            "formPropPath": "/data/person/first_name",
            "dbDocRef": "first_name",
            "strategy": {
                "type": "Levenshtein",
                "threshold": 3
            }
        },
        {
            "formPropPath": "/data/person/middle_name",
            "dbDocRef": "middle_name",
            "strategy": {
                "type": "Levenshtein",
                "threshold": 3
            }
        },
        {
            "formPropPath": "/data/person/last_name",
            "dbDocRef": "last_name",
            "strategy": {
                "type": "Levenshtein",
                "threshold": 3
            }
        },
        {
            "logic": "or",
            "conditions": [
                {
                    "formPropPath": "/data/person/date_of_birth",
                    "dbDocRef": "date_of_birth",
                    "strategy": {
                        "type": "Equals",
                    }
                },
                {
                    "formPropPath": "/data/person/phone_number",
                    "dbDocRef": "phone_number",
                    "strategy": {
                        "type": "Equals",
                    }
                },
            ]
        }
    ]
}

Report

Thinking about @garethbowen 's suggestions around default checks, with this language/system, consider the following default for record_type:

"props": {
    "logic": "or",
    "conditions": [
        {
            "formPropPath": "/data/report_code",
            "dbDocRef": "report_code",
            "strategy": {
                "type": "Equals",
            }
        },
        {
            "formPropPath": "/data/reported_date",
            "dbDocRef": "reported_date",
            "strategy": {
                "type": "Equals",
            }
        },
        {
            "formPropPath": "/data/subject",
            "dbDocRef": "subject",
            "strategy": {
                "type": "Equals",
            }
        }
    ]
}

Strategies

Suggested strategies:

type configuration variable example reasoning
Levenshtein threshold 'cat' ~ 'caterpillar' = 8 Fuzzy matching
NormalizedLevenshtein threshold 'cat' ~ 'caterpillar' = 0.727 Fuzzy matching with length normalisation
Equals value '1994-08-23' === '1994-08-23' Equality check.

In a form, suppose there's a check for something like an id_number. This could be checked with the equality strategy.

For something like dwelling names, eg. "123 Stevenson way", "Stven Way 23", and "Peter Stevenson Road", Levenshtein, or NormalizedLevenshtein could be helpful in catching duplicates.

Locations

Proposed storage locations for the various pieces of code that comprise this solution.

Config

Our first suggestions for where to store config is in app_settings.json or in <form>.properties.json. We already manage form fields (listing items to hide from reports) in <form>.properties.json so a "duplicate" block for fields feels like a natural a fit. Any advice here would be appreciated.

Because the structure is declarative, it's possible to put the structure in the .xlsx files, but because it's not source controlled and adds complexity we'd resist that as an initial implementation. However, a declarative approach doesn't shut that door.

That being said, we’re a bit uncertain how to load these values and make it available within contacts-edit.component/report.component. We would appreciate any guidance to accelerate us here.

Code

Our initial proposal is to locate all the code processing the declarations in a modules/utils/deduplicate.ts file, which could then be made available to contacts-edit.component and reports.component. Once again, advice here welcomed.

Early duplicate detection

We're not exactly sure how to implement this. Maybe, since the sibling request gets triggered during form initialization, we could filter the results through onBlur input field listeners once it's resolved. After a certain number/set of fields are marked as dirty, we could take action. We've suggested how the results might be shown to the user, please see the display duplicate modal point in the UX section for details.

UX

Default behaviour on duplicate found

It is a hard requirement for us that users can bypass the duplicate check. There may be a case for a harder check on something like reports, but that can be handled in validation, and perhaps not a default.

One idea for the UI is to add a section to the enketo.component.html just above the submit button with a checkbox to warn the users that there are potential duplicates, making them acknowledge the warning before moving forward.

Here's how the flow might look:

  1. On form load the sibling lookup is initiated and the submit button is disabled.
  2. Users enter details that matches sibling data.
  3. An inline-list is displayed, above/next to the checkbox, showing duplicate results as contact cards with detailed information and a link to navigate to the duplicate (possibly with a data loss warning).
  4. A checkbox with an acknowledgement box is displayed above the submit button, which is disabled.
  5. Once the checkbox is ticked, the submit button becomes enabled, allowing the form to be saved.

The benefits are:

  • Flexibility for many use cases. The CHT is used by many people with different needs.
  • Minimise disruptions. Disabling the form submission could frustrate users, a warning could allow them to make an informed decision.
  • Provide the field worker with the necessary agency.

Display duplicates

Instead of including the list of contacts inline in the form page, it might be better to pop a modal containing the list.

Image

An inline list of contacts (see above screenshot), allow us to leaf through the current form values and compare against the duplicates.

Modals are great for decisions that can be made quickly where all context can be presented on the modal. However, it's clear from our users that they need to be able to compare against what is already on the page when taking decisions, and a modal flow isn't appropriate for that, especially (but not only) in multi-page registration forms.

With the current modal implementations, CHWs would have to dismiss it every time a different form value needs checking, since it blocks user interaction (no scrolling or clicking through).

One of our suggestions for the future is to have some sort of "bottom drawer" that allows user interaction with the current form while also being able to pull up, dismiss and resize the duplicate result pane. The best of both worlds. Something like:

Image
What do you think?

Duplicate items as contact cards

We like the idea of presenting duplicates as a list of "contact cards", it is something the community might be comfortable with due to familiarity. It might be advantageous to make these cards expandable (starting off with less space), since multiple duplicates could be returned. That way, as @jkuester suggested, we could easily include various key identifiers without worrying too much about vertical space. It will allow the user to make an informed decision before navigating off the current form - losing entered data.

Modification to ListTiles

We'd like to suggest making a subtext line available on ListTiles, which might be configurable through the contact_type section in the app_settings. @garethbowen made an excellent point around assuming anything about naming conventions. We should prevent accidental duplication, not restrict intentional actions. For example, multiple family members could share the same name.

Using the example above, adding a configurable extra line of text, like showing date_of_birth, could make it much easier for CHWs to tell a father and a child apart at a glance. As not all records are created during the same visit, it would be an easy way to see that the father has already been added and now it's the child's turn. It would also help them quickly spot real duplicates in their workspace.

Endnotes

Duplicate check on edit

I am not sure if this is covered in the current PR functionality or not, but I think it will be important to do the same duplicate checking when editing a contact.

Short answer is yes it is. The contacts-edit.component.ts serves both contact creation and editing, as a result the duplicate prevention prototype was written to manage both cases. Everything works the same, just a slightly different lookup strategy on loading the page. Here's how we do it:

const docId = this.enketoContact.docId;
this.isEdit = docId != null;
this.parentId = contact?.parent?._id ?? this.routeSnapshot.params.parent_id;
this.contactType = contactType?.id ?? this.enketoContact?.type;

Re-fuzzy matching

There are many other strategies one could consider, like the Jaro-Winkler distance and N-Gram Similarity, those mentioned in the strategies section are simply what're initially necessary to meet our operational requirements. We've noticed that typos are a frequent occurrence, perhaps because we have a diverse group of users with varying levels of technical know-how. A user might also use a slightly different naming scheme every other time they go out to enumerate their assigned areas. For this reason, we need a form of evaluation that allows for 'likeness', for which Levenshtein seems a good fit. In cases where we need to make exact comparisons, like on id_number, we could use Equals.

Forum threads that started this discussion:
https://forum.communityhealthtoolkit.org/t/unexpected-search-results/3288
https://forum.communityhealthtoolkit.org/t/mitigate-duplicate-data-capture/3313/6?u=anro


Special thanks to @fardarter for helping to draw this up 🥇 .

@jkuester
Copy link
Contributor

Thanks for this brilliant write-up! I have been working my way though it, making notes, and thinking about things. Unfortunately, I am OOO the rest of the week, so I do not have time for a full response here. But, just wanted to let you know that I will be circling back here with more detailed thoughts early next week.

@ChinHairSaintClair
Copy link
Contributor Author

ChinHairSaintClair commented Nov 28, 2024

The anchor tags in our response have been fixed.

@jkuester
Copy link
Contributor

jkuester commented Dec 3, 2024

what I think we're looking for with the greatest focus is rapid alignment on a proposed data structure.

Agreed! Lets put the main focus on this. I think I have two remaining functional questions here:

  1. Do you foresee the need for "cross-comparing" fields as a part of the duplicate check (comparing the values of separate fields). For example, comparing the alternate_phone value to the phone value of other contacts?
  2. Can you confirm that the logical structuring of field comparisons you have described (e.g. a contact is duplicate if it has matching: (first_name && last_name && phone) || (first_name && last_name && id_number)) is a requirement for your duplicate checking needs? (As compared to a simple field list approach where a contact is duplicate if it has matching: first_name && last_name && phone && id_number.)

There is a ton of functionality that we could build here, but I really want to try and scope things to what we actually expect to be useful...

If the answer to both #1 and #2 is "no", then I think our data structure could be a simple list of fields to include in the duplicate checks. (Either way, I would like to try and avoid the formPropPath vs dbDocRef duplication. The system should be able to do this field mapping automatically.)

On the other hand, if we need one or both of #1 and #2, then I think we should seriously consider avoiding any kind of declarative data structure for defining the duplicate checking and instead go for a Javascript expression like the existing context.expression functionality in the form properties.json file. We could do a bunch of work coding up a super flexible declarative data structure and there would still be edge-cases missing and other kinds of unsupported workflows. If we need flexibility, we can have that while still keeping the code-base maintainable by supporting JS expression-style config. To go into a bit more detail on this idea, I am thinking the duplicate_check expression could have the following things available to reference:

  • current - the doc being created/edited
  • existing - the existing doc to compare with (e.g. a sibling contact)
  • levenshteinEq - function for comparing values
  • normalizedLevenshteinEq - function for comparing values

Then, in your form.properties.json file you could have:

{
    "duplicate_check": {
        "expression": "(levenshteinEq(3, current.first_name, existing.first_name) && levenshteinEq(3, current.last_name, existing.last_name)) || (current.phone === existing.phone && current.id_number === existing.id_number)"
    }
}

This approach is somewhat obtuse (you have to rely on the documentation to know what references are available in the expression), but I guess any approach is going to involve needing to reference docs to understand what config is available....


The other crucial piece of configuration to decide on is the method of marking a contact as "canonical". You have proposed some declarative configuration for shouldCheckForDuplicates that would allow the system to dynamically decide if a doc should be checked for duplicates. I am wondering if instead it would be possible to include the dynamic logic in the form and toggle the duplicate checking based on the value of a field on the contact. We could have a special skip_duplicate_check field. If that field value is true when the form is submitted, the dupe checking would not happen at all. If the field is not true and the dupe checking finds dups, but the user creates the contact anyways, then the contact can be written with skip_duplicate_check = true so that subsequent edits to the contact do not re-trigger the dupe checks. Is there some other aspects to this functionality that I am missing with this approach?

Once again, my goal here is to make things just flexible enough to be useful, while still keeping everything maintainable.

@jkuester
Copy link
Contributor

jkuester commented Dec 3, 2024

Defaults checks

I guess just automatically enabling some kind of default duplicate checking for all contacts could be seen as a breaking change. I wonder if it might not be better to focus on some way to allow folks to easily enable "best practice" dupe checks for each of their forms. For example, if we go with a duplicate_check object in the form's properties.json file, we could also allow other properties besides expression. Even just having a default property on duplicate_check that could be set to true if you wanted the default duplicate check to apply to this form might work fine....

Reports/Early duplicate detection

So far I do not perceive our design discussion as headed in a direction that would prevent us from implementing this functionality in the future. I group these together here because I think the early duplicate detection functionality could be super useful when dupe checking reports. (As previously noted, the most common case of "duplicate" reports would be the same form being submitted within a particular time window. So, we should know write at the start of a form that there are "duplicates".)

Additionally, the proposed "bottom drawer" UI could also pair well with the early duplicate detection since it would give a clean way to see the duplicate contacts while still in the middle of a form.

Display duplicates

I really appreciate your thoughts on the modal approach and agree that an in-line/drawer UI would be superior! My suggestion is that for the initial MVP we just display a list of the duplicate contacts in-line at the end of the form (with a few relevant details such as name, phone, date_of_birth, etc) with a check-box for bypassing the dupe-check. We can get fancier in future iterations.

@ChinHairSaintClair
Copy link
Contributor Author

Thank you for your response @jkuester, we're busy working through the proposed solution and hope to have something for your consideration soon. That being said, it is the end of the year and many people are going on leave which makes it a little tricky to make meaningful progress.

Form evolution problem

The structure and names of stored properties can vary between form versions, and this makes it difficult to guarantee field comparison between a current from and a sibling doc. We've not yet come up with a way to solve for this problem without some kind of historical mapping.

Consider the following sequence of hhm form versions:

First, v1 is created with this structure:

Create (v1)

type name
begin group hhm
string name
end group hhm

Edit (v2)

type name
begin group hhm
string name
end group hhm

Imagine we have 10 records saved with the above structure. Then the form structure is modified (v2):

Create (v2)

type name
begin group hhm
begin group personal_info
string name
end group personal_info
end group hhm

Edit (v2)

type name calculate
begin group init
calculate prev_name once(coalesce(../../hhm/personal_info/name, ../../hhm/name))
string name once(../prev_name)
end group init
begin group hhm
string name
begin group personal_info
calculate name ../../init/name
end group personal_info
end group hhm

To edit a doc saved in v1 using the v2 edit component, should our newest reference contain no value, we fall back to the previously stored name from v1 and populate the correct new field. In this example, we use the coalesce XPath function to check the current (v2) ../../hhm/personal_info/name field for a value. If it is null or "", we fall back to the v1 ../../hhm/name reference. So, if ../../hhm/personal_info/name is empty, the value from ../../hhm/name (e.g "John Doe") is assigned to the name variable in the init group.

One advantage of this is that it gives us a progressive migration of data between form versions (the 10 records would be incrementally moved to v2). A significant downside is that the resolution logic gets more and more complicated the more form versions are issued with changes. The current edit form component needs to be able to resolve the correct values given incoming data saved with an arbitrary version. This is a worry for us looking forward.

Regardless, for deduplication, the loaded siblings do not benefit from this same “migration” or mapping.

Approach

We’re still in the prototyping phase, but the approach we're currently playing with is to extract the property mapping from the XML and apply it to siblings before the comparison step.

We're still not sure how complex getting those values will be.

Some expected challenges:

  1. Distinguishing "mappings" from regular calculations. Let's say the field we're interested in has a calculation where a household_name is set to the name of the registered household head (should the household_name not be defined). No two implementers code their forms the same way, so coalesce might work as an identifier for us, but it might not be unique for another.
  2. Loading the edit form db doc while in the create form flow since only the edit form contains the "mapping" info.
  3. Correctly parsing the XML.

Alternative

Consider the following structure we're looking at storing with the expression in the duplicate_check of the .properties.json file. A fields block defines the paths and db refs for the current and existing keys we're interested in for duplicate checking, without which we were left attempting to regex them out of the expression. We'd really like to not include paths, but the form evolution problem means we need some way of expressing the correct way to resolve the value. The only alternative we can think of at the moment is to capture the possible references as an array for each existing expression key.

{
    "expression": "levenshteinEq(3, current.first_name, existing.first_name) ..."
    "fields": {
        "current": {
            "first_name": '/data/hhm/first_name'
        },
        "existing": {
            "first_name": ['first_name', 'personal_details/first_name']
        }
    }
}

We don't think this is ideal as the value mapping would need to be kept in step in two different locations (.xml & .properties.json) and it adds complexity to the declaration.

We're not sure of the best approach and wanted to expose the problem early. At some stage some new code will follow.

We would appreciate your input on this and if you have ideas for any of the other challenges.


Other items

logic

Yes, we need the ability to do logic. Let's use the household_member as an example, a duplicate could be identified by checking (current.name === existing.name && current.id === existing.id) || (current.name === existing.name && current.pmi ==== existing.pmi).
Here, id is similar to a social security number, and pmi (patient master index) is used to link to other clinical data downstream.

is_canonical

It's our view that is_canonical would make more sense as a field name to the downstream db users versus skip_duplicate_check as it explains the status of the doc, rather than app behaviour.

Default checks

How about simply setting the "duplicate_check":true as the default for the simple cases like checking names? The default doc for contacts could just check the name field, and reports could check the stricter cases that are relevant. We've not come up with a user profile that we think wouldn't want it, but are happy to defer if you know better. The only real wrench in the gears for that is the form evolution issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Add something new
Projects
None yet
Development

No branches or pull requests

3 participants