-
-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent duplicate sibling contact capture #9601
Comments
OverviewFirst I just want to clarify the space and existing problem-set to clarify exactly what should be addressed in this issue. I think there are three separate, but related, problems to solve under the heading of "Duplicate prevention":
DetailsWith that summary in mind, we can dig into the details of how to specifically prevent a (potentially offline) user from creating a duplicate contact. The prototype PR provides a great starting point for this conversation. My goal here is to synthesis/generalize the details from that prototype into a design summary here that is easier for folks to understand and discuss (also including some of my own suggestions and editorializing). Once we coalesce on a particular design approach, we can return to the actual code and make that happen. ConfigurationSince different types of contacts contain different levels of data that might need to be checked for duplication, it makes sense to configure the rules for dup checking individually for each contact type. I think it would be logical to include this config in the app-settings For each contact type, we need to define the rules for what constitutes a duplicate. The most flexible way to do this would be to allow a custom function to be included in the config that accepts two contacts and returns a boolean indicating if the contacts are duplicate or not. The Levenshtein library could be in-context for this function's logic to make use of. This would allow for the dupe logic to be as complex or simple as necessary. The main downside of this function-based approach is that I guess it would be almost impossible to re-use this configuration for any kind of server-side dupe checking (in any future solution for #6363). It is not feasible from a performance perspective to run a function like this against every contact in the db. (Honestly, though, the more I consider this, the less I think we should try to optimize this config for any kind of server-side reuse... It seems like the most likely possibilities for server-side dupe checking functionality in the near future are not going to happen with Couch data, but via an external data store, e.g. DOT or Databricks) The prototype PR presents an alternative approach where, instead of a function, for each contact type we define which fields should be compared from the contact docs and which algorithm (e.g. I think more discussion about the best form of configuration would be valuable here! One thing I would love to see is a simple way to enable some "default" dupe checks (e.g. something that just validates the FunctionalityMoving on to the functionality of how this should actually work in webapp. I think the example from the PR is a solid approach where, when opening a new contact form, it also triggers a lookup of the existing sibling contacts of the new contact (via I am not sure if this is covered in the current PR functionality or not, but I think it will be important to do the same duplicate checking when editing a contact. UXAs demonstrated in the screenshot above, the current UX in the PR is to present the user with a list of the found duplicates (along with links to go to their profile page). One incremental enhancement that might make sense here would be to present the duplicates more as a proper list of contacts (or even "contact cards") with various important identifying information displayed (instead of highlighting specifically which data is being matched for the duplicate check). If a user clicks into one of the other contacts, they will loose all the data they have entered into the contact form, so we want to give them as much info as is feasible about the contacts before they navigate away from the form. Instead of including the list of contacts inline in the form page, it might be better to pop a modal containing the list. 🤔 (Either way, we should be able to use a Another consideration is what the default behavior should be if duplicate contacts are found. Should we warn the user, but still let them submit the new contact? Or, should we totally disable the submit button and prevent the new contact from being added? Ultimately, this is something that we could make configurable, but it would be good to have a simple approach in the MVP and add configuration later... @ChinHairSaintClair @fardarter Please weigh in here with anything that I have missed or mistaken or additional thoughts of considerations that you have! @garethbowen let me know what you think about this proposed approach. What other stake-holders should we pull into this conversation to make sure we can maintain momentum on this feature? |
I think this is a must. We can't assume anything about naming conventions and it's quite possible for two people at the same family to have the same name. The point of this feature is to stop a CHW doing the wrong thing accidentally, not prevent them from doing an action on purpose.
I'm really interested in seeing if we can make this workflow generic enough that it works for all report types, not just contact creation. We have so many examples of duplicates being created accidentally across all types that this would be powerful. I worry that some of the thoughts here (like Levenshtein distance, using multiple fields, etc) are over and above what's actually needed. Can we just check the exact name and family? For reports, can we just check report code, reported date, and subject? If we can simplify it enough, can it work out-of-the-box without configuration?
The ideal solution would warn about duplicates as early as possible. Some forms are very long and forcing a user to enter all the details before telling them about dupes we could have found after the first input field was complete would be a very frustrating UX. In my head this looks like a validation error on the name input with a checkbox to bypass the check, but implementing it as an enketo validation would be difficult I think? But however it's done, notifying as early as possible would be a huge win.
I think the eCHIS Kenya team would be interested in this too. |
This was my initial thoughts as well, but then I was convinced by your comment on the other issue that the "duplicate report" workflow was quite different from duplicate contacts and perhaps there is not much overlap in the configs/logic. Specifically, when detecting a "duplicate report" is is probably much less about the contents of the report than just as you said: the type of report, who it is for, and when it is submitted. Basically for reports we would be looking for other reports of the same type that were created for the same contact in a particular timeframe. These checks can happen up-front before even loading the form. All of this is pretty different from the "duplicate contact" flow where the most important thing is the content that the user enters for the new contact. So we cannot do an upfront check for a duplicate contact. Also, it is likely that more config is necessary to allow the contact dupe checking to be really useful. (Maybe we can find some sensible pre-sets, but is seems like lots of tuning may be needed for some cases...) Because of this, I am skeptical of a "one-size-fits-all" solution for dupe-doc checking that covers both reports and contacts. (And even if we do decide to go that route, we would not need to support dupe-checking reports in this MVP PR.) It seems like the most important thing to decide at this point is if we think report dupe checking will need to allow for the same level of flexible configuration as contacts (e.g. specifying which fields should be dupe checked). If so, then yeah it probably makes sense to at least design the contact dupe-checking to be extended later for also checking reports. If not, then I think we probably just leave the report dupe checking to its own issue and not worry about it here.
Okay, this got me thinking that maybe I have been coming at this from the wrong direction! What if, instead of configuring the dupe-checking in the app-settings, we did it in the actual form The main downside I see to configuring things in the form is that for contact forms it would be important that only the fields that map directly to the contact doc are eligible for dupe checking. Fields in the (Tell me if I am wrong here, though! 😅 I feel like I am seeing Enekto widget + custom xlsx column as the solution to all problems lately....)
@eljhkrr just putting this on your radar! Please jump in to the discussion here if you have any specific concerns, requirements, or ideas! |
First, thank you, @jkuester , for explaining the concept and the prototype PR so clearly.
Our operational needsLet's headline the things we need from this solution:
Critical alignment issueWe'll cover a number of issue in this response (thank you very much for your engagement), but what I think we're looking for with the greatest focus is rapid alignment on a proposed data structure. The idea is that if we agree on a data structure, we can safely implement our own solution that will align. At worst, we might need to adjust the config load source. What we won't (we hope) need to do is a fundamental rethink/rewrite. This will let us progress operationally on our side (where the need is urgent) while still making space for necessary discussions around the best UX/UI, code implementation (none of this is implemented yet) and other architectural concerns. Please let us know if you agree with this approach or have an alternative theory. ProposalConfig data structure for duplicate checksBelow is our proposed data structure for a duplicate check system. Leaving aside the question of where the config should live, a declarative data structure should be portable to Assume the following person-edit form:
And its accompanying duplicate check structure:
Duplicate conditionsEvery condition we define is used to potentially identify a sibling database document as a duplicate. For example, if we match on the
The values from
Any properties (eg Doing logicConsider the following:
With the below we can, as per @jkuester 's example, identify duplicates when both the
For the above example, let's imagine we have a mom and a child registered on our system. The mother, Jade, will be saved with her phone number and the child, Savanah, will share their parent's phone number. With the above config, should we attempt to create an entry of either person where both the Notice we can also simultaneously express that we ALSO want duplicates listed where the Marking canonical entriesUnfortunately, since the go-live last November, our users have created duplicates. At times, this was for legitimate reasons, such as when no house number or land parcel number (locally, ERF number) is available. The CHWs often use 'Sa number not known' in this circumstance (and type it variably). Because of elements like these, or due to typos, it is difficult to accurately search for existing documents before creating a new ones (search is also not especially accurate). We needed some way of identifying the gold/canonical document and marking it as canonical, so the UI should not alert about other duplicate siblings. By marking certain documents as canonical, we will be able to make decisions about updating/deleting/merging duplicates on the backend. By evaluating the current form values in the We can express this in the same structure as in the
In the above example, we're saying that if the current form's We haven't thought through all cases, but this structure may well be able to handle the inter-field deps @jkuester is thinking about. Defaults checksPossible default strategies per field. ContactBelow is a proposal of the default strategies per field, using the
ReportThinking about @garethbowen 's suggestions around default checks, with this language/system, consider the following default for
StrategiesSuggested strategies:
In a form, suppose there's a check for something like an For something like dwelling names, eg. "123 Stevenson way", "Stven Way 23", and "Peter Stevenson Road", LocationsProposed storage locations for the various pieces of code that comprise this solution. ConfigOur first suggestions for where to store config is in Because the structure is declarative, it's possible to put the structure in the That being said, we’re a bit uncertain how to load these values and make it available within CodeOur initial proposal is to locate all the code processing the declarations in a Early duplicate detectionWe're not exactly sure how to implement this. Maybe, since the sibling request gets triggered during form initialization, we could filter the results through UXDefault behaviour on duplicate foundIt is a hard requirement for us that users can bypass the duplicate check. There may be a case for a harder check on something like reports, but that can be handled in validation, and perhaps not a default. One idea for the UI is to add a section to the Here's how the flow might look:
The benefits are:
Display duplicates
An inline list of contacts (see above screenshot), allow us to leaf through the current form values and compare against the duplicates. Modals are great for decisions that can be made quickly where all context can be presented on the modal. However, it's clear from our users that they need to be able to compare against what is already on the page when taking decisions, and a modal flow isn't appropriate for that, especially (but not only) in multi-page registration forms. With the current modal implementations, CHWs would have to dismiss it every time a different form value needs checking, since it blocks user interaction (no scrolling or clicking through). One of our suggestions for the future is to have some sort of "bottom drawer" that allows user interaction with the current form while also being able to pull up, dismiss and resize the duplicate result pane. The best of both worlds. Something like: Duplicate items as contact cardsWe like the idea of presenting duplicates as a list of "contact cards", it is something the community might be comfortable with due to familiarity. It might be advantageous to make these cards expandable (starting off with less space), since multiple duplicates could be returned. That way, as @jkuester suggested, we could easily include various key identifiers without worrying too much about vertical space. It will allow the user to make an informed decision before navigating off the current form - losing entered data. Modification to ListTilesWe'd like to suggest making a subtext line available on Using the example above, adding a configurable extra line of text, like showing EndnotesDuplicate check on edit
Short answer is yes it is. The
Re-fuzzy matchingThere are many other strategies one could consider, like the Jaro-Winkler distance and N-Gram Similarity, those mentioned in the strategies section are simply what're initially necessary to meet our operational requirements. We've noticed that typos are a frequent occurrence, perhaps because we have a diverse group of users with varying levels of technical know-how. A user might also use a slightly different naming scheme every other time they go out to enumerate their assigned areas. For this reason, we need a form of evaluation that allows for 'likeness', for which Forum threads that started this discussion: Special thanks to @fardarter for helping to draw this up 🥇 . |
Thanks for this brilliant write-up! I have been working my way though it, making notes, and thinking about things. Unfortunately, I am OOO the rest of the week, so I do not have time for a full response here. But, just wanted to let you know that I will be circling back here with more detailed thoughts early next week. |
The anchor tags in our response have been fixed. |
Agreed! Lets put the main focus on this. I think I have two remaining functional questions here:
There is a ton of functionality that we could build here, but I really want to try and scope things to what we actually expect to be useful... If the answer to both On the other hand, if we need one or both of
Then, in your {
"duplicate_check": {
"expression": "(levenshteinEq(3, current.first_name, existing.first_name) && levenshteinEq(3, current.last_name, existing.last_name)) || (current.phone === existing.phone && current.id_number === existing.id_number)"
}
} This approach is somewhat obtuse (you have to rely on the documentation to know what references are available in the expression), but I guess any approach is going to involve needing to reference docs to understand what config is available.... The other crucial piece of configuration to decide on is the method of marking a contact as "canonical". You have proposed some declarative configuration for Once again, my goal here is to make things just flexible enough to be useful, while still keeping everything maintainable. |
Defaults checksI guess just automatically enabling some kind of default duplicate checking for all contacts could be seen as a breaking change. I wonder if it might not be better to focus on some way to allow folks to easily enable "best practice" dupe checks for each of their forms. For example, if we go with a Reports/Early duplicate detectionSo far I do not perceive our design discussion as headed in a direction that would prevent us from implementing this functionality in the future. I group these together here because I think the early duplicate detection functionality could be super useful when dupe checking reports. (As previously noted, the most common case of "duplicate" reports would be the same Additionally, the proposed "bottom drawer" UI could also pair well with the early duplicate detection since it would give a clean way to see the duplicate contacts while still in the middle of a form. Display duplicatesI really appreciate your thoughts on the modal approach and agree that an in-line/drawer UI would be superior! My suggestion is that for the initial MVP we just display a list of the duplicate contacts in-line at the end of the form (with a few relevant details such as |
Thank you for your response @jkuester, we're busy working through the proposed solution and hope to have something for your consideration soon. That being said, it is the end of the year and many people are going on leave which makes it a little tricky to make meaningful progress. Form evolution problemThe structure and names of stored properties can vary between form versions, and this makes it difficult to guarantee field comparison between a current from and a sibling doc. We've not yet come up with a way to solve for this problem without some kind of historical mapping. Consider the following sequence of First,
Imagine we have 10 records saved with the above structure. Then the form structure is modified (
To edit a doc saved in One advantage of this is that it gives us a progressive migration of data between form versions (the 10 records would be incrementally moved to Regardless, for deduplication, the loaded siblings do not benefit from this same “migration” or mapping. ApproachWe’re still in the prototyping phase, but the approach we're currently playing with is to extract the property mapping from the XML and apply it to siblings before the comparison step. We're still not sure how complex getting those values will be. Some expected challenges:
AlternativeConsider the following structure we're looking at storing with the
We don't think this is ideal as the value mapping would need to be kept in step in two different locations (.xml & .properties.json) and it adds complexity to the declaration. We're not sure of the best approach and wanted to expose the problem early. At some stage some new code will follow. We would appreciate your input on this and if you have ideas for any of the other challenges. Other itemslogicYes, we need the ability to do logic. Let's use the is_canonicalIt's our view that Default checksHow about simply setting the |
Is your feature request related to a problem? Please describe.
Currently, there appears to be no built-in deterrent against creating records with names similar to existing siblings.
Describe the solution you'd like
Prevent duplicate place/person creation and display possible duplicates for consideration. On record submission (through create or edit flow), we want to show the possible duplicate items to our user. They can then navigate to the possible duplicate item via a link, and proceed with record changes there, or circumvent the duplicate check & proceed with record submission.
Describe alternatives you've considered
Despite improving our search functionality, and training the CHWs to use it, usage of the search feature before creating new records remains low, leading to frequent duplicates.
Additional context
We noticed that our CHWs either forget they've previously captured items or miss previously captured items due to it being slightly mistyped. This has resulted in quite a few duplicate records being created on all user created levels of our hierarchy. We want to fix this at the source before tasks are rolled out to make sure no unnecessary/incorrect tasks fill up our CHWs worklist. This will naturally also improve the accuracy of our data for reporting purposes.
We have a working prototype that we will soon upstream, which can produce the following:
Please see the following related discussions for more info:
https://forum.communityhealthtoolkit.org/t/mitigate-duplicate-data-capture/3313
#6363
As a "damage control" step, as discussed with the medic team, we plan to use Databricks (our tool primarily responsible for pulling couchDB data into our community database0 to also push "flags" to potential duplicate items. That will, in turn, cause tasks to trigger on the app. CHWs are expected to then confirm/deny possible duplicate and determine what should happen to the record (delete, merge, other).
The text was updated successfully, but these errors were encountered: