-
-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research freetext view use to inform hosting TCO reduction #9293
Comments
An additional "next step" that I think would be informative is taking a representative database and deleting the freetext views from the ddoc, calling view_cleanup, and then comparing the on disk size of the db. If it's not significant then we're barking up the wrong btree. |
I've done some research over the last few days with a relatively large dataset to give us an idea of how much disk space we could save for large instances. I've generated 5.94M contacts and 840k reports with the test data generator that amounts to a 5.1GB NB: when measuring the mrjones sez:
In this PR I managed to shrink NB2: The PR doesn't pass CI because some integration and e2e tests are failing. They're few but they rely on the fulltext views to index every field of every documents, they query arbitrary fields and expect the values on those fields to get picked up by the index and have those documents turn up in the search. But basic searches work as expected and their respective tests pass 🟢 Out of curiosity I deleted the 3 What to do next:
|
If anyone needs to reproduce testing on a large data set, here's the steps from @m5r : Download the archive (private google drive link)
services:
couchdb:
image: public.ecr.aws/medic/cht-couchdb:4.9.0
volumes:
- ${COUCHDB_DATA:-./srv}:/opt/couchdb/data
- cht-credentials:/opt/couchdb/etc/local.d/
environment:
- "COUCHDB_USER=${COUCHDB_USER:-admin}"
- "COUCHDB_PASSWORD=${COUCHDB_PASSWORD:?COUCHDB_PASSWORD must be set}"
- "COUCHDB_SECRET=${COUCHDB_SECRET}"
- "COUCHDB_UUID=${COUCHDB_UUID}"
- "SVC_NAME=${SVC_NAME:-couchdb}"
- "COUCHDB_LOG_LEVEL=${COUCHDB_LOG_LEVEL:-error}"
restart: always
logging:
driver: "local"
options:
max-size: "${LOG_MAX_SIZE:-50m}"
max-file: "${LOG_MAX_FILES:-20}"
networks:
cht-net:
volumes:
cht-credentials:
networks:
cht-net:
name: ${CHT_NETWORK:-cht-net} and the services:
couchdb:
ports:
- "5984:5984"
- "5986:5986" run it with:
|
@m5r Fantastic data, thanks! I had a quick look at the PR and I think we can reduce disk space a little further by not emitting fields with
This will also be important to keep an eye on - thanks for calling it out. I know the PR is in very draft state, but there are many improvements we can make to it to improve compile time with this new approach, for example, not iterating over all fields, but only checking those that are explicitly to be indexed. So all that is to say, it might be better to measure this once we've decided this is the right approach and have made the other improvements to the view. It would also be good to check the total indexing time compared to the current freetext views because new documents will have to be indexed either way, and if this one is significantly faster then that will be another upside to consider. |
Out of an abundance of caution, I have taken the time to drill down a bit further into the source code that is using
|
Based on the above findings that we do not have any programatic requirements around which fields are available in the freetext search, I would like to start a more detailed design discussion regarding exactly which fields we should index. @garethbowen I would appreciate your feedback on this proposal! We should only index a field if:
Given this criteria, here are the fields that I propose indexing: Contact docs
Report docs
Additionally, I propose that we index a new Also, for completeness I just want to include the further recommendations Gareth made above. When we re-work the freetext view code, we should:
|
Are all the |
Those are all "shortcodes" and not UUIDs. |
Perfect! thanks for the confirmation. |
I don't think this is correct - there is a very special use case which searches by
I don't know if this is even used, so we could investigate to see if we can drop it, but this is a big job. Alternatively keep
I'm not sure about indexing contact phone number. I assume we'd index the standardised format, but nobody is going to type that in, so we'll need to also standardise the input, which won't work if anyone uses space separation. At this point I would leave it out, and we can add it back in if someone misses it. Likewise I don't understand indexing the contact dob. It's skipped in the current implementation but I don't know how you would format it and how users would work out how to type it in the correct format. Also would it need to be exact or would we fuzzy match, looking for someone born in "september 2023". Can you give some examples of how this would work, what use case this would serve, and how users would discover it?
This is a nice idea, and means there's some extensibility for use cases we haven't thought of. It feels like a different issue, but if we do it in two releases it'll mean two expensive view reindexes so I appreciate you bringing it up now - maybe it's cleanest as two commits in the same release. How would app developers populate this field? Would it actually be |
😓 Thanks for pointing this out. I was even originally looking for the IMHO, the "proper" way to handle this functionality would be to rework it as a filter with a more dedicated view. However, I doubt we want to invest that kind of effort into this functionality right now. The most straightforward thing would be to just leave a special case in our freetext view code for On a related note, @garethbowen do you know what is up with the search on the reports tab via URL query param functionality? Diana seemed to think it was related to the
I agree with the formatting concerns here and do believe that the "freetext" search is not the best way to implement a "find-contact-by-phone" feature. That being said, my thinking was that the user would likely be able to see the formatted version of the phone number in the contact-summary. So, they would at least have a chance of knowing what to search for. I am fine not including phone in the index, particularly since if anyone misses it, it will give us an opportunity to perhaps give a proper solution for that workflow.
🤔 I don't think this is accurate. If you are thinking of this check for fields that end with
As with the phone number, I 100% agree that the formatting here is the main issue. At the same time, I guess it is possible to display that DoB to the user in the standard Mostly, my consideration around all these index fields was simply for the kinds of things that folks usually use as reference for finding people. Name/phone/DoB are def the most common that I know of. After reading what you wrote and thinking about this more, I think our generic contact search is probably not super useful for anything but searching by name and shortcode. More custom handling (and UI) would be needed to make searching by phone/DoB really useful or intuitive...I n the meantime, if folks really want to be able to search by phone/DoB, they can always just add that data to their
Issue logged: #9378. Will raise a discussion on Slack. 👍 |
Commented in Slack but repeating here for posterity and visibility. I've traced it back to this commit (which is exactly 10 years old this Sunday 🍰 ). This was added for this issue. This was done so you could link to the reports tab and show all reports about a given patient.
I was thinking of the type check but realising now that dob will either be a number or a string - thanks for refreshing my memory.
Yes this is basically where I've landed too. |
Rebasing the PR against |
can't wait for a passing build so that we can try this branch on the copy of production data! in the meanwhile to facilitate local testing on dev instances, I've added everyone here's SSH key to the test box and I'm creating a massive To download the 33GB file:
Anyone who can SSH in to watchdog can copy this file. Here's the last ~50 chars of the SSH keys on it:
|
Thanks for the prod data and the words of caution @mrjones-plip I pulled it down, made CouchDB index it and warm up the views, and then pushed the new freetext search views. |
Here are my notes on the "next steps" from the meeting today with @dianabarsan @m5r and @sugat009: ContextWe have collected a large amount of data regarding the impact it would have to scope the freetext views to just index a limited allow-list of fields. We know the improvements it offers in saved disk space and the costs it will require in terms of re-indexing during upgrades. We also have collected a large amount of information about how users are currently using the search functionality and unfortunately have come to the conclusion that a non-trivial percentage of users are likely to perform searches that depend on custom fields on reports and contacts being indexed. Because of this, it seems unwise to move forwards with this project as it is currently defined. Proposed next steps
|
One more interesting data point potentially supporting a separation in implementation for offline and online search is that Couch has introduced a new Lucene-based freetext search for the server. |
Allies recently retested 3 production deployments by upgrading them all to 4.12, running compaction, upgrading them to the freetext lite branch and running compaction again. We recorded the output of
We've concluded that while possible to get up 25% disk space back, the potential impact of either breaking work flows or only getting 5% disk space back. We're hoping that CouchDB Nouveau search will yield the results we're looking for |
Describe the issue
it's been decided that storage is the highest cost for hosting total cost of ownership (TCO). It's suspected that CHT's use of CouchDB freetext views cause a large amount of disk space. But how much? How might the size of the view be reduced? Where does the UI surface the use of these views?
Describe the improvement you'd like
research and document use of freetext views so we know how to safely reduce the disk use caused by using them
Describe alternatives you've considered
na
Next Steps:
Open Questions:
CouchDB views for freetext searching
These get queried by the @medic/search shared lib:
These generated queries are then executed one level higher in shared-libs/search/src/search.jsThis is then imported and ran by:
logs are VERY informative, even showing the keyword freetext !here's couch logs when searching for
bbbbbbbbbbbbbbbbbbbbbbbbbb
:As well, it shows up in HA proxy for online users when searching for
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
:The text was updated successfully, but these errors were encountered: