Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce disk space with CouchDB Nouveau (TCO) #9542

Open
mrjones-plip opened this issue Oct 14, 2024 · 7 comments · May be fixed by #9541
Open

Reduce disk space with CouchDB Nouveau (TCO) #9542

mrjones-plip opened this issue Oct 14, 2024 · 7 comments · May be fixed by #9541
Assignees
Labels
Type: Improvement Make something better

Comments

@mrjones-plip
Copy link
Contributor

mrjones-plip commented Oct 14, 2024

What feature do you want to improve?
Reduce total disk space that CHT takes up, thus reduce the total cost of ownership (TCO)

Describe the improvement you'd like
Use Nouveau search instead of existing CouchDB search

Describe alternatives you've considered
refactoring/removing freetext views

Additional context
We should answer how viable it will be to use CouchDB's Nouveau search to both improve search but to also reduce disk use of the indexes. This should include, but not be limited to:

  • how fast is it to index?
  • how hard will it be to bifurcate offline search using the old view from online search using this new search
  • how hard will it be to package up and maintain the java file which powers Nouveau
  • how hard will it be to measure database disk use?
  • how hard will it be to upgrade the index?
  • how much disk savings do we see?
  • what else are we forgetting?
@m5r
Copy link
Member

m5r commented Nov 7, 2024

I'll try to answer here the questions asked in the original issue and any question that might come up along the way.

how fast is it to index?

Much much faster than our regular couch views! I remade the contacts_by_freetext view using Nouveau and it took my computer 5.5 minutes to index the nouveau contacts_by_freetext view with MoH Zanzibar's dataset for ~400MB of disk space used.
In comparison, reindexing the regular contacts_by_freetext view means reindexing every view in medic-client took about 2.5 hours. The dataset has roughly 2M contacts and 2.8M reports.
This comparison is not apples to apples yet but it gives us an idea of what to expect.

TODO: also index key:value pairs then measure again but with the contacts_by_type_freetext and reports_by_freetext views as nouveau indexes.
Update on this todo: I indexed all 3 views and they represent 842MB on disk. I haven't gotten around to index key:value pairs because nouveau seems to complain about malformed keys. I will provide an update on this at a later point.

how hard will it be to bifurcate offline search using the old view from online search using this new search

TBD

how hard will it be to package up and maintain the java file which powers Nouveau

Easy using the docker image the couchdb team publishes.

how hard will it be to measure database disk use?

It's all contained in a single docker volume, it's easy to check manually. I don't know if APIs expose this data to measure programmatically but I will check later and report back here.
The /{db}/_design/{ddoc}/_nouveau_info/{index} API does expose this data! It's pretty neat, it's even more granular than the _info API for regular views, we can see how much disk space each nouveau view takes on disk instead of having the data for the ddoc as a whole.
So how hard will it be to measure disk use? Pretty easy.

how hard will it be to upgrade the index?

First, upgrading nouveau alongside couchdb right now can break existing indexes. It broke during the 3.4.1 => 3.4.2 upgrade and the couchdb team quickly fixed the underlying problem. With that said, they are committed to not cause a nouveau view reindexing on upgrades and are working towards automatic view reindexing, from their slack:

there will be a control on concurrency of rebuild, couchdb/nouveau will keep track of which indexes need rebuilding, and will then rebuild them over time, switching to the new one once complete and deleting the old.

So hopefully upgrades will be smooth when nouveau stabilizes.

And second, changing the view by modifying the code and index documents differently should be as straightforward as changing a regular couch view. It's essentially a JS function that lives in a ddoc, same as regular views we already have.
I need to double check if the view gets reindexed as soon as the ddoc is changed or on the first query.

how much disk savings do we see?

TL;DR: I saw ~25% disk savings for MoH Zanzibar.

Starting with the existing snapshot of MoH Zanzibar after compaction and views cleanup, disk usage of CouchDB was 55GB.
Removing the 3 freetext views from medic-client and re-running compaction, disk usage went down to 40GB.
After indexing those 3 freetext views with Nouveau, Nouveau's disk usage was 800MB but let's round it up to 1GB.
So we went down from 55GB to (40 + 1)GB, netting roughly 25% savings.

I haven't noticed Nouveau's disk usage going above 1GB during the 5 minutes it took to index the views but I'm planning to make changes to https://github.com/jkuester/chtoolbox/ to monitor that as well. As mentioned earlier, couch exposes an API to help with this. This will come in handy on top of our current disk monitoring functionality.

what else are we forgetting?

how important is the order value emitted in the freetext views and how can we replicate this behavior with nouveau?

var order = dead + ' ' + muted + ' ' + idx + ' ' + (doc.name && doc.name.toLowerCase());

It seems to be used to put the dead and muted contacts at the end of the search results and order the results alphabetically.
By default, Nouveau sorts results by relevance. At query time, we can pass a sort parameter that tells nouveau to sort results based on the field(s) passed in that parameter.
Since it's relying on fields that are in the document, we will have to find a workaround to keep this working. The most obvious workaround would be to migrate every contact document to create a sorting_order field and have a transition to catch deaths reports (with undo's) and muting to update this field. This is not ideal, it would be better if this could be calculated in the view like it is today.

more to come...

@mrjones-plip
Copy link
Contributor Author

this is great - thanks @m5r !

One question that didn't ask, which i've just added to the body is:

How much disk savings do we see?

In more detail: If we start with the freetext view we have now for online users, delete it, recreate it in Nouveau, what's the savings? As well, when upgrading/recreating the index in Nouveau, how much spare ephemeral disk do we need? Do we know the percent of total disk the freetext view takes up so we can try and compare to what Nouveau will take up?

Feel free to break this into it's own sub-ticket if the research seems a rabbit hole unto itself!

@m5r
Copy link
Member

m5r commented Nov 13, 2024

Great question @mrjones-plip 👀 I updated my previous comment with this week's updates and an answer to your question

@mrjones-plip
Copy link
Contributor Author

Thanks for the write up @m5r !

we went down from 55GB to (40 + 1)GB, netting roughly 25% savings

To be clear, this is a 25% savings on the medic-client database, not on all couchdb + Nouveau databases, correct?

I'm planning to make changes to chtoolbox to monitor [view creation disk use] as well

yes! i love this idea.

@m5r
Copy link
Member

m5r commented Nov 18, 2024

this is a 25% savings on the medic-client database, not on all couchdb + Nouveau databases, correct?

No this is a 25% savings on all couchdb + Nouveau. medic-client itself went down by 74%, from exactly 16,025,080,004 bytes to 4,114,916,430 bytes on disk (roughly 16GB => 4GB, meaning the 3 freetext views accounted for ~12GB of storage).
And if you take a step back and look at CouchDB as a whole, its volume went down from 55GB to 40GB. Add to that Nouveau's volume that went up from 0B (nonexistent before) to 1GB. And you go down from 55GB to 41GB total, hence the 25% savings.

@mrjones-plip
Copy link
Contributor Author

wow!!! awesome news - thanks Mokhtar

@m5r m5r linked a pull request Dec 16, 2024 that will close this issue
5 tasks
@mrjones-plip mrjones-plip added Type: Improvement Make something better and removed Type: Technical issue Improve something that users won't notice labels Dec 18, 2024
@mrjones-plip mrjones-plip changed the title Research CouchDB Nouveau search as way to reduce disk space (TCO) Rreduce disk space with CouchDB Nouveau (TCO) Dec 18, 2024
@mrjones-plip mrjones-plip changed the title Rreduce disk space with CouchDB Nouveau (TCO) Reduce disk space with CouchDB Nouveau (TCO) Dec 18, 2024
@andrablaj andrablaj moved this to In Progress in CHT Stewardship Dec 19, 2024
@m5r m5r removed their assignment Dec 24, 2024
@m5r
Copy link
Member

m5r commented Dec 24, 2024

Aside from the sub-issues already assigned to this issue, we're also dependent on apache/couchdb#5354 being released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Improvement Make something better
Projects
Status: In Progress
Status: This Week's commitments
Development

Successfully merging a pull request may close this issue.

4 participants