Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce chunksize when importing Applications, LinkedApplications, RemoteApps #33767

Merged
merged 2 commits into from
Dec 13, 2023

Conversation

minhaminha
Copy link
Contributor

@minhaminha minhaminha commented Nov 16, 2023

Technical Summary

While using the command load_domain_data, a customer ran into a HTTPError: 413 Client Error: Request Entity Too Large for url error, likely because it was trying to save a few individual docs within the chunk that were too big. This PR reduces the chunksize requirement when attempting to import doc_types Application, LinkedApplication and RemoteApps.

Safety Assurance

Safety story

This would only affect how many docs get saved to couch at once, so it's pretty safe.

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@minhaminha minhaminha added the product/invisible Change has no end-user visible impact label Nov 16, 2023
@minhaminha
Copy link
Contributor Author

Would it be ok to squash the first 4 commits into one?

if doc_type not in self._dbs:
couch_db = get_db_by_doc_type(doc_type)
if couch_db is None:
raise DocumentClassNotFound('No Document class with name "{}" could be found.'.format(doc_type))
callback = LoaderCallback(self._success_counter, self.stdout)
db = IterDB(couch_db, new_edits=False, callback=callback)
db = IterDB(couch_db, new_edits=False, callback=callback, chunksize=chunksize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing this is used specifically for applications - they're MUCH much bigger than anything else. Another approach to consider is to pass chunksize=1 if it's an app (as determined by doc_type). Then this doesn't need to be controlled by the user, and everything else can still be loaded in chunks for speed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Ethan, although I'm conflicted.

It is likely although not guaranteed that one or a very few applications are the culprits when the default chunk size is a problem. It would be nice to automatically fall back to a small chunk size automatically and temporarily on an as-needed basis within the loader loop. Is it possible to detect when the chunk size is too large by catching a specific error from Couch or something like that? Or does a too-large chunk cause a too-generic-to-catch error like MemoryError or something along those lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the screenshots the customer sent, the error they get from couch is:
requests.exceptions.HTTPError: Problem loading data <zip file>: 413 Client Error: Request Entity Too Large for url
which I guess it's something we could catch for. So once caught, should it try saving increasingly smaller portions of the chunk its trying to save/delete (halving the list until it reaches 1 item for example)? Is there a way for myself or QA to test something like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how best to test it, but we've certainly seen this before for apps. It's not super elegant, but IMO saving all apps with a chunksize of 1 would be sufficient to solve the issue and not overly hinder the performance.

It'd be really cool if IterDB could be retooled to keep track of the chunk size in bytes and commit automatically based on size rather than doc count. That sounds like maybe more trouble than it's worth, though.

@@ -53,7 +53,10 @@ def _get_db_for_doc_type(self, doc_type):
if couch_db is None:
raise DocumentClassNotFound('No Document class with name "{}" could be found.'.format(doc_type))
callback = LoaderCallback(self._success_counter, self.stdout)
db = IterDB(couch_db, new_edits=False, callback=callback)
chunksize = 100
if doc_type == "Application":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I think you might need to consider also LinkedApplication and RemoteApp, and possibly the -Deleted versions of those.

Copy link
Contributor Author

@minhaminha minhaminha Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing those out! The -Deleted suffix is dropped prior to being passed in here but even if it wasn't the new commit would still work.

@@ -54,7 +54,7 @@ def _get_db_for_doc_type(self, doc_type):
raise DocumentClassNotFound('No Document class with name "{}" could be found.'.format(doc_type))
callback = LoaderCallback(self._success_counter, self.stdout)
chunksize = 100
if doc_type == "Application":
if 'App' in doc_type:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think of encoding the specific list of doc types instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, that'd make it pretty clear which ones we're expecting to be large. Just making sure we're only considering Application, LinkedApplication and RemoteApp, right? Is there any other models we should consider that might get too large for the chunk?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's all of the Application doc types, I believe. The next biggest models i can think of are CommCareUser and WebUser, but I doubt they are bad enough to cause issues like Application does.

@minhaminha minhaminha merged commit a8eb55c into master Dec 13, 2023
13 checks passed
@minhaminha minhaminha changed the title Add chunk size argument to load_domain_data Reduce chunksize when importing Applications, LinkedApplications, RemoteApps Dec 13, 2023
@minhaminha minhaminha deleted the ml/add-chunksize-option branch December 13, 2023 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
product/invisible Change has no end-user visible impact
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants