-
-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce chunksize when importing Applications, LinkedApplications, RemoteApps #33767
Conversation
corehq/apps/dump_reload/management/commands/load_domain_data.py
Outdated
Show resolved
Hide resolved
Would it be ok to squash the first 4 commits into one? |
d4f73c8
to
3467219
Compare
if doc_type not in self._dbs: | ||
couch_db = get_db_by_doc_type(doc_type) | ||
if couch_db is None: | ||
raise DocumentClassNotFound('No Document class with name "{}" could be found.'.format(doc_type)) | ||
callback = LoaderCallback(self._success_counter, self.stdout) | ||
db = IterDB(couch_db, new_edits=False, callback=callback) | ||
db = IterDB(couch_db, new_edits=False, callback=callback, chunksize=chunksize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guessing this is used specifically for applications - they're MUCH much bigger than anything else. Another approach to consider is to pass chunksize=1
if it's an app (as determined by doc_type
). Then this doesn't need to be controlled by the user, and everything else can still be loaded in chunks for speed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Ethan, although I'm conflicted.
It is likely although not guaranteed that one or a very few applications are the culprits when the default chunk size is a problem. It would be nice to automatically fall back to a small chunk size automatically and temporarily on an as-needed basis within the loader loop. Is it possible to detect when the chunk size is too large by catching a specific error from Couch or something like that? Or does a too-large chunk cause a too-generic-to-catch error like MemoryError
or something along those lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the screenshots the customer sent, the error they get from couch is:
requests.exceptions.HTTPError: Problem loading data <zip file>: 413 Client Error: Request Entity Too Large for url
which I guess it's something we could catch for. So once caught, should it try saving increasingly smaller portions of the chunk its trying to save/delete (halving the list until it reaches 1 item for example)? Is there a way for myself or QA to test something like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how best to test it, but we've certainly seen this before for apps. It's not super elegant, but IMO saving all apps with a chunksize of 1 would be sufficient to solve the issue and not overly hinder the performance.
It'd be really cool if IterDB
could be retooled to keep track of the chunk size in bytes and commit automatically based on size rather than doc count. That sounds like maybe more trouble than it's worth, though.
@@ -53,7 +53,10 @@ def _get_db_for_doc_type(self, doc_type): | |||
if couch_db is None: | |||
raise DocumentClassNotFound('No Document class with name "{}" could be found.'.format(doc_type)) | |||
callback = LoaderCallback(self._success_counter, self.stdout) | |||
db = IterDB(couch_db, new_edits=False, callback=callback) | |||
chunksize = 100 | |||
if doc_type == "Application": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I think you might need to consider also LinkedApplication
and RemoteApp
, and possibly the -Deleted
versions of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pointing those out! The -Deleted
suffix is dropped prior to being passed in here but even if it wasn't the new commit would still work.
@@ -54,7 +54,7 @@ def _get_db_for_doc_type(self, doc_type): | |||
raise DocumentClassNotFound('No Document class with name "{}" could be found.'.format(doc_type)) | |||
callback = LoaderCallback(self._success_counter, self.stdout) | |||
chunksize = 100 | |||
if doc_type == "Application": | |||
if 'App' in doc_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you think of encoding the specific list of doc types instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, that'd make it pretty clear which ones we're expecting to be large. Just making sure we're only considering Application
, LinkedApplication
and RemoteApp
, right? Is there any other models we should consider that might get too large for the chunk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's all of the Application doc types, I believe. The next biggest models i can think of are CommCareUser and WebUser, but I doubt they are bad enough to cause issues like Application does.
ac0e37b
to
43a15e2
Compare
Technical Summary
While using the command load_domain_data, a customer ran into a
HTTPError: 413 Client Error: Request Entity Too Large for url
error, likely because it was trying to save a few individual docs within the chunk that were too big. This PR reduces the chunksize requirement when attempting to import doc_types Application, LinkedApplication and RemoteApps.Safety Assurance
Safety story
This would only affect how many docs get saved to couch at once, so it's pretty safe.
Rollback instructions
Labels & Review