-
-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP - Reject duplicate submissions #876
Conversation
9b39237
to
53c6320
Compare
53c6320
to
019e7bf
Compare
@@ -52,7 +52,7 @@ jobs: | |||
- name: Install Python dependencies | |||
run: pip-sync dependencies/pip/dev_requirements.txt | |||
- name: Run pytest | |||
run: pytest -vv -rf | |||
run: pytest -vv -rf --disable-warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
@@ -32,6 +32,7 @@ test: | |||
POSTGRES_PASSWORD: kobo | |||
POSTGRES_DB: kobocat_test | |||
SERVICE_ACCOUNT_BACKEND_URL: redis://redis_cache:6379/4 | |||
GIT_LAB: "True" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe something more descriptive like SKIP_TESTS_WITH_CONCURRENCY
@@ -40,7 +41,7 @@ test: | |||
script: | |||
- apt-get update && apt-get install -y ghostscript gdal-bin libproj-dev gettext openjdk-11-jre | |||
- pip install -r dependencies/pip/dev_requirements.txt | |||
- pytest -vv -rf | |||
- pytest -vv -rf --disable-warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
fakeredis.FakeStrictRedis(), | ||
): | ||
with patch( | ||
'onadata.apps.django_digest_backends.cache.RedisCacheNonceStorage._get_cache', | ||
fakeredis.FakeStrictRedis, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is one instantiated and the other isn't?
results[result] += 1 | ||
|
||
assert results[status.HTTP_201_CREATED] == 1 | ||
assert results[status.HTTP_409_CONFLICT] == DUPLICATE_SUBMISSIONS_COUNT - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the OpenRosa spec allow returning a 409? and do Enketo and Collect handle a 409 properly? i can't find the code, but i remember wanting to return a 40x that wasn't 400 but being forced to use 400 only because without it Collect wouldn't display the error message i was sending. could've been Enketo, though, or i might be misremembering entirely
# The start-time requirement protected submissions with identical responses | ||
# from being rejected as duplicates *before* KoBoCAT had the concept of | ||
# submission UUIDs. Nowadays, OpenRosa requires clients to send a UUID (in | ||
# `<instanceID>`) within every submission; if the incoming XML has a UUID | ||
# and still exactly matches an existing submission, it's certainly a | ||
# duplicate (https://docs.opendatakit.org/openrosa-metadata/#fields). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should reject outright any submission without a UUID
if existing_instance: | ||
existing_instance.check_active(force=False) | ||
# ensure we have saved the extra attachments | ||
new_attachments = save_attachments(existing_instance, media_files) | ||
if not new_attachments: | ||
raise DuplicateInstanceError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't Nevermind; I misunderstood, and that exception only has to do with concurrent processing of submissions with identical XMLConflictingXMLHashInstanceError
be raised already and block us from getting here?
if is_postgresql: | ||
cur = connection.cursor() | ||
cur.execute('SELECT pg_try_advisory_lock(%s::bigint);', (int_lock,)) | ||
acquired = cur.fetchone()[0] | ||
else: | ||
prefix = os.getenv('KOBOCAT_REDIS_LOCK_PREFIX', 'kc-lock') | ||
key_ = f'{prefix}:{int_lock}' | ||
redis_lock = settings.REDIS_LOCK_CLIENT.lock(key_, timeout=60) | ||
acquired = redis_lock.acquire(blocking=False) | ||
yield acquired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think support for something other than PostgreSQL is needed. We already depend on Postgres for many things.
Moot point, then, but just to say it: I think all os.getenv()
for configuration (and default values) should be in the settings files
except DuplicateInstance: | ||
response = OpenRosaResponse(t("Duplicate submission")) | ||
except ConflictingXMLHashInstanceError: | ||
response = OpenRosaResponse(t('Conflict with already existing instance')) | ||
response.status_code = 409 | ||
response['Location'] = request.build_absolute_uri(request.path) | ||
error = response | ||
except DuplicateInstanceError: | ||
response = OpenRosaResponse(t('Duplicate instance')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A conflicting XML hash with no additional attachments is the same thing as a duplicate instance. If the XML is the same but there are additional attachments, no exception should be raised; this is normal operation. I don't think a new exception class is needed. I also don't think that [was based on a misunderstanding]DuplicateInstanceError
can be reached anymore, but that's addressed by a different comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing that confused me about this is that the verbiage is wrong. It's not a really a conflict with an existing instance, it's unwanted concurrent processing of two (or more) instances with identical XML.
I think ideally we should wait (a short amount of time) to try to acquire the lock before returning an error. I don't know what happens in Enketo now, but it's easy to imagine that a client would asynchronously send the following requests concurrently:
- Submission XML (hash abc123) + dog.jpg
- Submission XML (hash abc123) + cat.jpg
- Submission XML (hash abc123) + gecko.jpg
Ideally, all three requests would succeed. Let's assume request 1 arrives first and is still processing while requests 2 and 3 are received by the server: what I understand this PR would do is reject requests 2 and 3 immediately with an error code. I think we should wait (again, briefly) for request 1 to finish1 before returning an immediate rejection.
If the waiting takes too long, then we do have to reject in order to avoid sapping the worker pool with useless spinning. The message we return should be effectively "try again later", not something about a conflict with an existing instance. The HTTP code has to be dependent on the OpenRosa specification and compatible with Enketo and ODK Collect. Hopefully, those requirements are one and the same, but we'll have to test.
Footnotes
-
Addendum: Whoops, we don't really need to spin until request 1 finishes, we just need to wait until the row has been created in
logger_instance
. Imagine onePOST
has 50 files: locking while all 50 are written to storage is not what we want. We should still avoid storing the same attachment multiple times for a submission, so within a single submission, we should have some kind of attachment uniqueness constraint (if we don't already). ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's assume request 1 arrives first and is still processing while requests 2 and 3 are received by the server: what I understand this PR would do is reject requests 2 and 3 immediately with an error code.
So my interpretation of the spec would be that we would only need to (immediately!) reject if the same submission XML was subsequently received [ie duplicate hash] and it includes no attachment (!). Because a client should never be attempting to resend just the submission XML on its own more than once.
Whereas if the same submission XML hash was received - but it included an attachment (or multiple!) - then we can then decide whether to or not to reject based on whether any of those attachments have already been received (or are currently being processed). [and just throw away the XML since we know its already been processed, or is currently being processed...]
So any locking would minimally only need to occur for the duration of calculating (then checking for existing, then storing if not) the submission XML hash, right?
@@ -1,6 +1,7 @@ | |||
# coding: utf-8 | |||
import os | |||
|
|||
from fakeredis import FakeConnection, FakeStrictRedis, FakeServer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, was the support for locking without PostgreSQL exclusively for unit testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit testing, at least on GitLab, is already using PostgreSQL, so I don't think we need to support any non-Postgres locking mechanisms for this
Closed in favor of kobotoolbox/kpi#5047 |
Description
TBC
Additional info
would supersede #859