Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - Reject duplicate submissions #876

Closed
wants to merge 4 commits into from
Closed

Conversation

noliveleger
Copy link
Contributor

@noliveleger noliveleger commented May 8, 2023

Description

TBC

Additional info

would supersede #859

@noliveleger noliveleger requested a review from jnm May 8, 2023 19:46
@noliveleger noliveleger force-pushed the duplicate-submissions branch 3 times, most recently from 9b39237 to 53c6320 Compare May 9, 2023 18:45
@noliveleger noliveleger force-pushed the duplicate-submissions branch from 53c6320 to 019e7bf Compare May 9, 2023 18:47
@@ -52,7 +52,7 @@ jobs:
- name: Install Python dependencies
run: pip-sync dependencies/pip/dev_requirements.txt
- name: Run pytest
run: pytest -vv -rf
run: pytest -vv -rf --disable-warnings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

@@ -32,6 +32,7 @@ test:
POSTGRES_PASSWORD: kobo
POSTGRES_DB: kobocat_test
SERVICE_ACCOUNT_BACKEND_URL: redis://redis_cache:6379/4
GIT_LAB: "True"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something more descriptive like SKIP_TESTS_WITH_CONCURRENCY

@@ -40,7 +41,7 @@ test:
script:
- apt-get update && apt-get install -y ghostscript gdal-bin libproj-dev gettext openjdk-11-jre
- pip install -r dependencies/pip/dev_requirements.txt
- pytest -vv -rf
- pytest -vv -rf --disable-warnings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

Comment on lines +90 to +94
fakeredis.FakeStrictRedis(),
):
with patch(
'onadata.apps.django_digest_backends.cache.RedisCacheNonceStorage._get_cache',
fakeredis.FakeStrictRedis,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is one instantiated and the other isn't?

results[result] += 1

assert results[status.HTTP_201_CREATED] == 1
assert results[status.HTTP_409_CONFLICT] == DUPLICATE_SUBMISSIONS_COUNT - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the OpenRosa spec allow returning a 409? and do Enketo and Collect handle a 409 properly? i can't find the code, but i remember wanting to return a 40x that wasn't 400 but being forced to use 400 only because without it Collect wouldn't display the error message i was sending. could've been Enketo, though, or i might be misremembering entirely

Comment on lines +171 to +176
# The start-time requirement protected submissions with identical responses
# from being rejected as duplicates *before* KoBoCAT had the concept of
# submission UUIDs. Nowadays, OpenRosa requires clients to send a UUID (in
# `<instanceID>`) within every submission; if the incoming XML has a UUID
# and still exactly matches an existing submission, it's certainly a
# duplicate (https://docs.opendatakit.org/openrosa-metadata/#fields).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reject outright any submission without a UUID

Comment on lines +189 to +194
if existing_instance:
existing_instance.check_active(force=False)
# ensure we have saved the extra attachments
new_attachments = save_attachments(existing_instance, media_files)
if not new_attachments:
raise DuplicateInstanceError()
Copy link
Member

@jnm jnm Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't ConflictingXMLHashInstanceError be raised already and block us from getting here? Nevermind; I misunderstood, and that exception only has to do with concurrent processing of submissions with identical XML

Comment on lines +229 to +238
if is_postgresql:
cur = connection.cursor()
cur.execute('SELECT pg_try_advisory_lock(%s::bigint);', (int_lock,))
acquired = cur.fetchone()[0]
else:
prefix = os.getenv('KOBOCAT_REDIS_LOCK_PREFIX', 'kc-lock')
key_ = f'{prefix}:{int_lock}'
redis_lock = settings.REDIS_LOCK_CLIENT.lock(key_, timeout=60)
acquired = redis_lock.acquire(blocking=False)
yield acquired
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think support for something other than PostgreSQL is needed. We already depend on Postgres for many things.

Moot point, then, but just to say it: I think all os.getenv() for configuration (and default values) should be in the settings files

Comment on lines -544 to +581
except DuplicateInstance:
response = OpenRosaResponse(t("Duplicate submission"))
except ConflictingXMLHashInstanceError:
response = OpenRosaResponse(t('Conflict with already existing instance'))
response.status_code = 409
response['Location'] = request.build_absolute_uri(request.path)
error = response
except DuplicateInstanceError:
response = OpenRosaResponse(t('Duplicate instance'))
Copy link
Member

@jnm jnm Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A conflicting XML hash with no additional attachments is the same thing as a duplicate instance. If the XML is the same but there are additional attachments, no exception should be raised; this is normal operation. I don't think a new exception class is needed. I also don't think that DuplicateInstanceError can be reached anymore, but that's addressed by a different comment [was based on a misunderstanding]

Copy link
Member

@jnm jnm Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing that confused me about this is that the verbiage is wrong. It's not a really a conflict with an existing instance, it's unwanted concurrent processing of two (or more) instances with identical XML.

I think ideally we should wait (a short amount of time) to try to acquire the lock before returning an error. I don't know what happens in Enketo now, but it's easy to imagine that a client would asynchronously send the following requests concurrently:

  1. Submission XML (hash abc123) + dog.jpg
  2. Submission XML (hash abc123) + cat.jpg
  3. Submission XML (hash abc123) + gecko.jpg

Ideally, all three requests would succeed. Let's assume request 1 arrives first and is still processing while requests 2 and 3 are received by the server: what I understand this PR would do is reject requests 2 and 3 immediately with an error code. I think we should wait (again, briefly) for request 1 to finish1 before returning an immediate rejection.

If the waiting takes too long, then we do have to reject in order to avoid sapping the worker pool with useless spinning. The message we return should be effectively "try again later", not something about a conflict with an existing instance. The HTTP code has to be dependent on the OpenRosa specification and compatible with Enketo and ODK Collect. Hopefully, those requirements are one and the same, but we'll have to test.

Footnotes

  1. Addendum: Whoops, we don't really need to spin until request 1 finishes, we just need to wait until the row has been created in logger_instance. Imagine one POST has 50 files: locking while all 50 are written to storage is not what we want. We should still avoid storing the same attachment multiple times for a submission, so within a single submission, we should have some kind of attachment uniqueness constraint (if we don't already).

Copy link

@tiritea tiritea Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume request 1 arrives first and is still processing while requests 2 and 3 are received by the server: what I understand this PR would do is reject requests 2 and 3 immediately with an error code.

So my interpretation of the spec would be that we would only need to (immediately!) reject if the same submission XML was subsequently received [ie duplicate hash] and it includes no attachment (!). Because a client should never be attempting to resend just the submission XML on its own more than once.

Whereas if the same submission XML hash was received - but it included an attachment (or multiple!) - then we can then decide whether to or not to reject based on whether any of those attachments have already been received (or are currently being processed). [and just throw away the XML since we know its already been processed, or is currently being processed...]

So any locking would minimally only need to occur for the duration of calculating (then checking for existing, then storing if not) the submission XML hash, right?

@@ -1,6 +1,7 @@
# coding: utf-8
import os

from fakeredis import FakeConnection, FakeStrictRedis, FakeServer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, was the support for locking without PostgreSQL exclusively for unit testing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit testing, at least on GitLab, is already using PostgreSQL, so I don't think we need to support any non-Postgres locking mechanisms for this

Copy link

@noliveleger
Copy link
Contributor Author

Closed in favor of kobotoolbox/kpi#5047

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants