Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate datasets for multiple products on the NCI #166

Open
omad opened this issue Sep 28, 2018 · 2 comments
Open

Duplicate datasets for multiple products on the NCI #166

omad opened this issue Sep 28, 2018 · 2 comments
Assignees
Labels

Comments

@omad
Copy link
Contributor

omad commented Sep 28, 2018

During recent ingest runs (Scenes -> Albers tiles) about 50,000+ duplicate tiles have been created for the ls8_nbar_albers product in 2018.

It's possible that duplicate data will be returned for anyone using this product!

I'm currently investigating:

  1. How to safely remove the duplicates.
  2. Whether other products are also affected.
  3. What caused the bug in the first place.
@omad omad added the bug label Sep 28, 2018
@omad omad self-assigned this Sep 28, 2018
@omad omad changed the title Duplicate LS8 NBAR Albers Datasets Duplicate LS8 NBAR Albers 2018 Datasets Sep 28, 2018
@omad
Copy link
Contributor Author

omad commented Sep 28, 2018

There are quite a few products with large numbers of duplicate datasets (2018 only):

prod_name actual num_dupes total
s2a_level1c_granule 42568 1298 43866
s2b_level1c_granule 45750 1857 47607
ls7_fc_albers 45491 13539 59030
ls8_fc_albers 78389 4432 82821
ls8_nbar_albers 73076 64803 137879
ls7_nbar_albers 42582 0 42582
ls8_nbart_albers 78389 48049 126438
ls7_nbart_albers 45491 0 45491
ls8_pq_albers 77270 23711 100981
ls7_pq_albers 43648 0 43648

We've also made some progress in finding the cause. Some of the AWS Lambda functions used to submit jobs on raijin were timing out at 2 minutes, and being retried a couple of minutes later. Resulting in two separate executions of the same job at the same time.

As a temporary measure we've upped the timeout to 5 minutes, and are looking into more rigorous methods to prevent this happening again.

The cause of the s2a and s2b duplicates is unrelated and will need to be addressed separately.

SQL to query for single product

SELECT
       COUNT(*) filter (where row_number = 1) as should_exist,
       COUNT(*) filter (where row_number > 1) as num_dupes,
       COUNT(*) as total
FROM (select row_number() over (partition by lat, lon, time ORDER BY metadata_doc ->> 'creation_dt') row_number,
             lat,
             lon,
             time,
             metadata_doc->>'creation_dt' as creation_dt,
             id
      from dv_ls8_pq_albers_dataset
      WHERE tstzrange('2018-01-01', '2018-12-31') && time
      ) t;

@Kirill888
Copy link
Contributor

Ingest is not re-entrant. Running ingest second time on the same product while the first run is still in progress will generate duplicate datasets that only differ by uuid (computed via non-deterministic method) and creation time.

  1. Figure out what to do for a given product
  2. Generate ingested files (with random uuids)
  3. Add to index

There are no locks of any kind, and uuid is generated at random. Having deterministic uuid computation will prevent duplicates but will not prevent wasted compute. Out of band measures to ensure that ingest is not being called concurrently are needed.

@omad omad changed the title Duplicate LS8 NBAR Albers 2018 Datasets Duplicate datasets for multiple products on the NCI Oct 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants