-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid processing info in item IDs #1189
base: dev
Are you sure you want to change the base?
Avoid processing info in item IDs #1189
Conversation
The processing information (dates/versions) is also in the native metadata for the observation. This has been a contested subject for awhile. This could be a good solution. |
Co-authored-by: Pete Gadomski <[email protected]>
CI is passing now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good input for ids best practices. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fully agreeing with this one. If you use the version extension, then you will need the processing timestamp in the ID as you'll need two distinct Items which you can link between. Storing them under the same ID would conflict with the unique ID constraint. I think this should be made more clear in the description and the provided solution with using the version extension and the same IDs doesn't work with the unique ID best practice.
It all depends on the catalog implementation. With a static catalog, you can still use the same id but with a different path including the version. The reference in the collection will be the "latest" version with the unique id. Then in the item, you link to the previous version, still with the same id but at a different path that would include the version. In STAC API, this is even simpler using the version API extension In my understanding, the main concept is that an item is always unique regardless of it's version. |
Yes, but the spec says:
and the best practice adds:
That's what has been written in the spec and just a paragraph above the addition. Reading this addition then is contradicting and confusing. So this should be better explained and the proposed solution with the version extension should be clarified or the uniqueness constraint needs to be weakened. |
What is proposed is not contradicting the principle of uniqueness of the ids. You can manage multiple version of the same STAC Item with a unique id but 2 different files. In a collection or globally, there is still a unique STAC Item. Then it is up to the implementation to manage which version to get according to the link or the API. On the other hand, this is for sure not what is done de facto within space agencies. Most of them including NASA and ESA with LANDSAT, Sentinels and many other includes the processing id, date or archive version in the filename. |
Oh, I may have misunderstood the version extension. I thought you had had two items with the same ID: I guess going back to the thing that originally motivated this: Say you have some software that generates a level-2 product from level-1 data (like sen2cor). If I run that at 8:00 and again at 9:00, the actual data assets should be byte-for-byte identical. And while the filenames might differ because they have a processing time, I'd argue that the STAC ID should not include the processing time. It's a bit more complicated when talking about changes to the actual processing software rather than just different processing times. In that case the outputs might not be byte-for-byte identical and so you could argue that As a user, I (probably) want the "latest" (best) version of the assets for a particular spatio-temporal footprint. I (probably) don't want to have to think about choosing between multiple items with the same spatio-temporal footprint. And for the less typical case where you do want the "old" version, we have the version extension. |
I agree with this as a motivating principle, and think that this could eventually be hardened into a Best Practice, i.e.: "Within a single collection, it is considered best practice to only have one non-deprecated item with a given spatio-temporal footprint." (see what I did there w/ the non-deprecated thing? More on that later) @m-mohr is correct that the unique-ID constraint forces us to include some sort of version information (whether it's processing datetime, an incrementing integer, a hash, whatever) in the item ID if we want to support item versions within a single collection. Which leads to three possible solutions (as I see it):
I think 1 is fine, but I think we can do better. My proposal:
Real-world exampleCurrently, the USGS has The Worst solution to the problem at hand (at least for landsat). They:
This leads to duplicate items for a given spatio-temporal footprint, where all but the latest items have 404 asset hrefs. Under scenario 1 above (no processing datetime in item ids), the USGS would remove processing datetime from item ids, and the re-processed items would have updated (presumably, more correct) assets. This is a good thing -- new searches will fetch only a single item per footprint, and that item will have "the best" data. So scenario 1 works. However, if the USGS wanted (in the future) to implement the version extension in its entirety to provide processing provenance, they couldn't -- only one item for a given spatio-temporal footprint could exist in the collection. Additionaly, any "frozen" items or feature collections (e.g. part of a publication) would have their assets changing, possibly in significant ways, without the knowledge of the user. Scenario 3 (use deprecated) requires a bit more ecosystem work, but allows us to support the version extension while still providing the best user experience (search for a thing, get one item per footprint). cc @matthewhanson, @pjhartzell, @ircwaves, and @arthurelmes (who joined me in a chat about this topic this week) |
Thanks Pete, your proposal sounds pretty solid. I think there are some details to work out (does iterating the items in a collection include deprecated items?) but it sounds workable. It solves my main issue with processing information in the IDs today and has the advantage of not silently changing the assets referenced by an item ID (at least I think that's an advantage... I suppose it's not always clear). |
Dev call:
|
Is there an update on the PR? |
This proposes a change to the Item ID best practices, based on some experiences and conversations with folks like @gadomski.
In my experience, many upstream data providers (USGS / landsat & MODIS, Copernicus / Sentinel,) include some kind of "processing timestamp" in their IDs. They'll occasionally reprocess assets, leading to new upstream IDs with the same "acquisition" timestamp but a new "processing" timestamp (what happens to the old assets varies, but I think doesn't matter for this discussion).
It's fundamentally ambiguous whether a reprocessed item is the "same" as an existing item. But I think the best recommendation is that the new, reprocessed item / assets should replace the old item / assets. That satisfies the common case of "Give me the item at this datetime over this area". If the processing datetime is included in the item ID then a provider would either
Between the versioning and processing extensions, STAC has all the building blocks to handle this elegantly. So this PR updates the recommendation to use those instead of stuffing a processing timestamp in the item ID.