Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "multi-scene" collecting and publishing #140

Open
pnuu opened this issue May 23, 2023 · 4 comments · May be fixed by #142
Open

Add "multi-scene" collecting and publishing #140

pnuu opened this issue May 23, 2023 · 4 comments · May be fixed by #142
Assignees

Comments

@pnuu
Copy link
Member

pnuu commented May 23, 2023

For creation of multi-temporal datasets data need to be collected and published for multiple time slots.

As an example, pytroll/satpy#2488 needs three distinct datasets:

  • files for a dataset at T-2
  • files for a dataset at T-1
  • files for the latest available dataset

The time-shift between the datasets can be anything, for example 15/30/60 minutes. It can even be irregular if used for polar satellite data or emphasis is needed on one direction or the other.

There are other envisioned needs for this kind of collection/publishing, so the feature needs to be kept as flexible as possible.

Messages

Currently we have the following message types for publishing data:

  • file: plain json without nested lists nor dictionaries, everything at the "top level" of the message
    • used for individual files
  • dataset: combined metadata (start/end times, platform, and such) at the top level, and a list named dataset of dictionaries having URI and UID of individual files
    • used for geostationary segments
  • collection: same as above, but there is a list named collection with dictionaries of individual start/end times and datasets
    • used for multi-segment multi-time data, such as granulated VIIRS SDR swaths

The collection message type could be used for the collection of multi-temporal data that described here, but how to distinguish from the existing usage? Should there be new message type like library (file -> dataset -> collection -> library 😜) or something that has a list named library with collections with datasets inside?

Configuration

This is the first crude idea of how to configure which data are published together. The publishing would be triggered after each data collection has terminated.

published_slots:
  - {min_age: 0, max_age: 0}
  - {min_age: 60, max_age: 65}
  - {min_age: 120, max_age: 125}

The min/max ages are relative to the start time of the currently completed collection. Just having the 0/0 combination would equal the current behaviour of publishing the latest completed set. If all the criteria are not met (just after restart, for example, we might not have the earlier slots collected).

Internals

Currently the completed Slots are deleted. We need to add a new check that looks at the published_slots config (and timeliness?) to determine which slots are not needed anymore. As the keys in the self.slots dictionary are the nominal or start time (possibly rounded, depending on config) of the slot as a string, comparison is quite easy.

@pnuu pnuu self-assigned this May 23, 2023
@gerritholl
Copy link
Member

I didn't know there existed standardised message types with defined data structures. Is this defined/documented and/or enforced/tested anywhere?

Should there be new message type like library (file -> dataset -> collection -> library 😜) or something that has a list named library with collections with datasets inside?

For what it's worth, in one software package I know the seven dimensions are called Library, Vitrine, Shelf, Book, Page, Row, Column :-)

On a more serious note, if we do use standardised names and a collection collects all granules or segments belonging to a single scene, then "multicollection" would be I think quite clear in its purpose.

@pnuu
Copy link
Member Author

pnuu commented May 23, 2023

I didn't know there existed standardised message types with defined data structures. Is this defined/documented and/or enforced/tested anywhere?

I doubt it's documented anywhere. I was thinking the same earlier today. But the above is most of what we have in use in posttroll-based packages. The file message type is the most common. Segment gatherers uses dataset if it receives files, collection if it receives datasets. Geographic collector always publishes collection messages. There are some other types at least in Trollmoves (ack, push, error, pong, err, unknown show up in a quick grep) for internal communications.

On a more serious note, if we do use standardised names and a collection collects all granules or segments belonging to a single scene, then "multicollection" would be I think quite clear in its purpose.

I like that, the data are most likely passed to MultiScene in Satpy, so that'd match.

@mraspaud
Copy link
Member

I'm thinking that the difference between collection and mulitcollection is not really obvious, while temporal_collection is more explicit...

@pnuu
Copy link
Member Author

pnuu commented May 24, 2023

Thanks, I'll think about the naming. I've started with multicollection also for the internals, but changing that isn't too complicated.

@pnuu pnuu moved this to In Progress in PCW Spring 2023 May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants