generic QC function design #353
Replies: 1 comment
-
Mulled this over a bit and realized step one was a fundamental decision: should the API be procedural or object-oriented? My prototype is procedural. It can be described as the following three generic functions signatures:
I have no prototype for the object-oriented design. I realized it was an option when I started thinking about how this might be made generic. The fact I could write the procedure api as 3 generic functions screams to me for an abstract base class with this set of abstract (virtual) functions all concrete implementations would need to implement:
where I omitted some of the complexity noted above for the save and update methods assuming that would be handled in the constructors for a concrete class. In this model a concrete implementation would extend this class like this skeleton example for the snr algorithm I'm currently using as a prototype:
which really only shows the point that a concrete class will load its parameters into the constructor. The above would have some other method. So the key point this discussion should first decide is that if we want to create a generic QC api do we make it procedural or object-oriented? |
Beta Was this translation helpful? Give feedback.
-
@wangyinz and I started this discussion offline, but I think it is worth starting this discussion page to preserve some for the record.
Quality control is a generic, first order issue all users of modern broadband, continuous data will face. The generic issue is certain to be a prime use of MsPASS as good quality control filters are an essential way to reduce terrabytes of data to a manageable size. There are many reasons for this but the fundamental one is that records of earthquakes unlike seismic reflection data are wildly variable due to order of magnitude variations in amplitude we use to determine earthquake magnitude. The smallest detectable events will always dominate any processing workflow because of the magnitude-frequency relationship. Thus all workflows will need some step to "separate the wheat from the chaff" to use an old cliche.
How to filter out the good from the bad is particularly challenging because the input can be expected to normally be the biggest volume of data the workflow must handle. In our offline discussion we concluded there were two fundamentally different ways to potentially handle this problem:
filter
method of dask bag or spark rdd are used downstream to throw away the debris. A variant of this is the same concept used in the Undertaker module but which currently works only for ensembles. That is, prior to running filter we could send the bad/rdd to an Undertaker who would bury the dead. (A cute programming joke but memorable.)There are merits to both approaches and I tend to think we need to have generic implementations of both including the Undertaker variant of 2. The main advantage of the first is for exploratory tests of a data subset to sort out what QC parameters to use to clean up a dataset for the application. With the metrics stored in the same database it would straightforward to do QC graphics or prototype tests on a subset of the data to decide on a production run set of filter parameters. Then the approach in 2 could be applied to the full data set. At least that is how I see it for now.
I'm prototyping an example right now with the snr module. When I get the new revisions working I'll post the approach I used there and we can hopefully use it as a springboard to designing a more generic set of functions to handle these issues. For now, please comment on this and add other ideas that might be useful.
Beta Was this translation helpful? Give feedback.
All reactions