generic QC function design #353

pavlis · 2022-12-02T12:24:11Z

pavlis
Dec 2, 2022
Collaborator

@wangyinz and I started this discussion offline, but I think it is worth starting this discussion page to preserve some for the record.

Quality control is a generic, first order issue all users of modern broadband, continuous data will face. The generic issue is certain to be a prime use of MsPASS as good quality control filters are an essential way to reduce terrabytes of data to a manageable size. There are many reasons for this but the fundamental one is that records of earthquakes unlike seismic reflection data are wildly variable due to order of magnitude variations in amplitude we use to determine earthquake magnitude. The smallest detectable events will always dominate any processing workflow because of the magnitude-frequency relationship. Thus all workflows will need some step to "separate the wheat from the chaff" to use an old cliche.

How to filter out the good from the bad is particularly challenging because the input can be expected to normally be the biggest volume of data the workflow must handle. In our offline discussion we concluded there were two fundamentally different ways to potentially handle this problem:

What we might call the database approach would compute a series of quality control metrics and post them either back to the parent wf document for each processed waveform or to an auxiliary collection that would be cross-referenced to the original wf documents. This is actually the only model I can think of that would work with a serial workflow.
The alternative is the *inline approach" where metrics are computed by one or more functions within a parallel workflow and the filter method of dask bag or spark rdd are used downstream to throw away the debris. A variant of this is the same concept used in the Undertaker module but which currently works only for ensembles. That is, prior to running filter we could send the bad/rdd to an Undertaker who would bury the dead. (A cute programming joke but memorable.)

There are merits to both approaches and I tend to think we need to have generic implementations of both including the Undertaker variant of 2. The main advantage of the first is for exploratory tests of a data subset to sort out what QC parameters to use to clean up a dataset for the application. With the metrics stored in the same database it would straightforward to do QC graphics or prototype tests on a subset of the data to decide on a production run set of filter parameters. Then the approach in 2 could be applied to the full data set. At least that is how I see it for now.

I'm prototyping an example right now with the snr module. When I get the new revisions working I'll post the approach I used there and we can hopefully use it as a springboard to designing a more generic set of functions to handle these issues. For now, please comment on this and add other ideas that might be useful.

pavlis · 2022-12-03T10:47:16Z

pavlis
Dec 3, 2022
Collaborator Author

Mulled this over a bit and realized step one was a fundamental decision: should the API be procedural or object-oriented?

My prototype is procedural. It can be described as the following three generic functions signatures:

def QCCompute(datum, *args,**kwargs)->(dict,ErrorLogger):
def QCDBsave(db,doc,wfid,wf_collection,save_collection,*args,**kwargs)->ObjectId:
def QCDBupdate(db,doc,collection,subdoc_key=None,*args,**kwargs)->bool:

QCCompute takes an input datum and computes a set of metrics it posts to the output tuple dict (component 0).

QCDBsave is intended to save the result stored in the dictionary returned from QCCompute as a new document in the database. save_collection defines the name of the collection to which the data should be stored. wf_collection is the waveform collection from which datum was derived and wfid is the id of the document used to create datum. That would always be need to cross-reference a record written to save_collection with wf_collection.

QCDBupdate is like QCDBsave but posts the doc it receives as an update to collection. The assumption in this case is the collection would normally be the wf collection from which datum originated. I suggest the arg subdoc_key be used like my prototype. If the key is defined the contents of doc are posted as a "subdocument". By default (None) the doc contents are just pushed to the collection with update_one. I suggest it return a boolean with True for success and False for failure, but that could be done a lot of ways including no return.

I have no prototype for the object-oriented design. I realized it was an option when I started thinking about how this might be made generic. The fact I could write the procedure api as 3 generic functions screams to me for an abstract base class with this set of abstract (virtual) functions all concrete implementations would need to implement:

class BasicQCAlgorithm(ABC):
   def compute(self,datum,*args,**kwarg)->(dict,ErrorLogger):
   def dbsave(self,doc,*arg,**kwarg)->ObjectId:
   def dbupdate(self,doc)->bool:

where I omitted some of the complexity noted above for the save and update methods assuming that would be handled in the constructors for a concrete class.

In this model a concrete implementation would extend this class like this skeleton example for the snr algorithm I'm currently using as a prototype:

class BroadbandSNRQC(BasicQCAlgorithm): 
  def __init__(self, signal_window, noise_window, otherargs):
  def compute(self,datum,otherargs)->(dict,ErrorLogger):
  def dbsave(self,doc,otherargs)->ObjectId:
  def dbupdate(self,doc,otherargs):

which really only shows the point that a concrete class will load its parameters into the constructor. The above would have some other method.

So the key point this discussion should first decide is that if we want to create a generic QC api do we make it procedural or object-oriented?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generic QC function design #353

{{title}}

Replies: 1 comment

{{title}}

Select a reply

generic QC function design #353

pavlis Dec 2, 2022 Collaborator

Replies: 1 comment

pavlis Dec 3, 2022 Collaborator Author

pavlis
Dec 2, 2022
Collaborator

pavlis
Dec 3, 2022
Collaborator Author