avoiding CGI #99

lachlancoin · 2019-03-05T00:32:37Z

I am sure you considered this, but why not use the FastQsplitter to split into batches->write to bucket -> send location to AlignerCluster via pubsub -> AlignerCluster has a script which kicks of bwa alignment -> send location of bam via pubsub-> dataflow proceeds ?

I guess the reason is that this becomes asyncrhonous (ie a different dataflow process has to be running to split the fastq, and then another to read the bam). Is it possible to have these two asynchronous processes running on different threads within Dataflow? Or indeed to have two dataflow jobs running (one splitting, and one processing the BAM).

allenday · 2019-03-05T01:33:29Z

It's possible to do as you describe yes. We implemented as it is now to have a single pipeline that contains all of the application logic. Other than avoiding CGI, is there an advantage to having two distinct dataflows in the proposed architecture? On a related note, we are exploring having a cluster that communicates with dataflow (or anything else) via pubsub. GCS fastq in, GCS sam out. This also enables e.g. variant calling using the same pattern. I began implementing a POC, I can give you what I have if you'd like to work on it.

…

On Tue, Mar 5, 2019, 08:32 lachlancoin ***@***.***> wrote: I am sure you considered this, but why not use the FastQsplitter to split into batches->write to bucket -> send location to AlignerCluster via pubsub -> AlignerCluster has a script which kicks of bwa alignment -> send location of bam via pubsub-> dataflow proceeds ? I guess the reason is that this becomes asyncrhonous (ie a different dataflow process has to be running to split the fastq, and then another to read the bam). Is it possible to have these two asynchronous processes running on different threads within Dataflow? Or indeed to have two dataflow jobs running (one splitting, and one processing the BAM). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#99>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAanP0Cu82JWo0gD71D-30vpDmPJ_3mIks5vTbslgaJpZM4bdf2f> .

lachlancoin · 2019-03-05T02:23:22Z

The advantage is just avoiding CGI (which as I understand adds to complexity in terms of adding load balancers etc), and also is the step where the pipeline gets blocked for us. The pattern you describe sounds great.

…

On Tue, 5 Mar 2019 at 11:33, Allen Day ***@***.***> wrote: It's possible to do as you describe yes. We implemented as it is now to have a single pipeline that contains all of the application logic. Other than avoiding CGI, is there an advantage to having two distinct dataflows in the proposed architecture? On a related note, we are exploring having a cluster that communicates with dataflow (or anything else) via pubsub. GCS fastq in, GCS sam out. This also enables e.g. variant calling using the same pattern. I began implementing a POC, I can give you what I have if you'd like to work on it. On Tue, Mar 5, 2019, 08:32 lachlancoin ***@***.***> wrote: > I am sure you considered this, but why not use the FastQsplitter to split > into batches->write to bucket -> send location to AlignerCluster via pubsub > -> AlignerCluster has a script which kicks of bwa alignment -> send > location of bam via pubsub-> dataflow proceeds ? > > I guess the reason is that this becomes asyncrhonous (ie a different > dataflow process has to be running to split the fastq, and then another to > read the bam). Is it possible to have these two asynchronous processes > running on different threads within Dataflow? Or indeed to have two > dataflow jobs running (one splitting, and one processing the BAM). > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#99>, or mute the > thread > < https://github.com/notifications/unsubscribe-auth/AAanP0Cu82JWo0gD71D-30vpDmPJ_3mIks5vTbslgaJpZM4bdf2f > > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#99 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD01ZG69LVIJ4s6vbhJqiOvZMeCgW8_1ks5vTclqgaJpZM4bdf2f> .

-- Group leader, Institute for Molecular Bioscience, University of Queensland Senior Lecturer, Imperial College http://academickarma.org/0000-0002-4300-455X http://orcid.org/0000-0002-4300-455X

lachlancoin · 2019-03-07T00:53:40Z

Hi @allenday @obsh @Pseverin

I am wondering whether its possible to have a cut-down dataflow pipeline which carves out the post-bam processing, and does not do any of the fastq processing.

On our end we are working on the minimap2 docker which will process independently any fastq arriving in the UPLOAD_BUCKET (via a mounting of the cloud bucket on the instance), and produce a bam file, which could then go into the cut-down dataflow pipeline. I should point out that we can control how finely the fastq are split from the nanopore device, and we have a client-side script which is watching for new fastq and syncing those to the UPLOAD_BUCKET. So its not completely necessary to split further on GCP side.

Another advantage of this is that we could test the post-bam processing independently of the fastq processing steps. At the moment we are getting stuck in the alignment step

lachlancoin · 2019-03-07T01:10:48Z

There are a few more advantages to this setup.

we can control upstream processing more easily ,e.g. compression/decompression or encryption/deencryption
I think we could hack minimap2 to continue working on subsequent fastqs uploaded while its still processing

obsh · 2019-03-07T20:47:27Z

I think there are two main options, we can add another class with a cut-down pipeline that subscribes to BAM/SAM files upload events or we can extend the existing pipeline to detect uploaded file type and by-pass not needed steps, like if it's BAM/SAM file uploaded - by-pass alignment step.

obsh · 2019-03-07T20:57:25Z

We haven't worked with a bam files previously, am I correct that .bam files are always created with a corresponding index file .bam.bai?

lachlancoin · 2019-03-07T21:23:26Z

we dont need bai in this case as we need to read whole bam

…

On Fri, 8 Mar 2019, 06:57 Alexander Bushkovsky ***@***.***> wrote: We haven't worked with a bam files previously, am I correct that .bam files are always created with a corresponding index file .bam.bai? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#99 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD01ZKtBE8gaUFclJ2mJY8ktkhJwZDLjks5vUX01gaJpZM4bdf2f> .

obsh · 2019-03-07T23:24:44Z

I've created separate branch bam_files with pipeline version which skips alignment steps for BAM/SAM files.
But I believe there will be errors on the k-align step if your pipeline has difficulties connecting to the alignment cluster.

lachlancoin · 2019-03-07T23:37:08Z

Yes I see, so we still need to provision an alignment cluster, but probably could do so with less memory. Also, I dont believe the k-align step is necessary for the species typing (and not sure its currently required for AMR typing in your pipeline either, although it is handy in the AMR pipeline in order to get high base level accurate sequences at the end, but we not currently exploiting that in the pipeline at the moment, we just using the counts)

…

On Fri, 8 Mar 2019 at 09:24, Alexander Bushkovsky ***@***.***> wrote: I've created separate branch bam_files with pipeline version which skips alignment steps for BAM/SAM files. But I believe there will be errors on the k-align step if your pipeline has difficulties connecting to the alignment cluster. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#99 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD01ZE_9bB0pREdPHDAdfYjXypi_l_Hbks5vUZ-8gaJpZM4bdf2f> .

-- Group leader, Institute for Molecular Bioscience, University of Queensland Senior Lecturer, Imperial College http://academickarma.org/0000-0002-4300-455X http://orcid.org/0000-0002-4300-455X

obsh mentioned this issue Mar 7, 2019

Modified pipeline to support uploads of SAM/BAM files. #100

Merged

obsh added the tracked label Sep 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoiding CGI #99

avoiding CGI #99

lachlancoin commented Mar 5, 2019

allenday commented Mar 5, 2019 via email

lachlancoin commented Mar 5, 2019 via email

lachlancoin commented Mar 7, 2019

lachlancoin commented Mar 7, 2019

obsh commented Mar 7, 2019

obsh commented Mar 7, 2019

lachlancoin commented Mar 7, 2019 via email

obsh commented Mar 7, 2019

lachlancoin commented Mar 7, 2019 via email

avoiding CGI #99

avoiding CGI #99

Comments

lachlancoin commented Mar 5, 2019

allenday commented Mar 5, 2019 via email

lachlancoin commented Mar 5, 2019 via email

lachlancoin commented Mar 7, 2019

lachlancoin commented Mar 7, 2019

obsh commented Mar 7, 2019

obsh commented Mar 7, 2019

lachlancoin commented Mar 7, 2019 via email

obsh commented Mar 7, 2019

lachlancoin commented Mar 7, 2019 via email