-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoiding CGI #99
Comments
It's possible to do as you describe yes.
We implemented as it is now to have a single pipeline that contains all of
the application logic.
Other than avoiding CGI, is there an advantage to having two distinct
dataflows in the proposed architecture?
On a related note, we are exploring having a cluster that communicates with
dataflow (or anything else) via pubsub. GCS fastq in, GCS sam out. This
also enables e.g. variant calling using the same pattern. I began
implementing a POC, I can give you what I have if you'd like to work on it.
…On Tue, Mar 5, 2019, 08:32 lachlancoin ***@***.***> wrote:
I am sure you considered this, but why not use the FastQsplitter to split
into batches->write to bucket -> send location to AlignerCluster via pubsub
-> AlignerCluster has a script which kicks of bwa alignment -> send
location of bam via pubsub-> dataflow proceeds ?
I guess the reason is that this becomes asyncrhonous (ie a different
dataflow process has to be running to split the fastq, and then another to
read the bam). Is it possible to have these two asynchronous processes
running on different threads within Dataflow? Or indeed to have two
dataflow jobs running (one splitting, and one processing the BAM).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#99>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAanP0Cu82JWo0gD71D-30vpDmPJ_3mIks5vTbslgaJpZM4bdf2f>
.
|
The advantage is just avoiding CGI (which as I understand adds to
complexity in terms of adding load balancers etc), and also is the step
where the pipeline gets blocked for us. The pattern you describe sounds
great.
…On Tue, 5 Mar 2019 at 11:33, Allen Day ***@***.***> wrote:
It's possible to do as you describe yes.
We implemented as it is now to have a single pipeline that contains all of
the application logic.
Other than avoiding CGI, is there an advantage to having two distinct
dataflows in the proposed architecture?
On a related note, we are exploring having a cluster that communicates with
dataflow (or anything else) via pubsub. GCS fastq in, GCS sam out. This
also enables e.g. variant calling using the same pattern. I began
implementing a POC, I can give you what I have if you'd like to work on it.
On Tue, Mar 5, 2019, 08:32 lachlancoin ***@***.***> wrote:
> I am sure you considered this, but why not use the FastQsplitter to split
> into batches->write to bucket -> send location to AlignerCluster via
pubsub
> -> AlignerCluster has a script which kicks of bwa alignment -> send
> location of bam via pubsub-> dataflow proceeds ?
>
> I guess the reason is that this becomes asyncrhonous (ie a different
> dataflow process has to be running to split the fastq, and then another
to
> read the bam). Is it possible to have these two asynchronous processes
> running on different threads within Dataflow? Or indeed to have two
> dataflow jobs running (one splitting, and one processing the BAM).
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#99>, or mute the
> thread
> <
https://github.com/notifications/unsubscribe-auth/AAanP0Cu82JWo0gD71D-30vpDmPJ_3mIks5vTbslgaJpZM4bdf2f
>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#99 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AD01ZG69LVIJ4s6vbhJqiOvZMeCgW8_1ks5vTclqgaJpZM4bdf2f>
.
--
Group leader, Institute for Molecular Bioscience, University of Queensland
Senior Lecturer, Imperial College
http://academickarma.org/0000-0002-4300-455X
http://orcid.org/0000-0002-4300-455X
|
I am wondering whether its possible to have a cut-down dataflow pipeline which carves out the post-bam processing, and does not do any of the fastq processing. On our end we are working on the minimap2 docker which will process independently any fastq arriving in the UPLOAD_BUCKET (via a mounting of the cloud bucket on the instance), and produce a bam file, which could then go into the cut-down dataflow pipeline. I should point out that we can control how finely the fastq are split from the nanopore device, and we have a client-side script which is watching for new fastq and syncing those to the UPLOAD_BUCKET. So its not completely necessary to split further on GCP side. Another advantage of this is that we could test the post-bam processing independently of the fastq processing steps. At the moment we are getting stuck in the alignment step |
There are a few more advantages to this setup.
|
I think there are two main options, we can add another class with a cut-down pipeline that subscribes to BAM/SAM files upload events or we can extend the existing pipeline to detect uploaded file type and by-pass not needed steps, like if it's BAM/SAM file uploaded - by-pass alignment step. |
We haven't worked with a bam files previously, am I correct that |
we dont need bai in this case as we need to read whole bam
…On Fri, 8 Mar 2019, 06:57 Alexander Bushkovsky ***@***.***> wrote:
We haven't worked with a bam files previously, am I correct that .bam
files are always created with a corresponding index file .bam.bai?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#99 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AD01ZKtBE8gaUFclJ2mJY8ktkhJwZDLjks5vUX01gaJpZM4bdf2f>
.
|
I've created separate branch |
Yes I see, so we still need to provision an alignment cluster, but probably
could do so with less memory.
Also, I dont believe the k-align step is necessary for the species typing
(and not sure its currently required for AMR typing in your pipeline
either, although it is handy in the AMR pipeline in order to get high base
level accurate sequences at the end, but we not currently exploiting that
in the pipeline at the moment, we just using the counts)
…On Fri, 8 Mar 2019 at 09:24, Alexander Bushkovsky ***@***.***> wrote:
I've created separate branch bam_files with pipeline version which skips
alignment steps for BAM/SAM files.
But I believe there will be errors on the k-align step if your pipeline
has difficulties connecting to the alignment cluster.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#99 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AD01ZE_9bB0pREdPHDAdfYjXypi_l_Hbks5vUZ-8gaJpZM4bdf2f>
.
--
Group leader, Institute for Molecular Bioscience, University of Queensland
Senior Lecturer, Imperial College
http://academickarma.org/0000-0002-4300-455X
http://orcid.org/0000-0002-4300-455X
|
I am sure you considered this, but why not use the FastQsplitter to split into batches->write to bucket -> send location to AlignerCluster via pubsub -> AlignerCluster has a script which kicks of bwa alignment -> send location of bam via pubsub-> dataflow proceeds ?
I guess the reason is that this becomes asyncrhonous (ie a different dataflow process has to be running to split the fastq, and then another to read the bam). Is it possible to have these two asynchronous processes running on different threads within Dataflow? Or indeed to have two dataflow jobs running (one splitting, and one processing the BAM).
The text was updated successfully, but these errors were encountered: