Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoiding CGI #99

Open
lachlancoin opened this issue Mar 5, 2019 · 9 comments
Open

avoiding CGI #99

lachlancoin opened this issue Mar 5, 2019 · 9 comments
Labels

Comments

@lachlancoin
Copy link

I am sure you considered this, but why not use the FastQsplitter to split into batches->write to bucket -> send location to AlignerCluster via pubsub -> AlignerCluster has a script which kicks of bwa alignment -> send location of bam via pubsub-> dataflow proceeds ?

I guess the reason is that this becomes asyncrhonous (ie a different dataflow process has to be running to split the fastq, and then another to read the bam). Is it possible to have these two asynchronous processes running on different threads within Dataflow? Or indeed to have two dataflow jobs running (one splitting, and one processing the BAM).

@allenday
Copy link
Owner

allenday commented Mar 5, 2019 via email

@lachlancoin
Copy link
Author

lachlancoin commented Mar 5, 2019 via email

@lachlancoin
Copy link
Author

Hi @allenday @obsh @Pseverin

I am wondering whether its possible to have a cut-down dataflow pipeline which carves out the post-bam processing, and does not do any of the fastq processing.

On our end we are working on the minimap2 docker which will process independently any fastq arriving in the UPLOAD_BUCKET (via a mounting of the cloud bucket on the instance), and produce a bam file, which could then go into the cut-down dataflow pipeline. I should point out that we can control how finely the fastq are split from the nanopore device, and we have a client-side script which is watching for new fastq and syncing those to the UPLOAD_BUCKET. So its not completely necessary to split further on GCP side.

Another advantage of this is that we could test the post-bam processing independently of the fastq processing steps. At the moment we are getting stuck in the alignment step

@lachlancoin
Copy link
Author

There are a few more advantages to this setup.

  1. we can control upstream processing more easily ,e.g. compression/decompression or encryption/deencryption

  2. I think we could hack minimap2 to continue working on subsequent fastqs uploaded while its still processing

@obsh
Copy link
Collaborator

obsh commented Mar 7, 2019

I think there are two main options, we can add another class with a cut-down pipeline that subscribes to BAM/SAM files upload events or we can extend the existing pipeline to detect uploaded file type and by-pass not needed steps, like if it's BAM/SAM file uploaded - by-pass alignment step.

@obsh
Copy link
Collaborator

obsh commented Mar 7, 2019

We haven't worked with a bam files previously, am I correct that .bam files are always created with a corresponding index file .bam.bai?

@lachlancoin
Copy link
Author

lachlancoin commented Mar 7, 2019 via email

@obsh
Copy link
Collaborator

obsh commented Mar 7, 2019

I've created separate branch bam_files with pipeline version which skips alignment steps for BAM/SAM files.
But I believe there will be errors on the k-align step if your pipeline has difficulties connecting to the alignment cluster.

@lachlancoin
Copy link
Author

lachlancoin commented Mar 7, 2019 via email

@obsh obsh added the tracked label Sep 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants