Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the usage of tbprofiler vcf_profile #104

Closed
TimHHH opened this issue May 31, 2022 · 7 comments · Fixed by #137
Closed

Refactor the usage of tbprofiler vcf_profile #104

TimHHH opened this issue May 31, 2022 · 7 comments · Fixed by #137
Assignees
Labels
Milestone

Comments

@TimHHH
Copy link
Collaborator

TimHHH commented May 31, 2022

MERGE_WF:RESISTANCE_ANALYSIS:TBPROFILER_VCF_PROFILE__LOFREQ crashes because, when run in parallel, multiple processes are reading and writing the same files in /conda/xbs-nf-env-1-d99876e5fea88a1c4bd18887d111ae27/share/tbprofiler/

temporary solution is to keep on restarting the run, each time completing more samples. Decreasing queueSize and increasing errorStrategy retries is likely to be a good temporary solution as well.

@abhi18av
Copy link
Member

@TimHHH pointed out that maybe we can implement MERGE_WF:RESISTANCE_ANALYSIS:TBPROFILER_VCF_PROFILE__LOFREQ on a merged lofreqVCF file for the cohort.

Please let me know after your own experiment, whether we take this path.

@abhi18av abhi18av assigned abhi18av and TimHHH and unassigned abhi18av Jul 12, 2022
@TimHHH
Copy link
Collaborator Author

TimHHH commented Jul 14, 2022

I tried merging the LoFreq vcf's but unfortunately some important info (sample and allele frequency) is lost. Hence this route is not feasible.

@abhi18av
Copy link
Member

As an alternative, we can bundle the WHO database within the containers and at runtime just focus on the main script. TODO @abhi18av

@abhi18av abhi18av added this to the v1.0.0 milestone Sep 27, 2022
@LennertVerboven
Copy link
Collaborator

LennertVerboven commented Sep 27, 2022

Hi Abhi

Below you can find the flow required for the new lofreq analysis. I have also uploaded the reformat_lofreq.py script

  • For each lofreqVcf file (from CALL_WF.LOFREQ)
# The inputs for the new script are the same as for the command below i.e. (sampleName / lofreqVcf)
${params.tbprofiler_path} vcf_profile --lofreq_sample_name ${sampleName}  ${lofreqVcf}

# INPUT FOR reformat_lofreq
# outputfile should be ${sampleName}.LoFreq.Reformat.vcf
python reformat_lofreq.py ${lofreqVcf} ${sampleName} ${reformat_output_vcf} 

# output gets stored in ${sampleName}.LoFreq.Reformat.vcf (for example)
# produces the gzip version  ${sampleName}.LoFreq.Reformat.vcf.gz
bgzip -f ${reformat_output_vcf}  

# output gets stored in ${sampleName}.LoFreq.Reformat.vcf.gz.tbi
GATK IndexFeatureFile -I ${sampleName}.LoFreq.Reformat.vcf.gz 
  • Once for all files this step requires the vcf.gz and vcf.gz.tbi files for all samples
bcftools merge -o ${joint_name}.LoFreq.vcf.gz $LIST_ALL_VCF_FILES (reformatted ones)
bgzip -f ${joint_name}.LoFreq.vcf


# Run the tbprofiler on the merged-and-gzipped file
${params.tbprofiler_path} vcf_profile ${optionalDb} ${joint_name}.LoFreq.vcf

@abhi18av
Copy link
Member

Sounds good @LennertVerboven !

Below you can find the flow required for the new lofreq analysis. I have also uploaded the reformat_lofreq.py script

Which branch have you used to upload this?

@LennertVerboven
Copy link
Collaborator

Given that I just added a file, I added it to master directly

@abhi18av abhi18av changed the title crash on TBPROFILER_VCF_PROFILE__LOFREQ due to multiple processes reading/writing same DB files Refactor the usage of tbprofiler vcf_profile Nov 1, 2022
@abhi18av
Copy link
Member

abhi18av commented Nov 1, 2022

Update:

  • The original scope of this issue was focused on addressing the problem of loading WHO database in tbprofiler, in a parallel computation setting when used with conda (cluster + conda).

This has been addressed with #128

  • The current scope is to reduce the requirement for running tbprofiler in parallel altogether and analyse a set of concatenated VCF files (from lofreq).

@abhi18av abhi18av linked a pull request Nov 1, 2022 that will close this issue
@abhi18av abhi18av removed a link to a pull request Nov 6, 2022
@abhi18av abhi18av linked a pull request Nov 6, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants