Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline failing with older fastq file #216

Open
RuanSpies21 opened this issue Aug 13, 2024 · 20 comments
Open

Pipeline failing with older fastq file #216

RuanSpies21 opened this issue Aug 13, 2024 · 20 comments
Assignees

Comments

@RuanSpies21
Copy link

RuanSpies21 commented Aug 13, 2024

Hi there,

I am trying to run the pipeline on some older fastq files (circa 2010s) using the docker profile. The reads for the files are relatively short at ~75bp. Following previous advice from Abhinav, I have created a custom.config file with contents:

profiles {
                bwa_k66 {
                                params {
                                                                BWA_MEM {
                                                                                arguments = " -k 66"
                                                                }
                                }
                }
}

which I specify with the -c argument. So my full command is: nextflow run . -params-file params/params.yaml -profile docker,server,bwa_k66 -c custom.config.

However I get this following error:

ERROR ~ Error executing process > 'UTILS_MERGE_COHORT_STATS (joint_name: joint)'

Caused by:
  Process `UTILS_MERGE_COHORT_STATS (joint_name: joint)` terminated with an error exit status (1)

Command executed:

  generate_merged_cohort_stats.py \
      --relabundance_approved_tsv approved_samples.relabundance.tsv \
      --relabundance_rejected_tsv rejected_samples.relabundance.tsv\
      --call_wf_cohort_stats_tsv joint.cohort_stats.tsv\
      --output_file joint.merged_cohort_stats.tsv

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "/mnt/volume_data/ruan/walker_2013/MAGMA/bin/generate_merged_cohort_stats.py", line 55, in <module>
      df_final_cohort_stats['ALL_THRESHOLDS_MET'] = df_final_cohort_stats['MAPPED_NTM_FRACTION_16S_THRESHOLD_MET'].astype('bool')  & df_final_cohort_stats['COVERAGE_THRESHOLD_MET'].astype('bool')  & df_final_cohort_stats['BREADTH_OF_COVERAGE_THRESHOLD_MET'].astype('bool')  & df_final_cohort_stats['RELABUNDANCE_THRESHOLD_MET'].astype('bool')
    File "/opt/conda/lib/python3.9/site-packages/pandas/core/generic.py", line 6240, in astype
      new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
    File "/opt/conda/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 445, in astype
      return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    File "/opt/conda/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 347, in apply
      applied = getattr(b, f)(**kwargs)
    File "/opt/conda/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 526, in astype
      new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    File "/opt/conda/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe
      new_values = astype_array(values, dtype, copy=copy)
    File "/opt/conda/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 227, in astype_array
      values = values.astype(dtype, copy=copy)
    File "/opt/conda/lib/python3.9/site-packages/pandas/core/arrays/masked.py", line 474, in astype
      raise ValueError("cannot convert float NaN to bool")
  ValueError: cannot convert float NaN to bool

Work dir:
  /mnt/volume_data/ruan/walker_2013/MAGMA/work/92/7f8941da70c68790f747afea230770

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

I think these is due to the sample returning with 0 coverage (when i check /mnt/volume_data/ruan/walker_2013/MAGMA/magma-results/QC_statistics/per_sample/coverage all have 0)

Any ideas what could be going on here or any workarounds?
ERR038264 is an example fastq

Thanks!
Ruan

@abhi18av
Copy link
Member

Hi @RuanSpies21 ,

Happy to work on this together, could you please share 5 sample IDs from your dataset?

This way I can test those locally.

@RuanSpies21
Copy link
Author

Thanks @abhi18av!

ERR038276
ERR038277
ERR038278
ERR038279
ERR038280

@RuanSpies21
Copy link
Author

Hi @abhi18av - any thoughts on this yet?

@abhi18av
Copy link
Member

Hi @RuanSpies21 ,

Apologies for the late response on this one, I has been able to reproduce this error on my side using the pipeline's default -k 100 for BWA, which completed in 30 seconds per sample.

image

This was NOT resolved even when I enabled bwa_k66 on my side with these samples, raising the runtime for BWA to roughly 40 seconds per sample.

The following statistics were generated for the individual files

|SAMPLE                   |AVG_INSERT_SIZE|MAPPED_PERCENTAGE|RAW_TOTAL_SEQS|AVERAGE_BASE_QUALITY|MEAN_COVERAGE|SD_COVERAGE|MEDIAN_COVERAGE|MAD_COVERAGE|PCT_EXC_ADAPTER|PCT_EXC_MAPQ|PCT_EXC_DUPE|PCT_EXC_UNPAIRED|PCT_EXC_BASEQ|PCT_EXC_OVERLAP|PCT_EXC_CAPPED|PCT_EXC_TOTAL|PCT_1X  |PCT_5X  |PCT_10X |PCT_30X |PCT_50X |PCT_100X|MAPPED_NTM_FRACTION_16S|MAPPED_NTM_FRACTION_16S_THRESHOLD_MET|COVERAGE_THRESHOLD_MET|BREADTH_OF_COVERAGE_THRESHOLD_MET|ALL_THRESHOLDS_MET|
|-------------------------|---------------|-----------------|--------------|--------------------|-------------|-----------|---------------|------------|---------------|------------|------------|----------------|-------------|---------------|--------------|-------------|--------|--------|--------|--------|--------|--------|-----------------------|-------------------------------------|----------------------|---------------------------------|------------------|
|MAGMA.ERX015472_ERR038276|366.5          |73.97            |17112670      |34.5                |154.672354   |71.416821  |157            |48          |0              |0.09915     |0.154413    |0               |0.02558      |0.001099       |0             |0.280241     |0.972886|0.96603 |0.961128|0.942092|0.916305|0.78371 |0.0                    |1                                    |1                     |1                                |1                 |
|MAGMA.ERX015473_ERR038277|384.5          |77.16            |18091946      |35.3                |176.428354   |71.317561  |185            |46          |0              |0.086099    |0.148593    |0               |0.020496     |0.000459       |0             |0.255647     |0.973516|0.966587|0.963094|0.950857|0.935023|0.859417|0.0                    |1                                    |1                     |1                                |1                 |
|MAGMA.ERX015474_ERR038278|407.9          |77.47            |13464688      |35.3                |134.847332   |58.306765  |142            |38          |0              |0.084936    |0.132313    |0               |0.020973     |0.000473       |0             |0.238694     |0.966827|0.959598|0.954376|0.933156|0.903887|0.750069|0.0                    |1                                    |1                     |1                                |1                 |
|MAGMA.ERX015475_ERR038279|427.5          |76.3             |16200744      |35.2                |155.460953   |61.334029  |165            |38          |0              |0.09051     |0.147023    |0               |0.021057     |0.000692       |0             |0.259282     |0.97162 |0.964273|0.960278|0.946518|0.928075|0.832158|0.0                    |1                                    |1                     |1                                |1                 |
|MAGMA.ERX015476_ERR038280|478.8          |75.39            |18525588      |35.2                |171.901534   |69.736791  |180            |42          |0              |0.096019    |0.158024    |0               |0.020307     |0.000607       |0             |0.274956     |0.973743|0.967281|0.96324 |0.949922|0.934053|0.859584|0.0                    |1                                    |1                     |1                                |1                 |

And I was able to reproduce the issue related to type casting in python script

INFO:    Converting SIF file to temporary sandbox...
Traceback (most recent call last):
  File "/home/abhinav/.nextflow/assets/TORCH-Consortium/MAGMA/bin/generate_merged_cohort_stats.py", line 55, in <module>
    df_final_cohort_stats['ALL_THRESHOLDS_MET'] = df_final_cohort_stats['MAPPED_NTM_FRACTION_16S_THRESHOLD_MET'].astype('bool')  & df_final_cohort_stats['COVERAGE_THRESHOLD_MET'].astype('bool')  & df_final_cohort_stats['BREADTH_OF_COVERAGE_THRESHOLD_MET'].astype('bool')  & df_final_cohort_stats['RELABUNDANCE_THRESHOLD_MET'].astype('bool')
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/generic.py", line 6240, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 448, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 352, in apply
    applied = getattr(b, f)(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 526, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 227, in astype_array
    values = values.astype(dtype, copy=copy)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/arrays/masked.py", line 474, in astype
    raise ValueError("cannot convert float NaN to bool")
ValueError: cannot convert float NaN to bool
INFO:    Cleaning up image...

NOTE

I am currently working on a patch to address this issue - thank you for bringing it to my attention!

@abhi18av abhi18av self-assigned this Sep 16, 2024
@abhi18av
Copy link
Member

abhi18av commented Sep 16, 2024

@RuanSpies21 , could you please try running the pipeline with the following command? I have pushed a patch to master branch now.

NOTE: Please replace whatever makes sense in your context, but the main snippet is -r master -latest -resume

nextflow run 'https://github.com/TORCH-Consortium/MAGMA'
		 -profile singularity,bwa_k66
		 -r master
		 -latest
		 -resume
		 -params-file params.magma.yaml

@RuanSpies21
Copy link
Author

RuanSpies21 commented Sep 17, 2024

Thank you so much for the help @abhi18av! I'm so sorry, I am not quite getting it right :(

When I run nextflow run 'https://github.com/TORCH-Consortium/MAGMA' -profile docker,server,bwa_k66 -params-file params.yaml -r master -latest -resume I get: Unknown configuration profile: 'bwa_k66'

If I then add the -c custom.config with the file mentioned above I get ERROR ~ Unknown method invocation splitJson on UnixPath type

Seems to be an issue with sample sheet validation? Here is the format of my sample sheet for reference:

Sample,R1,R2
ERR025842,/mnt/volume_data/ruan/walker_2013/ERR025842_1.fastq.gz,/mnt/volume_data/ruan/walker_2013/ERR025842_2.fastq.gz
ERR025843,/mnt/volume_data/ruan/walker_2013/ERR025843_1.fastq.gz,/mnt/volume_data/ruan/walker_2013/ERR025843_2.fastq.gz

I've also attached the nextflow logs in case helpful.

Thanks again for your help - very sorry to keep bothering!
nextflow.log

@abhi18av
Copy link
Member

Hi @RuanSpies21

test profile

The samplesheet sheet looks fine to me, but let's make sure that the basics are all set

nextflow run 'https://github.com/TORCH-Consortium/MAGMA' -profile docker,server,test -r hotfix/bwa_k66

This should make use of the test profile and download some samples from original MAGMA publication and run them through.

bwa_k66 profile

I have created a new bwa_k66 profile, which you can use without providing a -c custom.config file.

nextflow run 'https://github.com/TORCH-Consortium/MAGMA' -profile docker,server -r hotfix/bwa_k66 --input_samplesheet /path/to/your/samplesheet.csv

Seems to be an issue with sample sheet validation?

Actually, to me the samplesheet seems valid 🤔

Thanks again for your help - very sorry to keep bothering!

No worries at all Ruan, this is very helpful. There's no perfect software, but with user feedback and usage, we can keep improving it.

I do thank you for your patience!


If this doesn't work, then perhaps we can meet sometime next week? Here's my academic email abhinavsharma at sun dot ac dot za 📆

@RuanSpies21
Copy link
Author

RuanSpies21 commented Sep 18, 2024

Ok its looks like its failing with the same error on the test profile as well.

I ran nextflow run 'https://github.com/TORCH-Consortium/MAGMA' -profile docker,server,test -r hotfix/bwa_k66

Output:

process > SAMPLESHEET_VALIDATION                                                                [  0%] 0 of 1
[-        ] process > VALIDATE_FASTQS_WF:FASTQ_VALIDATOR                                                    -
[-        ] process > VALIDATE_FASTQS_WF:UTILS_FASTQ_COHORT_VALIDATION                                      -
[-        ] process > QUALITY_CHECK_WF:FASTQC                                                               -
[-        ] process > QUALITY_CHECK_WF:NTMPROFILER_PROFILE                                                  -
[-        ] process > QUALITY_CHECK_WF:NTMPROFILER_COLLATE                                                  -
[-        ] process > MAP_WF:BWA_MEM                                                                        -
[-        ] process > CALL_WF:SAMTOOLS_MERGE                                                                -
[-        ] process > CALL_WF:GATK_MARK_DUPLICATES                                                          -
[-        ] process > CALL_WF:SAMTOOLS_INDEX                                                                -
[-        ] process > CALL_WF:GATK_HAPLOTYPE_CALLER                                                         -
[-        ] process > CALL_WF:LOFREQ_CALL__NTM                                                              -
[-        ] process > CALL_WF:LOFREQ_INDELQUAL                                                              -
[-        ] process > CALL_WF:SAMTOOLS_INDEX__LOFREQ                                                        -
[-        ] process > CALL_WF:LOFREQ_CALL                                                                   -
[-        ] process > CALL_WF:LOFREQ_FILTER                                                                 -
[-        ] process > CALL_WF:UTILS_REFORMAT_LOFREQ                                                         -
[-        ] process > CALL_WF:BGZIP__LOFREQ                                                                 -
[-        ] process > CALL_WF:GATK_INDEX_FEATURE_FILE__LOFREQ                                               -
[-        ] process > CALL_WF:SAMTOOLS_STATS                                                                -
[-        ] process > CALL_WF:GATK_COLLECT_WGS_METRICS                                                      -
[-        ] process > CALL_WF:GATK_FLAG_STAT                                                                -
[-        ] process > CALL_WF:UTILS_SAMPLE_STATS                                                            -
[-        ] process > CALL_WF:UTILS_COHORT_STATS                                                            -
[-        ] process > MINOR_VARIANTS_ANALYSIS_WF:BCFTOOLS_MERGE__LOFREQ                                     -
[-        ] process > MINOR_VARIANTS_ANALYSIS_WF:TBPROFILER_VCF_PROFILE__LOFREQ                             -
[-        ] process > MINOR_VARIANTS_ANALYSIS_WF:TBPROFILER_COLLATE__LOFREQ                                 -
[-        ] process > MINOR_VARIANTS_ANALYSIS_WF:UTILS_MULTIPLE_INFECTION_FILTER                            -
[-        ] process > UTILS_MERGE_COHORT_STATS                                                              -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:BWA_MEM__DELLY                                        -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:SAMTOOLS_MERGE__DELLY                                 -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:GATK_MARK_DUPLICATES__DELLY                           -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:SAMTOOLS_INDEX__DELLY                                 -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:DELLY_CALL                                            -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:BCFTOOLS_VIEW__DELLY                                  -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:BCFTOOLS_MERGE__DELLY                                 -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:TBPROFILER_VCF_PROFILE__DELLY                         -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:TBPROFILER_COLLATE__DELLY                             -
[-        ] process > MERGE_WF:PREPARE_COHORT_VCF:GATK_COMBINE_GVCFS                                        -
[-        ] process > MERGE_WF:PREPARE_COHORT_VCF:GATK_GENOTYPE_GVCFS                                       -
[-        ] process > MERGE_WF:PREPARE_COHORT_VCF:SNPEFF                                                    -
[-        ] process > MERGE_WF:PREPARE_COHORT_VCF:BGZIP                                                     -
[-        ] process > MERGE_WF:PREPARE_COHORT_VCF:GATK_INDEX_FEATURE_FILE__COHORT                           -
[-        ] process > MERGE_WF:SNP_ANALYSIS:GATK_SELECT_VARIANTS__SNP                                       -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:GATK_VARIANT_RECALIBRATOR__ANN7  -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:UTILS_ELIMINATE_ANNOTATION__ANN7 -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:GATK_VARIANT_RECALIBRATOR__ANN6  -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:UTILS_ELIMINATE_ANNOTATION__ANN6 -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:GATK_VARIANT_RECALIBRATOR__ANN5  -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:UTILS_ELIMINATE_ANNOTATION__ANN5 -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:GATK_VARIANT_RECALIBRATOR__ANN4  -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:UTILS_ELIMINATE_ANNOTATION__ANN4 -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:GATK_VARIANT_RECALIBRATOR__ANN3  -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:UTILS_ELIMINATE_ANNOTATION__ANN3 -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:GATK_VARIANT_RECALIBRATOR__ANN2  -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:UTILS_ELIMINATE_ANNOTATION__ANN2 -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:UTILS_SELECT_BEST_ANNOTATIONS    -
[-        ] process > MERGE_WF:SNP_ANALYSIS:GATK_APPLY_VQSR__SNP                                            -
[-        ] process > MERGE_WF:SNP_ANALYSIS:GATK_SELECT_VARIANTS__EXCLUSION__SNP                            -
[-        ] process > MERGE_WF:INDEL_ANALYSIS:GATK_SELECT_VARIANTS__INDEL                                   -
[-        ] process > MERGE_WF:GATK_MERGE_VCFS__INC                                                         -
[-        ] process > MERGE_WF:MAJOR_VARIANT_ANALYSIS:TBPROFILER_VCF_PROFILE__COHORT                        -
[-        ] process > MERGE_WF:MAJOR_VARIANT_ANALYSIS:TBPROFILER_COLLATE__COHORT                            -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__EXCOMPLEX:GATK_SELECT_VARIANTS__PHYLOGENY                -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__EXCOMPLEX:GATK_VARIANTS_TO_TABLE                         -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__EXCOMPLEX:SNPSITES                                       -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__EXCOMPLEX:SNPDISTS                                       -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__EXCOMPLEX:IQTREE                                         -
[-        ] process > MERGE_WF:CLUSTER_ANALYSIS__EXCOMPLEX:CLUSTERPICKER__5SNP                              -
[-        ] process > MERGE_WF:CLUSTER_ANALYSIS__EXCOMPLEX:CLUSTERPICKER__12SNP                             -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__INCCOMPLEX:GATK_SELECT_VARIANTS__PHYLOGENY               -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__INCCOMPLEX:GATK_VARIANTS_TO_TABLE                        -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__INCCOMPLEX:SNPSITES                                      -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__INCCOMPLEX:SNPDISTS                                      -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__INCCOMPLEX:IQTREE                                        -
[-        ] process > MERGE_WF:CLUSTER_ANALYSIS__INCCOMPLEX:CLUSTERPICKER__5SNP                             -
[-        ] process > MERGE_WF:CLUSTER_ANALYSIS__INCCOMPLEX:CLUSTERPICKER__12SNP                            -
[-        ] process > REPORTS_WF:MULTIQC                                                                    -
[-        ] process > REPORTS_WF:UTILS_SUMMARIZE_RESISTANCE_RESULTS                                         -
[-        ] process > REPORTS_WF:UTILS_SUMMARIZE_RESISTANCE_RESULTS_MIXED_INFECTION                         -
WARN: There's no process matching config selector: VALIDATE_FASTQS_WF:SAMPLESHEET_VALIDATION
ERROR ~ Unknown method invocation `splitJson` on UnixPath type

 -- Check '.nextflow.log' file for details

@abhi18av
Copy link
Member

Then, I think the problem might be with you Java setup, could you please confirm you're using an LTS version as mentioned here https://github.com/TORCH-Consortium/MAGMA?tab=readme-ov-file#nextflow ?

@RuanSpies21
Copy link
Author

I can confirm I'm using a LTS version of Java 17.

I don't seem to get the same error when using the alpha pre-release of v2.0.0 nextflow run 'https://github.com/TORCH-Consortium/MAGMA' -profile docker,server -r v2.0.0-alpha -params-file params.yaml

In this case the pipeline runs successfully through the samplesheet validation step

@abhi18av
Copy link
Member

abhi18av commented Sep 18, 2024

Mmm, then the next suspect is the version of Nextflow, which I think should fix the problem

Could you please test with the following command? 🙏

NXF_VER=24.04.4 nextflow run 'https://github.com/TORCH-Consortium/MAGMA' -profile docker,server,test -r hotfix/bwa_k66

If this works, then I will set the minimum nextflow version to 24.04.x in the pipeline and you should upgrade by typing nextflow -self-update

@RuanSpies21
Copy link
Author

Ok great! Test seems to have worked. Thanks for the help. Will give it a bash with these old sequences now - holding thumbs, will let you know how it goes.

@RuanSpies21
Copy link
Author

nextflow run 'https://github.com/TORCH-Consortium/MAGMA' -profile docker,server,test -r hotfix/bwa_k66

Just getting loads of fails for VALIDATE_FASTQS_WF:FASTQ_VALIDATOR
[100%] 14 of 14, failed: 12, retries: 8 ✔ - from test profile. So only 1 of the 3 samples is actually processed
[100%] 830 of 830, failed: 738, retries: 492 ✔ - from my sequences. Only 46/169 samples processed

@abhi18av
Copy link
Member

Good so we're past the setup issues.

[100%] 14 of 14, failed: 12, retries: 8 ✔ - from test profile. So only 1 of the 3 samples is actually processed

I wouldn't worry too much about the samples from test since often while downloading samples from NCBI (FTP) they get corrupted in transit if the network or disk performance is not good.

[100%] 830 of 830, failed: 738, retries: 492 ✔ - from my sequences. Only 46/169 samples processed

So it seems that these samples are likely to be either corrupted while downloading or moving across external disks/computers.

⚠️ That is the reason why we ended up adding a separate VALIDATE_FASTQS_WF:FASTQ_VALIDATOR process.

One file which you might want to inspect is the QC_statistics/cohort/fastq_validation/magma_analysis.json file which should gather information about the files such as md5sum and size along with stats generated by seqkit etc. Perhaps that might be useful in debugging the failing samples.

@abhi18av
Copy link
Member

I'd recommend you download your samples from NCBI/ENA using nf-core/fetchngs pipeline https://nf-co.re/fetchngs/1.12.0/docs/usage/ which makes sure the samples are not corrupted.

@RuanSpies21
Copy link
Author

Thanks for this @abhi18av. Its a long journey we have been on together now 😂. It seems the pipeline really does not like these old files.

I re-downloaded some of them with nf-core/fetchngs but large amounts of fails persist at VALIDATE_FASTQS_WF:FASTQ_VALIDATOR.

Further, those that do pass have 0 coverage.
magma_analysis.json and joint.merged_cohort_stats.tsv attached for interest.

As a sanity check, a batch of newer fastqs processed successfully so set up is fine.
[magma_analysis.json] (https://github.com/user-attachments/files/17054048/magma_analysis.json)
joint.merged_cohort_stats.txt

@abhi18av
Copy link
Member

abhi18av commented Sep 19, 2024

Hi @RuanSpies21

It seems the pipeline really does not like these old files.

Actually, I would need more evidence to believe that - since we've been using MAGMA to analyse all Brazilian and South African sequences from SRA, produced in last 20 years, and unless there's something wrong with the samples themselves they get through.

That is the reason, why we added the JSON file so that we can have a better overview of the samples which failed. Could you please share that JSON QC_statistics/cohort/fastq_validation/magma_analysis.json file with me?

Further, those that do pass have 0 coverage.

Indeed, the results here are very suspicious, I will try to run these samples on my end to see if they are atleast reproducible

SAMPLE AVG_INSERT_SIZE MAPPED_PERCENTAGE RAW_TOTAL_SEQS AVERAGE_BASE_QUALITY MEAN_COVERAGE SD_COVERAGE MEDIAN_COVERAGE MAD_COVERAGE PCT_EXC_ADAPTER PCT_EXC_MAPQ PCT_EXC_DUPE PCT_EXC_UNPAIRED PCT_EXC_BASEQ PCT_EXC_OVERLAP PCT_EXC_CAPPED PCT_EXC_TOTAL PCT_1X PCT_5X PCT_10X PCT_30X PCT_50X PCT_100X LINEAGES FREQUENCIES MAPPED_NTM_FRACTION_16S MAPPED_NTM_FRACTION_16S_THRESHOLD_MET COVERAGE_THRESHOLD_MET BREADTH_OF_COVERAGE_THRESHOLD_MET RELABUNDANCE_THRESHOLD_MET ALL_THRESHOLDS_MET
MAGMA.ERX023849_ERR046787 0.0 0.0 7272484.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX023851_ERR046789 0.0 0.0 3365516.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX023852_ERR046790 0.0 0.0 7425878.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX023853_ERR046791 0.0 0.0 6207574.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX023885_ERR046823 0.0 0.0 6324052.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX023913_ERR046851 0.0 0.0 6399012.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX023975_ERR046913 0.0 0.0 6674844.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX024002_ERR046940 0.0 0.0 6617920.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX024012_ERR046950 0.0 0.0 6311164.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX049831_ERR072065 0.0 0.0 3248118.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX049843_ERR072077 0.0 0.0 3311862.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX049846_ERR072080 0.0 0.0 2881802.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0

@RuanSpies21
Copy link
Author

Ah ok I see.

Here is the QC_statistics/cohort/fastq_validation/magma_analysis.json file
magma_analysis.json

@abhi18av
Copy link
Member

Hi @RuanSpies21 , just letting you know that I'm still tracking this, just running across some resource contraints these days on our shared server.

@RuanSpies21
Copy link
Author

No worries @abhi18av! Thank you so much - have already been so accommodating

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants