Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding microdeletions #36

Open
GreeshmaThulasi opened this issue May 9, 2018 · 8 comments
Open

Finding microdeletions #36

GreeshmaThulasi opened this issue May 9, 2018 · 8 comments

Comments

@GreeshmaThulasi
Copy link

Hi,
I have single-end sequenced maternal blood data.
I executed WISECONDOR and got result like this,
`# BAM information: #
Reads mapped: 4021312
Reads unmapped: 24027
Reads nocoord: 24027
Reads rmdup: 546798
Reads lowqual: 132340

RETRO filtering:

Reads in: 4676255
Reads removed: 1356445
Reads out: 3319810

Z-Score checks:

Z-Score used: 5.08
AvgStdDev: 6.23%
AvgAllStdDev: 31.68%

Test results:

z-score effect mbsize location
6.19 2.07 80.75 2:34250000-115000000
-6.26 -6.52 11.25 7:52500000-63750000
5.14 6.60 7.00 7:75250000-82250000
-6.86 -3.11 49.50 19:9750000-59250000
`
Is the z-score above threshold 5.08 means a duplication and below threshold shows deletion?
Whats the effect field for ?
Is the location indicates the chromosome and start-end base positions?
I read that, there is different methods like,
Single bin, bin test
Single bin, aneuploidy test
Windowed, bin test
Windowed, aneuploidy test
Chromosome wide, aneuploidy test. How to choose these methods?
I executed the steps described in the link https://github.com/VUmcCGP/wisecondor
Please describe steps to find microdeletions along the chromosomes

looking forward to hear from you
Thanks in advance
Greeshma

@rstraver
Copy link
Collaborator

rstraver commented May 9, 2018

Hi Greeshma,

I'm not sure what size exactly the micro deletions you are looking for would be, WISECONDOR was originally written to target fairly lengthy but barely deviating CNVs. I have found the latest version was able to find a CNV between 3 and 4 mb but I'd be hesitant to just take such short CNV results as truth without further testing.

To answer your questions directly:

Is the z-score above threshold 5.08 means a duplication and below threshold shows deletion?

A z-score above 5.08 means a duplication, a negative z-score beyond -5.08 will mean a deletion. Anything in between -5.08 and 5.08 will be considered unaffected.

Whats the effect field for ?

That's the effect size. It's the determined % of copy number change for that particular region. If it says 100 it found twice as much DNA fragments as expected, 5 means it found 5% more DNA fragments.

Is the location indicates the chromosome and start-end base positions?

That is correct, it is chr:start-end

I read that, there is different methods like,
Single bin, bin test
Single bin, aneuploidy test
Windowed, bin test
Windowed, aneuploidy test
Chromosome wide, aneuploidy test. How to choose these methods?

Those were implemented in an older version of WISECONDOR (as described in the paper). If you wish to use that version you can find it in the legacy branch:
https://github.com/VUmcCGP/wisecondor/tree/legacy
The master branch at this point has no such test differences, instead is used a segmentation algorithm to find the optimal CNV cutoffs.

If you really aim to find small CNVs perhaps the input data is a bit limiting, it seems you have ~4 million reads, I'd suggest trying something over ~10 million and using a fairly large set of training samples if you are unable to find known short CNVs.

Additionally, I believe this fork of WISECONDOR could be of interest to you, as it should contain several improvements over my work:
https://github.com/leraman/wisecondorX
Perhaps that can help you find micro deletions better. It should be faster and it is actively being developed right now.

Let me know if something is still unclear.

@GreeshmaThulasi
Copy link
Author

Thank you so much Roy Straver,
Your reply was very clear and informative for me.
I will use samples with reads > 8 million.
One more doubt,
while creating reference set, whether the samples contained should be normal, ie without any microdeletions or microduplications ?
If we are adding reference samples of reads like 8 million, 10 million, 12 million etc will it affect the efficacy of this tool? Do we need to fix the reads within a stringent range like 11-12 million only (by not to including samples with low or high coverage) ?
Is this tool uses sliding-window approach ?
Is it good to reduce the bin-size or increase the bin size ?
I will try the extended version WisecondorX too.

Thank you
Greeshma.

@rstraver
Copy link
Collaborator

rstraver commented May 9, 2018

while creating reference set, whether the samples contained should be normal, ie without any microdeletions or microduplications ?

Training samples should preferably be without any CNVs. However, it's pretty much impossible to ensure that is true and if you use many reference samples (i.e. hundreds) and a few (one or two) have the same CNV, I highly doubt it's going to influence your sensitivity much (if at all) as it's not really systematic behaviour.

If we are adding reference samples of reads like 8 million, 10 million, 12 million etc will it affect the efficacy of this tool? Do we need to fix the reads within a stringent range like 11-12 million only (by not to including samples with low or high coverage) ?

Anything from 5 to 20 million should be fine, no need to be very stringent unless you go to very small binsizes, you may want to ensure you have enough coverage per bin if you do.

Is this tool uses sliding-window approach ?

The master branch does not use the sliding-window approach, it has been replaced by a segmentation step. Instead that step will give a stouffers z-score for a region of any possible length, making sure the z-score is the (absolute) maximum possible for that region.

Is it good to reduce the bin-size or increase the bin size ?

It's a trade-off: Smaller means less data per bin, but more bins to use as reference bins. Surely needs more time per sample, may increase erratic behavior if low coverage, but if enough training data was available may also give good results on small CNVs.
Larger would be the exact opposite, with the upside that the read coverage per bin is likely a bit more stable, allowing analyses of lower coverage samples. Seeing there are few bins left to use as a reference with binsizes > 2 mb I'd advise staying with smaller binsizes. I'd guess about 100 kb or maybe 50 kb is the smallest you could try, beyond that the reliability and time per sample may not be worth it, but that may be solved in WisecondorX anyway.

@GreeshmaThulasi
Copy link
Author

Hi @rstraver ,
Is the effect of a microdeletion is indicated in negatives?
If the effect of a microdeletion is -8.54, what does it indicates? For a reliable microdeletion, the effect should be a high negative value?

@rstraver
Copy link
Collaborator

Assuming you are talking about the effect size, that value would mean it measured 8.54% less fragments than it expected to find, which could indicate a microdeletion that is much smaller than the bin, or only is found in a subset of the cells analysed (mosaicism in our case of cell free DNA).

@GreeshmaThulasi
Copy link
Author

For a significant microdeletion, how much should be the effect size?

@rstraver
Copy link
Collaborator

I'm afraid that is not within my knowledge, I never aimed to find microdeletions and I never tested for them. I suggest you set up some experiments to test the reliability for various thresholds for that.

WISECONDOR mostly uses a z-score threshold instead of an effect size based one, as the effect size may be quite high caused by a not-so-relaible reference set, which is taken into account with the z-score. Also, you may find spikes in few or single bins that often turn out to be meaningless, so be careful on that...

@GreeshmaThulasi
Copy link
Author

Yes.
I think, that's why I am getting some micro-deletions and microduplications as well (may be of no effect), even for the normal samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants