-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to scaffolding with other individual HiC data? #95
Comments
Hi @YPGG1234, The trio binning mode of Hifiasm can typically yield high-quality haplotype-phased assemblies, with contigs from each parent already output into two separate GFA files. At this point, it may be feasible to perform scaffolding using Hi-C sequencing data from another individual, but this would require scaffolding each haplotype separately. This is because the haplotype composition differs between individuals and cannot correctly guide haplotype-phased assembly/scaffolding. For the strategy of scaffolding each haplotype separately, I can think of three potential issues:
In summary, performing phased scaffolding without Hi-C data from the same individual presents challenges, especially when sex chromosomes are involved. Whether this problem can be fully resolved also depends on the characteristics of the species, and at present, I cannot provide a definitive answer. Best, |
Thanks for your quick and detailed reply! I knew it would be better to use hic data from the same individual. I know from your reply that it is better to merge the haplotigs together for scaffolding. I was wondering if scaffolding the merged haplotigs with hic from different individuals would cause the filter bam to become very small, rather than responding to low species heterozygosity? If the merged haplotigs are scaffolded with hic data from the same individual, does the filter bam retain more data. |
This should not be the main reason affecting the size of the BAM file. The main factor is still the heterozygosity level. If the heterozygosity level is too low, many Hi-C reads will have multiple alignments and will be filtered out by the MAPQ>=1 criterion. Additionally, if the genome sequence differences between different individuals are very large, such as exceeding 2-3%, the |
Thank you for your reply! |
Hello Zeng, I am still in the process of merging two haplotigs. The Hi-C data for the same individual is currently being sequenced, so I am still testing with Hi-C data from different individuals. I found that the result from the standard workflow is shown in the figure below: However, when I use the unfiltered BAM file instead of the filtered one, and run Juicer with the -q 0 option, the result improves somewhat. The plot looks similar to #34 . Does this mean that if I use Hi-C data from the same individual, the abnormal interactions will disappear? Or should I use a method where haplotigs are assembled separately? |
Hi @YPGG1234, The main issue is not with the threshold for MAPQ filtering, but rather with not using Hi-C data from the same individual. This is similar to the issue mentioned in #34: when you switch to Hi-C data from the same individual, the strong, diagonally distributed signals between homologous chromosomes will disappear.
This is also an option worth considering since trio data were used for haplotype phasing (the contig-level assembly should have been correctly phased). However, this approach might overlook structural variations that could exist between haplotypes and introduce differences between individuals. Given that you are already sequencing Hi-C data from the same individual, it would be better to use that data for the final version. Best, |
Thank you for your reply! |
Dear Zeng, We have now obtained Hi-C data for the offspring (using BGI's T7 platform with 100x coverage) and used it to perform a phased assembly of the two haplotypes. This approach aims to better assemble the sex chromosomes, with parameters consistent with those used for HG002 (#46). As you mentioned, this species exhibits relatively low heterozygosity. Even with our Hi-C data, the filtered BAM files underwent a significant reduction in size (from 271 Gb to 34 Gb). Despite this, I still prefer to scaffold combined haplotypes over separate scaffold haplotype. To evaluate the optimal parameters, I tried two parameters for the filtering and juicer steps, as follows:
The results with the default parameters (-q 1 for both steps) were indeed better. However, upon zooming in on the chromosomes, it is apparent that for both configurations, the Hi-C coverage for each chromosome seems insufficient, displaying a mosaic-like interaction pattern along the chromosomes:
Our sequencing depth should be adequate (as seen in the Looking forward to your suggestions. |
I am pleased to see that your results have improved significantly, as evidenced by the noticeable reduction in Hi-C signals between homologous chromosomes, which aligns with our previous expectations. You mentioned that the low heterozygosity of the genome led to a substantial amount of Hi-C links being filtered out; however, from the figures you provided, it does not seem to be a big problem. This is because your contigs are long enough, and thus weakened signals should not greatly impact your scaffolding results. If you wish to retain more Hi-C signals for manual adjustment in Juicebox, you can visualize the scaffolds using the unfiltered BAM file instead. You can achieve this simply by replacing the filtered BAM ( In your first figure, some chromosomes appear not to cluster together as expected, which I suspect might be due to the nchrs parameter being set higher than the actual number of chromosomes. However, these results are overall good and can be easily adjusted in Juicebox without the need for tuning the HapHiC parameters. |
Thank you for your reply, I will try it ! |
I have another question to ask: when not using |
The |
Dear Zeng, Thank you for your reply! I have now discovered another problem when generate adjusted fasta.
Got an error
But I have not modified any content of the assembly file, may I ask how this happened? Besides, I have a question to ask you. I now have another genome (contig size ~2.9 Gb, Hi-C coverage ~66x) with a different issue. I ran the following commands using the default pipeline: filter_bam HiC.bam 1 --nm 3 --threads 60 | samtools view - -b -@ 60 -o HiC.filtered.bam
haphic pipeline $genome HiC.filtered.bam 14 --threads 8 --processes 8 --quick_view In the final results, the heatmap appears to show very limited interactions at both ends, while these regions seem to exhibit strong interactions with the ends of other chromosomes. I suspect this might be caused by repetitive sequences. What could be the reason for this? The chromosome number of this species is n=14. |
Thank you for developing such an excellent software! I am currently using Triobin to assemble offspring individuals
but I don't have Hi-C data from the same individual for scaffolding. Instead, I only have ~80x Hi-C data from another individual. Could you please let me know what potential issues might arise from doing this?
I tested the results of combining two haplotigs and processing them separately, following the official recommended workflow.
I noticed that when following the recommended
filter_bam
step (MAPQ>=1), the remaining data in the combined processing approach seemed overly reduced, which might suggest insufficient heterozygosity? Or perhaps the use of Hi-C data from a different individual resulted in a large amount of sequence being filtered out (Hi-C BAM file size reduced from 156 GB to 17 GB after filtering). As a result, I abandoned using the combined haplotigs approach. The results from processing the two haplotigs separately seem to have some minor issues, but the problems don't appear to be very significant.The main issues in the results of both hap1 and hap2 are concentrated on the relatively complex sex chromosomes of this species.
hap1 suspicious sex chromosomes (I moved the lower right suspicious sex chromosome to the upper left with the interaction)
hap2 suspicious sex chromosomes
However, the overall Hi-C map doesn’t seem to indicate significant problems.
hap1 whole genome
hap2 whole genome
Could I use Hi-C data from another individual for scaffolding? Do you have any suggestions for assembling such complex sex chromosome structures? Can I use the human HG002 parameters when assembling the two haplotigs?
I would appreciate your reply!
The text was updated successfully, but these errors were encountered: