Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to scaffolding with other individual HiC data? #95

Open
YPGG1234 opened this issue Dec 18, 2024 · 13 comments
Open

Is it possible to scaffolding with other individual HiC data? #95

YPGG1234 opened this issue Dec 18, 2024 · 13 comments

Comments

@YPGG1234
Copy link

YPGG1234 commented Dec 18, 2024

Thank you for developing such an excellent software! I am currently using Triobin to assemble offspring individuals

yak count -k31 -b37 -t16 -o pat.yak <(zcat $paternal) <(zcat $paternal) 
yak count -k31 -b37 -t16 -o mat.yak <(zcat $maternal) <(zcat $maternal)
hifiasm -o $prefix -t 32 -1 pat.yak -2 mat.yak $child_hifi)

but I don't have Hi-C data from the same individual for scaffolding. Instead, I only have ~80x Hi-C data from another individual. Could you please let me know what potential issues might arise from doing this?

I tested the results of combining two haplotigs and processing them separately, following the official recommended workflow.

processing them separately

bwa mem -5SP -t 60 $genome $Hic_1 $Hic_2 | samblaster | samtools view - -@ 60 -S -h -b -F 3340 -o HiC.bam
filter_bam HiC.bam 0 --nm 3 --threads 60 | samtools view - -b -@ 60 -o HiC.filtered.bam
haphic pipeline $genome HiC.filtered.bam $nchr

I noticed that when following the recommended filter_bam step (MAPQ>=1), the remaining data in the combined processing approach seemed overly reduced, which might suggest insufficient heterozygosity? Or perhaps the use of Hi-C data from a different individual resulted in a large amount of sequence being filtered out (Hi-C BAM file size reduced from 156 GB to 17 GB after filtering). As a result, I abandoned using the combined haplotigs approach. The results from processing the two haplotigs separately seem to have some minor issues, but the problems don't appear to be very significant.

The main issues in the results of both hap1 and hap2 are concentrated on the relatively complex sex chromosomes of this species.
hap1 suspicious sex chromosomes (I moved the lower right suspicious sex chromosome to the upper left with the interaction)
image

hap2 suspicious sex chromosomes
image

However, the overall Hi-C map doesn’t seem to indicate significant problems.
hap1 whole genome
image

hap2 whole genome
image

Could I use Hi-C data from another individual for scaffolding? Do you have any suggestions for assembling such complex sex chromosome structures? Can I use the human HG002 parameters when assembling the two haplotigs?

I would appreciate your reply!

@zengxiaofei
Copy link
Owner

zengxiaofei commented Dec 20, 2024

Hi @YPGG1234,

The trio binning mode of Hifiasm can typically yield high-quality haplotype-phased assemblies, with contigs from each parent already output into two separate GFA files. At this point, it may be feasible to perform scaffolding using Hi-C sequencing data from another individual, but this would require scaffolding each haplotype separately. This is because the haplotype composition differs between individuals and cannot correctly guide haplotype-phased assembly/scaffolding.

For the strategy of scaffolding each haplotype separately, I can think of three potential issues:

  1. There may be chromosomal structural variations between different individuals, which could lead to incorrect ordering and orientation of contigs.

  2. Similarly, within a single individual, there can also be significant structural differences between homologous chromosomes, especially between sex chromosomes.

  3. If sex chromosomes have a structure similar to that of the human genome (with PAR and non-PAR regions), this can also lead to certain issues. Firstly, the PAR regions, due to their high similarity, may introduce errors during the upstream assembly and phasing processes. For instance, it's possible that the PAR regions of the two sex chromosomes might be incorrectly assembled into a single sequence (forming collapsed contigs). Moreover, due to low homology, the non-PAR regions of the two sex chromosomes might be incorrectly phased into the same haplotype by upstream assembly software (phasing error). Secondly, because the non-PAR regions exhibit significant differences, their actual sequencing depth is only half that of autosomes and PAR regions. This would be reflected in the heatmap as lower interaction signals for this region.

In summary, performing phased scaffolding without Hi-C data from the same individual presents challenges, especially when sex chromosomes are involved. Whether this problem can be fully resolved also depends on the characteristics of the species, and at present, I cannot provide a definitive answer.

Best,
Xiaofei

@YPGG1234
Copy link
Author

Thanks for your quick and detailed reply! I knew it would be better to use hic data from the same individual. I know from your reply that it is better to merge the haplotigs together for scaffolding. I was wondering if scaffolding the merged haplotigs with hic from different individuals would cause the filter bam to become very small, rather than responding to low species heterozygosity? If the merged haplotigs are scaffolded with hic data from the same individual, does the filter bam retain more data.

@zengxiaofei
Copy link
Owner

This should not be the main reason affecting the size of the BAM file. The main factor is still the heterozygosity level. If the heterozygosity level is too low, many Hi-C reads will have multiple alignments and will be filtered out by the MAPQ>=1 criterion. Additionally, if the genome sequence differences between different individuals are very large, such as exceeding 2-3%, the --nm parameter might filter out many alignments due to mismatches.

@YPGG1234
Copy link
Author

Thank you for your reply!

@YPGG1234
Copy link
Author

YPGG1234 commented Jan 9, 2025

Hello Zeng,

I am still in the process of merging two haplotigs. The Hi-C data for the same individual is currently being sequenced, so I am still testing with Hi-C data from different individuals.

I found that the result from the standard workflow is shown in the figure below:

image

However, when I use the unfiltered BAM file instead of the filtered one, and run Juicer with the -q 0 option, the result improves somewhat.

image

The plot looks similar to #34 . Does this mean that if I use Hi-C data from the same individual, the abnormal interactions will disappear? Or should I use a method where haplotigs are assembled separately?

@YPGG1234 YPGG1234 reopened this Jan 9, 2025
@zengxiaofei
Copy link
Owner

Hi @YPGG1234,

The main issue is not with the threshold for MAPQ filtering, but rather with not using Hi-C data from the same individual. This is similar to the issue mentioned in #34: when you switch to Hi-C data from the same individual, the strong, diagonally distributed signals between homologous chromosomes will disappear.

Or should I use a method where haplotigs are assembled separately?

This is also an option worth considering since trio data were used for haplotype phasing (the contig-level assembly should have been correctly phased). However, this approach might overlook structural variations that could exist between haplotypes and introduce differences between individuals. Given that you are already sequencing Hi-C data from the same individual, it would be better to use that data for the final version.

Best,
Xiaofei

@YPGG1234
Copy link
Author

Thank you for your reply!

@YPGG1234
Copy link
Author

YPGG1234 commented Jan 15, 2025

Dear Zeng,

We have now obtained Hi-C data for the offspring (using BGI's T7 platform with 100x coverage) and used it to perform a phased assembly of the two haplotypes. This approach aims to better assemble the sex chromosomes, with parameters consistent with those used for HG002 (#46). As you mentioned, this species exhibits relatively low heterozygosity. Even with our Hi-C data, the filtered BAM files underwent a significant reduction in size (from 271 Gb to 34 Gb).

Despite this, I still prefer to scaffold combined haplotypes over separate scaffold haplotype. To evaluate the optimal parameters, I tried two parameters for the filtering and juicer steps, as follows:

Filter: -q 1 | Juicer: -q 1 (default)
fig_1

unfiltered BAM | Juicer: -q 1
fig_1

The results with the default parameters (-q 1 for both steps) were indeed better. However, upon zooming in on the chromosomes, it is apparent that for both configurations, the Hi-C coverage for each chromosome seems insufficient, displaying a mosaic-like interaction pattern along the chromosomes:

Filter: -q 1 | Juicer: -q 1 (default)
fig_2

unfiltered BAM | Juicer: -q 1
fig_2

Our sequencing depth should be adequate (as seen in the unfiltered BAM | Juicer: -q 1 parameters). Could you advise on how I might improve the results while ensuring sufficient coverage? Are there specific parameters you recommend adjusting to achieve this?

Looking forward to your suggestions.

@zengxiaofei
Copy link
Owner

I am pleased to see that your results have improved significantly, as evidenced by the noticeable reduction in Hi-C signals between homologous chromosomes, which aligns with our previous expectations. You mentioned that the low heterozygosity of the genome led to a substantial amount of Hi-C links being filtered out; however, from the figures you provided, it does not seem to be a big problem. This is because your contigs are long enough, and thus weakened signals should not greatly impact your scaffolding results.

If you wish to retain more Hi-C signals for manual adjustment in Juicebox, you can visualize the scaffolds using the unfiltered BAM file instead. You can achieve this simply by replacing the filtered BAM (HiC.filtered.bam) in 04.build/juicebox.sh with the unfiltered one (HiC.bam). Although this approach will introduce many incorrectly mapped Hi-C reads, it should be acceptable in your case due to the long contigs and low heterozygosity.

In your first figure, some chromosomes appear not to cluster together as expected, which I suspect might be due to the nchrs parameter being set higher than the actual number of chromosomes. However, these results are overall good and can be easily adjusted in Juicebox without the need for tuning the HapHiC parameters.

@YPGG1234
Copy link
Author

Thank you for your reply, I will try it !

@YPGG1234
Copy link
Author

I have another question to ask: when not using correct_round but applying parameters such as remove_allelic_links, density_upper, remove_concentrated_links, and normalize_by_nlinks, will Haphic behave like round 0 in 3D-DNA, where the contigs' structures are preserved and only their order and orientation are adjusted? Or will it also disrupt the structure of collapsed or chimeric haplotigs?

@zengxiaofei
Copy link
Owner

The --correct_nrounds parameter is used to correct chimeric contigs. If this parameter is not specified, no contigs will be broken. In your case, I do not recommend using the --correct_nrounds parameter because current assemblers typically produce very few chimeric contigs when assembling HiFi data. Moreover, the species you are studying exhibits a relatively complex chromatin conformation, as evidenced by contact maps that show a checkerboard-like pattern—a feature very common in mammals—and many Hi-C links have been filtered out. These factors could result in correctly assembled contigs being incorrectly fragmented.

@YPGG1234
Copy link
Author

YPGG1234 commented Jan 20, 2025

Dear Zeng,

Thank you for your reply!

I have now discovered another problem when generate adjusted fasta.

#! /bin/bash

ln -s allhap.fa .
samtools faidx allhap.fa
juicer pre -a -q 1 -o out_JBAT HiC.bam  scaffolds.raw.agp allhap.fa.fai >out_JBAT.log 2>&1
(juicer_tools.1.9.9_jcuda.0.8.jar pre out_jbat.txt out_jbat.hic. part <(cat out_JBAT.log | grep PRE_C_SIZE | awk '{print $2" "$3}')) && (mv out_JBAT.hic.part out_JBAT.hic)
juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp allhap.fa

Got an error

[I::main_post] writing FASTA file for scaffolds
[E::make_asm_dict_from_agp] invalid sequence component h2tg000301l:1-223719 on scaffold_1
[E::make_asm_dict_from_agp] sequence end position (223719) greater than sequence length (28784)

But I have not modified any content of the assembly file, may I ask how this happened?

out_JABT.zip

Besides, I have a question to ask you. I now have another genome (contig size ~2.9 Gb, Hi-C coverage ~66x) with a different issue. I ran the following commands using the default pipeline:

filter_bam HiC.bam 1 --nm 3 --threads 60 | samtools view - -b -@ 60 -o HiC.filtered.bam
haphic pipeline $genome HiC.filtered.bam 14 --threads 8 --processes 8 --quick_view

In the final results, the heatmap appears to show very limited interactions at both ends, while these regions seem to exhibit strong interactions with the ends of other chromosomes. I suspect this might be caused by repetitive sequences. What could be the reason for this? The chromosome number of this species is n=14.

Image

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants