Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to adjust quality score median/mean of simulated reads? #22

Open
jelber2 opened this issue Sep 27, 2024 · 6 comments
Open

Possible to adjust quality score median/mean of simulated reads? #22

jelber2 opened this issue Sep 27, 2024 · 6 comments

Comments

@jelber2
Copy link

jelber2 commented Sep 27, 2024

EDIT I guess it would help if I were to read the publication to see especially the figures on time and amplitude domains' effects on base identity.

Hi, thank you very much for developing this tool. This issue is perhaps an enhancement, but not necessarily as it may be tied completely to the underlying k-mer model and only changeable through making a new k-mer model (I am not sure).

So here it goes:

Run squigulator

./squigulator --ont-friendly=yes --seed 1 -x dna-r10-prom -o test.blow5 -t 8 -f 10 <(echo -e ">1\nTACAACATGCACATAACTTTTCCATACATAATAAAATGATGCAAAGAAGATGTACATTCTAACCATAGCTGAAATCGGGGCCATTTGTACAAGATTAATTATTAACCACATAAGCCAAGAATTACTAATAAAATGTACTGCAAAATAGCTGAAAAACAATTGCATGATTGCAGCCAATCCAAGTACATAGAAAAACCTAGTGAAAAGAATATATGCCAAAAACCACTCTGCAACTAAGCCAAAAGCAGTTAAATCCCATTTAAAAGATGAAATGGTAATTTGTATAGTTTCTAAAGAAGGATAGGTGTCTAAAGAATCTAAACCACTAAGACAAACACTACAAGGTATAGAACCAGTACAGTAGGTTGCAATAGTGACATTAGTAGAGTTCAAATAGCCTTCTCTGTAACCAGTACAGTAAGAAGGCATGCCTAAATTAGACATTAAAACACCTAAAGCAGCGGTTGAGTAGATTAAAGAACCTAGGCAAACACTTAATAGTAAAAACCAAATTATAATATTTATCAGTTTAGAAAAATTAGGTGACTTCAAATAATTAAATGAAGCCTCTAGACAAAATTTACCGACACTCTTAACAGTATTCTTTGCTATAGTAGTCGGCATAGATGCTTTAATTCTAGAATTTGTACTTCTAGTAAAAGTACACAATTGTAGCAATAAAGTAAAGAAATAAGGCATATAATTAGTACAAACACGGTTTAAACACCGTGTAACTATGTTAGTAGTTGTACTAACAACTTTGTTAAGAAAAGGCTTAGCATAATTAGCTATAGTATCCCAAGGGACACTATTAACAGCAGCTAAACCATGAGTAGCAAGGGTTTTCAAACCTAATACTCTAGATAATTCATTAGGTTTCTTAATAGTAAGACTAGAATTGTCTACATAAGCAGCCATTAGATCTGTGTGGCCAACCTCTTCTGTAATTTTTAAACTATTATTTGCTGGTTTAAGTATAATGTCTCCTACAACTTCGGTAGTTTTCACATTACACTCAAGAACGTCTTTCTGTATGGTAGGATTTTCCACTACTTCTTCAGAGACTGGTTTTAGATCTTCGCAGGCAAGATTATCCATTCCCTGCGCGTCCTCTGACTTCAGTACATCAAACGAATTTGATGTTTCAACTGGTTTTGTGCTCCAAAGACAACGTATACACCAGGTATTTGGTTTATACGTGGCTTTATTAGTTGCATTGTTAACATGCCAAACAATAGGTTTATGTAACAATTTAGCTCCTTTCTTAAAAGAGGGTGTGTAGTGTTTATAATCAATAGCCACCACATCACCATTTAAGTCAGGGAAAAATGTAACTTTAAGCTCTCTTGAAGCAGGTTTCTTATAACCAGTTAACTGGTTTAAATCATCAGCAAATTTGATATTATCACATACAAACTTAAAATTATCGAAGCTTGCGTTTGGATATGGTTGGTTTGGTACAAGATCAATTGGTTGCTCTGTGAAATAAGAATTGTCTTTCTTATAATAATTGTCCAACTTAGGGTCAATTTCTGTACAAACAACACCATCCAATTTATAAGTAACTGGTTTTATGGTTGTTGTGTAACTGTTTTCTTTGTAGAAAACATCCGTAATAGGACCTTTGTATTCTGAGGACTTTGTAAGTAAAGCACCGTCTATGCAATACAAAGTTTCTTTAGAAGTTATATGTTTATAGTGACCACACTGGTAATTACCAGTGTACTCACTAGCACAAGTAAATGTACCATGCTTAAGTTCATACTGAGCAGGTGGTGCTGACATCATAACAAAAGGTGACTCCTGTTGTACTAGATATTTTGTAGCTTGTTTACCACACGTACAAGGTATCTGAACACCTTTCTTAAATTGTTCATAAGAAAGTGTGCCCATGTACATAACAGCTTCTACACCCTTAAGGGTTGTCTGCTGTTGTCCACAAGTTTTACACACCACGTTCAAGACTCTTTTGCAAGAATCTAAATTGGCATGTTGAAACAAGTAACTCATTGTTTCTCTAACATCACCTAACTCACCTACTGTCTTATTACAGTAGGCTAAGATAAGTGCACAAAAGTTAGCAGCTTCACCAGCCCTTGCTCTGTAATAAGCATCTTGTAGAGCAGGTGGATTAAACTTCAACTCTATTTGTTGGAGTGTTAACAATGCAGTGGCAAGATAACAGTTGTTATCTGCCCATTTAATAGAAGTTAAACCATTAACTTGTGGGTATTTCCACTTTTTAGTGTGATTTAATGCTGACATGTACCTACCCAGAAAACTAGGATCAGTTGTGTGGTAGTACTCAAAAGCCTCAACACGTAGAGTGTCATCATTAGGTAAAACATAAAATGTTTTACCTTCATGTGAATTATGAGGTTTTATTTTAGTAACATCAGCTCCATCCAAATAAGTTGGACCAAACTGTTGTCCATATGTCATTGACATGTCCACAACTTGCGTGTGGAGGTTAATGTTGTCTACTGTTGTAAACACCTTAATAGTCCTCACTTCTCTCAAAGAAAGAAGTGTCTTAAGATTGTCAAAGGTGATAACTTCACCATCTAGGTGGAATGTGGTAGGATTACTAGTGTAATATACACTTTTATCACCTCTCTTAAGAAATTCTATACCTAGTTGTGTAGATTGTCCAGAATAGGACCAATCTTTATAGGAACCAGCAAGTGAGATGGTTTCAATAAAATGTTCTTCAGGTGTTTTAGAAGAAGAAGTAAGATAACCATTATACGCTGTAACAGCATCAGGTGAAGAAACAGAAACTGTAGCTGGCACTTTGAGAGATCTCATATACCGAGCAGCTTCTTCCAAATTTAAGCCATGTGTTACATAGCCAAGTGGCATTGTAACAAGAGTTTCATTTAGATCGTTAAGTGTGTTGATAAGTGACGCTACAGTTGTTTTACTGGTGTAAAAGTAAAATCTAGCACCATAATCAACCACACCCTCTTGTATTTTAATACCCTTATATTTACGCTGTATAGTTGAAACTATGGCTTTAGTTTCCACACAGACAGGCATTAATTTGCGTGTTTCTTCTGCATGTGCAAGCATTTCTCGCAAATTCCAAGAAACAGTTCCAAGAATTTCTTGCTTCTCATTAGAGATAATAGATGGTAGAATGTAAAAGGCACTTTTACACTTTTTAAGCACTGTCTTTGCCTCCTCTACAGTGTAACCATTTAAACCCTGACCCGGGTAAGTGGTTATATAATTGTCTGTTGGCACTTTTCTCAAAGCTTTCGCTAGCATTTCAGTAGTGCCACCAGCCTTTTTAGTAGGTATAACCACAGCAGTTAAAACACCCTCTTGAACAACATCACCCACTATATATGGAGCATCTTTCTTTAAGAAAGTGATGTCAATGTCACTAACAAGAGTGGCAGAATCTGGATGAAGATTGCCATTAATGTCAATATAAAGTAACAAGTTTTCTGTGAGGAACTTAGTTTCTTCCAGAGTTGTTGTAACTTCTTCAACACAAGCTTTGATTTTCTTATCATCTTGTTTTCTCTGTTCAACTGAAGGTTTACTTTCAGTTATAAATGGCTTAACTTCCTCTTTAGGAATCTCAGCGATCTTTTGTTCAACTTGCTTTTCACTCTTCATTTCCAAAAAGCTTGAAACAAGTTTGTCATAGAGATTTTTATCAAAGACAGCTAAGTAGACATTTGTGCGAACAGTATCTACACAAACTCTTAAAGAATGTATAGGGTCAGCACCAAAAATACCAGCTGATAATAATGGTGCAAGTAGAACTTCGTGCTGATTAAAATTTTCATAAGCACTCTTAAGAAGTTGAATGTCTTCACCTTTGTTAACATTTGGGCCGACAACATGAAGACAGTGTTTAGCAAGATTGTGTCCGCTTAAAACACAACTACCACCCACTTTAAGTGGTCCATTAGTAGCTATGTAATCATCAGATTCAACTTGCATGGCATTGTTAGTAGCCTTATTTAAGGCTCCTGCAACACCTCCTCCATGTTTAAGGTAAACATTGGCTGCATTAACAACCACTGTTGGTTTTACCTTTTTAGCTTCTTCCACAATGTCTGCATTTTTAATGTATACATTGTCAGTAAGTTTTAAATAACCACTAAAACTATTCACTTCAATAGTCTGAACAACTGGTGTAAGTTCCATCTCTAATTGAGGTTGAACCTCAACAATTGTTTGAATAGTAGTTGTCTGATTGTCCTCACTGCCGTCTTGTTGACCAACAGTTTGTTGACTATCATCATCTAACCAATCTTCTTCTTGCTCTTCTTCAGGTTGAAGAGCAGCAGAAGTGGCACCAAATTCCAAAGGTTTACCTTGGTAATCATCTTCAGTACCATACTCATATTGAGTTGATGGCTCAAACTCTTCTTCTTCACAATCACCTTCTTCTTCATCCTCATCTGGAGGGTAGAAAGAACAATACATATGTGAAGCCAATTTAAACTCACCAGACTCATCAAATAAGTAGTATGTAGCCATACTCCACTCATCTAAATCAATGCCCAGTGGTGTAAGTAATTCAGATACTGGTTGCAAAGTTTTTATGACAGCATCTGCCACAACACAGGCGAACTCATTTACTTCTGTACCGAGTTCAACTGTATAGGCAGAGCACTTCTCATTAAGTACTTTATCAATCCTTTCATCAAGTTCAAAAGTGATATTCACACTCTTGTAACCTTGCACTTCTATCACAGTGTCATCACCAAAAGTAACCTTTGTTGGTGCACCGCCTTTGAGTGTGAAGGTATTGTTTGTTACCATCATATTAGGTGCAAGGGCACAGTACTTTTCTGTGTCTTTGATTTCGAGCAACATAAGCCCGTTAATACAAACTGGTGTACCAACCAATGGAGCTTCAACAGCTTCACTAGTAGGTTGTTCTAATGGTTGTAAATCACCAGTTTTCAAGACAACTTCCTCTGTTAACACTTCTGTGGGAAGTGTTTCTCCCTCTAAGAAGATAATTTCTTTTGGGGCTTTTAGAGGCATGAGTAGGCCAGTTTCTTCTCTGGATTTAACACACTTTCTGTACAATCCCTTTGAGTGCGTGACAAATGTTTCACCTAAATTCAAGGCTTTAAGTTTAGCTCCACCAATAATGATAGAGTCAGCACACAAAGCCAAAAATTTATTTACAAGCTTAAAGAATGTCTGAACACTCTCCTTAATTTCCTTTGCACAGGTGACAATTTGTCCACCGACAATTTCACAAGCACAGGTTGAGATAAATTTAACAATTTCCCAACCGTCTCTAAGAAACTCTACACCTTCCTTAAACTTCTCTTCAAGCCAATCAAGGACGGGTTTGAGTTTTTCATAAACAGTGCCAAAGATGTTAGTTAGCCACTGCGAAGTCAACTGAACAACACCACCTGTAATGTAGGCCATTACAACTAGATTGTTAGTAGCCAAATCAGATGTGAACATCATAGCATCAATGAGTCTCAGTGAATACTGTGAAATTCCATCTAGTATTGTTATAGCGGCCTTCTGTAAAACACGCACAGAATTTTGAGCAGTTTCAAGAGTGCGGGAGAAAATTGATCGTACAACACGAGCAGCCTCTGATGCAAATGCATAAAGAGGACTCAGTATTGATTTCTGTTCACCAATATTCCAGGCACCTTTTTTAGCTTTTCCTTTTGTAACTTTAAAATTACCACAGGATTCAACAATTTGTTTGAATGCTTTATAATCCAAACCTTTCACAGTTTCCACAAAAGCACTTGTGGAAGCAGAAAAAGATGCCAAAATAATGGCGATCTCTTCATTAAGTTTAAAGTCACCAACAATATTGATGTTGACTTTCTCTTTTTGGAGTATTTCAAGAAGGTTGTCATTAAGACCTTCGGAACCTTCTCCAACAACACCTGTATGGTTACAACCTATGTTAGCGCTAGCACGTGGAACCCAATAGGCACACTTGTTATGGCAACCAACATAAGAGAACACACAGCCTCCAAAGGCAATAGTGCGACCACCCTTACGAAGAATGGTTTTCAAGCCAGATTCATTATGGTATTCGGCAAGACTATGCTCAGGTCCTACTTCTGAATTGTGACATGCTGGACAATAAATTTTAACAACAGCATTTTGGGGTAAGTAACCACAAGTAGTGGCACCTTCTTTAGTCAAATTCTCAGTGCCACAAAATTCGCAAGTGGCTTTAACAAAATCGCCCGTCTGCCATGAAGTTTCACCACAATGATCACACTTCATGAGAGTTGAAAGGCACATTTGGTTGCATTCATTTGGTGACGCAACTGGATAGACAGATCGAATTCTACCCATAAAGCCATCAAGCTTTTTCTTTTCAACCCTTGGTTGAATAGTCTTGATTATGGAATTTAAGGGAAATACAAAATTTGGACATTCCCCATTGAAGGTGTCAAATTTCTTTGCCAATTTAATTTCAAAAGGTGTCTGCAATTCATAGCTCTTTTCAGAACGTTCCGTGTACCAAGCAATTTCATGCTCATGTTCACGGCAGCAGTATACACCCCTCTTAGTGTCAATAAAGTCCAGTTGTTCGGACAAAGTGCATGAAGCTTTACCAGCACGTGCTAGAAGGTCTTTAATGCACTCAAGAGGGTAGCCATCAGGGCCACAGAAGTTGTTATCGACATAGCGAGTGTATGCCCCTCCGTTAAGCTCACGCATGAGTTCACGGGTAACACCACTGCTATGTTTAGTGTTCCAGTTTTCTTGAAAATCTTCATAAGGATCAGTGCCAAGCTCGTCGCCTAAGTCAAATGACTTTAGATCGGCGCCGTAACTATGGCCACCAGCTCCTTTATTACCGTTCTTACGAAGAAGAACCTTGCGGTAAGCCACTGGTATTTCGCCCACATGAGGGACAAGGACACCAAGTGTCTCACCACTACGACCGTACTGAATGCCTTCGAGTTCTGCTACCAGCTCAACCATAACATGACCATGAGGTGCAGTTCGAGCATCCGAACGTTTGATGAACACATAGGGCTGTTCAAGTTGAGGCAAAACGCCTTTTTCAACTTCTACTAAGCCACAAGTGCCATCTTTAAGATGTTGACGTGCCTCTGATAAGACCTCCTCCACGGAGTCTCCAAAGCCACGTACGAGCACGTCGCGAACCTGTAAAACAGGCAAACTGAGTTGGACGTGTGTTTTCTCGTTGAAACCAGGGACAAGGCTCTCCATCTTACCTTTCGGTCACACCCGGACGAAACCTAGATGTGCTGATGATCGGCTGCAACACGGACGAAACCGTAAGCAGCCTGCAGAAGATAGACGAGTTACTCGTGTCCTGTCAACGACAGTAATTAGTTATTAATTATACTGCGTGAGTGCACTAAGCATGCAGCCGAGTGACAGCCACACAGATTTTAAAGTTCGTTTAGAGAACAGATCTACAAGAGATCGAAAGTTGGTTGGTTTGTTACCTGGGAAGGTATAAACCTTTAAT")

Convert blow5 to pod5 with blue-crab

blue-crab s2p test.blow5 -o test.pod5

Basecall with dorado-0.8.0, but first download the SUPv5.0.0 model

~/bin/dorado-0.8.0-linux-x64/bin/dorado download
~/bin/dorado-0.8.0-linux-x64/bin/dorado basecaller [email protected] test.pod5 > test.pod5.bam

BAM to FASTQ

samtools fastq test.pod5.bam > test.fastq

Use seqtk fastq check to estimate the average quality score

~/bin/seqtk/seqtk fqchk test.fastq|head
min_len: 1959; max_len: 6591; avg_len: 4663.14; 46 distinct quality values
POS	#bases	%A	%C	%G	%T	%N	avgQ	errQ	%low	%high
ALL	32642	31.8	18.3	19.5	30.4	0.0	16.3	11.4	67.7	32.3
1	7	42.9	0.0	14.3	42.9	0.0	7.1	6.0	100.0	0.0
2	7	42.9	14.3	0.0	42.9	0.0	9.3	6.8	100.0	0.0
3	7	0.0	14.3	28.6	57.1	0.0	11.7	7.9	85.7	14.3
4	7	57.1	14.3	14.3	14.3	0.0	11.0	9.1	100.0	0.0
5	7	28.6	0.0	0.0	71.4	0.0	12.6	9.9	85.7	14.3
6	7	42.9	0.0	28.6	28.6	0.0	14.0	12.3	85.7	14.3
7	7	14.3	28.6	0.0	57.1	0.0	15.7	13.1	71.4	28.6

It appears as though I am getting average quality scores of 16.3 or so based on running seqtk fqchk.

My question is whether it is possible to adjust this mean (or if the median might be adjustable) or whether that can only happen based on the k-mer values provided by Oxford Nanopore themselves? Thus, if I would like to do this, I would somehow need to make a new k-mer frequency file based on the data quality that I desire. Is it as simple as counting the 9-mers from an exemplary data set?

@hasindu2008
Copy link
Owner

This is a great question. Are you looking to increase the quality or decrease the quality?

The quality scores (and identity scores that correlate to quality scores) can be adjusted to a certain degree by tuning the following parameters

  1. --dwell-std FLOAT standard deviation of the number of signal samples per base
  2. --amp-noise FLOAT amplitude domain noise factor

If you look at the squigulator paper, there are some plots showing how increasing the dwell-std reduced the quality and reducing it increased the quality. For amp-noise, there is a sweet spot that gives the best quality and going away from this makes the quality low. Note that the paper has benchmarks from R9, but should equally apply to R10.

Also, keep in mind that the best quality is capped by the quality of the pore model. That means if the pore model is of high quality and accurately represents the real data, we can get simulated reads as close as the real data. For R9 this is the case, but not so for R10 at the moment.

@jelber2
Copy link
Author

jelber2 commented Sep 30, 2024

I was looking into increasing the quality, but perhaps the issue is as you pointed out the quality of the R10 pore model, and further, maybe it has something to do with the random 7Kbp sequence I chose to simulate the reads from. I tried making my own Pore model by the following:

count 9-mers from reads using BBMap/BBTools kmercountexact.sh

kmercountexact.sh k=9 threads=54 -Xmx300g in=~/hapmers/WGS_HG002_EZ1_25kb.pod5.bam out=~/hapmers/WGS_HG002_EZ1_25kb.pod5.bam.9mers.txt

Convert from the format >12345\nAAAAAAAAA to AAAAAAAAA\t12345 and sort from k-mers AAAAAAAAA to TTTTTTTTT

cat ~/hapmers/WGS_HG002_EZ1_25kb.pod5.bam.9mers.txt|paste - - |perl -pe "s/>(\d+)\t(\w*)/\2\t\1/g" |sort > ~/hapmers/WGS_HG002_EZ1_25kb.pod5.bam.9mers2.txt

Turns out from my human WGS data, there are quite a few missing potential 9-mers (I only have 131072 9-mers out of a possible 262144), and your k-mer model input requires all possible 4-base 9-mers, so I used column1 (i.e., the k-mers) from https://raw.githubusercontent.com/nanoporetech/kmer_models/refs/heads/master/dna_r10.4.1_e8.2_400bps/9mer_levels_v1.txt

wget https://raw.githubusercontent.com/nanoporetech/kmer_models/refs/heads/master/dna_r10.4.1_e8.2_400bps/9mer_levels_v1.txt
cut -f 1 9mer_levels_v1.txt > 9-mers.txt

and then the following julialang script to get fill in the table.

using DataFrames, CSV

# Read the input files
test = DataFrame(CSV.File("/home/jelber43/genomics_server/hapmers/WGS_HG002_EZ1_25kb.pod5.bam.9mers2.txt", header=false))
test2 = DataFrame(CSV.File("/home/jelber43/genomics_server/hapmers/9-mers.txt", header=false))

# Create test3 DataFrame
test3 = DataFrame("k-mer" => test2[:, 1], "number" => fill(0, nrow(test2)))

# Update test3 with values from test
for row in eachrow(test)
    idx = findfirst(==(row[1]), test3[!, "k-mer"])
    if !isnothing(idx)
        test3[idx, "number"] = row[2]
    end
end

# Sort test3 by k-mer
test4 = sort(test3, "k-mer")

# Write the output file
CSV.write("/home/jelber43/genomics_server/hapmers/WGS_HG002_EZ1_25kb.pod5.bam.9mers3.txt", test4, delim='\t', header=false)

Python script to get 9-mers with counts transformed by a z-score

import pandas as pd
import numpy as np

# Read the input file
input_file = '~/hapmers/WGS_HG002_EZ1_25kb.pod5.bam.9mers3.txt'
kmer_df = pd.read_csv(input_file, sep='\t', header=None, names=['kmer', 'count'])

# Calculate mean and standard deviation
mean_count = kmer_df['count'].mean()
std_count = kmer_df['count'].std()

# Calculate expected level in standard units
kmer_df['expected_level'] = (kmer_df['count'] - mean_count) / std_count

# Create result table
result_table = kmer_df[['kmer', 'expected_level']].reset_index(drop=True)

# Write output file without header
output_file = '~/hapmers/WGS_HG002_EZ1_25kb.pod5.bam.9mers4.txt'
result_table.to_csv(output_file, sep='\t', header=False, index=False)

Edit output file so that the header has

#k\t9

and that there is a fake level standard deviation of 0 for all k-mers.

## format is:
k-mer\tz-score\t0
## with the following code
cat <(echo -e "#k\t9") <(perl -pe "s/(\w+)\t(.+)/\1\t\2\t0/g" ~/hapmers/WGS_HG002_EZ1_25kb.pod5.bam.9mers4.txt) > ~/hapmers/WGS_HG002_EZ1_25kb.pod5.bam.9mers5.txt
cd ~/pixi/blue-crab

Simulate the reads with squigulator and the new k-mer model

~/bin/squigulator-v0.4.0/squigulator --sample-rate 5000.0 --ont-friendly=yes --seed 1 --kmer-model ~/hapmers/WGS_HG002_EZ1_25kb.pod5.bam.9mers5.txt -o ~/pixi/blue-crab/test2.blow5 -t 8 -f 10 <(echo -e ">1\nTACAACATGCACATAACTTTTCCATACATAATAAAATGATGCAAAGAAGATGTACATTCTAACCATAGCTGAAATCGGGGCCATTTGTACAAGATTAATTATTAACCACATAAGCCAAGAATTACTAATAAAATGTACTGCAAAATAGCTGAAAAACAATTGCATGATTGCAGCCAATCCAAGTACATAGAAAAACCTAGTGAAAAGAATATATGCCAAAAACCACTCTGCAACTAAGCCAAAAGCAGTTAAATCCCATTTAAAAGATGAAATGGTAATTTGTATAGTTTCTAAAGAAGGATAGGTGTCTAAAGAATCTAAACCACTAAGACAAACACTACAAGGTATAGAACCAGTACAGTAGGTTGCAATAGTGACATTAGTAGAGTTCAAATAGCCTTCTCTGTAACCAGTACAGTAAGAAGGCATGCCTAAATTAGACATTAAAACACCTAAAGCAGCGGTTGAGTAGATTAAAGAACCTAGGCAAACACTTAATAGTAAAAACCAAATTATAATATTTATCAGTTTAGAAAAATTAGGTGACTTCAAATAATTAAATGAAGCCTCTAGACAAAATTTACCGACACTCTTAACAGTATTCTTTGCTATAGTAGTCGGCATAGATGCTTTAATTCTAGAATTTGTACTTCTAGTAAAAGTACACAATTGTAGCAATAAAGTAAAGAAATAAGGCATATAATTAGTACAAACACGGTTTAAACACCGTGTAACTATGTTAGTAGTTGTACTAACAACTTTGTTAAGAAAAGGCTTAGCATAATTAGCTATAGTATCCCAAGGGACACTATTAACAGCAGCTAAACCATGAGTAGCAAGGGTTTTCAAACCTAATACTCTAGATAATTCATTAGGTTTCTTAATAGTAAGACTAGAATTGTCTACATAAGCAGCCATTAGATCTGTGTGGCCAACCTCTTCTGTAATTTTTAAACTATTATTTGCTGGTTTAAGTATAATGTCTCCTACAACTTCGGTAGTTTTCACATTACACTCAAGAACGTCTTTCTGTATGGTAGGATTTTCCACTACTTCTTCAGAGACTGGTTTTAGATCTTCGCAGGCAAGATTATCCATTCCCTGCGCGTCCTCTGACTTCAGTACATCAAACGAATTTGATGTTTCAACTGGTTTTGTGCTCCAAAGACAACGTATACACCAGGTATTTGGTTTATACGTGGCTTTATTAGTTGCATTGTTAACATGCCAAACAATAGGTTTATGTAACAATTTAGCTCCTTTCTTAAAAGAGGGTGTGTAGTGTTTATAATCAATAGCCACCACATCACCATTTAAGTCAGGGAAAAATGTAACTTTAAGCTCTCTTGAAGCAGGTTTCTTATAACCAGTTAACTGGTTTAAATCATCAGCAAATTTGATATTATCACATACAAACTTAAAATTATCGAAGCTTGCGTTTGGATATGGTTGGTTTGGTACAAGATCAATTGGTTGCTCTGTGAAATAAGAATTGTCTTTCTTATAATAATTGTCCAACTTAGGGTCAATTTCTGTACAAACAACACCATCCAATTTATAAGTAACTGGTTTTATGGTTGTTGTGTAACTGTTTTCTTTGTAGAAAACATCCGTAATAGGACCTTTGTATTCTGAGGACTTTGTAAGTAAAGCACCGTCTATGCAATACAAAGTTTCTTTAGAAGTTATATGTTTATAGTGACCACACTGGTAATTACCAGTGTACTCACTAGCACAAGTAAATGTACCATGCTTAAGTTCATACTGAGCAGGTGGTGCTGACATCATAACAAAAGGTGACTCCTGTTGTACTAGATATTTTGTAGCTTGTTTACCACACGTACAAGGTATCTGAACACCTTTCTTAAATTGTTCATAAGAAAGTGTGCCCATGTACATAACAGCTTCTACACCCTTAAGGGTTGTCTGCTGTTGTCCACAAGTTTTACACACCACGTTCAAGACTCTTTTGCAAGAATCTAAATTGGCATGTTGAAACAAGTAACTCATTGTTTCTCTAACATCACCTAACTCACCTACTGTCTTATTACAGTAGGCTAAGATAAGTGCACAAAAGTTAGCAGCTTCACCAGCCCTTGCTCTGTAATAAGCATCTTGTAGAGCAGGTGGATTAAACTTCAACTCTATTTGTTGGAGTGTTAACAATGCAGTGGCAAGATAACAGTTGTTATCTGCCCATTTAATAGAAGTTAAACCATTAACTTGTGGGTATTTCCACTTTTTAGTGTGATTTAATGCTGACATGTACCTACCCAGAAAACTAGGATCAGTTGTGTGGTAGTACTCAAAAGCCTCAACACGTAGAGTGTCATCATTAGGTAAAACATAAAATGTTTTACCTTCATGTGAATTATGAGGTTTTATTTTAGTAACATCAGCTCCATCCAAATAAGTTGGACCAAACTGTTGTCCATATGTCATTGACATGTCCACAACTTGCGTGTGGAGGTTAATGTTGTCTACTGTTGTAAACACCTTAATAGTCCTCACTTCTCTCAAAGAAAGAAGTGTCTTAAGATTGTCAAAGGTGATAACTTCACCATCTAGGTGGAATGTGGTAGGATTACTAGTGTAATATACACTTTTATCACCTCTCTTAAGAAATTCTATACCTAGTTGTGTAGATTGTCCAGAATAGGACCAATCTTTATAGGAACCAGCAAGTGAGATGGTTTCAATAAAATGTTCTTCAGGTGTTTTAGAAGAAGAAGTAAGATAACCATTATACGCTGTAACAGCATCAGGTGAAGAAACAGAAACTGTAGCTGGCACTTTGAGAGATCTCATATACCGAGCAGCTTCTTCCAAATTTAAGCCATGTGTTACATAGCCAAGTGGCATTGTAACAAGAGTTTCATTTAGATCGTTAAGTGTGTTGATAAGTGACGCTACAGTTGTTTTACTGGTGTAAAAGTAAAATCTAGCACCATAATCAACCACACCCTCTTGTATTTTAATACCCTTATATTTACGCTGTATAGTTGAAACTATGGCTTTAGTTTCCACACAGACAGGCATTAATTTGCGTGTTTCTTCTGCATGTGCAAGCATTTCTCGCAAATTCCAAGAAACAGTTCCAAGAATTTCTTGCTTCTCATTAGAGATAATAGATGGTAGAATGTAAAAGGCACTTTTACACTTTTTAAGCACTGTCTTTGCCTCCTCTACAGTGTAACCATTTAAACCCTGACCCGGGTAAGTGGTTATATAATTGTCTGTTGGCACTTTTCTCAAAGCTTTCGCTAGCATTTCAGTAGTGCCACCAGCCTTTTTAGTAGGTATAACCACAGCAGTTAAAACACCCTCTTGAACAACATCACCCACTATATATGGAGCATCTTTCTTTAAGAAAGTGATGTCAATGTCACTAACAAGAGTGGCAGAATCTGGATGAAGATTGCCATTAATGTCAATATAAAGTAACAAGTTTTCTGTGAGGAACTTAGTTTCTTCCAGAGTTGTTGTAACTTCTTCAACACAAGCTTTGATTTTCTTATCATCTTGTTTTCTCTGTTCAACTGAAGGTTTACTTTCAGTTATAAATGGCTTAACTTCCTCTTTAGGAATCTCAGCGATCTTTTGTTCAACTTGCTTTTCACTCTTCATTTCCAAAAAGCTTGAAACAAGTTTGTCATAGAGATTTTTATCAAAGACAGCTAAGTAGACATTTGTGCGAACAGTATCTACACAAACTCTTAAAGAATGTATAGGGTCAGCACCAAAAATACCAGCTGATAATAATGGTGCAAGTAGAACTTCGTGCTGATTAAAATTTTCATAAGCACTCTTAAGAAGTTGAATGTCTTCACCTTTGTTAACATTTGGGCCGACAACATGAAGACAGTGTTTAGCAAGATTGTGTCCGCTTAAAACACAACTACCACCCACTTTAAGTGGTCCATTAGTAGCTATGTAATCATCAGATTCAACTTGCATGGCATTGTTAGTAGCCTTATTTAAGGCTCCTGCAACACCTCCTCCATGTTTAAGGTAAACATTGGCTGCATTAACAACCACTGTTGGTTTTACCTTTTTAGCTTCTTCCACAATGTCTGCATTTTTAATGTATACATTGTCAGTAAGTTTTAAATAACCACTAAAACTATTCACTTCAATAGTCTGAACAACTGGTGTAAGTTCCATCTCTAATTGAGGTTGAACCTCAACAATTGTTTGAATAGTAGTTGTCTGATTGTCCTCACTGCCGTCTTGTTGACCAACAGTTTGTTGACTATCATCATCTAACCAATCTTCTTCTTGCTCTTCTTCAGGTTGAAGAGCAGCAGAAGTGGCACCAAATTCCAAAGGTTTACCTTGGTAATCATCTTCAGTACCATACTCATATTGAGTTGATGGCTCAAACTCTTCTTCTTCACAATCACCTTCTTCTTCATCCTCATCTGGAGGGTAGAAAGAACAATACATATGTGAAGCCAATTTAAACTCACCAGACTCATCAAATAAGTAGTATGTAGCCATACTCCACTCATCTAAATCAATGCCCAGTGGTGTAAGTAATTCAGATACTGGTTGCAAAGTTTTTATGACAGCATCTGCCACAACACAGGCGAACTCATTTACTTCTGTACCGAGTTCAACTGTATAGGCAGAGCACTTCTCATTAAGTACTTTATCAATCCTTTCATCAAGTTCAAAAGTGATATTCACACTCTTGTAACCTTGCACTTCTATCACAGTGTCATCACCAAAAGTAACCTTTGTTGGTGCACCGCCTTTGAGTGTGAAGGTATTGTTTGTTACCATCATATTAGGTGCAAGGGCACAGTACTTTTCTGTGTCTTTGATTTCGAGCAACATAAGCCCGTTAATACAAACTGGTGTACCAACCAATGGAGCTTCAACAGCTTCACTAGTAGGTTGTTCTAATGGTTGTAAATCACCAGTTTTCAAGACAACTTCCTCTGTTAACACTTCTGTGGGAAGTGTTTCTCCCTCTAAGAAGATAATTTCTTTTGGGGCTTTTAGAGGCATGAGTAGGCCAGTTTCTTCTCTGGATTTAACACACTTTCTGTACAATCCCTTTGAGTGCGTGACAAATGTTTCACCTAAATTCAAGGCTTTAAGTTTAGCTCCACCAATAATGATAGAGTCAGCACACAAAGCCAAAAATTTATTTACAAGCTTAAAGAATGTCTGAACACTCTCCTTAATTTCCTTTGCACAGGTGACAATTTGTCCACCGACAATTTCACAAGCACAGGTTGAGATAAATTTAACAATTTCCCAACCGTCTCTAAGAAACTCTACACCTTCCTTAAACTTCTCTTCAAGCCAATCAAGGACGGGTTTGAGTTTTTCATAAACAGTGCCAAAGATGTTAGTTAGCCACTGCGAAGTCAACTGAACAACACCACCTGTAATGTAGGCCATTACAACTAGATTGTTAGTAGCCAAATCAGATGTGAACATCATAGCATCAATGAGTCTCAGTGAATACTGTGAAATTCCATCTAGTATTGTTATAGCGGCCTTCTGTAAAACACGCACAGAATTTTGAGCAGTTTCAAGAGTGCGGGAGAAAATTGATCGTACAACACGAGCAGCCTCTGATGCAAATGCATAAAGAGGACTCAGTATTGATTTCTGTTCACCAATATTCCAGGCACCTTTTTTAGCTTTTCCTTTTGTAACTTTAAAATTACCACAGGATTCAACAATTTGTTTGAATGCTTTATAATCCAAACCTTTCACAGTTTCCACAAAAGCACTTGTGGAAGCAGAAAAAGATGCCAAAATAATGGCGATCTCTTCATTAAGTTTAAAGTCACCAACAATATTGATGTTGACTTTCTCTTTTTGGAGTATTTCAAGAAGGTTGTCATTAAGACCTTCGGAACCTTCTCCAACAACACCTGTATGGTTACAACCTATGTTAGCGCTAGCACGTGGAACCCAATAGGCACACTTGTTATGGCAACCAACATAAGAGAACACACAGCCTCCAAAGGCAATAGTGCGACCACCCTTACGAAGAATGGTTTTCAAGCCAGATTCATTATGGTATTCGGCAAGACTATGCTCAGGTCCTACTTCTGAATTGTGACATGCTGGACAATAAATTTTAACAACAGCATTTTGGGGTAAGTAACCACAAGTAGTGGCACCTTCTTTAGTCAAATTCTCAGTGCCACAAAATTCGCAAGTGGCTTTAACAAAATCGCCCGTCTGCCATGAAGTTTCACCACAATGATCACACTTCATGAGAGTTGAAAGGCACATTTGGTTGCATTCATTTGGTGACGCAACTGGATAGACAGATCGAATTCTACCCATAAAGCCATCAAGCTTTTTCTTTTCAACCCTTGGTTGAATAGTCTTGATTATGGAATTTAAGGGAAATACAAAATTTGGACATTCCCCATTGAAGGTGTCAAATTTCTTTGCCAATTTAATTTCAAAAGGTGTCTGCAATTCATAGCTCTTTTCAGAACGTTCCGTGTACCAAGCAATTTCATGCTCATGTTCACGGCAGCAGTATACACCCCTCTTAGTGTCAATAAAGTCCAGTTGTTCGGACAAAGTGCATGAAGCTTTACCAGCACGTGCTAGAAGGTCTTTAATGCACTCAAGAGGGTAGCCATCAGGGCCACAGAAGTTGTTATCGACATAGCGAGTGTATGCCCCTCCGTTAAGCTCACGCATGAGTTCACGGGTAACACCACTGCTATGTTTAGTGTTCCAGTTTTCTTGAAAATCTTCATAAGGATCAGTGCCAAGCTCGTCGCCTAAGTCAAATGACTTTAGATCGGCGCCGTAACTATGGCCACCAGCTCCTTTATTACCGTTCTTACGAAGAAGAACCTTGCGGTAAGCCACTGGTATTTCGCCCACATGAGGGACAAGGACACCAAGTGTCTCACCACTACGACCGTACTGAATGCCTTCGAGTTCTGCTACCAGCTCAACCATAACATGACCATGAGGTGCAGTTCGAGCATCCGAACGTTTGATGAACACATAGGGCTGTTCAAGTTGAGGCAAAACGCCTTTTTCAACTTCTACTAAGCCACAAGTGCCATCTTTAAGATGTTGACGTGCCTCTGATAAGACCTCCTCCACGGAGTCTCCAAAGCCACGTACGAGCACGTCGCGAACCTGTAAAACAGGCAAACTGAGTTGGACGTGTGTTTTCTCGTTGAAACCAGGGACAAGGCTCTCCATCTTACCTTTCGGTCACACCCGGACGAAACCTAGATGTGCTGATGATCGGCTGCAACACGGACGAAACCGTAAGCAGCCTGCAGAAGATAGACGAGTTACTCGTGTCCTGTCAACGACAGTAATTAGTTATTAATTATACTGCGTGAGTGCACTAAGCATGCAGCCGAGTGACAGCCACACAGATTTTAAAGTTCGTTTAGAGAACAGATCTACAAGAGATCGAAAGTTGGTTGGTTTGTTACCTGGGAAGGTATAAACCTTTAAT")

Convert blow5 to pod5

pixi shell
blue-crab s2p test2.blow5 -o test2.pod5

Basecall with dorado-0.8.0 SUP

~/bin/dorado-0.8.0-linux-x64/bin/dorado download
~/bin/dorado-0.8.0-linux-x64/bin/dorado basecaller [email protected] test2.pod5 > test2.pod5.bam

bam to fastq and quality check

samtools fastq test2.pod5.bam > test2.fastq
~/bin/seqtk/seqtk fqchk test2.fastq|head
min_len: 1228; max_len: 3901; avg_len: 2785.86; 2 distinct quality values
POS	#bases	%A	%C	%G	%T	%N	avgQ	errQ	%low	%high
ALL	19501	32.0	17.6	16.5	33.9	0.0	2.6	3.0	100.0	0.0
1	7	14.3	0.0	0.0	85.7	0.0	2.6	3.0	100.0	0.0
2	7	14.3	42.9	0.0	42.9	0.0	2.7	3.0	100.0	0.0
3	7	42.9	0.0	0.0	57.1	0.0	2.7	3.0	100.0	0.0
4	7	0.0	0.0	14.3	85.7	0.0	2.6	3.0	100.0	0.0
5	7	0.0	42.9	14.3	42.9	0.0	2.6	3.0	100.0	0.0
6	7	57.1	0.0	0.0	42.9	0.0	2.6	3.0	100.0	0.0
7	7	14.3	0.0	0.0	85.7	0.0	2.6	3.0	100.0	0.0

I guess I am not surprised the results are so bad with my k-mer model with an average quality score of 2.6 as I pointed out that there were so many missing k-mers. Also I just put a fake standard deviation level of 0 for all of the z-scores that I had, so I have no idea what the values should look like.

@jelber2
Copy link
Author

jelber2 commented Sep 30, 2024

Hmmm... I guess you have based your k-mer models on the legacy k-mer models, and the above I wrote has gone with the assumption of the newer models that does not contain the electrical current information for a k-mer, rather the abundance of that k-mer.

@jelber2
Copy link
Author

jelber2 commented Sep 30, 2024

I will play around with your R10 k-mer model and stop wasting time trying to make my own as that does not seem possible.

@hasindu2008
Copy link
Owner

Hello.

Sorry for taking a bit of time to respond, I was iterating through the other repositories and work.

Great effort in deducing the k-mers levels. Generating nucleotide k-mer levels is a bit of a headache for a number of reasons.

  1. The need for high-quality modification-free control data as the modifications can mess up the stats
  2. The R10 pore being longer and thus 9-mer means that we need to have the above control data from something like a human genome. We did once when 4KHz came, but then this 5KHz came.
  3. The standard deviation values are important for simulation - if they are left at 0, the basecalling accuracy will be worse.

I have not documented how exactly I generated the squigulator model, but I have documented it for f5c which you can see here https://hasindu2008.github.io/f5c/docs/r10train#nucleotide-model. The steps are quite related, for instance, the conversion of the ONT's score normalised value to the pico-ampere levels using a linear transform. From on top of my head (might be not 100% correct, got to see my logs).

  1. Use the https://github.com/nanoporetech/kmer_models/blob/master/dna_r10.4.1_e8.2_260bps/9mer_levels_v1.txt (yes, 260 works better than 400 for even simulation at 400 bps)
  2. scale to pA through a linear transform
  3. standard deviations from the 1st round of training mentioned in https://hasindu2008.github.io/f5c/docs/r10train#nucleotide-model are taken

I believe the standard deviations that I have got through the above training are not great. Also, I believe some of the ONT mean levels are not as good as they should be.

@hiruna72 has been working on an independent pore model generation method:

@skovaka has also been working on an independent pore model generation method:

While these methods focus on generating k-mer models for alignment, I believe the methods could be tweaked to generate k-mer models for simulation as well. While testing these approaches has been on my todo list, have not been able to find some time to go through this model improvement carefully.

@jelber2
Copy link
Author

jelber2 commented Oct 2, 2024

Thank you for your super detailed and helpful response!

I could really only afford the limited time I spent on deducing the k-mer abundances from the reads in making a custom k-mer model. Now that I realize that I would have to take into account the actual raw signal converted to pA available from converting pod5 to s/blow5, I could not afford the time to do this entire process.

We have whole-genome amplification data for human cells (I would need to double-check), so presumably that would be modification-free r10.4.1 data, but I am afraid there will be too many missing k-mers based on k-mer abundances I got from the whole-genome sequencing that I had analyzed here.

Thanks for all of your help. Feel free to close this or leave it as a discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants