libCSAM contains several C++ codes for compress,decompress, and access each of the fields of any SAM format file. Part of this library was taking from Francisco Claude libcds project (https://github.com/fclaude/libcds/). Also the boost C++ library must be installed in your computer (http://www.boost.org/).
-[CompressSAM]:
Use: ./CompressSAM <arch> <opt>
opt:
-q qm: How the Quality values are stored. q = 0 - lossles, 1 - pblock, 2 - rblock. Default: q = 1
-m mode: Mode use to store the Representative Array. mode = 0 ASCII, 1 Binary Global, 2 Binary Local. Default mode=1
-l lossy: lossy parameter use to compress the quality score depending on the mode use. Default: 0
-s sample: Sample rate used for Fields and Quality structure. Default: s = 1000
-p position: Sample position rate used for Seq, Rname, and Pos. Default: p = 1000
output: .cqual File
Example: ./CompressSAM ./data/file.sam -q 2 -l 60 -s 500 -p 100
output: file.sam.csam
-[CompressQual]:
Use: ./CompressQual <arch> <opt>
arch: SAM format file. Note that only the quality field will be use.
opt:
-q mode: How the Quality values are stored.
mode=0 gzip
mode=1 P-Block
mode=2 R-Block
mode=3 Bins base on the LogBinning Wan et al.
2011 paper. Note that UniBinning is
also implemented but is not included
in this program.
mode=4 Only one value is stored to represent
all the qualities
default: 0
-m mode: If 'q' is 1 or 2, give the mode use to store
the Representative Array.
mode=0 ASCII
mode=1 Binary Global
mode=2 Binary Local
default=1
-l lossy: Lossy parameter use to compress the quality
score depending on the mode use.
P-Block: maximum distance between the values
and their representative (1,2,3,4)
R-BLock: Max/Min maximum diference allowed,
recieve the extra overhead pbb
(5, 20, 40, 100)
Bins: number of bins to be use (94, 20,
10, 5)
-r : Reorder the quality scores by reference and
position. Also the permutation is stored
-s sample: Size of the sample rate that will be use.
default: no sample.
output: .cqual File
Example: ./CompressQual ./data/file.sam -q 2 -l 60 -s 500 -r
output: file.sam.cqual
Compress the quality scores of file.sam using R-Block with r=1.60
storing a sample every 500 lines and reordering the file
by reference and position in the reference.
-[CompressSeq]:
Use: ./CompressSeq <arch> <opt>
<arch>: must be a .sam or .rps (rname pos seq) file
opt:
-s sample: size of the sample rate that will be use. Default: no sample;
output: .cseq file
-[DecompressSAM]:
Use: ./DecompressSAM <arch>
arch: .csam File
output:.sam File containing the SAM information
-[DecompressQual]:
Use: ./DecompressQual <arch>
arch: .cqual File
output:.qual File containing the quality scores
-[DescompressSeq]:
Use: ./DecompressSeq <arch>
arch: .cseq File
output:.seq File containing only the sequence field
-[CountReadsSample]: Counts the number of read within each of the intervals in the sample_interval size. Also gives some stats about the interval found.
Use: ./CountReadsSample <arch>.csam sample_interval_file
output: On screen
-[GetIntervalSAM]: Extracs from a csam file all the alignment lines wihtin the interval (ref,x,y)
Use: ./GetIntervalSAM <arch> ref_name pos_x pos_y
arch: .csam File
ref: reference name
pos_x, pos_y: interval positions
output: file_name + "_inter.sam" File
-[GetIntervalSeq]: Same as before but only extracting only the SEQ fields.
-[GetIntervalSAMSample]: Same as GetIntervalSAM but receive a file containing many intervals to query
Use: ./GetIntervalSAMSample <arch>.csam sample_interval_file
Use: ./GetIntervalSAMSample <arch>.csam sample_interval_file BuffSizeInBytes
-[GetIntervalSeqSample]: Same as before but only extracting the SEQ field
-[GetIntervalSSN]: Same as GetIntervalSAMSample but extracting only a selection of the Fields and replacing the rest with empty values. For the moment it extrac a minimal set (QNAME FLAG RNAME POS MAPQ SEQ). Modify line 113 of the file to change this option (TODO: do it by command line)
Use: ./GetIntervalSSN <arch>.csam sample_interval_file
output: file_name + "_inter.sam" File
Also this library contains in the stats_src the following programs:
-[Change_qual]: Changes the quality field of a SAM file with the quality file given
Use: python ./Change_qual.py file.sam new_qual.qual
output: newSAM.sam
-[Change_qual_letter]: Changes the quality field of a SAM file to only one quality score value
Use: python ./Change_qual_letter.py file.sam letter name_output.sam
output: name_output.sam
-[ComputeEntroHist]: Computes the Entropy of order 0 of a file return the histogram of each symbol
(compile first: g++ -o ComputeMetrics ComputeMetrics.cpp)
Use: ./ComputeEntroHist <arch> <out_arch>
output: In screen prints the entropy of the file, and in <out_arch> returns the histogram of the symbols
-[ComputeMetrics]: Compares two quality files and compute some distance metrics
(compile first: g++ -o ComputeMetrics ComputeMetrics.cpp)
Use: ./ComputeMetrics qualityFile.qual referenceFile.qual
output : In screen prints the Manhattan, Max:Min, MSE, Chebyshev, Soergel and Lorentzian metrics.
-[Get_qual]: Extracs the quality field of a SAM file
Use: python ./Get_qual.py file.sam
output: file.sam.qual
-[Get_seq]: Extracs the reference, positon and sequence field of a SAM file
Use: python ./Get_seq.py file.sam
output: file.sam.rps
-[getVCF]: Simple example of how to generate the vcf file of a BAM file using mpileup and bcftools.
Use: ./getVCF reference_file file.bam
output: file.bam.vcg
-[get_vcf_stats]: Compares two vcf files computing true positive, false positive, false negative, precision, recall, and MSE.
Use: python ./get_vcf_stats.py original.vcf second.vcf
output: Returns stats in screen
Note: These codes assume that the computer have enough RAM memory to read and store the complete input.