-
Notifications
You must be signed in to change notification settings - Fork 2
/
hicup_truncater
executable file
·659 lines (548 loc) · 25.8 KB
/
hicup_truncater
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;
use POSIX ":sys_wait_h"; #for nonblocking read
use File::Basename;
use FindBin '$Bin';
use lib $Bin;
use hicup_module;
use hicup_module qw(hashVal);
use Data::Dumper;
###################################################################################
###################################################################################
##This file is Copyright (C) 2023, Steven Wingett ##
## ##
## ##
##This file is part of HiCUP. ##
## ##
##HiCUP is free software: you can redistribute it and/or modify ##
##it under the terms of the GNU General Public License as published by ##
##the Free Software Foundation, either version 3 of the License, or ##
##(at your option) any later version. ##
## ##
##HiCUP is distributed in the hope that it will be useful, ##
##but WITHOUT ANY WARRANTY; without even the implied warranty of ##
##MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ##
##GNU General Public License for more details. ##
## ##
##You should have received a copy of the GNU General Public License ##
##along with HiCUP. If not, see <http://www.gnu.org/licenses/>. ##
###################################################################################
###################################################################################
##########################################################
#Get user-supplied parameters
#Option variables
my %config = (
nofill => '',
config => '',
datestamp => '',
example => '',
help => '',
keep => '',
outdir => '',
quiet => '',
seq_trunc => '',
re1 => '',
r => '',
threads => '',
version => '',
zip => ''
);
my $config_result = GetOptions(
"nofill" => \$config{nofill},
"config=s" => \$config{config},
"example" => \$config{example},
"datestamp=s" => \$config{datestamp},
"help" => \$config{help},
"keep" => \$config{keep},
"outdir=s" => \$config{outdir},
"quiet" => \$config{quiet},
"re1=s" => \$config{re1},
"sequence=s" => \$config{seq_trunc},
"r=s" => \$config{r},
"threads=i" => \$config{threads},
"version" => \$config{version},
"zip" => \$config{zip}
);
die "Could not parse options" unless ($config_result);
$config{help} = 1 unless ( hashVal(%config) ); #Print help and exit if no command line parameters
if ( $config{help} ) {
print while (<DATA>);
exit(0);
}
#Print version and exit
if ( $config{version} ) {
print "HiCUP+ Truncater v$hicup_module::VERSION\n";
exit(0);
}
if ( $config{example} ) {
print_example_config_file('truncater_example.conf');
exit(0);
}
if ( $config{threads} eq '' ) {
$config{threads} = 1;
}
#########################################################################################
#Read the config file to identify the sequences to be screened and the ligation boundary
#Config file name passed as a command line argument
#Process the configuration file
my @filenames;
if ( hasval( $config{config} ) ) {
@filenames = process_config( $config{config}, \%config ); #Modifies %config and returns an array of the filenames
if ( scalar @filenames % 2 ) {
die "There needs to be an even number of files in the configuration file, see hicup --help for more details.\n";
}
}
if (@ARGV) {
if ( scalar @ARGV % 2 ) {
die "There needs to be an even number of files specified in the command line, see hicup_truncater --help for more details.\n";
}
push( @filenames, @ARGV ); #Add filenames specified in the command line to those in the configuration file
}
unless ( check_files_exist( \@filenames, 'EXISTS' ) ) {
die "Please adjust configuration.\n";
}
my %files = @filenames; #%files : hash of paired forward and reverse files
my $rA_junction_sequences = check_re1();
my @ligation_sequences;
#Check the output directory exists
if ( $config{outdir} ne '' ) {
unless ( -d $config{outdir} ) {
die "Output directory '$config{outdir}' does not exist or is not writable.\n";
}
#Make sure that $outdir ends with the forward slash character
unless ( $config{outdir} =~ /\/$/ ) {
$config{outdir} .= '/';
}
} else {
$config{outdir} = './'; #Make current directory if none specified
}
#Check R installed
checkR( \%config );
#Determine the ligation boundary sequence
my $rA_ligsite;
if ( $config{seq_trunc} eq '' ) { #$origseq and $before_cut only required if Hi-C ligation junction determined computationally
if ($config{nofill}) {
#Fill-in not performed, ligation junction is the original restriction site sequence without the caret
my $rA_cutsites;
foreach my $re1 ( @{ $config{re1} }) {
$rA_cutsites = cutsite_deduce( -seq => $re1); #Passing a hash -seq -> {$re1} to cutsite_deduce, which returns ref to array
push @{$rA_ligsite}, $rA_cutsites;
}
} else {
$rA_ligsite = fill_end( $rA_junction_sequences );
}
}
my %unique_ligation_sites; #Used for de-duplicating, which may occur following new code to calculate sequences containing Ns
foreach my $ligation_sites (@{$rA_ligsite}) {
foreach my $ligation_site ( @{$ligation_sites}){ #Array of arrays
$unique_ligation_sites{$ligation_site} = undef;
}
}
if( $config{nofill} and (scalar (keys %unique_ligation_sites) > 1) ){
die "Option '--nofill' is not supported for restriction enzyme digestion at more than one cut-site\nPlease adjust configuration.\n";
}
my $sequences_string = "[";
foreach my $ligation_site (keys %unique_ligation_sites){
$sequences_string .= join ", ", $ligation_site;
$sequences_string .= ", ";
}
$sequences_string =~ s/, $/]/;
print "Truncating with HiCUP+ Truncater v$hicup_module::VERSION\n" unless ( $config{quiet} );
print "Truncating sequences at occurrence of sequences '$sequences_string'\n" unless ( $config{quiet} );
#Before processing the sequence file check whether the output output files
#(yet to be generated) don't already exist. Doing this now means the
#program will less likely fail during the middle of sequence file processing
foreach my $filename (@filenames) {
$filename = fileNamer( $filename, \%config, 'truncater' );
$filename = $config{outdir} . $filename; #Add output directory filename
if ( -e $filename ) {
die "Outputfile \'$filename\' already exists. Please delete or rename file.\n";
}
}
#Create a date-stamped summary file
if ( $config{datestamp} eq '' ) {
$config{datestamp} = datestampGenerator();
}
my ($summaryfile_temp) = fileNamer( 'SUMMARY', \%config, 'truncater', 0, 0, 0, 1, 0 ); #Will return an array
$summaryfile_temp = $config{outdir} . $summaryfile_temp;
if ( -e "$summaryfile_temp" ) {
die "Summary file \'$summaryfile_temp\' already exists. Please delete or rename file.\n\n";
}
open( SUMMARY_TEMP, ">$summaryfile_temp" ) or die "Could not write to $summaryfile_temp\n";
print SUMMARY_TEMP "File\tTotal_Reads_Processed\tTruncated\t%Truncated\tNot_truncated\t%Not_truncated\tAverage_length_truncated_sequence\n";
close SUMMARY_TEMP or die "Could not close filehandle on summary file '$summaryfile_temp' : $!";
#Process the sequence file(s)
print "Truncating sequences\n" unless ( $config{quiet} eq '1' );
my $terminate = 0; #Instruct script to die if error detected in child process
my %children; #Hash of child processes
foreach my $fileF ( keys %files ) {
my @pair = ( $fileF, $files{$fileF} );
foreach my $inputfile (@pair) {
my $pid = fork();
die "cannot fork" unless defined $pid;
if ( $pid == 0 ) {
process_file($inputfile);
exit 0;
} else {
$children{$pid} = 1;
while ( keys(%children) == $config{threads} ) {
sleep(1);
reaper();
}
}
}
}
#Make sure all child processes have terminated before exiting
do {
sleep(1);
reaper();
} until ( keys(%children) == 0 );
#Process the temporary summary file and order so that paired files are adjacent
#to one another in the final summary file. This is ESSENTIAL, since HiCUP Reporter uses this
#to determine file forward/reverse pairings!!!
my ($summaryfile) = fileNamer( 'SUMMARY', \%config, 'truncater', 0, 1, 0, 0, 0 ); #Will return an array
$summaryfile = $config{outdir} . $summaryfile;
if ( -e "$summaryfile" ) {
die "Summary file '$summaryfile' already exists, please delete or rename file.\n";
}
open(SUMMARY_TEMP, '<', $summaryfile_temp) or die "Could not open '$summaryfile_temp' : $!";
open( SUMMARY, '>', $summaryfile ) or die "Could not write to $summaryfile : $!";
print SUMMARY scalar <SUMMARY_TEMP>; #Print header row
my %summaryfile_data; # %{FASTQ filename} = summary line
while(<SUMMARY_TEMP>){
my $line = $_;
my ($fastq_file) = split(/\t/, $line);
$summaryfile_data{$fastq_file} = $line;
}
close SUMMARY_TEMP or die "Could not close filehandle on '$summaryfile_temp' : $!";
foreach my $fileF ( keys %files ) { #Now extract the file pair data to file summary file
my $fileR = basename($files{$fileF});
$fileF = basename($fileF);
unless(exists $summaryfile_data{$fileF}){
die "Could not find $files{$fileF} in summary hash - this should not happen!\n"; #Internal check
}
unless(exists $summaryfile_data{$fileR}){
die "Could not find $files{$fileR} in summary hash - this should not happen!\n"; #Internal check
}
print SUMMARY $summaryfile_data{$fileF};
print SUMMARY $summaryfile_data{$fileR};
}
close SUMMARY or die "Could not close filehandle on summary file '$summaryfile' : $!";
unlink $summaryfile_temp or warn "Could not delete temporary file '$summaryfile_temp'\n";
#Produce summary graph
unless ( $config{r} eq '0' ) { #R not installed/found
my $command = $config{r} . 'script ' . "$Bin/r_scripts/hicup_truncater_summary.r $config{outdir} $summaryfile";
!system("$command") or warn "Could not produce hicup_truncater summary bar chart: $command: $!";
}
print "Truncating complete\n" unless ( $config{quiet} eq 1 );
exit(0);
#####################################################ls##################################
##################################
#Subroutines #
#######################################################################################
sub check_re1 {
my $parameters_ok = 1;
$config{re1} =~ s/\s//g; #Edit variable (declared outside of subroutine)
$config{re1} =~ tr/atcg/ATCG/;
chomp $config{re1};
#Check restriction enzymes in correct format
if ( $config{re1} eq '' ) { #RE1 present
warn "Please specify restriction enzyme (--re1)\n";
$parameters_ok = 0;
}
#Split re1 into multiple enzyme entries, using an array reference
if ( $config{re1} =~ /:/ ) {
my $rA_res = [ split( /:/, $config{re1} ) ];
$config{re1} = $rA_res;
#Even if there is only one re1 enzyme specified (i.e. no ":" found), we still store it in an array reference
} else {
$config{re1} = [ $config{re1} ];
}
#Check whether the RE names been included
my @re_seq = ();
my @re_name = ();
foreach my $re_string (@{$config{re1}}) {
if ( $re_string =~ /,/ ) { #if "," is present, then the string contains the RE name
my ( $seq, $name ) = split( /,/, $re_string );
$seq = uc $seq; #Standardise by converting to upper case
push @re_seq, $seq;
push @re_name, $name;
} else {
push @re_seq, uc $re_string;
push @re_name, 're1_unspecified';
}
}
$config{re1} = \@re_seq;
$config{re1_name} = \@re_name;
#Check RE are in correct format
for ( my $i = 0; $i < scalar(@{$config{re1}}); $i++ ) {
unless ( $config{re1}->[$i] =~ /^[ACGTN\^]+$/ ) {
warn "Restriction enzyme: '$config{re1}->[$i]' should only contain the characters: 'A','G','C','T', 'N' or '^'\n";
$parameters_ok = 0;
}
unless ( ( $config{re1}->[$i] =~ tr/\^// ) == 1 ) {
warn "Restriction enzyme: '$config{re1}->[$i]' should contain one caret character ('^')\n";
$parameters_ok = 0;
}
unless ( $config{re1_name}->[$i] =~ /^\w*$/ ) {
warn "The restriction enzyme name '$config{re1_name}->[$i]' should only contain alphanumeric characters\n";
$parameters_ok = 0;
}
}
#Create a config term storing the sequence with the caret (the caret will be removed from the original)
my @re_name_with_caret = ();
for ( my $i = 0; $i < scalar(@{$config{re1}}); $i++ ) {
push @re_name_with_caret, $config{re1}->[$i];
}
$config{re1_with_caret} = \@re_name_with_caret;
unless ( $parameters_ok ) {
die "Please change configuration file and/or command-line parameters and/or installation accordingly\n";
}
my @re_caret_position = ();
my @cut_sites = ();
for ( my $i = 0; $i < scalar(@{$config{re1}}); $i++ ) {
#Store the caret position of re1
$config{re1}->[$i] =~ /\^/g;
my $caret_position = pos($config{re1}->[$i]) - 1;
push @re_caret_position, $caret_position;
$config{re1}->[$i] =~ s/\^//g;
#Check if current re1 cut-site contains any Ns
if ( $config{re1}->[$i] =~ m/N/ ) {
#If Ns are found, then deduce all the possible cut-sites and push them to @cut_sites as an array ref
my $rA_cutsites = cutsite_deduce( -seq => $config{re1}->[$i] );
push @cut_sites, $rA_cutsites;
} else {
#If no N, then push the current re1 cut-site still as an array ref, so each case will be treated the same way i.e. as an array
push @cut_sites, [ $config{re1}->[$i] ];
}
}
$config{re1_caret_position} = \@re_caret_position;
$config{re1_deduced_cut_sites} = \@cut_sites;
my @junctions = ();
for ( my $i = 0; $i < scalar(@{$config{re1}}); $i++ ) {
#Then, for each of the deduced cut sites, build the start & end sequences of the ligation junctions
my @junctions_per_site = ();
foreach my $cutsite (@{$config{re1_deduced_cut_sites}->[$i]}) {
my $start_seq = substr($cutsite, 0, length($cutsite) - $config{re1_caret_position}->[$i]);
my $end_seq = substr($cutsite, $config{re1_caret_position}->[$i]);
push @junctions_per_site, { -start => $start_seq,
-end => $end_seq };
}
push @junctions, \@junctions_per_site
}
$config{re1_caret_position} = \@re_caret_position;
$config{re1_start_end_junction} = \@junctions;
return \@junctions;
}
#######################
#Subroutine "fill_end":
#The ligation sequence is the result of cutting DNA with a restriction
#enzyme, filling-in the overhangs and then performing a blunt-ended ligation.
#For example, BglII (5'-A^GATCT-3') cuts between A and G. So, the ligation
#sequence will be: A + GATC + GATC + T = AGATCGATCT.
sub fill_end {
my @junction_sequences = @{$_[0]};
my @ligation_sequences;
for ( my $i = 0; $i < scalar(@junction_sequences); $i++ ) {
my @ligation_sequences_per_site = ();
foreach my $rH_junction1 (@{$junction_sequences[$i]}) {
for ( my $j = 0; $j < scalar(@junction_sequences); $j++ ) {
foreach my $rH_junction2 (@{$junction_sequences[$j]}) {
my $ligation_sequence = $rH_junction1->{-start} . $rH_junction2->{-end}; #Potentially new sequence created at the ligation interface
push @ligation_sequences_per_site, $ligation_sequence;
# print $config{re1_name}->[$i] . "\t" . $ligation_sequence . "\n";
}
}
}
push @ligation_sequences, \@ligation_sequences_per_site;
}
$config{re1_ligation_sequences} = \@ligation_sequences;
return \@ligation_sequences;
}
###########################
#Subroutine "process_file":
#Truncates DNA sequences containing the putative Hi-C ligation sequence
sub process_file {
my $inputfile = $_[0];
#Create variables for counting the number of sequences truncated/not truncated
my $reads_processed = 0;
my $truncated = 0;
my $not_truncated = 0;
#Array to calculate average size of a truncated sequence - [0]:Sum of lengths; [1]:Number of truncated sequences
my @truncated_lengths;
#Check whether the input file is zipped and then open accordingly
if ( $inputfile =~ /\.gz$/ ) {
open( IN, "gunzip -c $inputfile |" ) or die "Couldn't read file \'$inputfile\' : $!";
} elsif ( $inputfile =~ /\.bz2$/ ) {
open( IN, "bzcat $inputfile |" ) or die "Couldn't read file \'$inputfile\' : $!";
} else {
open( IN, $inputfile ) or die "Couldn't read file \'$inputfile\' : $!";
}
print "Truncating $inputfile\n" unless ( $config{quiet} eq '1' );
my $outputfile = fileNamer( $inputfile, \%config, 'truncater' );
$outputfile = $config{outdir} . $outputfile;
if ( $config{zip} eq '1' ) {
open( OUT, "| gzip -c - > $outputfile" ) or die "Couldn't write to file '$outputfile': $!";
} else {
open( OUT, ">$outputfile" ) or die "Couldn't write to file '$outputfile' : $!";
}
#Truncates the sequence (and quality scrore) up until and including the
#restriction site, writing the results into a new file.
while (<IN>) {
if ( /^@/ or /^[ATCG]+_/ or /^NO_BARCODE_/ ) {
my $line1 = $_;
my $line2 = scalar <IN>;
my $line3 = scalar <IN>;
my $line4 = scalar <IN>;
chomp $line2;
chomp $line4;
$reads_processed++;
if ( $config{seq_trunc} ne '' ) { #Process differently depending on whether the user provides ligation sequences
my $truncation_occurred = 0; #Boolean flag to determine whether the truncation occurred
#Loop through all the ligation junctions and truncate at the position where the ligation juction differs from the reference: '_'
my @truncated;
foreach my $sequence (@ligation_sequences) {
my $sequence_no_underscore = $sequence;
$sequence_no_underscore =~ s/_//g; #Remove the '_' delimit symbol from the sequence
my ($length_same) = split( /_/, $sequence );
$length_same = length($length_same); #Number of bases in which the ligation junction and reference genome correspond before first mismatch
my $line2_temp = $line2; #Create temp line to avoid cutting $line2
@truncated = split( /$sequence_no_underscore/, $line2_temp ); #Cut the sequence
unless ( ( length $line2 ) == ( length $truncated[0] ) ) { #Has the sequence been cut?
$length_same += length $truncated[0]; #Length the same is now the length of the read which matched the reference genome
$line2 = substr( $line2, 0, $length_same );
$line4 = substr( $line4, 0, $length_same );
$truncation_occurred = 1;
}
}
if ($truncation_occurred) {
$truncated_lengths[0] += length $line2;
$truncated_lengths[1]++;
$truncated++;
} else {
$not_truncated++;
}
} else {
#Finds the ligation site, if present, and truncates accordingly
#(swapping the ligation sequence for the restriction site up until the cut site).
#Counts the sequences truncated/not-truncated.
my $string = join "|", map { @{$_} } @{$rA_ligsite};
if ( $line2 =~ /($string)/ ) {
my $ligsite = $1;
my $re1;
for (my $i = 0; $i < scalar(@{$config{re1}}); $i++) {
if ( grep $_ eq $ligsite, @{$rA_ligsite->[$i]} ) {
$re1 = $config{re1}->[$i];
last;
}
}
my @truncated = split( /$ligsite/, $line2 );
unless ( ( length $line2 ) == ( length $truncated[0] ) ) {
if(length $truncated[0] == 0) {
$line2 = substr($line2, 0, 1); #Prevent truncated reads being of zero length
}elsif($re1 =~ /N/){ #If re1 contains an N, don't append this to read, else will add an N to the read
$line2 = $truncated[0]
} else {
$line2 = $truncated[0] . $re1;
}
$line4 = substr( $line4, 0, ( length $line2 ) );
$truncated_lengths[0] += length $line2;
$truncated_lengths[1]++;
$truncated++;
} else {
$not_truncated++;
}
} else {
$not_truncated++;
}
}
print OUT $line1 . $line2 . "\n" . $line3 . $line4 . "\n";
}
}
#Close input and output files
close IN or die "Could not close filehandle on '$inputfile' : $!";
close OUT or die "Could not close filehandle on '$outputfile' : $!";
#Write results to the summary file
my $percent_trunc;
my $percent_not_trunc;
my $average_trunc_length;
if ($reads_processed) { #Avoid division by zero errors
$percent_trunc = $truncated / ($reads_processed) * 100;
$percent_not_trunc = $not_truncated / ($reads_processed) * 100;
if ($truncated) { #Avoid division by zero errors
$average_trunc_length = $truncated_lengths[0] / $truncated_lengths[1];
} else {
$average_trunc_length = 0;
}
} else {
$percent_trunc = 0;
$percent_not_trunc = 0;
$average_trunc_length = 0;
}
open( SUMMARY_TEMP, ">>$summaryfile_temp" ) or die "Could not write to $summaryfile_temp : $!";
my $filename_for_summary = ( split( /\//, $inputfile ) )[-1]; #Removes folder extensions
print SUMMARY_TEMP "$filename_for_summary\t$reads_processed\t$truncated\t";
printf SUMMARY_TEMP "%.2f", $percent_trunc;
print SUMMARY_TEMP "\t$not_truncated\t";
printf SUMMARY_TEMP "%.2f", $percent_not_trunc;
print SUMMARY_TEMP "\t";
printf SUMMARY_TEMP "%.2f", $average_trunc_length;
print SUMMARY_TEMP "\n";
close SUMMARY_TEMP or die "Could not close filehandle on '$summaryfile_temp' : $!";
}
#####################
#Subroutine "reaper":
#reaps dead child processes
sub reaper {
#Don't change $! and $? outside handler
local ( $!, $? );
my $pid = waitpid( -1, WNOHANG );
return if $pid == -1;
unless ( defined $children{$pid} ) {
return;
} else {
my $exit_value = $? >> 8;
if ($exit_value) {
$terminate = 1;
}
delete $children{$pid};
}
}
__DATA__
HiCUP homepage: www.bioinformatics.babraham.ac.uk/projects/hicup
SYNOPSIS
hicup_truncater script terminates reads at Hi-C ligation junctions
hicup_truncater [OPTIONS]... -config [CONFIGURATION FILE]...
hicup_truncater [OPTIONS]... [FASTQ FILE PAIRS]...
FASTQ file pairs should be place next to each other when using the command line, or
on adjacent lines in the configuration file.
FUNCTION
Valid Hi-C pairs comprise two DNA fragments from different regions of the genome
ligated together. The hicup_truncater script identifies ligation junctions within
reads and deletes sequences downstream of the restriction enzyme recognition
sequence.
The names of the files to be processed and the restriction enzyme recogniton site
may be passed tonthe scrip using a configuration file or command line arguments.
COMMAND LINE OPTIONS
--config Name of the optional configuration file
--help Print program help and exit
--nofill Hi-C protocol did NOT include a fill-in of sticky ends prior to
re-ligation and therefore reads shall be truncated at
the restriction site sequence. This feature is only supported for
single restriction enzyme Hi-C.
--outdir Directory to write output files
--quiet Suppress all progress reports
--re1 Restriction enzyme recognition sequence. e.g. A^GATCT,BglII
HiCUP can accomodate more than one enzyme and N nucleotides
e.g. A^GATCT,BglII:A^AGCTT,HindIII:^GANTC,myRE.
--sequences Instead of specifying a restriction enzyme recognition sequence,
specify the ligation sequences directly
--threads Number of threads to use, allowing simultaneous processing of
different files
--version Print the program version and exit
--zip Compress output using gzip
Full instructions on running the pipeline can be found at:
www.bioinformatics.babraham.ac.uk/projects/hicup
Steven Wingett, Babraham Institute, Cambridge, UK