forked from eriqande/eca-bioinf-handbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
2.05-managing-workflows-with-snakemake.Rmd
1697 lines (1473 loc) · 71.6 KB
/
2.05-managing-workflows-with-snakemake.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Managing Workflows with Snakemake {#snakemake-chap}
In the preceding sections we have covered a few approaches for using RStudio
projects and RMarkdown to create reproducible research repositories with visually
pleasing HTML outputs that can help readers quickly see what was required to
generate different figures/tables/outputs. Such an approach
is well suited to projects that are undertaken primarily on one's own desktop or
laptop computer, and which don't take an inordinately long time to run. Unfortunately,
that does not describe most _bioinformatic_ analyses handling whole-genome sequencing
or RAD sequencing data for conservation genetics. Most of those analyses will be
run on a university computer cluster, and trying to manage all of that within
an RMarkdown-based framework is at best rather difficult, and at worst it can
become a shrieking horror.
For managing these sorts of bioinformatic workflows there is a Python-based
framework called [Snakemake](https://snakemake.readthedocs.io/en/stable/)
that is well-supported, flexible, and incredibly powerful.
Understanding how Snakemake works and becoming familiar with its many
features involves a non-trivial learning curve; however, for anyone spending
a sizable chunk of their graduate career managing bioinformatic analyses, or for anyone that is
a lab bioinformatician running bioinformatic workflows across
many species (for example), the benefits of learning
Snakemake will continue to pay dividends for years to come.
Someone who has mastered all the topics in Part I of this handbook (Unix programming,
working on remote computers, etc.) certainly will have the skills to write
what I will call _input-oriented_, _forward-marching_ workflows. I call these
"forward-marching" because they start from a set of
input files (for example files of sequences from a sequencer), and then
the workflow is defined as a series of sequential steps, one after the other.
They are "input-oriented" in the sense that such a workflow starts
with a series of inputs, but the workflow itself doesn't really know
what it is trying to produce from those inputs until it has run all the
way through and actually turned those inputs into outputs.
For example, all the input
files might get trimmed or cleaned, then they would all get mapped to a genome,
and then those mapped sequences would be used to call variants, so as to
find SNPs in the data, etc. If you were writing this in an
input-oriented, forward-marching fashion, then, to deal with the fact that you had multiple files of
DNA sequences (for example, one for each individual bird or fish that had
been sampled), you might write Unix `for` loops to cycle over all the input files
as in Section \@ref(unix-for-loops), or, you could define a SLURM job array to start a separate job
instance for each input file, as in Section \@ref(slurm-job-arrays). In each case,
you would have to do some extra programming to deal with the different input files, and
if one of the steps of your workflow failed on just one, or a few, of the files,
you might spend a large amount of time tracking those failures down and than
manually re-running that small number of jobs.
By contrast, Snakemake takes a different approach to managing workflows.
We will call it an _output-oriented_, _backward-looking_ approach. We call it
that because workflows in Snakemake are defined first and foremost in terms of the _output files_
that are desired, along with instructions on how to create those output files
from necessary input files and bioinformatic programs. They are _backward-looking_
in the sense that, once you have developed a Snakemake workflow, you get results by
telling Snakemake which output files you
want to create, and then it "looks backwards" through the workflow to determine which
input files are needed to create the requested outputs. Sometimes it has to look backwards
through several steps before it identifies all the necessary input files.
Once it has found those necessary inputs, it then runs forward through the steps to
create the output files. In this phase, the workflow looks like it is "forward-marching", in the
sense that outputs are being created from inputs. But, in order to get to that
"forward-running" phase, Snakemake had to look backward to figure out what inputs to use.
The above constitutes some subtle points, that might not be clear upon first reading,
but we will try to summarize it in a few pithy phrases:
- A workflow defined as a typical Unix shell script can be thought of as a process that runs
forward. You give it a lot of input files and it just cranks through a bunch of steps until
the output files are made.
- A workflow defined with Snakemake works differently. First you define the workflow in terms
of a series of "rules." Then, when you request any given set of output files, Snakemake
will look backwards through
the rules of the workflow and figure out exactly which steps must be performed, on which input
files, in order to create the requested output files. Once it has determined that,
it runs just those necessary steps.
There are many advantages to this output-oriented, backward-looking approach:
1. If your workflow has many steps, and some of them have already been run,
then Snakemake automatically recognizes that, and will not re-run steps
in the workflow that have already been completed. In an input-oriented system (like
a traditional Unix script), you would have to spend the time to figure out
which steps had already been run, and then run your script only from that
point forward. Doing so can be a hassle and can also be prone to errors.
2. A workflow defined by Snakemake, being explicit about the inputs needed
for each step, naturally defines "work units" that can be run independently of
one another. Accordingly, Snakemake, itself, can break a huge bioinformatic workflow
into a number of small jobs that can be run in parallel, saving you, the user,
from having to write scripts to launch a series of SLURM job arrays.
3. The fact that Snakemake automatically keeps track of which inputs already
exist---and which might still need to be generated---provides huge benefits when some of your jobs fail.
Anyone who has used a cluster has stories about jobs that inexplicably fail. Without a workflow
management system like Snakemake, you can spend almost as much of your own time managing these
failed jobs as it took to launch all the jobs in the first place.
In addition to these obvious advantages of the output-oriented approach, Snakemake also
includes a number of features that make it easy to use your workflow on a variety of
different platforms. It is tightly integrated with conda (Section \@ref(miniconda)),
letting the user define
conda environments for each step in the workflow. This means that if you move your whole
workflow to a new cluster, you don't have to spend any time coordinating the installation
of the programs you need---if you set things up properly with Snakemake and conda,
that will happen automatically.
If you distribute your workflows to other people to use, this is particularly helpful, since
you will spend far less time assisting them in setting up their computer environment to run
your workflow. Snakemake can also be customized to work in your own cluster environment. Finally,
there are interfaces to allow your Snakemake workflows to run seamlessly in the cloud.
Full documentation for Snakemake can be found at
[https://snakemake.readthedocs.io/en/stable/](https://snakemake.readthedocs.io/en/stable/).
This documentation is comprehensive, but can feel a little daunting at first.
On the other hand, the developers of Snakemake also provide an excellent and accessible
tutorial at
[https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html](https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html).
If you have not run through that tutorial yet, you should do that before
going through the next sections in this chapter.
In the following I present another short tutorial that focuses a little more
on understanding how _wildcards_ in Snakemake work, as this seems to be a
stumbling block for many students. In addition, I try to offer a few tidbits
of advice regarding _input functions_ and simple ways to work interactively with
Python while you are developing your Snakemake workflows. I hope this will be particularly
helpful for students that aren't super familiar with Python.
Before proceeding, you will need to follow the instructions
[here](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda) to install Snakemake into a conda environment named `snakemake`.
Once installed, activate the `snakemake` environment and then add the `tree` Unix utility to
it:
```sh
conda activate snakemake
conda install -c conda-forge tree
```
`tree` is a utility that we will use in the following. It
provides a nice way of visualizing directories and their contents.
Once you have done that, make a directory called `smk-tut` and
`cd` into it to do this tutorial.
```sh
mkdir smk-tut
cd smk-tut
```
## A Simple Snakemake Example
Here is a somewhat make-believe example that will let us grok out the
basics of Snakemake and the logic of simple workflows without worrying about
performing actual operations (like trimming or mapping) on files. This is
a useful excercise because it can also be helpful when you are developing
your own complex workflows to first make sure that the workflow logic (from
Snakemake's point of view) is correct, without having to spend a long
time on the actual processing of files.
We will pretend that we have paired-end sequence files from a sequencing center and
we want to compare the outcome of using two different programs to trim leftover adapters
and low-quality sequence from these files. These two (utterly fictitious) programs
are `TrimmoMcAwesome` and `trim_stupendous`^[The astute reader will note that these names are similar
to the actual programs `Trimmomatic` and `trim_galore`. Just having some fun...].
We will pretend that both `TrimmoMcAwesome` and `trim_stupendous` work with this sort of syntax:
```sh
TrimmoMcAwesome file_read1.fq file_read2.fq outpath1 outpath2
# or
trim_stupendous file_read1.fq file_read2.fq outpath1 outpath2
```
which will trim the paired-end reads `file_read1.fq` and `file_read2.fq` and put the
results for read1 and read2, respectively, into the files `outpath1` and `outpath2`.
For now, let us assume that we have 10 input files: 2 paired-end read files
for each of 5 samples named `001` through `005`. Please create those files
(even though they will be empty) with this Unix command:
```sh
mkdir -p data/raw
touch data/raw/00{1..5}_R{1,2}.fq
```
Please note that the use of curly braces here is an example of "brace expansion" within
the Unix shell. You will see lots of curly braces in the upcoming material, and you should
be aware that the curly braces that you will see in Snakemake code
(in the `Snakefile` as it were) are different. Those are typically serving the
purpose of delimiting wildcards. (Though, they have other roles within Python as well.).
Once you have done that, use `tree` to see what the directory structure in
`smk-tut` looks like:
```sh
% tree .
.
└── data
└── raw
├── 001_R1.fq
├── 001_R2.fq
├── 002_R1.fq
├── 002_R2.fq
├── 003_R1.fq
├── 003_R2.fq
├── 004_R1.fq
├── 004_R2.fq
├── 005_R1.fq
└── 005_R2.fq
```
That handy little summary shows the names of 10 files inside a directory called
`raw` inside a directory called `data`.
### Snakefile #1: A simple rule to TrimmoMcAwesome the files from 001
If you did the tutorial, you will recall that the instructions for the workflow
go into the file, `Snakefile`. Make a Snakefile within the `smk-tut` directory
with these contents, which give it the rule for running TrimmoMcAwesome on
the reads from sample `001`:
```yaml
rule trim_awesome:
input:
"data/raw/001_R1.fq",
"data/raw/001_R2.fq"
output:
"results/awesome/001_R1.fq",
"results/awesome/001_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
```
Notice that this rule takes the raw fastq files as input to TrimmoMcAwesome, and
puts the trimmed output files into a diretory `results/awesome`.
Once you have that Snakefile, you can ask snakemake to show you what it
would do if you ran it. Give it the `-n` option (dry run) and the `-p` option
(print the shell commands) by putting them together, preceded by a single dash:
```sh
# do this on your unix terminal in directory smk-tut
snakemake -np
```
The result you get should look like this:
```
Building DAG of jobs...
Job counts:
count jobs
1 trim_awesome
1
[Mon Aug 9 11:43:31 2021]
rule trim_awesome:
input: data/raw/001_R1.fq, data/raw/001_R2.fq
output: results/awesome/001_R1.fq, results/awesome/001_R2.fq
jobid: 0
TrimmoMcAwesome data/raw/001_R1.fq data/raw/001_R2.fq results/awesome/001_R1.fq results/awesome/001_R2.fq
Job counts:
count jobs
1 trim_awesome
1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
```
It is worth sitting down and looking closely at this output. The first block shows how many
jobs will run. The second block shows the rule that will be run in the job, and it
explicitly shows the values of the `input` and `output` of the block. The line at the top
of the third block shows the shell command that would be run. It looks correct. Great!
Another thing worthy of note: the requested output files in this rule are
actual file names and they appear in the first rule of the Snakefile.
Because of this, we were able simplty to run `snakemake -np` without
explicitly requesting any output files on the command line---Snakemake used
the two outputs from that first rule block.
Alternatively, we could have explicitly requested those output files on the command line,
like:
```sh
# do this on your unix terminal in directory smk-tut
snakemake -np results/awesome/001_R1.fq results/awesome/001_R2.fq
```
Try that, and note that you get the same result.
But now, try requesting (on the command line) the TrimmoMcAwesome output for a different sample,
like 002:
```sh
# do this on your unix terminal in directory smk-tut
snakemake -np results/awesome/002_R1.fq results/awesome/002_R2.fq
```
What happened when you did that? Well Snakemake should have replied with an error
message telling you, essentially, "What's your problem asking for those output files? My Snakefile
doesn't tell me how to make that output file!" And that is what it should tell you because
the rule `trim_awesome` only gives instructions on how to make the output files
`results/awesome/001_R1.fq` and `results/awesome/001_R2.fq`
So, what if we want a rule to make `results/awesome/002_R1.fq` and `results/awesome/002_R2.fq` and
awesome-trimmed output for all the other samples too. The naive,
and exceedingly inefficient, approach would be to
write a new rule block for every sample in your Snakefile till it looked like:
```yaml
rule trim_awesome_001:
input:
"data/raw/001_R1.fq",
"data/raw/001_R2.fq"
output:
"results/awesome/001_R1.fq",
"results/awesome/001_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
rule trim_awesome_002:
input:
"data/raw/002_R1.fq",
"data/raw/002_R2.fq"
output:
"results/awesome/002_R1.fq",
"results/awesome/002_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
# and so forth, until
rule trim_awesome_005:
input:
"data/raw/005_R1.fq",
"data/raw/005_R2.fq"
output:
"results/awesome/005_R1.fq",
"results/awesome/005_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
```
Boy! Doing that would really suck! It would require lots of typing, lots of
chances to make an error, and if you wanted to change something in the shell
block, for example, you would have to change all of the rules.
What a hassle! Obviously,
no sane person would do that. And, thankfully no one has to, because
Snakemake allows a single block to apply to multiple instances of output
(and hence input) by using _wildcards_. The next section takes us through
that.
### Snakefile #2: A single TrimmoMcAwesome rule using wildcards
The best way to start thinking about wildcards is to imagine all the different
output files you would want from a rule, and then investigate the parts that
change. With samples 001 through 005, the TrimmoMcAwesome output we would want
would be these pairs of files:
```
results/awesome/001_R1.fq
results/awesome/001_R2.fq
results/awesome/002_R1.fq
results/awesome/002_R2.fq
results/awesome/003_R1.fq
results/awesome/003_R2.fq
results/awesome/004_R1.fq
results/awesome/004_R2.fq
results/awesome/005_R1.fq
results/awesome/005_R2.fq
```
OK! So, these are really similar, except that there is a part pertaining
to the sample number that is changing between each pair of files.
In fact, we could write down the general pattern for all of these output
file names in a sort of shorthand like this:
```
results/awesome/{sample}_R1.fq
results/awesome/{sample}_R2.fq
```
In other words every one of the 10 file names can be generated by replacing
`{sample}` in
```
results/awesome/{sample}_R1.fq
results/awesome/{sample}_R2.fq
```
with a value. For example, if you replace `{sample}` with `003`:
```
results/awesome/003_R1.fq
results/awesome/003_R2.fq
```
This shorthand is exactly the way that Snakemake specifies a whole family
of output files using _wildcards_. We will call
`"results/awesome/{sample}_R1.fq", "results/awesome/{sample}_R2.fq"` the
_wildcard phrase_. The wildcard in that wildcard phrase
is`{sample}`. Each wildcard in a wildcard phrase can take
_values_ or _instances_. When wildcard phrases are used in a Snakemake rule
block, the values/instances of the wildcards within them are found when an output file is requested
that _matches_ the wildcard phrase. This will become more clear as we proceed.
First, let us update our Snakefile to look like this:
```yaml
rule trim_awesome:
input:
"data/raw/{sample}_R1.fq",
"data/raw/{sample}_R2.fq"
output:
"results/awesome/{sample}_R1.fq",
"results/awesome/{sample}_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
```
So, go ahead and make that your Snakefile.
Now, see what happens if you do:
```sh
# do this on your unix terminal in directory smk-tut
snakemake -np
```
Aha! Snakemake gets upset at you because it is trying to run rule trim_awesome
but the output requested in it is a wildcard phrase with no values or instances
for the wilcards. Snakemake doesn't really know how to run a rule to obtain a file
that is a wilcard phrase in which the wildcards do not have explicit values.
However, if you explicitly ask for certain output files, like:
```sh
# do this on your unix terminal in directory smk-tut
snakemake -np results/awesome/002_R1.fq results/awesome/002_R2.fq
```
then Snakemake finds the rule to create those output files (because
the requested files match the wildcard phrase with appropriate values
of the wildcard) and it knows
what the input files should be because it _propagates wildcard values from the
requested output into the wildcard phrases of the input_. Diagrammatically
that process looks like this:
```{r wildcard_propagation_1, echo=FALSE, fig.align='center', dpi=45, fig.cap="A diagrammatic view of how wildcard values get propagated."}
knitr::include_graphics("figs/wildcard_propagation_1.svg")
```
The upshot of this is that we can now easily request all the sample output files
like this:
```sh
# do this on your unix terminal in directory smk-tut
snakemake -np results/awesome/00{1..5}_R{1,2}.fq
```
Again, we note that these curly braces in the above shell command line
are different from the curly
braces you have seen in the Snakemake/Python wildcards. These are just
Unix brace-expansion curly braces. They merely expand into `results/awesome/001_R1.fq results/awesome/001_R2.fq results/awesome/002_R1.fq results/awesome/002_R2.fq results/awesome/003_R1.fq results/awesome/003_R2.fq results/awesome/004_R1.fq results/awesome/004_R2.fq results/awesome/005_R1.fq results/awesome/005_R2.fq`
Cool! Snakemake was able to figure out how to generate all of those output files by
running 5 jobs!
Before we leave this section, let's see what kind of error message Snakemake
gives us when we ask for output that it doesn't have what it needs to generate.
Try this:
```sh
# do this on your unix terminal in directory smk-tut
snakemake -np results/awesome/006_R1.fq results/awesome/006_R2.fq
```
From the error message, we see that Snakemake propagated the wildcard
value `{sample} = 006` to the input, but didn't find the appropriate input
files:
```
Building DAG of jobs...
MissingInputException in line 4 of /Users/eriq/Desktop/smk-tut/Snakefile:
Missing input files for rule trim_awesome:
data/raw/006_R1.fq
data/raw/006_R2.fq
```
This is good. Snakemake tells us (in the form of an error message before quitting)
when we are asking for something it can't produce.
### Snakefile #3: Add `trim_stupendous` in there
Now that we know how to deploy wildcards in our output and input file names
we can use them when we make another block for our `trim_stupendous` jobs:
Update your Snakefile so it looks like this:
```yaml
rule trim_awesome:
input:
"data/raw/{sample}_R1.fq",
"data/raw/{sample}_R2.fq"
output:
"results/awesome/{sample}_R1.fq",
"results/awesome/{sample}_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
rule trim_stupendous:
input:
fq1="data/raw/{sample}_R1.fq",
fq2="data/raw/{sample}_R2.fq"
output:
"results/stupendous/{sample}_R1.fq",
"results/stupendous/{sample}_R2.fq"
shell:
"trim_stupendous {input.fq1} {input.fq2} {output}"
```
Note that we have written the `input` block a little bit differently
in that we named the different input file wildcard phrases. We named
them as `fq1` and `fq2` using this syntax:
```
input:
fq1="data/raw/{sample}_R1.fq",
fq2="data/raw/{sample}_R2.fq"
```
And then we also explicitly put those files into the shell command by accessing
the `input` with a `.fq1` and `.fq2`, where appropriate, like this:
```
"trim_stupendous {input.fq1} {input.fq2} {output}"
```
The reasons for this are twofold:
1. You should be familiar with the fact that you can give names to particular
parts of the `input` or `output` files (or even the `params` or `log` files),
because this makes it a lot easier to keep track of the inputs when there are many
of them.
2. In a moment, we will be exploring how to use Snakemake's `unpack()` function, and naming
the files this way provides us an opportunity to use it.
For now, we don't worry too much about it, except to note that we can
request all the different TrimmoMcAwesome and trim_stupendous output files
like this:
```sh
# do this on your unix terminal in directory smk-tut
snakemake -np results/awesome/00{1..5}_R{1,2}.fq results/stupendous/00{1..5}_R{1,2}.fq
```
and Snakemake can figure out that it needs to run 10 jobs.
## Ugly, complex file names...a job for Snakemake input functions!
I hope that all of the above has been fine and well for everyone. But I suspect that
the same question is on everyone's mind: _"Hey! What kind of unrealistic BS is this? I
never get files named something like `001_R1.fq` from the sequencing center!!"_
Indeed, more often than not, you will be dealing with FASTQ files that might be named
something like this:
```
kcr-wiwa-885261-L002-HGGXXX_R1.fastq.gz
kcr-wiwa-885261-L002-HGGXXX_R2.fastq.gz
kcr-wiwa-896753-L002-HGGXXX_R1.fastq.gz
kcr-wiwa-896753-L002-HGGXXX_R2.fastq.gz
plate2-WIWA67365-L002-HHHGYY_R1.fastq.gz
plate2-WIWA67365-L002-HHHGYY_R2.fastq.gz
```
where the R1 and R2 parts of the names are referring to read 1 and
read 2 of paired end reads, `885261`, `896753`, and
`WIWA67365` are sample IDs
or field numbers for the animals you are sequencing, and the rest is
not only a bit gibberishy, but also is not necessarily consistent across the
different individuals. In this sort of case, you cannot easily
propagate wildcard values into an input file name that is specified
by a single, consistent, wildcard phrase. Not only that, but
we probably don't want to specify a wildcard like `{sample}` to take values like
`885261`, `896753`, and `WIWA67365`. We certainly could, but because the
these field numbers are somewhat long and complex, if we do so, our
output file names will end up being kind of ugly too.
But don't fear! Snakemake allows for what are called _input functions_ that
make it possible to deal with these types of cases. An input function is merely
a Python function that takes as input some wildcard values, and will map those
to any arbitrary output values. And, used in conjunction with a simple table that
gives clean sample names to the complex field_numbers, it is possible to
keep the whole workflow much cleaner.
To prepare for this part of the lesson, we are going to want to create
files with those names, inside a directory called `raw_nasty_names`. Do that
by evaluating this code in the shell in your `smk-tut` directory:
```sh
mkdir -p data/raw_nasty_names
for i in kcr-wiwa-885261-L002-HGGXXX_R1.fastq.gz \
kcr-wiwa-885261-L002-HGGXXX_R2.fastq.gz \
kcr-wiwa-896753-L002-HGGXXX_R1.fastq.gz \
kcr-wiwa-896753-L002-HGGXXX_R2.fastq.gz \
plate2-WIWA67365-L002-HHHGYY_R1.fastq.gz \
plate2-WIWA67365-L002-HHHGYY_R2.fastq.gz; do
touch data/raw_nasty_names/$i;
done
```
Let's step back and imagine that we have a CSV file that gives us some
sample ID's that go to each of the field_numbers which are also associated with
the paths to the fastq files. It would look something like this:
```sh
sample,field_number,fastq1,fastq2
s001,885261,data/raw_nasty_names/kcr-wiwa-885261-L002-HGGXXX_R1.fastq.gz,data/raw_nasty_names/kcr-wiwa-885261-L002-HGGXXX_R2.fastq.gz
s002,896753,data/raw_nasty_names/kcr-wiwa-896753-L002-HGGXXX_R1.fastq.gz,data/raw_nasty_names/kcr-wiwa-896753-L002-HGGXXX_R2.fastq.gz
s003,WIWA67395,data/raw_nasty_names/plate2-WIWA67365-L002-HHHGYY_R1.fastq.gz,data/raw_nasty_names/plate2-WIWA67365-L002-HHHGYY_R2.fastq.gz
```
In fact, this file is super important. It tells you how the nice tidy
sample names like `s001` and `s002` correspond to the
field numbers of your animals, and also to the paths of the FASTQ files for them.
The next step is very important. Copy the above contents into a file called `samples.csv` within your
`smk-tut` directory. Once you have done that, we are ready to proceed.
Here is where we are going: we are going to read the `samples.csv` table
into aPython variable that will
be accessible by Snakemake, and which
can be used by it to translate wildcards like `{sample} = s001` into paths to FASTQ files.
In order to easily handle the `samples.csv` file as a variable in Python we will rely on the Pandas
package. If you are new to Python, you can think of Pandas as the
Python equivalent of R's 'dplyr' and 'readr' packages.
Let's play with it a little bit to familiarize ourselves with it...
### Playing with Pandas
The Pandas package is included in your `snakemake` conda environment, and we are going to play with
it a little bit, so, open a new terminal window, change the directory in it to your
`smk-tut` directory, activate your snakemake conda environment and then
start an interactive python session:
```sh
# do this in a new unix terminal in directory smk-tut
conda activate snakemake
python
```
Now, you have an interactive Python session in that terminal window. And you can do the following
to import the Pandas package into the session and read in the `samples.csv` file and print it:
```python
# do this in your python interpreter...
# this imports the pandas package functionality in an object named pd
import pandas as pd
# this reads the CSV file and sets an index using the values in the "sample" column.
# For R people you can think of this "index" as serving the role of rownames.
samples_table = pd.read_csv("samples.csv").set_index("sample", drop=False)
# now print the table
samples_table
```
After running the last line, you should have seen something like:
```
sample field_number fastq1 fastq2
sample
s001 s001 885261 data/raw_nasty_names/kcr-wiwa-885261-L002-HGGX... data/raw_nasty_names/kcr-wiwa-885261-L002-HGGX...
s002 s002 896753 data/raw_nasty_names/kcr-wiwa-896753-L002-HGGX... data/raw_nasty_names/kcr-wiwa-896753-L002-HGGX...
s003 s003 WIWA67395 data/raw_nasty_names/plate2-WIWA67365-L002-HHH... data/raw_nasty_names/plate2-WIWA67365-L002-HHH...
```
OK, `samples_table` is just a table of data. Importantly, it is
indexed by the sample name (s001, s002, etc.) This
means that if we want to access values in any particular row, we can find that row according to
the sample name, and then pick values out of particular columns within that row using this syntax:
```python
# do this in your python interpreter...
# get the field number of sample s002
samples_table.loc["s002", "field_number"]
# get the path to fastq file 1 for sample s003:
samples_table.loc["s003", "fastq1"]
```
That is quite tasty! The `.loc` is a method in Pandas to return the location (i.e., the row)
of a table according to its index. And then if you want to get the value from any particular
column, you just put the column name in there.
Note that if you want to get the values in multiple columns as a Python list
you can do like this:
```python
# do this in your python interpreter...
# get the paths to both fastq files for sample s003 in a list:
list(samples_table.loc["s003", ["fastq1", "fastq2"]])
```
This is totally sweet! This python code takes a value like `s002`---which could be the value
of a wildcard---and it returns another value based on it. In fact, it can return the
path of a FASTQ file. This is just the sort of thing we need to convert wildcard values
to FASTQ file pahts in our Snakefile. However, in order to use this type of code
we have to wrap it up into a Snakemake _input function_.^[Note that you can write the
code in a more "inline" fashion in your Snakefile by using what are called "lambda functions"
in Python. However, it can be cleaner and easier to understand explicitly written input functions.]
Before we go big and do that in our Snakefile, we are going to spoof
it in our interactive Python session. This turns out to be a great way to develop
and test your Snakemake input functions.
### A Quick Sidebar: Spoofing a Snakemake input function
Snakemake input functions are simply Python functions that take a single
input which is a Python object called `wildcards`. This object will have different
attributes, which are the particular values/instances of the wildcards themselves.
In our above example with the `trim_awesome` block, if you were requesting the
files `results/awesome/001_R1.fq` and `results/awesome/001_R2.fq`, Snakemake would
match that to the `trim_awesome` block, knowing that the value of the wildcard, `{sample}`
was `001`. In this case, Snakemake would be carrying around "in its brain" a little variable
called `wildcards` such that its `sample` attribute---accessible as `wildcards.sample` would be
`001`.
In order to start understanding this without actually having to be
"inside an ongoing Snakemake invocation/session", we can use our Python interactive session
to create a fictitious `wildcards` object and give
values to its attributes, which will give us a chance to test its use in an input function.
The following is not a formal part of working with Snakemake, and can be omitted; however
I have found this to be very helpful when writing complex input functions.
```python
# do this in your python interpreter...
# create a generic object class (called Foo here)
class Foo(object):
pass
# make a variable, wildcards, which is an object of that class
wildcards = Foo()
# now, if you want to test specific values for different wildcards you can do like:
wildcards.sample = "s002"
# print that value:
wildcards.sample
```
That code has defined an object named `wildcards` and endowed
it with an attribute `sample` whose value is `s002`.
Now, let us write a function that will take that `wildcards`
object and turn it into the path for a fastq2 for sample `s002`.
To define a function in Python: you use the `def`
keyword, then follow it by the name of the function and its
inputs (in this case `wildcards`), then a colon, and then an
indented block of code which constitutes the function,
followed by a non-indented line. Here is the definition for
a function that returns the path to the fastq2 file from
a wildcards object (and it uses the previously defined variable
called `samples_table`)
```python
# this is the function definition
def fq2_from_sample(wildcards):
return samples_table.loc[wildcards.sample, "fastq2"]
# here we use it on our spoofed wildcards object
fq2_from_sample(wildcards)
```
Eureka! That is a function that will take a wildcard value
(stored in a variable `wildcards`) and return a file path
that it finds in the `samples.csv`. We can now put that
all together in a new Snakefile, as shown in the next
section.
Before we leave this section, though, it is important to point
out that these input functions can be named anything that
we want. However, it makes sense to use a name that tells
what the function returns, and also the wildcards or inputs
that are used to create the return value. (Hence the names
chosen here).
### Snakefile #4: an input function for trim_awesome
Now, that we have experimented with writing input functions in our
Python interpreter, it is time to incorporate them into our Snakefile
to turn the wildcards into file paths for the `trim_awesome` block.
To do so, we simply have to make sure that _within the Snakefile_,
before we do anything else, we define the input functions we need,
and then use those functions in the `input` section of the
`trim_awesome` block. So we must:
1. read the `samples.csv` file into the
`samples_table` variable,
2. define the functions
`fq1_from_sample` and `fq2_from_sample`.
3. Put those functions into the `trim_awesome` block in place of the input file
names.
So, update your Snakefile so that it looks like this:
```python
import pandas as pd
samples_table = pd.read_csv("samples.csv").set_index("sample", drop=False)
# fastq1 input function definition
def fq1_from_sample(wildcards):
return samples_table.loc[wildcards.sample, "fastq1"]
# fastq2 input function definition
def fq2_from_sample(wildcards):
return samples_table.loc[wildcards.sample, "fastq2"]
rule trim_awesome:
input:
fq1_from_sample, # <--- Note how the function names go here,
fq2_from_sample # <--- in place of the file names
output:
"results/awesome/{sample}_R1.fq",
"results/awesome/{sample}_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
rule trim_stupendous:
input:
fq1="data/raw/{sample}_R1.fq",
fq2="data/raw/{sample}_R2.fq"
output:
"results/stupendous/{sample}_R1.fq",
"results/stupendous/{sample}_R2.fq"
shell:
"trim_stupendous {input.fq1} {input.fq2} {output}"
```
Then invoke that new Snakefile with:
```sh
# do this in your unix terminal in directory smk-tut
snakemake -np results/awesome/s00{1..3}_R{1,2}.fq
```
And, Behold! Look at the shell commands that get scheduled, and
see that they properly include the raw_nasty_names files. Cool.
## Input functions from Python Dictionaries
The previous section showed us how to produce input
files for trim_awesome from the wildcards using the input
functions `fq1_from_sample` and `fq2_from_sample`, and it worked!
For the `trim_stupendous` block, however, note that we are using
_named_ input files:
```python
rule trim_stupendous:
input:
fq1="data/raw/{sample}_R1.fq",
fq2="data/raw/{sample}_R2.fq"
```
and the names get used explicitly in the shell command:
```python
shell:
"trim_stupendous {input.fq1} {input.fq2} {output}"
```
(see the `{input.fq1}` and `{input.fq2}`, there). We could
redefine the `trim_stupendous` block to use our input functions like this:
```python
rule trim_stupendous:
input:
fq1=fq1_from_sample,
fq2=fq2_from_sample
output:
"results/stupendous/{sample}_R1.fq",
"results/stupendous/{sample}_R2.fq"
shell:
"trim_stupendous {input.fq1} {input.fq2} {output}"
```
There would be nothing wrong with that at all!
However, there is another way to pass inputs to Snakemake rule blocks---by using
Python _Dictionaries_. A Python Dictionary, or _dict_ for short, is very much
like a "named list" in R. The elements of a dictionary can be accessed using their
names (called "keys" in Python). Let's go back and evaluate some code in our Python
interpreter to illustrate this:
```python
# do this in your python interpreter...
# define a dictionary called boing:
boing = {"fq1": "bing", "fq2": "bong"}
# print it:
boing
# access the fq2 element from it:
boing["fq2"]
```
Cool. That just shows us how to define a dictionary in Python.
Next, let's make a function that takes `wildcards` as input and returns
a dictionary (with keys `fq1` and `fq2`) of file paths to the nasty-name
FASTQs.
```python
# do this in your python interpreter...
# define an input function that returns a dict
def fq_dict_from_sample(wildcards):
return {
"fq1": samples_table.loc[wildcards.sample, "fastq1"],
"fq2": samples_table.loc[wildcards.sample, "fastq2"]
}
# make sure our spoofed wildcards variable is set:
wildcards.sample = "s003"
# Now see what that function returns on that wildcards input:
fq_dict_from_sample(wildcards)
```
OK! It returns a dict object. So, we should be able to use
that dict to return the proper paths for `fq1` and `fq2`.
**BUT** in order for this to work properly (i.e., the way we
expect it to) we need to wrap our input function inside
a special Snakemake function called `unpack()` which turns the
dict into a collection of named inputs.
Replace the contents of your Snakefile with this:
```python
import pandas as pd
samples_table = pd.read_csv("samples.csv").set_index("sample", drop=False)
# fastq1 input function definition
def fq1_from_sample(wildcards):
return samples_table.loc[wildcards.sample, "fastq1"]
# fastq2 input function definition
def fq2_from_sample(wildcards):
return samples_table.loc[wildcards.sample, "fastq2"]
# function to return both fastqs, keyed by `fq1` and `fq2`, in a dict.
# note that the order of the dict elements is not important here. We
# illustrate that by putting key `fq2` first.
def fq_dict_from_sample(wildcards):
return {
"fq2": samples_table.loc[wildcards.sample, "fastq2"],
"fq1": samples_table.loc[wildcards.sample, "fastq1"]
}
rule trim_awesome:
input:
fq1_from_sample, # <--- Note how the function names go here,
fq2_from_sample # <--- in place of the file names
output:
"results/awesome/{sample}_R1.fq",
"results/awesome/{sample}_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
rule trim_stupendous:
input:
unpack(fq_dict_from_sample) # <--- see, it is wrapped in unpack()
output:
"results/stupendous/{sample}_R1.fq",
"results/stupendous/{sample}_R2.fq"
shell:
"trim_stupendous {input.fq1} {input.fq2} {output}"
```
Then, give it a whirl with:
```sh
# do this in your unix terminal in directory smk-tut
snakemake -np results/stupendous/s00{1..3}_R{1,2}.fq
```
Booyah! This can be quite helpful if you have a lot of named
arguments that you wish to create as a dict from a function.
Note that if you use an input function that returns a dict,
and you forget to wrap it in the `unpack()`, then you will typically
see an error like this:
```
Function did not return str or list of str.
```
## Generating lists of files with expand()
## A Python Primer for Using Snakemake
Snakefiles are evaluated in a Python context, and any valid Python
code can be placed within them, and it will be evaluated.
Stuff to talk about:
- Lists
- Cycling over list elements and testing them
- Building up a list with `for`
- Adding lists together
- Testing if something is in a list.
- Getting the unique values in a list.
- Formatting strings.
- The Config variable, an OrderedDict
- Loading it with load_configfile
- Accessing different levels by specific names.
- Cycling over all members of a level and getting a list back
- Getting lists out of the keys() and vals()
## Using Snakemake on a Computing Cluster
One key advantage of Snakemake is that it can allow intelligent use of the resources on a
high-performance computing cluster. This is done by supplying customization to Snakemake via its
`--cluster` command line option. Here we focus on methods for running your Snakemake workflow
on an HPCC that uses the SLURM scheduler (See Chapter \@ref(chap-HPCC)). Although getting Snakemake to schedule jobs
using SLURM's `sbatch` command can be as easy as invoking Snakemake with a command that includes an
option like `--cluster sbatch`, making such a simple addition to the command line will not bring
all the Snakemake goodness to SLURM that you might hope. Rather than just a simple `sbatch` argument
to the `--cluster` option, the Snakemake documentation notes that you can provide more job-specific information
to the `--cluster` option. For example, suppose you have rules that use different numbers of threads, specified
in the rule blocks like this:
```yaml
rule easy:
input:
"small_input_{sample}.txt"
threads: 1
output:
"small_output_{sample}.txt"
rule hard:
input:
"big_input_{sample}.txt"
threads: 8
output:
"big_output_{sample}.txt"
```
Here the rule `easy` uses just a single thread, while the rule `hard` is multithreaded, and we
want to run it with 8 threads. If we were running this locally on a single workstation, Snakemake
would automatically supply 8 cores (if available) to instances of rule `hard` that were running.
However, when running on a cluster, the `--cluster` command argument would need to be
expanded to instruct `sbatch` to request more cores for rule `hard` than for rule `easy`.
A Snakemake command that started like this would work:
```sh