forked from eriqande/eca-bioinf-handbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
1.030-working-on-remote-servers.Rmd
2890 lines (2430 loc) · 137 KB
/
1.030-working-on-remote-servers.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
pdf_document: default
html_document: default
---
# Working on remote servers
## Accessing remote computers
The primary protocol for accessing remote computers in this day and age
is `ssh` which stands for "Secure Shell." In the protocol, your computer
and the remote computer talk to one another and then choose to have a "shared
secret" which they can use as a key to encrypt data traffic from one to the other.
The amazing thing is that the two computers can actually tell each other what
that shared secret is by having a conversation "in the open" with one another.
That is a topic for another day, but if you are interested, you could
read about it [here](https://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange).
At any rate, the SSH protocal allows for secure access to a remote server. It involves
using a username and a password, and, in many cases today, some form of two-factor
authentication (i.e., you need to have your phone involved, too!). Different
remote servers have different routines for logging in to them, and they are also
all configured a little differently. The main servers we are concerned about in
these teaching materials are:
1. The Hummingbird cluster at UCSC, which is accessible by anyone with a UCSC blue username/password.
1. The Alpine Supercomputer at CU Boulder which is accessible by all graduate students and
faculty at CSU.
1. The Sedna cluster housed at the National Marine Fisheries Service, Northwest Fisheries Science
Center. This is accessible only by those Federal NMFS employees whom have been granted access.
Happily, all of these systems use SLURM for job scheduling (much more about that in the
next chapter); however are a few vagaries to each of these systems that we will cover below.
### Windows
If you are on a Windows machine, you can use the `ssh` utility from your Git Bash shell, but
that is a bit of a hassle from RStudio. And a better terminal emulator is available if you
are going to be accessing remote computers. It is recommended that you install and
use the program [PuTTy](https://www.ssh.com/ssh/putty). The steps are pretty self-explanatory
and well documented. Instead of using `ssh` on a command line you put a host name into
a dialog box, etc.
WHOA! I'm not a Windows person, but I just Matthew Hopken working on Windows using [MobaXterm](https://mobaxterm.mobatek.net/) to connect to the server and it looks
pretty nice.
### Hummingbird
Directions for UCSC students and staff to login to Hummingbird are available
at [https://www.hb.ucsc.edu/getting-started/](https://www.hb.ucsc.edu/getting-started/).
If you are not on the UCSC campus network, you need to use the UCSC VPN to connect.
By default, this cluster uses `tcsh` for a shell rather than `bash`. To keep things
consistent with what you have learned about `bash`, you will want to automatically switch
to `bash` upon login. You can do this by adding a file `~/.tcshrc` whose contents are:
```sh
setenv SHELL /usr/bin/bash
exec /usr/bin/bash --login
```
Then, configure your `bash` environment with your `~/.bashrc` and `~/.bash_profile` as
described in Chapter \@ref(unix-env).
The `tmux` settings (See section \@ref(tmux)) in hummingbird are a little messed up as well, making
it hard to set window names that don't get changed the moment you make another command. Therefore,
you must make a file called `~/.tmux.conf` and put this line in it:
```
set-option -g allow-rename off
```
### Alpine
To get an account on the CU Boulder computing resources (which includes Alpine), see [https://www.acns.colostate.edu/hpc/summit-get-started/](https://www.acns.colostate.edu/hpc/summit-get-started/). Account creation is automatic for graduate students and faculty. This setup requires
that you get an app called Duo on your phone for doing two-factor authentication.
Instructions for logging into Summit are at [https://www.acns.colostate.edu/hpc/#remote-login](https://www.acns.colostate.edu/hpc/#remote-login).
On your local machine (i.e., laptop), you might consider adding an alias to your
`.bashrc` that will let you type `summit` to issue the login command. For example:
```sh
alias summit='ssh [email protected]@login.rc.colorado.edu'
```
where you replace `csu_eid` with your actual CSU eID.
### Sedna
To connect to this cluster you must be on the NMFS network, or
connected to it via the VPN, then `ssh` with, for example:
```
```
but using your own username.
## Transferring files to remote computers
### `sftp` and several systems that use it
Most Unix systems have a command called `scp`, which works like `cp`, but which is
designed for copying files to and from
remote servers using the SSH protocol for security. This works really well
if you have set up a public/private key pair to allow SSH access to your server
without constantly having to type in your password. Use of public-private keypairs is unfortunately, not
an option (as far as I can tell) on new NSF-funded clusters that use 2-factor authentication (like SUMMIT
at CU Boulder). Trying to use `scp` in such a context becomes an endless cycle of
entering your password and checking your phone for a DUO push. Fortunately, there are
alternatives.
#### Vanilla sftp
The administrators of the SUMMIT supercomputer at CU Boulder recommend
the `sftp` utility for transferring files from your laptop to the server.
This works reasonably well. The syntax for a CSU student or affiliate connecting to the server is
```sh
sftp [email protected]@login.rc.colorado.edu
# for example, here is mine:
sftp [email protected]@login.rc.colorado.edu
```
After doing this you have to give your eID password followed by `,push`, and then
approve the DUO push request on your phone. Once that is done, you have a "line open"
to the server and can use the commands of `sftp` to transfer files around.
However, the vanilla version of `sftp` (at least on a Mac) is unbelievably limited,
because there is simply no good support for TAB completion within
the utility for navigating directories on the server or upon your laptop.
It must have been developed by troglodytes...consequently, I won't describe
vanilla `sftp` further.
#### Windows alternatives
If you are on Windows, it looks like the makers of PuTTY also bring you
[PSFTP](https://www.ssh.com/ssh/putty/putty-manuals/0.68/Chapter6.html#psftp) which
might be useful for you for file transfer. Even better, MobaXterm has native GUI file transfer
capabilities. Go for it!
#### A GUI solution for Mac or Windows
When you are first getting started transfering files to a server, it might be easiest
to use a graphical user interface. There is a decently-supported (and freely available)
application called FileZilla, that does this. You can download the FileZilla client
application appropriate for your operating system (note! you download and install this _on your
own laptop_ not the server) from [https://filezilla-project.org/download.php?type=client](https://filezilla-project.org/download.php?type=client).
Once you install it, there are a few configurations to be done. First, go to `Edit->Settings` and activate
and give a master password to protect your passwords. This master password should be something that
you will remember easily. It does not have to be, and, really, should not be, the same as your Summit password.
```{r filezilla-passwd, echo=FALSE, fig.align='center', dpi=100, fig.cap="Setting FileZilla's master password"}
knitr::include_graphics("figs/filezilla-password.png", auto_pdf = TRUE)
```
Second, from `Edit->Settings` request
```{r filezilla-timeout, echo=FALSE, fig.align='center', dpi=100, fig.cap="Setting FileZilla's master password"}
knitr::include_graphics("figs/filezilla-connection-timeout.png", auto_pdf = TRUE)
```
And finally, go to `File->Site Manager` and set up a connection to your remote machine.
For SUMMIT, do like this:
```{r filezilla-site, echo=FALSE, fig.align='center', dpi=100, fig.cap="Setting FileZilla's master password"}
knitr::include_graphics("figs/filezilla-site.png", auto_pdf = TRUE)
```
After you hit OK and have established this site, you can do `File->Site Manager`, then choose
your Summit connection in the left pane and hit "connect" to connect to Summit. You may have to
type in the "Master Password" that you gave to FileZilla.
After connecting, you have two file-browser panes. The one on your left is typically your
local computer, and the one on the right is the server (remote computer). You can change the
local or remote directory by clicking in either the left or right pane, and
files and folders by dragging and dropping. The setup looks like this:
```{r filezilla-connected, echo=FALSE, fig.align='center', dpi=100, fig.cap="Setting FileZilla's master password"}
knitr::include_graphics("figs/filezilla-connected.png", auto_pdf = TRUE)
```
#### lftp
If you are on a Mac, you can install `lftp` (`brew install lftp`: note that I need to write
a section about installing command line utilities via homebrew somewhere in this handbook).
`lftp` provides the sort of TAB completion of paths that you, by now, will have come to
know and love and expect.
Before you connect to your server with `lftp` there are a few customizations that you will
want to do in order to get nicely colored output, and to avoid having to login repeatedly
during your `lftp` session. You must make a file on your laptop called `~/.lftprc` and put
the following lines in it:
```sh
set color:dir-colors "rs=0:di=01;36:fi=01;32:ln=01;31:*.txt=01;35:*.html=00;35:"
set color:use-color true
set net:idle 5h
set net:timeout 5h
```
Now, to connect to SUMMIT with `lftp`, you use this syntax (shown for my username):
```sh
lftp sftp://[email protected]@login.rc.colorado.edu
```
That can be a lot to type, so I would recommend putting something this in your
`.bashrc`:
```sh
alias summit_ftp='lftp sftp://[email protected]@login.rc.colorado.edu'
```
so you can just type `summit_ftp` (which will TAB complete...) to launch that command.
After you issue that command, you put in your password (on SUMMIT, followed by `,push`). `lftp` then caches your
password, and will re-issue it, if necessary, to execute commands. It doesn't actually send your
password until you try a command like `cls`. On the SUMMIT system, with the default `lftp` settings,
after 3 minutes of idle time, when you issue an `sftp` command on the server, you will have to approve
access with the DUO app on your phone again. However, the line last two lines in the
`~/.lftprc` file listed above ensure that your connection to SUMMIT will stay active even
through 5 hours of idle time, so you don't have to keep clicking DUO pushes on your phone.
After 5 hours, if you try issuing a command to the server in `lftp`, it will use your cached
password to reconnect to the server. On SUMMIT, this means that you only need to deal with
approving a DUO push again---not re-entering your password. If you are working on SUMMIT daily,
it makes sense to just keep one Terminal window open, running `lftp`, all the time.
Once you have started your `lftp/sftp` session this way, there are some important things to keep in mind.
The most important of which is that the `lftp` session you are in maintains a _current working directory_
on both the server and on your laptop. We will call these the _server working directory_ and
the _laptop working directory_, respectively, (Technically, we ought to call the laptop working directory the _client working directory_
but I find that is confusing for people, we we will stick with _laptop_.)
There are two different commands to see what each
current working directory is:
- `pwd` : print the _server working directory_
- `lpwd` : print _laptop working directory_ (the preceding `l` stands
for _local_).
If you want to change either the server or the laptop current working directory you use:
- `cd` _path_ : change the server working directory to _path_
- `lcd` _path_ : change the laptop working directory to _path_.
Following `lcd`, TAB-completion is done for paths _on the laptop_, while following
`cd`, TAB-completion is done for paths _on the server_.
If you want to list the contents of the different directories _on the servers_ you use:
- `cls` : list things in the server working directory, or
- `cls` _path_ : list things in _path_ on the server.
Note that `cls` is a little different than the `ls` command that comes
with `sftp`. The latter command always prints in long format and does not play
nicely with colorized output. By contrast, `cls` is part of `lftp` and it
behaves mostly like your typical Unix `ls` command, taking options like `-a`, `-l` and `-d`, and
it will even do `cls -lrt`. Type `help cls` at the `lftp` prompt for more information.
If you want to list the contents of the different directories on your laptop, you
use `ls` _but you preface it with a_ `!`, which means "execute the following on my
laptop, not the server." So, we have:
- `!ls` : list the contents of the laptop working directory.
- `!ls` _path_ : list the contents of the laptop path _path_.
When you use the `!` at the beginning of the line, then all the TAB completion occurs
in the context of the laptop current working directory. Note that with the `!`
you can do all sorts of typical shell commands on your laptop from within the `lftp`
session. For example `!mkdir this_on_my_laptop` or `!cat that_file`, etc.
If you wish to make a directory on the _server_, just use `mkdir`. If you wish to
remove a file from the server, just use `rm`. The latter works much like it does in
bash, but does not seem to support globbing (use `mrm` for that!) In fact, you can
do a lot of things (like `cat` and `less`) on the server
_as if you had a bash shell running on it_ through an
SSH connection. Just type those commands at the `lftp` prompt.
#### Transferring files using `lftp`
To this point, we haven't even talked about our original goal with `lftp`, which
was to _transfer files from our laptop to the server_ or from _the server to our laptop_.
The main `lftp` commands for those tasks are: `get`, `put`, `mget`, `mput`, and `mirror`---it is
not too much to have to remember.
As the name suggests, `put` is for _putting_ files from your laptop onto the server. By default it
puts files into the server working directory. Here is an example:
```sh
put laptopFile_1 laptopFile_2
```
If you want to put the file into a different directory on the server (that must already exist)
you can use the `-O` option:
```sh
put -O server_dest_dir laptopFile_1 laptopFile_2
```
The command `get` works in much the same way, but in reverse: you are _getting_ things
_from the server to your laptop_. For example:
```sh
# copy to laptop working directory
get serverFile_1 serverFile1_2
# copy to existing directory laptop_dest_dir
get -O laptop_dest_dir serverFile_1 serverFile1_2
```
Neither of the commands `get` or `put` do any of the pathname expansion (or "globbing" as it
we have called it) that you will be familiar with from the `bash` shell. To effect that sort
of functionality you must use `mput` and `mget`, which, as the `m` prefix in the
command names suggests, are the "multi-file" versions of `put` and `get`. Both of these
commands also take the -O option, if desired, so that the above commands could be
rewritten like this:
```sh
mput -O server_dest_dir laptopFile_[12]
# and
mget -O laptop_dest_dir serverFile_[12]
```
Finally, there is not a _recursive_ option, like there is with `cp`, to any of `get`, `put`, `mget`,
or `mput`. Thus, you cannot use any of those four to put/get entire directories on/from the
server. For that purpose, `lftp` has reserved the `mirror` command. It does what it sounds like:
it mirrors a directory from the server to the laptop. The `mirror` command can actually
be used in a lot of different configurations (between two remote servers, for example) and
with different settings (for example to change only pre-existing files older than
a certain date).
However, here, we will demonstrate only its common use case
of copying directories between a server and laptop here.
To copy a directory `dir`, and its contents, from your server to your
laptop current directory you use:
```sh
mirror dir
```
To copy a directory `ldir` from your laptop to your server current directory you
use `-R` which transmits the directory in the reverse direction:
```sh
mirror -R ldir
```
Learning to use `lftp` will require a little bit more of your time, but it is worth
it, allowing you to keep a dedicated terminal window open for file transfers with sensible
TAB-completion capability.
### git
Most remote servers you work on will have `git` by default.
If you are doing all your work on a project within a single
repository, you can use `git` to keep scripts and other files
version-controlled on the server. You can also push and pull files
(not big data or output files!) to GitHub, thus keeping things backed up
and version controlled, and providing a useful way to synchronize scripts
and other files in your project between the server and your laptop.
Example:
1. write and test scripts on your laptop in a repo called `my-project`
1. commit scripts on your laptop and push them to GitHub in a repo also
called `my-project`
1. pull `my-project` from GitHub to the server.
1. Try running your scripts in `my-project` on your server. In the process,
you may discover that you need to change/fix some things so they will
run correctly on the server. Fix them!
1. Once things are fixed and successfully running on the server, commit
those changes and push them to GitHub.
1. Update the files on your laptop so that they reflect the changes you
had to make on the server, by pulling `my-project` from GitHub to your
laptop.
#### Configuring git on the remote server
In order to make this sort of worklow successful, you first need
to ensure that you have set up git on your remote server. Doing
so involves:
1. establishing your name and email that will be used with your git commits
made from the server.
1. Ensuring that git password caching is set up so you don't always have
to type your GitHub password when you push and pull.
1. configuring your git text editor to be something that you know how
to use.
It can be useful give yourself a git name on the server that reflects
the fact that the changes you are committing were made on the server.
For example, for my own setup on the Summit cluster at Boulder, I might
do my git configurations by issuing these commands on the
command line on the server:
```sh
git config --global user.name "Eric C. Anderson (From Summit)"
git config --global user.email [email protected]
git config --global core.editor nano
```
In all actuality, I tend to set my editor to be `vim` or `emacs`, because those are
more powerful editors and I am familiar with then; however, if you are new to Unix,
then `nano` is an easy-to-use editor, and one is less likely to get "stuck" inside of it, as can happen in `vim`.
You should set configurations on your server appropriate to yourself
(i.e., with your name and email and preferred text editor). Once these configurations are set, you are ready to start cloning
repositories from GitHub and then pushing and pulling them, as well.
To this point, we have always done those actions from within
RStudio. On a remote server, however, you will have to do all these
actions from the command line. That is OK, it just requires learning
a few new things.
The first, and most important, issue to understand is that if you want
to push new changes back to a repository that is on your GitHub account,
GitHub needs to know that you have privileges to do so. Back in the days
when you could make authenticated https connections to GitHub, there were some
tricks to this. But, since all your connections to GitHub must now be done with
SSH, it has actually gotten a lot easier (but it involves setting up SSH keys,
as described in the next section).
#### Using git on the remote server
When on the server, you don't have the convenient RStudio interface
to git, so you have to use git commands on the command line. Fortunately
these provide straightforward, command-line analogies to the RStudio
GUI git interface you have become familiar with.
Intead of having an RStudio Git panel that shows you files that are new or
have been modified, etc., you use `git status` in your repo to give
a text report of the same.
For example, imagine that Figure \@ref(fig:git-window) shows an RStudio project Git window describing the status of files in the repository.
```{r git-window, echo=FALSE, fig.align='center', dpi=100, fig.cap="An example of what an RStudio git window might look like."}
knitr::include_graphics("figs/git-window.png", auto_pdf = TRUE)
```
That view is merely showing you a graphical view of the output of
the `git status` command run at the top level of the repository which
looks like this:
```sh
% git status
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: .gitignore
modified: 002-homeologue-permutation-with-bedr.Rmd
Untracked files:
(use "git add <file>..." to include in what will be committed)
002-homeologue-permutation-with-bedr.nb.html
data/
mykiss-rad-project-with-mac-and-devon.Rproj
reconcile/
no changes added to commit (use "git add" and/or "git commit -a")
```
Aha! Be sure to read that and understand that the output tells you which
files are tracked by git and Modified (blue M in RStudio) and which
are untracked (Yellow ? in RStudio).
If you wanted to see a report of the changes in the files relative
to the currently committed version, you could use `git diff`, passing
it the file name as an argument. We will see an example of that below...
Now, recall, that in order to commit files to `git` you first must
_stage_ them. In RStudio you do that by clicking the little button to
the left of the file or directory in the Git window. For example,
if we clicked the buttons for the `data/` directory, as well as for
`.gitignore` and `002-homeologue-permutation-with-bedr.Rmd`, we would
have staged them and it would look like Figure \@ref(fig:git-staged).
```{r git-staged, echo=FALSE, fig.align='center', dpi=100, fig.cap="An example of what an RStudio git window might look like."}
knitr::include_graphics("figs/git-staged.png", auto_pdf = TRUE)
```
In order to do the equivalent operations with `git` on the command line
you would use the `git add` command, explicitly naming the files you wish to
_stage_ for committing:
```sh
git add .gitignore 002-homeologue-permutation-with-bedr.Rmd data
```
Now, if you check `git status` you will see:
```sh
% git status
On branch master
Changes to be committed:
(use "git reset HEAD <file>..." to unstage)
modified: .gitignore
modified: 002-homeologue-permutation-with-bedr.Rmd
new file: data/Pearse_Barson_etal_Supp_Table_7.tsv
new file: data/high-fst-rad-locus-indices.txt
Untracked files:
(use "git add <file>..." to include in what will be committed)
002-homeologue-permutation-with-bedr.nb.html
mykiss-rad-project-with-mac-and-devon.Rproj
reconcile/
```
It tells you which files are ready to be committed!
In order to commit the files to git you do:
```sh
git commit
```
And then, to push them back to GitHub (if you cloned this repository
from GitHub), you can simply do:
```sh
git push origin master
```
That syntax is telling git to push the `master` branch (which is
the default branch in a git repository), to the repository labeled as
`origin`, which will be the GitHub repository if you cloned the repository
from GitHub. (If you are working with a different git branch than master,
you would need to specify its name here. That is not difficult, but is
beyond the scope of this chapter.)
Now, assuming that we cloned the `alignment-play` repository to our
server, here are the steps involved in editing a file, committing the
changes, and then pushing them back to GitHub. The command in the following
is written as `[alignment-play]--% ` which is telling us that we are in the
`alignment-play` repository.
```sh
# check git status
[alignment-play]--% git status
# On branch master
nothing to commit, working directory clean
# Aha! That says nothing has been modified.
# But, now we edit the file alignment-play.Rmd
[alignment-play]--% nano alignment-play.Rmd
# In this case I merely added a line to the YAML header.
# Now, check status of the files:
[alignment-play]--% git status
# On branch master
# Changes not staged for commit:
# (use "git add <file>..." to update what will be committed)
# (use "git checkout -- <file>..." to discard changes in working directory)
#
# modified: alignment-play.Rmd
#
no changes added to commit (use "git add" and/or "git commit -a")
# We see that the file has been modified.
# Now we can use git diff to see what the changes were
[alignment-play]--% git diff alignment-play.Rmd
diff --git a/alignment-play.Rmd b/alignment-play.Rmd
index 9f75ebb..b389fae 100644
--- a/alignment-play.Rmd
+++ b/alignment-play.Rmd
@@ -3,6 +3,7 @@ title: "Alignment Play!"
output:
html_notebook:
toc: true
+ toc_float: true
---
# The output above is a little hard to parse, but it shows
# the line that has been added: " toc_float: true" with a
# "+" sign.
# In order to commit the changes, we do:
[alignment-play]--% git add alignment-play.Rmd
[alignment-play]--% git commit
# after that, we are bumped into the nano text editor
# to write a short message about the commit. After exiting
# from the editor, it tells us:
[master 001e650] yaml change
1 file changed, 1 insertion(+)
# Now, to send that new commit to GitHub, we use git push origin master
[alignment-play]--% git push origin master
Password for 'https://[email protected]':
Counting objects: 5, done.
Delta compression using up to 24 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 325 bytes | 0 bytes/s, done.
Total 3 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.
To https://[email protected]/eriqande/alignment-play
0c1707f..001e650 master -> master
```
In order to push to a GitHub repository from your remote server you will
need to establish a public/private SSH key pair, and share the public key
in the settings of your GitHub account. The process for this is similar to
what you have already done for accessing GitHub via git with your laptop:
follow the directions for Linux systems at:
[https://happygitwithr.com/ssh-keys.html](https://happygitwithr.com/ssh-keys.html).
In order to copy your public key to GitHub, it will be easiest to
`cat ~/.ssh/id_ed25519.pub` to stdout and then copy it from your terminal to
GitHub.
Finally, if after pushing those changes to GitHub, we then pull them
down to our laptop, and make more changes on top of them and push those
back to GitHub, we can retrieve from GitHub to the server those changes we
made on our laptop with `git pull origin master`. In other words, from the
server we simply issue the command:
```sh
[alignment-play]--% git pull origin master
```
### Globus
Globus is a file transfer system for high performance computing that was
developed long ago by a group at the University of Chicago. If you work at
an institution that has a subscription to the Globus system (as is the case with
Colorado State University!), then it is quite easy to use it.
In the Globus model, files get transferred between different "endpoints." Which are
typically file servers on large university computing systems. You as a user are entitled
to initiate transfers between the endpoints to which you have access rights. You can
initiate these transfers using a web interface through your web browser. This makes
is incredibly convenient, especially if you want to transfer large files between
different computing clusters that are endpoints on the Globus network.
Addtionally, Globus provides a small software application that can turn your own
laptop or your desktop workstation into a Globus endpoint, allowing you to initiate
data transfers between your laptop/desktop and the cluster. Globus is a well-tested
and robust system, so, since it is offered for Colorado State University students and
faculty, it is well worth using.
The steps to using it are:
1. Sign in to Globus as a Colorado State Affiliate by going to
[https://www.globus.org/app/login](https://www.globus.org/app/login), and finding
Colorado State University in the dropdown menu, and hitting continue.
```{r globus1, echo=FALSE, fig.align='center', dpi=120}
knitr::include_graphics("figs/globus1.png", auto_pdf = TRUE)
```
When you do that the first time, you might need to agree to using CILogin. Do so.
2. You are then taken to a page to authenticate with CSU---it is the familiar eID login.
Login to it. For me it looks like this:
```{r globus2, echo=FALSE, fig.align='center', dpi=120}
knitr::include_graphics("figs/globus2.png", auto_pdf = TRUE)
```
3. After authenticating, you might be taken to a page that looks like this:
```{r globus3, echo=FALSE, fig.align='center', dpi=120}
knitr::include_graphics("figs/globus3.png", auto_pdf = TRUE)
```
To be honest, I don't know what this is about. I think it is Globus pitching its paid
options. Whatever....You don't need it.
4. Instead, proceed directly to [https://app.globus.org/file-manager](https://app.globus.org/file-manager) which looks like
this:
```{r globus4, echo=FALSE, fig.align='center', dpi=120}
knitr::include_graphics("figs/globus4.png", auto_pdf = TRUE)
```
Search for `CU Boulder Research Computing` in the right hand box. When you find it and
select it, you should see your home directory on SUMMIT in it, like this:
```{r globus5, echo=FALSE, fig.align='center', dpi=120}
knitr::include_graphics("figs/globus5.png", auto_pdf = TRUE)
```
5. For the next step, you want to create an endpoint on your own laptop. Choose the
"Endpoints" in the left menu (see red arrow in picture above). When you do that you
can find the "Create a personal endpoint" link:
```{r globus6, echo=FALSE, fig.align='center', dpi=120}
knitr::include_graphics("figs/globus6.png", auto_pdf = TRUE)
```
After clicking that, click the link to download "Globus Personal Connect" for your
operating system.
6. After downloading it, install "Globus Personal Connect".
7. After installing it, open "Globus Personal Connect". If you haven't used it before,
it should ask you to log in:
```{r globus7, echo=FALSE, fig.align='center', dpi=120}
knitr::include_graphics("figs/globus7.png", auto_pdf = TRUE)
```
8. After clicking log-in enter a name by which you would like to call your endpoint, and then
choose "Allow"
```{r globus8, echo=FALSE, fig.align='center', dpi=120}
knitr::include_graphics("figs/globus8.png", auto_pdf = TRUE)
```
9. Only one more screen to go. Fill in some more names that are appropriate to your
laptop/desktop and choose "Save". (Don't put in the names I have used...) You probably
do not want to choose the High Assurance option, as that requires an extra round of
work for the sys admins...
```{r globus9, echo=FALSE, fig.align='center', dpi=120}
knitr::include_graphics("figs/globus9.png", auto_pdf = TRUE)
```
10. Yay! You are done. Now, on a mac, you can find the Globus icon in the menu bar
and use that to start a web transfer session:
```{r globus10, echo=FALSE, fig.align='center', dpi=120}
knitr::include_graphics("figs/globus10.png", auto_pdf = TRUE)
```
11. And when you get that web page, your laptop will be the left endpoint and you can
search for "CU Boulder Research Computing" in the right endpoint box.
```{r globus11, echo=FALSE, fig.align='center', dpi=120}
knitr::include_graphics("figs/globus11.png", auto_pdf = TRUE)
```
Now copying things from one to another is as easy as highlighting files from your
desired source endpoint (left or right) and then hitting the "Start" button for that
source endpoint.
### Interfacing with "The Cloud"
Increasingly, data scientists and tech companies alike are keeping their
data "in the cloud." This means that they
pay a large tech firm like Amazon, Dropbox, or Google to store their data for them in
a place that can be accessed via the internet. There are many advantages to this
model. For one thing, the company that serves the data often will create multiple copies
of the data for backup and redundancy: a fire in a single data center is not a calamity
because the data are also stored elsewhere, and can often be accessed seamlessly from those
other locations with no apparent disruption of service. For another, companies that are
in the business of storing and serving
data to multiple clients have data centers that are well-networked, so that getting
data onto and off of their storage systems can be done very quickly over the internet
by an end-user with a good internet connection.
Five years ago, the idea of storing next generation sequencing data in the cloud might have
sounded a little
crazy---it always seemed a laborious task getting the data off of the remote server at the
sequencing center, so why not just keep the data in-house once you have it?
To be sure, keeping a copy of your
data in-house still can make sense for long-term data archiving needs, but, today, cloud
storage for your sequencing data can make a lot of sense. A few reasons are:
1. Transferring your data from the cloud to the remote HPC system
that you use to process the data can be very fast.
2. As above, your data can be redundantly backed up.
3. If your institution (university, agency, etc.) has an agreement with a cloud storage
service that provides you with unlimited storage and free network access, then storing
your sequencing data in the cloud will cost considerably less than buying a dedicated
large system of hard drives for data backup. (One must wonder if service
agreements might not be at risk of renegotiation if many researchers start using their
unlimited institutional cloud storage space to store and/or archive their
next generation sequencing data sets. My own agency's contract with Google runs
through 2021...but I have to think that these services are making plenty of money, even
if a handful of researchers store big sequence data in the cloud. Nonetheless, you
should be careful not to put multiple copies of data sets, or intermediate files that
are easily regenerated, up in the cloud.)
4. If you are a PI with many lab members wishing to access the same data set, or even if
you are just a regular Joe/Joanna researcher but you wish to share your data, it is
possible to effect that using your cloud service's sharing settings. We will discuss
how to do this with Google Drive.
There are clearly advantages to using the cloud, but one small hurdle remains. Most
of the time, working in an HPC environment, we are using Unix, which provides a consistent
set of tools for interfacing with other computers using SSH-based protocols (like `scp`
for copying files from one remote computer to another). Unfortunately, many common
cloud storage services do not offer an SSH based interface. Rather, they typically process
requests from clients using an HTTPS protocol. This protocol, which effectively runs the
world-wide web, is a natural choice for cloud services that most people will access
using a web browser; however, Unix does not traditionally come with a utility or command
to easily process the types of HTTPS transactions needed to network with
cloud storage. Furthermore, there must be some security when it comes to accessing
your cloud-based storage---you don't want everyone to be able to access your files, so
your cloud service needs to have some way of authenticating people
(you and your labmates for example) that are authorized to access your data.
These problems have been overcome by a utility called `rclone`, the product of a
comprehensive open-source software project that brings the functionality of the
`rsync` utility (a common Unix tool used to synchronize and mirror file systems)
to cloud-based storage. (Note: `rclone` has nothing to do with the R programming
language, despite its name that looks like an R package.)
Currently `rclone` provides a consistent interface for accessing
files from over 35 different cloud storage providers, including Box, Dropbox, Google Drive,
and Microsoft OneDrive. Binaries for `rclone` can be downloaded for your desktop
machine from [https://rclone.org/downloads/](https://rclone.org/downloads/). We will
talk about how to install it on your HPC system later.
Once `rclone` is installed and in your `PATH`, you invoke it in your terminal
with the command `rclone`. Before we get into the details of the various `rclone` subcommands,
it will be helpful to take a glance at the information `rclone` records when it
configures itself to talk to your cloud service. To do so, it creates a file called `~/.config/rclone/rclone.conf`, where it stores information about all the different
connections to cloud services you have set up. For example, that
file on my system looks like this:
```
[gdrive-rclone]
type = drive
scope = drive
root_folder_id = 1I2EDV465N5732Tx1FFAiLWOqZRJcAzUd
token = {"access_token":"bs43.94cUFOe6SjjkofZ","token_type":"Bearer","refresh_token":"1/MrtfsRoXhgc","expiry":"2019-04-29T22:51:58.148286-06:00"}
client_id = 2934793-oldk97lhld88dlkh301hd.apps.googleusercontent.com
client_secret = MMq3jdsjdjgKTGH4rNV_y-NbbG
```
In this configuration:
* `gdrive-rclone` is the name by which rclone refers to this cloud storage location
* `root_folder_id` is the ID of the Google Drive folder that can be thought of as the root directory of `gdrive-rclone`. This ID is not the simple name of that directory on
your Google Drive, rather it is the unique name given by Google Drive to that directory.
You can see it by navigating in your browser to the directory you want and finding it
after the last slash in the URL. For example, in the above case, the URL is:
`https://drive.google.com/drive/u/1/folders/1I2EDV465N5732Tx1FFAiLWOqZRJcAzUd`
* `client_id` and `client_secret` are like a username and a shared secret that `rclone` uses
to authenticate the user to Google Drive as who they say they are.
* `token` are the credentials used by `rclone` to make requests of Google Drive on the basis
of the user.
Note: the above does not include my
real credentials, as then anyone could use them to access my Google Drive!
To set up your own configuration file to use Google Drive, you will use the `rclone config`
command, but before you do that, you will want to wrangle a client_id from Google. Follow
the directions at [https://rclone.org/drive/#making-your-own-client-id](https://rclone.org/drive/#making-your-own-client-id). Things are a little different from in their step
by step, but you can muddle through to get to a screen with a client_ID and a client
secret that you can copy onto your clipboard.
Once you have done that, then run `rclone config` and follow the prompts. A
typical session of `rclone config` for Google Drive access is given
[here](https://rclone.org/drive/). Don't choose to do the advanced setup; however
do use "auto config," which will bounce up a web page and let you authenticate rclone
to your Google account.
It is worthwhile first setting up a config file on your laptop, and making sure
that it is working. After that, you can copy that config file to other remote
servers you work on and immediately have the same functionality.
#### Encrypting your config file
While it is a powerful thing to be able to copy a config file from
one computer to the next and immediately be able to access your Google
Drive account. That might (and should) also make you a little bit
uneasy. It means that if the config file falls into the wrong hands,
whoever has it can gain access to everything on your Google Drive. Clearly
this is not good. Consequently, once you have created your rclone config
file, and well before you transfer it to another computer, you must
encrypt it. This makes sense, and fortunately it is fairly easy: you can
use `rclone config` and see that encryption is one of
the options. When it is encrypted, use `rclone config show` to see what
it looks like in clear text.
The downside of using encryption is that you have to enter your password
every time you make an rclone command, but it is worth it to have the
security.
Here is what it looks like when choosing to encrypt one's config file:
```sh
% rclone config
Current remotes:
Name Type
==== ====
gdrive-rclone drive
e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> s
Your configuration is not encrypted.
If you add a password, you will protect your login information to cloud services.
a) Add Password
q) Quit to main menu
a/q> a
Enter NEW configuration password:
password:
Confirm NEW configuration password:
password:
Password set
Your configuration is encrypted.
c) Change Password
u) Unencrypt configuration
q) Quit to main menu
c/u/q> q
Current remotes:
Name Type
==== ====
gdrive-rclone drive
e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q
```
Once that file is encrypted, you can copy it to other machines for use.
#### Basic Maneuvers
The syntax for use is:
```sh
rclone [options] subcommand parameter1 [parameter 2...]
```
The "subcommand" part tells `rclone` what you want to do, like `copy` or `sync`, and
the "parameter" part of the above syntax is typically a path
specification to a directory or a file. In using rclone to access the
cloud there is not a root directory, like `/` in Unix. Instead, each remote
cloud access point is treated as the root directory, and you refer to it
by the name of the configuration followed by a colon. In our example,
`gdrive-rclone:` is the root, and we don't need to add a `/` after it to
start a path with it. Thus `gdrive-rclone:this_dir/that_dir` is a
valid path for `rclone` to a location on my Google Drive.
Very often when moving, copying, or syncing files, the parameters
consist of:
```s
source-directory destination-directory
```
One very important point is that, unlike the Unix commands `cp` and `mv`, rclone
likes to operate on directories, not on multiple named files.
A few key subcommands:
- `ls`, `lsd`, and `lsl` are like `ls`, `ls -d` and `ls -l`
```sh
rclone lsd gdrive-rclone:
rclone lsd gdrive-rclone:NOFU
```
- `copy`: copy the _contents_ of a source _directory_ to a destination _directory_. One super cool
thing about this is that `rclone` won't re-copy files that are already on the destination and which
are identical to those in the source directory.
```sh
rclone copy bams gdrive-rclone:NOFU/bams
```
Note that the destination directory will be created if it does not already exist.
- `sync`: make the contents of the destination directory look just like the
contents of the source directory. *WARNING* This will delete files in the destination
directory that do not appear in the source directory.
A few key options:
- `--dry-run`: don't actually copy, sync, or move anything. Just tell me what you would have done.
- `--progress`: give me progress information when files are being copied.
This will tell you which file is being transferred, the rate at which
files are being transferred, and and estimated amount of time for all the
files to be transferred.
- `--tpslimit 10`: don't make any more than 10 transactions a second with Google Drive (should always be used when transferring files)
- `---fast-list`: combine multiple transactions together. Should always be used with Google Drive,
especially when handling lots of files.
- `--drive-shared-with-me`: make the "root" directory a directory that shows all
of the Google Drive folders that people have shared with you. This is key for accessing
folders that have been shared with you.
For example, try something like:
```sh
rclone --drive-shared-with-me lsd gdrive-rclone:
```
**Important Configuration Notes!!** Rather than always giving the `--progress`
option on the command line, or always having to remember to use
`--fast-list` and `--tpslimit 10` (and remember what they should be...),
you can set those options to be invoked "by default" whenever you use
rclone. The developers of `rclone` have made this possible
by setting _environment variables_ in your `~/.bashrc`.
If you have an rclone option called `--fast-limit`, then the corresponding
environment variable is named `RCLONE_FAST_LIMIT`---basically, you
start with `RCLONE_` then you just
drop the first two dashes of the option name, replace the remaining dashes
with underscores, and turn it all into uppercase to make the
environment variable. So, you should, at a minimum add these
lines to your `~/.bashrc`:
```sh
# Environment variables to use with rclone/google drive always
export RCLONE_TPSLIMIT=10
export RCLONE_FAST_LIST=true
export RCLONE_PROGRESS=true
```
#### filtering: Be particular about the files you transfer {#rclone-filter}
`rclone` works a little differently than the Unix utility `cp`. In particular,
`rclone` is not set up very well to copy individual files. While there is a
an `rclone` command known as `copyto` that will allow you copy a single file,
you cannot (apparently) specify multiple, individual files that you wish to copy.
In other words, you can't do:
```sh
rclone copyto this_file.txt that_file.txt another_file.bam gdrive-rclone:dest_dir
```
In general, you will be better off using `rclone` to copy the *contents* of a directory
to the inside of the destination directory. However, there are options in `rclone` that
can keep you from being totally indiscriminate about the files you transfer. In other words,
you can *filter* the files that get transferred. You can read about that at
[https://rclone.org/filtering/](https://rclone.org/filtering/).
For a quick example, imagine that you have a directory called `Data` on you Google Drive
that contains both VCF and BAM files. You want to get only the VCF files (ending with `.vcf.gz`, say)
onto the current working directory on your cluster. Then something like this works:
```sh
rclone copy --include *.vcf.gz gdrive-rclone:Data ./