-
Notifications
You must be signed in to change notification settings - Fork 0
/
cluster_presentation.Rpres
732 lines (541 loc) · 28 KB
/
cluster_presentation.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
<link rel="stylesheet" href="http://yandex.st/highlightjs/7.3/styles/default.min.css">
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.9.0/jquery.min.js"></script>
<script src="http://yandex.st/highlightjs/7.3/highlight.min.js"></script>
<script>
$(document).ready(function() {
$('pre code').each(function(i, e) {hljs.highlightBlock(e)});
});
</script>
<style>
pre code.bash {
background: black;
color: white;
font-size: 1em;
}
</style>
Examples of parallel computing and using the cluster
========================================================
width: 1440
height: 900
transition: none
font-family: 'Helvetica'
css: my_style.css
title: "Examples of parallel computing and using the cluster"
author: Bryan Mayer
date: 2/4/2016
Topics
========================================================
1. Example background: a simulation study in CMV
2. Local machine parallel computing
3. Working on the remote machine
4. Sending jobs to the cluster
[All materials and code available on my fhcrc github](https://github.com/mayerbry/cluster-presentation/)
Simulation stochastic model
========================================================
type: sub-section
When a virus infects a cell, there are two outcomes: 1) many cells become infected and the infection is sustained (what we observe) or 2) only a few cells become infected and the infection is transient (not generally detected) - stochastic extinction.
Two parameters strongly govern this probability:
1. Cell infectivity at initial exposure - if one cell is infected, how many cells will it infect during its lifetime?
2. How many cells are initially infected from an initial exposure?
![alt text](images/model_small.png)
Model code
========================================================
```{r, warning = F, message = F}
library(knitr)
library(ggplot2)
theme_set(theme_classic())
source("Code/model_functions.R") #model functions in here
```
[model function code](https://github.com/mayerbry/cluster-presentation/blob/master/Code/model_functions.R)
Example of a simulation:
========================================================
type: sub-section
```{r, warning = F, message = F}
start_time = Sys.time()
example = stochastic_model_latent(max_time = 100, initI = 10, infectivity = 1.5, parms)
print(Sys.time() - start_time)
```
The simulation function takes four arguments: 1) the total time to run it for (max_time); 2) initial infected cells (initI); 3) initial cell infectivity (infectivity); and 4) parms (this a data.frame that is created in model_function.R).
Example of a simulation:
========================================================
The output of interest is time and corresponding viral load at that time.
```{r, warning = F, message = F, fig.width = 12, fig.height = 4, fig.align='center'}
ggplot(data = example, aes(x = time, y = viral_load)) + geom_line()
```
In this single simulation, the infection was sustained.
Multiple simulations example
========================================================
```{r, warning = F, message = F, fig.width = 12, fig.height = 4, fig.align='center', cache = T}
total_simulations = 10
start_time = Sys.time()
example_simulations = NULL
for(run in 1:total_simulations){
results = stochastic_model_latent(max_time = 100, initI = 10, infectivity = 1.1, parms)
results$run = run #keep track of the run
example_simulations = rbind(example_simulations, results)
}
print(Sys.time() - start_time)
ggplot(data = example_simulations,
aes(x = time, y = viral_load, colour = factor(run))) +
geom_line()
```
In these 10 simulations, we observe several instances where the infection ends early.
Multiple simulations - how time scales with total simulations
========================================================
```{r, warning = F, message = F, echo = T, cache = T}
total_simulations = 100
start_time = Sys.time()
example_simulations = NULL
for(run in 1:total_simulations){
results = stochastic_model_latent(max_time = 100, initI = 10, infectivity = 1.1, parms)
results$run = run #keep track of the run
example_simulations = rbind(example_simulations, results)
}
print(Sys.time() - start_time)
rm(example_simulations)
```
If this scaled linearly, we would've expected about 5 seconds but this approach gets worse and worse as the data.frame increases in size, 20x longer per 10x increase - A run of 1000 took 5 minutes, 600x longer for a 100x increase.
Note: Looping and stacking is inefficient in R.
Multiple simulations - planning ahead
========================================================
The actual simulation requirements
```{r, warning = F, message = F, echo = T}
total_simulations = 1000
I0s = c(1:10, 15, 10 * 2:9, 1:4 * 10^2)
infectivity_set = seq(1, 2, 0.05)
est_time = 5 * length(I0s) * length(infectivity_set)/24
cat(paste(est_time, "days"))
```
In this example, 1000 simulations per each combination of parameter settings. Actually did 10,000 each overnight.
Local machine parallelization
========================================================
type: section
1. The `doParallel` package
2. Using `foreach` (from doParallel)
3. The `plyr` package (uses doParallel)
the doParallel package
========================================================
type:sub-section
id:parallel_setup
https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf
```{r, warning = F}
library(doParallel)
```
How many cores do I have?
```{r, warning = F}
detectCores()
```
How many is R using (1 is default)?
```{r, warning = F}
getDoParWorkers()
```
Registering cores in R
========================================================
Register your cores and check:
```{r, warning = F}
registerDoParallel(2)
getDoParWorkers()
```
Using foreach
========================================================
type: sub-section
id: foreach_sim
The `foreach` call is very similar to writing a loop. In `foreach`, there are two arguments. First, you define what you are loop over (here we name that sequence 'run'). Then we set how the results are combined (`.combine = rbind`, which stacks rows with rbind). Then you call `%dopar% {}` with the task inside of the brackets. If parallelization is not necessary, you can use `%do%`.
```{r, warning = F, message = F, echo = T, cache = T}
total_simulations = 1000
start_time = Sys.time()
example_simulations = foreach(run = 1:total_simulations, .combine=rbind) %dopar% {
results = stochastic_model_latent(max_time = 100, initI = 10, infectivity = 1.1, parms,
seed_set = 5)
results$run = run #keep track of the run
results
}
print(Sys.time() - start_time)
```
```{echo = F}
rm(example_simulations)
```
Versus 5 minutes, so pretty good improvement.
The plyr package
========================================================
type:sub-section
The `plyr` package contains a set of functions that stacks R objects more efficiently. The functions follow the `apply` method of programming (which is just looping).
```{r}
?apply
```
- We will focus on `ldply`, a function that carries out a loop and lets us stack data frames
- In the following example, `ldply` takes two arguments: 1) a set of values to loop through (1:5) and a function that takes the values as a parameter (`i` here). The function just returns the input and `ldply` will stack those and return a data.frame .
```{r}
plyr::ldply(1:5, function(i) i)
```
[A biostatistics example](#/bio_plyr)
Simulation example - using plyr
========================================================
id:plyr_example
```{r, warning = F, message = F, echo = T, cache = T}
total_simulations = 1000
start_time = Sys.time()
example_simulations = plyr::ldply(1:total_simulations, function(run){ #ldply is a loop
results = stochastic_model_latent(max_time = 100, initI = 10, infectivity = 1.1, parms,
seed_set = 5)
results$run = run #keep track of the run
results
})
print(Sys.time() - start_time)
rm(example_simulations)
```
This did even better than foreach without any parallel processing!
Adding parallel computing in plyr is simple.
========================================================
id: plyr_sim
```{r, warning = F, message = F, echo = T, cache = T}
total_simulations = 1000
start_time = Sys.time()
example_simulations = plyr::ldply(1:total_simulations, function(run){ #ldply is like a loop
results = stochastic_model_latent(max_time = 100, initI = 10, infectivity = 1.1, parms,
seed_set = 5)
results$run = run #keep track of the run
results
}, .parallel = T) #Just set this .parallel = T after you close the function
print(Sys.time() - start_time)
rm(example_simulations)
```
About 2x improvement.
Note: to use `.parallel = T`, [doParallel package needs to be loaded and cores need to be registered](#/parellel_setup)
plyr vs. foreach (an opinion)
========================================================
type: sub-section
- Generally, if the resulting output is very big (in this example, > 1 million rows) then the computing solution needs to efficiently handle the data. In this case, `ldply` seems to be a superior solution to `foreach` using `.combine = rbind`. However, I did not search for better stacking methods for `foreach` and even `ldply` can become limited at this data size.
- If the method is slow (e.g., MCMC computations) but the resulting output isn't very big, then this `foreach` implementation should be comparable.
- Large data manipulations: `data.table` and `dplyr` have implementations to make this more efficient.
Using the server
========================================================
type:section
I use the rhino servers through scicomp. It requires an account and a "fred" drive.
https://teams.fhcrc.org/sites/citwiki/SciComp/Pages/How%20to%20get%20or%20use%20an%20account.aspx
I will be doing tasks using Bash in the command line (Terminal app on a mac). For windows, you can use PuTTY (it is free):
http://sharedresources.fredhutch.org/libresources/putty
[Installing and using PuTTY](#/rhino_demo)
Any blocked code with black background is command line (bash) code.
Any blocked code with white background is R code.
Connecting to rhino - through the command line
========================================================
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
```
and then enter your password when prompted
Then this might happen (enter yes)
![alt text](images/connect.png)
[Installing and using PuTTY](#/rhino_demo)
Setting up the remote environment
========================================================
type:sub-section
id:remote_directory
Create directories to work from
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
mkdir cluster-example
mkdir installed_packages
```
Make this directory to remotely install R packages (next slide)
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
mkdir cluster-example
cd cluster-example
mkdir installed_packages
```
Open R
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
R
```
To quit R:
```{r, eval = F}
quit()
```
How to check if a package is available.
========================================================
Check its version: will throw an error if the package is not available.
```{r, eval = F}
packageVersion("data.table")
```
data.table is not installed, so let's install it.
```{r, eval = F}
install.packages("data.table", lib = "installed_packages/")
library(data.table, lib.loc = "installed_packages/")
packageVersion("data.table")
```
Note: When you load R on the server, it will always set your working directory to where you called R from. So all paths are relative to that. Get the absolute path with `getwd()`.
Transferring R scripts and files from local machine to the remote machine
========================================================
Four options:
1. Use the command line (from the local computer transferring to the server in the correct directory). This can be a pain and will require a password if you don't set some advanced settings.
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
scp model_functions.R [email protected]:/home/bmayer/cluster-example/
```
2. Use an ftp client (there are a few free ones for windows).
https://cyberduck.io/?l=en
[Using Cyberduck](#/cyberduck_demo)
3. Use remote desktop software. For example, NoMachine remotely connects to a linux machine with graphical interface.
https://teams.fhcrc.org/sites/citwiki/SciComp/Pages/Connecting%20to%20Session%20Server%20from%20Mac.aspx
4. Do your programming from the shell using a text editor like vim or emacs (very high learning curve) combined with 1. to transfer figure and data files.
Using R on the server
========================================================
type:sub-section
id:r_server
When using R through the command line, a lot of convienent functionality is lost. For example, it is no longer easy to highlight and run chunks of code. However, using R on the server frees up your local machine for other tasks if you are conducting computational intensive processses (like using all of your cores for parallel processing).
One option to continue using the interactive console, is to utilize the `source` function to read in scripts that you can edit outside of R then pass onto your remote drive using software like Cyberduck.
Remote workflows
========================================================
id: workflow
Interaction between your computer and the remote workspace.
![alt text](images/workflow.png)
Note: git is also available on the server.
Running R scripts on the server
========================================================
The general code to run an R script on the server:
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
R CMD BATCH myR_script.R
```
This will create a myR.script.Rout (default naming) which contains an echo of all the code, results that were printed, and a process time statement at the end. You can check the file from the command line with:
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
less myR_script.Rout
```
To suppress the code echo and only output printed results add the `--slave` option. We also can add a second file name (results.txt) to name the output file.
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
R CMD BATCH --slave myR_script.R results.txt
```
Parallelization on the server
========================================================
type:section
1. In a session while logged on
2. Sending batch jobs
Grabbing nodes
========================================================
type: sub-section
While logged on you can request a set of cores to be used for computational requirements.
https://teams.fhcrc.org/sites/citwiki/SciComp/Pages/Grab%20Commands.aspx
```{r, echo = F}
node_cmds = c("grabfullnode", "12 processors on one node",
"grabnode", "6 processors on one node",
"grabhalfnode", "6 processors on one node (same as grabnode)",
"grabquarternode", "3 processors one one node",
"grabcpu", '1 processor',
"grabR",' 1 processor (starts an R shell)',
"grablargenode", "32 processors on a large-memory node")
node_table = as.data.frame(matrix(node_cmds, nrow = 7, byrow = T))
kable(node_table, col.names = c("bash command", "total cores"))
```
These commands don't actually grant you access (you already have access to cores) but they tell the server what your intentions are. Your session will then be moved to a node with available resources. The link contains information on server etiquette: do not request more cores in your R session than you grabbed.
Grabbing nodes (grabnode command)
========================================================
In this example, we use grabnode and the request 6 cores for duration of 1 total day.
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
grabnode
```
![This is PuTTY!](images/grabnode.png)
You may have to accept key changes like when you first logged on to the server.
Running the simulation script
========================================================
[foreach example](#/foreach_sim)
[plyr example](#/plyr_sim)
Add the following to scripts to the server directory (for me "/users/bmayer/cluster-example/" from earlier):
[model function code](https://github.com/mayerbry/cluster-presentation/blob/master/Code/model_functions.R)
[a script to run simulations](https://github.com/mayerbry/cluster-presentation/blob/master/Code/test_server_code.R)
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
R CMD BATCH --slave test_server_code.R results.txt
less results.txt
```
Performance of simulation script
========================================================
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
less results.txt
```
![alt text](images/results.png)
plyr performance is far superior and it scaled closely to the estimates from the earlier simulations.
Note that the results were only performance checks and that nothing from the model simulations was saved to be used. Alternatively, this script could've been run within R with `source("test_server_code.R")` and then example_simulations would be available for analysis in the console (after a 15 minute wait).
Sending a batch job using slurm
========================================================
type: sub-section
Another option is to send your job to the server where it will be assigned to a node. One advantage of this process is that you don't have to remain logged on to the server while your job completes. One disadvantage is that there is no interactive element.
https://teams.fhcrc.org/sites/citwiki/SciComp/Pages/R%20Howto%20for%20Gizmo.aspx
There are a lot of options and commands that will not be covered here. There are training sessions offered through the Hutch.
http://centernet.fhcrc.org/CN/depts/hr/training/courses/Introduction_to_Gizmo.html
Batch job using slurm - R script
========================================================
[An R script for batch simulations](https://github.com/mayerbry/cluster-presentation/blob/master/Code/test_batch_code.R)
Key differences from test_server_code.R:
1. Removed the `foreach` method because it was too slow.
2. In addition to 10000 simulations, 2 different parameter values to cycle through (nested ldply).
3. Instead of saving all of the raw output (millions of rows), aggregated the results so one outcome per simulation per parameter setting. The goal is to count all of the simulations that had stochastic extinction (denoted T or F for each run).
Took 1.2 minutes last time, with 2x more simulations, expect about 2ish minutes now.
Batch job using slurm - command line
========================================================
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
sbatch --cpus-per-task=6 --time=0-2 --wrap="R --no-save --no-restore < test_batch_code.R"
```
Important options here:
1) We request a node with at least 6 cores for use using cpus-per-task=6
2) We estimate the time we require (days-hours) with time=0-2
3) There are other available options not set here (e.g., email yourself updates when the job finishes)
4) Leave R options (no-save and no-restore) within the -wrap
Check on your jobs (with your hutchID):
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
squeue -u bmayer
```
Sometimes the job will be queued for awhile. The wait time depends on the cores and the time requested. Waiting time seems to increase over the week (Monday morning shortest, Friday afternoon longest). You can cancel your jobs using scancel followed by the JOBID number (found using squeue).
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
scancel 0000000x
```
Batch job using slurm
========================================================
An output file is created for each batch job (it will look like slurm-JOBID#.out, unless designated otherwise). The output file is updated as the job runs so you can see what code is currently being executed. If the job terminated early because of a bug, that will be the last line written to the .out file. You can inspect this code using `less`.
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
less slurm-0000000x.out
```
After the job is finished, can quickly preview results:
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
less batch_results.csv
```
![results](images/example_results.png)
I might then transfer batch.results.csv to my computer for analyzing and graphing.
Additional parallelization options
========================================================
type:sub-section
Each job is sent to a node where the cores are accessed for your program. By sending multiple jobs, you have access to a lot of nodes (>100 jobs running). Taking advantage of this can be very powerful:
- Simplest method: Just send multiple jobs to the server (rerun the script and make sure you vary the output file names in the R code).
- Write more complicated bash scripts that use loops: have the loop pass in values to R that can be used for different simulation settings.
Bash loops (shell files)
========================================================
A bash loop can just repeatedly send jobs to the server but it can also pass values into R.
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
#!/bin/bash
for x in {1,2}; do
sbatch --cpus-per-task=6 --time=0-1 --wrap="R --no-save --no-restore '--args inf_set=$x' < test_loopbatch_code.R"
done
```
The '--args inf_set=$x' (single quotes required) defines the arguments to be passed (will be called inf_set in R). The $x grabs the values from x given in the loop command. Bash can be finicky; for example, don't add spaces between values in the loop.
This code isn't for copying and pasting. Need to use a bash script file (.sh).
[bash script for loop](https://github.com/mayerbry/cluster-presentation/blob/master/Code/example_bash_loop.sh)
Bash loops - passing values into R
========================================================
The following R code must be in the R script (I put it at the top):
```{r, eval = F}
args<-(commandArgs(TRUE));
if(length(args)==0){
print("No arguments supplied.")
}else{
for(i in 1:length(args)){
eval(parse(text=args[[i]]))
}
print(args)
}
```
This tells R to check if there were values passed in. The variable name will come from the bash script. You can use that input to select certain parameter ranges or change output file names:
```{r, eval = F}
#use the inputted values from the bash loop to reduce out subset (change output name)
if(inf_set == 1) infectivity_list = infectivity_list[1]
if(inf_set == 2) infectivity_list = infectivity_list[2]
out_file_name = paste("batch_results/batch_results_loop", inf_set, ".csv", sep = "") #output name varies by input; save in folder
```
For this example, a batch_results folder must be created in the remote directory. [The R script is available here.](https://github.com/mayerbry/cluster-presentation/blob/master/Code/test_loopbatch_code.R)
Running the bash script
========================================================
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
./example_bash_loop.sh
```
If you get a weird permission error, may need to do the following:
```{r, eval = F, engine = "bash", out.width="1920px",height="1080px"}
chmod u+x example_bash_loop.sh
```
If the job seemingly completes but there is no output, check the .out file for errors. Is there a batch_results subdirectory created?
Output from multiple jobs
========================================================
__If you are running multiple jobs using the same script, make sure the output file names change or they will just be overwritten.__
With multiple output files, it may be necessary to write a script to combine them. Here is the script for this example.
```{r, eval = F}
#remember this has to be in the cluster-example folder to access the relative path "batch_results/"
library(plyr)
output_file_names = list.files("batch_results/")
output = ldply(output_file_names, function(file_name){
read.csv(paste("batch_results/", file_name, sep = ""), stringsAsFactors = F)
})
write.csv(output, "combined_batchloop_output.csv", row.names = F)
```
[R script to combine output](https://github.com/mayerbry/cluster-presentation/blob/master/Code/combine_batch_output.R)
Summary using the remote servers
========================================================
type: sub-section
A big drawback of using the server/command line: no convenient interactive software like RStudio.
Tips:
- Identify the output you want from your runs (.Rda files, csv's, plots).
- Find a workflow: find a comfortable way to quickly access your results and update code as needed (i.e., using NoMachine or Cyberduck to pass things back and forth) [Workflow diagram](#/workflow).
- Anticipate bugs: don't let a 15 hour batch run go to waste because you wrote a write.csv command incorrectly at the end or forgot to vary output file names. Run an example batch first that should run quickly to check for bugs. Consider outputting results or printing status reports periodically.
- Parallelizing with nodes (looping batch scripts) means you may have to combine your output files.
The End
========================================================
type: section
(Extra slides follow)
Biostat example
========================================================
id:bio_plyr
Fake data set has randomly assigned measurements of IgG and IgA. Treatment measurements are an average 2 units higher.
```{r, warning = F}
ptid = LETTERS[1:20]
immuno = c("IgG", "IgA")
trt = c("control", "treatment")
test_data = expand.grid(ptid = ptid, immuno = immuno, trt = trt)
test_data$outcome = with(test_data, ifelse(trt == "control", rnorm(40, 2), rnorm(40, 4)))
head(test_data)
```
Biostat example
========================================================
Run a t-test for both immunoglobulins and save all the results together.
```{r, warning = F, fig.align='center'}
ttest_results = plyr::ldply(immuno, function(imm){
temp = with(subset(test_data, immuno == imm), t.test(outcome ~ trt))
data.frame(
immuno = imm,
mean_diff = diff(temp$estimate),
lower_ci = -temp$conf.int[2],
upper_ci = -temp$conf.int[1]
)
})
ttest_results
```
Biostat example
========================================================
```{r, warning = F, fig.align='center'}
ggplot(data = ttest_results,
aes(x = immuno, y = mean_diff, ymin = lower_ci, ymax = upper_ci)) +
geom_point() +
geom_errorbar(width = 0.2)
```
[Go back to main talk (simulation example with plyr)](#/plyr_example)
Connecting to rhino - using PuTTY (download)
========================================================
id: rhino_demo
Download PuTTY:
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
The red box is what I downloaded.
![alt text](images/putty_install.png)
[Go back to main talk](#/remote_directory)
Connecting to rhino - using PuTTY (connecting)
========================================================
Most of the settings are already set, just fill in the Host Name and click Open.
![alt text](images/putty_connect.png)
Connecting to rhino - using PuTTY (connecting)
========================================================
Then click yes.
![alt text](images/putty_connect2.png)
And then login.
![alt text](images/putty_login.png)
All of code in the black boxes is meant to be used in the command line and should work using PuTTY.
[Go back to main talk (creating remote directories)](#/remote_directory)
Using Cyberduck
========================================================
id:cyberduck_demo
After Cyberduck is installed click on "Open Connection". Select "SFTP (SSH File Transfer Protocol)" from the dropdown. Enter the server name (rhino.fhcrc.org) , leave the port as 22, and then enter your hutchnet ID and password. Click connect.
![alt text](images/cyberduck_login.png)
Using Cyberduck
========================================================
This is what the interface looks like (on a Mac) after connected. You can drag and drop files and folders between your local machine and your server directory.
![alt text](images/cyberduck_interface.png)
[Back to talk (R on server)](#/r_server)