-
Notifications
You must be signed in to change notification settings - Fork 15
/
1.02-sed-awk-regex.Rmd
607 lines (508 loc) · 25.6 KB
/
1.02-sed-awk-regex.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
# Sed, awk, and regular expressions
In the course of doing bioinformatics, you will be dealing with myriad different
_text_ files. As we noted in the previous chapters, Unix, with its file I/O model,
piping capabilities, and numerous utilities, is well-suited to handling
large text files. Two utilities found on every Unix installation---`awk` and `sed`---merit
special attention in this context. `awk` is a lightweight scripting language that lets
you write succinct programs to operate line-by-by line on the contents of text files. It is
particularly useful for handling text files that have columns of data separated by white spaces
or tabs. `sed` on the other hand, is particularly useful for automating
"find-and-replace" operations on text files. Each of them is optimized to
handle large files without storing a lot of information in memory, so they can be
useful for quick operations on large bioinformatic data sets. Neither is a fully-featured
programming language that you would want to write large, complex programs in (that said,
I did once implement a complete program for full-sibling inference from multiallelic markers
in `awk`); however they do share many of the
useful text-manipulation capabilities of such languages, such as Perl and Python. Additionally,
`awk` and `sed` are deployed in a consistent fashion across most Unix operating systems, and they
don't require much time to learn to use effectively for common text-processing tasks.
As a consequence `awk` and `sed` are a useful addition to the bioinformatician's toolbox.
Both `awk` and `sed` rely heavily on _regular expressions_ to describe _patterns_
in text upon which some operation should be performed. You can think of
regular expressions as providing a succinct language for performing very advanced "find"
and "find-and-replace" operations
in a text file.
In this chapter we will only scratch the surface of what can be done with `awk` and `sed`. Indeed,
there is an entire [432-page book](http://shop.oreilly.com/product/9781565922259.do) published decades
ago by O'Reilly about `awk` and `sed`. Our goal here is to provide an introduction to a few basic
maneuvers with both `awk` and `sed` and to describe instances where they can be useful, as well as
to give an introduction to many (but certainly not all) of the patterns that can be expressed
using regular expressions. We will start with a basic overview of how `awk` works. Then we will
have a short look at regular expressions, then we will use those further in `awk`, and finally we
will play with `sed` a little bit.
In order to have a set of files to use in the examples, I have made a small GitHub
repository called `awk-and-sed-inputs`. You can get that with
```sh
git clone https://github.com/eriqande/awk-and-sed-inputs
```
All of the examples in this chapter that use such external files
assume that the current working directory is
the repository directory `awk-and-sed-inputs`.
## awk {#awk}
### Line-cycling, tests and actions
`awk` operates by cycling through a file (or an input stream from _stdin_) line-by-line. At each
new line `awk` tests whether it should do anything with the contents of the line. If the answer
to that test is "yes", then it executes the code that describes what should be done to the
contents of the line. The scope of ways that `awk` can operate on text is quite wide, but,
the most common use of `awk` is to print parts of the line, or do small calculations on parts of the
line. The instructions that describe the tests and the actions are included in the `awk` "script", though,
in practice, this script is often given to `awk` not as a file, but, rather, as text between single-quotes on the
command line. The basic syntax looks like this:
```sh
% awk '
test1 {action1}
test2 {action2}
' file
```
Where `file` is the path of the file you want `awk` to read, and the "tests" and "actions"
above are just placeholders showing where those parts of the script are written. The layout
makes it clear that if `test1` is TRUE, `action1` will be executed, and if `test2` is TRUE, then
`action2` is executed. Code describing the actions must always appear within a set of
curly braces. We will refer to the text within those curly braces as an _action block_.
Two things to note: first, different tests and actions do not have to be written on separate lines. That makes
things easier to read, but if you are writing these short scripts on the command line, it is often
easier to put everything on a single line, like `'test1 {action1} test2 {action2}`, and that is fine.
Second, if you don't supply a file path to `awk` it expects (and gladly processes) data from _stdin_.
This makes it easy to pipe data into `awk`. For example, putting the above two points together,
the above `awk` script skeleton could
have been written this way:
```sh
% cat file | awk 'test1 {action1} test2 {action2}'
```
Before we get too abstract talking about tests and actions, let's look at a few
examples:
```sh
# print lines starting with @SQ in the SAM header of a file
awk '/^@SQ/ {print}' data/DPCh_plate1_F12_S72.sam
# print only those @SQ lines with a sequence name tag starting with "NC_"
awk '/^@SQ/ && /SN:NC_/ {print}' data/DPCh_plate1_F12_S72.sam
# same as above, but quit processing the file as soon as you hit
# an @SQ line with a sequence name starting with "NW_"
awk '
/^@SQ/ && /SN:NC_/ {print}
/^@SQ/ && /SN:NW_/ {exit}
' data/DPCh_plate1_F12_S72.sam
# print only lines 101, and 103 from the fastq file
gzcat data/DPCh_plate1_F12_S72.R1.fq.gz | awk 'NR==101 || NR==102 {print}'
```
The above examples show only two different actions (`print` and `exit`) and
a variety of tests based on matching substrings in each line, or on which
line number (`NR`, which stands for "number of the record").
Take a moment to make sure you understand which parts are the tests, and which are the actions
in the above examples.
### Column splitting, fields, `-F`, `$`, `NF`, `print`, `OFS` and `BEGIN`
Every time `awk` processes a line of text it breaks it into different _fields_, which you
can think of as columns (as in a spreadsheet). By default, any number of whitespace (space or TAB)
characters constitutes a break between columns. If you want field-splitting to be done using
more specific characters, you can specify that on the command line with the the `-F` option.
For example:
```sh
awk -F"\t" 'test {action}' file # split lines into fields on single TAB characters
awk -F"," 'test {action}' file # split lines into fields on single commas
awk -F":" 'test {action}' file # split lines into fields on single colons
awk -F";" 'test {action}' file # split lines into fields on single semicolons
# and so forth...
```
While in all those examples, the field separator is a single character, in full-blown
practice, the field separator can be specified as a _regular expression_ (see below).
Within an action block, you can access the different fields that `awk` has split out
of the line using a `$` followed immediately by a field number, i.e., `$1`, `$2`, `$3`, ... . When you get
to fields with indexes greater than 9, you have to wrap the number in parentheses, like
`$(10)`, `$(11)`,... . The special value `$0` denotes the whole line, not just one of its
fields. (Note that if you give the action `print` without any arguments, that also just
prints the whole line.) The special variable `NF` is the "Number of Fields", so
`$NF` is the "last column."
Let's say we have a tab-separated file in `data/wgs-chinook-samples.tsv` that looks like this:
```
vcf_name ID_Berk NMFS_DNA_ID BOX_ID BOX_POSITION Population Concentration (ng/ul) run_type
DPCh_plate1_A01_S1 CH_Plate1_1A T144767 T1512 1A Salmon River Fall 23.4 Fall
DPCh_plate1_A02_S2 CH_Plate1_2A T144804 T1512 5F Salmon River Fall 67.4 Fall
DPCh_plate1_A03_S3 CH_Plate1_3A T145109 T1515 7G Salmon River Spring 3.52 Spring
DPCh_plate1_A04_S4 CH_Plate1_4A T145118 T1515 8H Salmon River Spring 10.3 Spring
DPCh_plate1_A05_S5 CH_Plate1_5A T144863 T1513 1A Feather River Hatchery Fall 220 Fall
```
Then, if we wanted to print the first and the last columns (fields) we could do
```sh
awk -F"\t" '{print $1, $NF}' data/wgs-chinook-samples.tsv
# the first few lines of output look like:
vcf_name run_type
DPCh_plate1_A01_S1 Fall
DPCh_plate1_A02_S2 Fall
DPCh_plate1_A03_S3 Spring
DPCh_plate1_A04_S4 Spring
```
Notice that if there is no `test` before the action block, then the action is
done on every line.
The `print` command prints variables to _stdout_, if you separate those variables with
a comma, then in the output, they will be separated by the _output field sepator_ which,
by default, is a single space. You can set the _output field separator_ using the `OFS`
variable, in an action block that is forced to run at the beginning of execution using the
special `BEGIN` test keyword:
```sh
awk -F"\t" '
BEGIN {OFS=";"}
{print $1, $NF}
' data/wgs-chinook-samples.tsv
# makes output like:
vcf_name;run_type
DPCh_plate1_A01_S1;Fall
DPCh_plate1_A02_S2;Fall
DPCh_plate1_A03_S3;Spring
DPCh_plate1_A04_S4;Spring
DPCh_plate1_A05_S5;Fall
```
The comma between the arguments to `print` is what makes `awk` print the output field
separator between the items. If you
separate arguments to `print` with just a space (and not a comma), there will be nothing
printed between the two arguments on output. This, coupled with the fact that
`print` will happily print any strings and numbers (in addition to variables!) you pass to it,
provides a way to do some light formatting of the output text:
```sh
awk -F"\t" 'NR > 1 {print "sample_name:" $1, "run_type:" $NF}' data/wgs-chinook-samples.tsv
# gives output like:
sample_name:DPCh_plate1_A01_S1 run_type:Fall
sample_name:DPCh_plate1_A02_S2 run_type:Fall
sample_name:DPCh_plate1_A03_S3 run_type:Spring
sample_name:DPCh_plate1_A04_S4 run_type:Spring
sample_name:DPCh_plate1_A05_S5 run_type:Fall
```
Note that if you don't provide an action block after a test, the default action, if the
test is true, is assumed to be "print the whole line as is." Thus you can use `awk` to
print matching lines simply like this:
```sh
awk '/regex/'
```
where `/regex/` just means "some regular expression," as is explained in the
next section.
**Now it's your turn!**
```sh
# 1. Using the file data/wgs-chinook-samples.csv, print out the
# NMFS_DNA_ID the BOX_ID and the BOX_POSITION, separated by periods
```
### A brief introduction to regular expressions
In a few of the above examples you will see tests that look like: `/^@SQ/`. This is a regular expression. In
`awk`, regular expressions are enclosed in forward slashes, so the actual regular expression part
in the above is `^@SQ`, the enclosing forward slashes are just delimiters that are telling `awk`, "Hey! The stuff inside here should
be interpreted as a regular expression."
At this stage, you can think of a regular expression as a "search-string" that `awk` will try to match
against the text in a line. At its simplest, a regular expression just describes how characters should
match between the regular expression and the line being matched. For example:
```
/ACGGTC/
```
Is saying, "search for the word `ACGGTC`," which is something that might find in a DNA string.
If all that regular expressions did was express a search word (like your familiar find function in
Microslop Word, for example), then they would be very easy to learn, but also very limited in
utility. The good news is that all your standard numerals and upper- and lowercase letters work
in regular expressions just like they do in your vanilla "find" function. So, all of the following
regular expressions are just requesting that the word enclosed in the `/`'s be found:
```
/Fall/
/A01/
/plate/
/LN/
```
Regular expressions get more complicated (and useful) because some of the familiar punctuation
marks have special meaning within a regular expression. These are called _metacharacters_.
The fundamental metacharacters in `awk` are listed in Table \@ref(tab:metachars). It is
worth getting familiar with all of these as most are common to all languages that use
regular expressions, such as R, python, and perl, so learning these will be helpful not
just in using `awk`, but also in your programming, in general.
```{r, echo=FALSE, message=FALSE}
tab <- readr::read_delim("table_inputs/metachars.txt", delim = "&", trim_ws = TRUE)
pander::pander(
tab,
booktabs = TRUE,
caption = '(\\#tab:metachars) Basic metacharacters used in regular expressions with `awk`.',
justify = "left")
```
We note in the above table that more explanation is needed for the concept of _character classes_ defined by `[ ]`.
These are similar to what you are already
familiar with in terms of _globbing_ file names. In short, if you
put a variety of characters between square brackets, it means "match any one of these characters."
For example, `/[aceACE]/` means "match any upper- or lowercase `a`, `c`, or `e`. Within
those square brackets, `^` and `-` have special meaning depending on where between
the square brackets they occur:
- A `-`, when _between_ letters or numbers, indicates a range: `/[a-zA-Z]/` means any letter;
/[0-5]/ means any numeral between 0 and 5.
- A `-` at the beginnging or end of the characters inside the square brackets just means `-`: `/[-;ab]/` means
match any of the characters `-`, `;`, `a`, or `b`.
- A `^` _at the beginning of the characters inside the square brackets_ negates the character class, meaning
the match will be to anything _not_ within the character class, i.e., `[^ABC]` means match any character
that is _not_ `A` nor `B` nor `C`. If the `^` is not at the beginning of the characters within the `[ ]`
then it carries no special meaning: `[:;^%]` matches any of those four punctuation characters.'
- All characters, except `-` and `^` in the correct positions, and the backslash `\`, are interpreted _literally_
(i.e., not as metacharacters) within the `[ ]`. So you can match any one of `?`, `*`, `(`, `)`, or `+`, for instance, with `/[?*()+]/`.
All these metacharacters might seem a little cryptic, so I provide a few
examples, here, that might help to make it more clear. They are presented
(i.e. "match any line") as if they are part of an `awk` test.
```sh
# match any line that starts with @RG
/^@RG/
# match any line that starts with @ followed by any two characters
/^@../
# match any line that starts with @ followed by any
# two uppercase letters
/^@[A-Z][A-Z]/
# the above could also be written as:
/^@[A-Z]{2}/
# This matches phone numbers formatted either like
# (###) ###-####, or ###-###-####
/\(?[0-9]{3}(\) |-)[0-9]{3}-[0-9]{4}/
# And if you can parse that out, you get an A for the day!
# match anything that starts with T456, then has any
# number (including 0) of any other characters, then
# ends with ".fq"
/T456.*\.fq/
# Note the use of backslash to escape the .
# Match either "big mess" or "huge mess"
/(big|huge) mess/
```
**Now it's your turn**
Here are some things to try.
```sh
# 1. use alternation (the |) to match lines in the SAM file data/DPCh_plate1_F12_S72.sam
# that start with either @PG or @RG
# 2. search through the fastq file data/DPCh_plate1_F12_S72.R1.fq.gz
# to find any line with exactly two consecutive occurrences of any of the following characters:
# ! " # $ % &
# What are we searching for in the above?
```
### A variety of tests
When you want to test whether a line will be processed by an action block, or not,
you have many options. Among the major ones (each given with an example or two) are:
1. match a regular expression anywhere in a line:
```{sh, eval=FALSE}
awk '/^@RG/'
awk '/Fall/'
```
1. match a regular expression in a specific column/field
```{sh, eval=FALSE}
awk '$3 ~ /Sacramento/'
awk '$1 ~ /Chr[123]/'
```
1. test for equality, using `==`, of a single column to a string, or a number. Note
these are _not_ regular expressions, but actual statements of equality. Strings in
such a context are surrounded by double quotes.
```{sh, eval=FALSE}
awk '$1 == "NC_07124.1"'
awk '$5 == 100'
```
1. Use comparison operators `<`, `<=`, `>`, `>=`, and `!=` (not equals) to compare
a column to a string (compared by lexicographical order) or a number
(compared by numerical order)
```{sh, eval=FALSE}
awk '$2 <= "Aardvark"'
awk '$7 > 25'
```
1. test the value of a user-defined variable that may be changing as lines are getting
processed.
```{sh, eval=FALSE}
awk 'n > 356'
awk 'my_word == "Loony"'
```
1. test the value of an internal variable, like `NR` or `NF`
```{sh, eval=FALSE}
awk 'NR < 25'
awk 'NF == 13'
```
As these tests are effectively things that return a value of TRUE or FALSE, they can
be combined with logical AND and logical OR operators. In awk, the logical AND is
`&&` and the logical OR is `||`. To make the intent of long combinations clear, and to
specify specific combinations, the tests can be grouped with parentheses:
```sh
awk '(units = "days" && n > 356) || (units == "months" && n > 12) {print "More than a year!"}'
```
### Code in the action blocks
Within the action blocks you write computer code to do things, and `awk` has many features
you expect in a programming language.
Separate lines of code can be ended with a line return (i.e., in scripts), or they can be
ended with a semicolon.
Variable assignment is done with the `=` sign. Unlike the bash shell, you can have spaces around
the `=`. One interesting aspect of `awk` is that the
variables are _untyped_. This means that you don't have to tell `awk` ahead of time whether
a variable is going to hold a number, or a string, etc. Everything is stored as a string,
but when used in a numeric context, if it makes sense, the variable will be treated
as a number. So, you don't have to worry too much about the type of variables.
If a value has not yet been assigned to a variable, but it is used in a numeric
context, the value is assumed to be 0; if used in a string context, its value
is assumed to be the empty string `""`.
**You can use `for` loops within `awk`**. They have the syntax of the C language:
```sh
for(var = initial; test; increment)
```
For example, this cycles `i` over values starting from 1, until 10, each
time incrementing the value by 1:
```sh
for(i=1;i<=10;i++)
```
Note that the `++` means "add one to the variable to my left."
You can also increment by larger amounts. What do you think this would do?
```sh
echo boing | awk '{for(i=5; i<=25; i+=5) print i}'
```
If the body of the loop consists of multiple lines of code (or even just one) those lines
can be _grouped_ using curly braces.
Here is an examle of using a `for` loop to print the columns of the Chinook sample sheet
in row format:
```sh
awk -F"," '
NR == 1 {for(i=1;i<=NF;i++) head[i] = $i; next}
{for(i=1;i<=NF;i++) {
print i ". " head[i] ": " $i;
}
print "----------------------"
}
' data/wgs-chinook-samples.csv
```
**Arrays are implemented as _associative arrays_.** Thus, rather than being arrays
with elements indexed (and accessed) by natural numbers (i.e. `array[1]`, `array[2]`, etc.),
arrays are indexed by any string. This is sometimes called a hash. In python it is
called a _dictionary_. So, within an awk action block you could see
```sh
n[$3]++
```
which means "find the element of the array `n` that is associated with the _key_ $3, and then
add 1 to that element". This can
be quite useful for counting up the occurrences of different strings or values
in a column. Of course, in order to actually see what the values are, after they
have been counted up, you need to be able
to cycle over all the different _keys_ of the array, and print the key and the
value for each. In `awk`, you cycle over the keys of an array using
```
for (i in array)
```
where `for` and `in` are keywords, `i` is any variable name you want, and `array` is the
name of an array variable. Unfortunately, you have no control over the _order_ in which the
different keys in the array are visited! Example:
```sh
# count the number of fish from different sampling collections
# in the wgs-chinook-samples.tsv file:
awk -F"\t" '
NR > 1 {n[$6]++}
END {for(i in n) print i ":", n[i]}
' data/wgs-chinook-samples.tsv
# gives us:
Coleman Hatchery Late Fall: 16
Feather River Hatchery Spring: 16
Feather River Hatchery Fall: 16
Salmon River Spring: 16
Trinity River Hatchery Fall: 16
Trinity River Hatchery Spring: 16
Butte Creek Spring: 16
Sacramento River Winter: 16
Salmon River Fall: 16
San Joaquin River Fall: 16
```
The above demonstrates the use of the very important `END` specifier in the
test position. The `END` there means "perform the actions in the action block
once all the lines of the file have been processed."
**Now it's your turn!**
We want to cycle through the file `DPCh_plate1_F12_S72.R1.fq.gz` and
count up the number of times different base-quality sequences occur in the file. The
base quality scores occur on lines 4, 8, 12, ... (so, the line number divided by 4 has
no remainder. Note that `x % 4` gives the remainder when `x` is divided by 4. We want the
output on each line to be in the format:
```
Number_of_occurrences Base_quality_score_sequence
```
```sh
# go for it!
```
After doing that, you might wish that things were sorted differently, like highest to lowest
in terms of # of occurrences. Try piping the output into:
```sh
sort -n -b -r -k 1
```
Pipe that to less and look through it. It is actually pretty cool.
**Conditional tests use `if` and `else`, with blocking by curly braces.**
Below we will demonstrate the use of `if` and `else` and, how they can be put together
to make an `ifelse`-like construction. At the same time we demonstrate how
output from `print` statements in `awk` can be redirected to files, from within
the `awk` script itself. This uses much the same syntax as the shell.
Imagine that we want to put the IDs (called the `vcf_names`) of the fish
in `data/wgs-chinook-samples.csv` into separate files, one for each
of the `run_types` of "Fall", "Winter", and "Spring", and another file
for anything else. We could do that like this:
```sh
awk -F"," '
NR > 1 {
if($NF == "Fall") {
print $1, $NF > "fall.txt"
} else if($NF == "Winter") {
print $1, $NF > "winter.txt"
} else if($NF == "Spring") {
print $1, $NF > "spring.txt"
} else {
print $1, $NF > "other.txt"
}
}
' data/wgs-chinook-samples.csv
```
Now look at the four files produced.
**NB:** Parsing CSV files with `awk` is not always so straightforward as doing `-F","` because
the CSV specification allows for commas that do not separate fields to be hidden within quotation marks.
Don't expect to parse complex CSV files made by Excel, for example, to be parsed this easily
with `awk`. It is better to save them as TAB separate files, typically.
**Mathematical operations:** `awk` has got them all: `+`, `-`, `*`, `/`, `%`, as well as "operate and reassign"
versions: `+=`, `-=`, `*=`, `/=`, as well as: `exp`, `log`, `sqrt`, `sin`, `cos`, and `atan2`
**Built in functions:** `awk` has a limited set of built-in functions. See `man awk` for a full
listing. The ones I find I use all the time are:
- `length(n)` : return the number of characters of the variable `n`
- `substr(s, m, n)` : return the portion of string `s` that begins at position `m`
(counted starting from 1), and is `n` characters long
- `sub(r, t, s)` : substitute string `t` for the first occurrence of the regular expression `r` in the string `s`. If `s` is not given, `$0` is used.
- `gsub` : just like `sub` but replaces _all_ (rather than just the first) occurrences of the regular expression `r` in string `s`.
- `split(s, a, fs)` : split the string `s` into an array variable named `a` at occurrences of the regular expression `fs`.
The function returns the number of pieces the string was split into. Afterward each part of the string can be accessed like
`a[1]`, `a[2]`, and so forth. (Good example = splitting out fields from a column in a VCF file.)
**Take control of output formatting with the `printf()` function**:
I still need to write this.
### Using `awk` to assign to shell variables
We leave our discussion of `awk` by noting that its terse syntax makes it
a perfect tool for creating shell variables that we can do things (like
cycling over them) with.
For example, imagine we wanted to do something to all the FASTQ files associated with
the fish in `data/wgs-chinook-samples.csv` that are of `run_type` = "Winter". We can
get all their IDs using command substitution and then cycle over them in the shell like this:
```sh
WINTERS=$(awk -F"," '$NF == "Winter" {print $1}' data/wgs-chinook-samples.csv)
for i in $WINTERS; do
echo "FASTQS are: $i.R1.fq.gz $i.R2.fq.gz"
done
```
### Passing Variables into `awk` with `-v`
Gotta let people know about this!
### Writing awk scripts in files
Need to do this.
## sed
`sed` is a "stream editor". Though it is capable of all sorts of things,
to be honest, I use it almost exclusively to
do simple find-and-replace operations. The syntax for that is:
```sh
sed 's/regex/replacement/g;' file
```
where `regex` is a regular expression and `replacement` is the string you wish to replace
any segments of `file` that match `regex`. The `s` means "substitute" and is like
a command to `sed`, and the `g` means "globally", without which only the first
match of `regex` on each line would be replaced.
Multiple instances of the "s" command can be given, like:
```sh
sed '
s/regex1/replacement1/g;
s/regex2/replacement2/g;
s/regex3/replacement3/g;
' file
```
The separate commands are done in order, line by line.
**Now it's your turn!**
Take the output of:
```sh
WINTERS=$(awk -F"," '$NF == "Winter" {print $1}' data/wgs-chinook-samples.csv)
for i in $WINTERS; do
echo "FASTQS are: $i.R1.fq.gz $i.R2.fq.gz"
done
```
and pipe it into `sed` to changes the `R1` and `R2` into `r1` and `r2`,
and to remove the `.gz` from the end of each file name. Remember to
backslash-escape that period!