-
Notifications
You must be signed in to change notification settings - Fork 1
/
workshop.slide
1728 lines (1065 loc) · 49.4 KB
/
workshop.slide
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Performance and Scalability of Go Applications
24 Sep 2019
Tags: Go, performance, scalability
Fabio Falzoi
Senior Software Engineer @ Develer
https://github.com/Pippolo84
@Pippolo84
* Before starting
- check your Go environment: compile and run the "Hello World"
package main
import "fmt"
func main() {
fmt.Println("Hello world!")
}
- download the workshop material
git clone --depth 1 [email protected]:Pippolo84/performance-and-scalability-of-go-applications.git
- peeking inside the solution folders is not allowed ;-)
* Introduction
* Big O notation in a nutshell
Adapted from [[https://en.wikipedia.org/wiki/Big_O_notation][Wikipedia]]
Let f be a real valued function and g a real valued function. One writes:
f(x) = O(g(x)) for x -> +inf
if for all sufficiently large values of x, the absolute value of f(x) is at most a positive
constant multiple of g(x).
That is, f(x) = O(g(x)) if and only if there exists a positive real number M and
a real number x0 such that
abs(f(x)) <= Mg(x) for all x >= x0
* Time complexity examples
- _O(1)_ constant time (look-up table)
- _O(logn)_ logarithmic time (binary search)
- _O(n)_ linear time (linear search)
- _O(nlogn)_ linearithmic time (quick sort)
- _O(n^2)_ quadratic time (bubble sort)
* A taste of performance analysis
Try to complete the Isogram exercise!
Follow the instructions inside
01-introduction/INSTRUCTIONS.md
Happy coding!
* Isogram solutions
1) Using a set
- iterate over the string
- for each rune: if it is already in the map, return false
- return true
2) Using sorting
- sort the string
- check each pair of adjacent runes
- if you find a pair of equal runes, return false, otherwise return true
3) Using linear searching
- iterate over the string
- for each rune, search for it in the remaining part of the string
- if found, return false, otherwise, return true
*Quick*quiz*: can you guess the time complexity of each solution?
* Isogram solutions time complexities
1) Using a set: _O(n)_
2) Using sorting: _O(nlogn)_
3) Using linear searching: _O(n^2)_
* Let's benchmark them!
Inside the solution folder
$ go test -run=^$ -bench=.
goos: linux
goarch: amd64
pkg: performance-and-scalability-of-go-applications/01-introduction/solution
BenchmarkIsIsogram/01-map-8 200000 8203 ns/op
BenchmarkIsIsogram/02-sorting-8 300000 5534 ns/op
BenchmarkIsIsogram/03-linear-search-8 1000000 1382 ns/op
...
PASS
ok performance-and-scalability-of-go-applications/01-introduction/solution 6.295s
Wait... What!?
* Map vs Sort vs Linear Search
.image img/RuneUniqueAll.svg
For #runes < 256 the linear search outperforms the other implementation
* Map vs Sort
.image img/RuneUniqueMapVsSort.svg
For #runes < 1024 the sort implementation is still faster than the map
* A closer look to the map data type
There are no generics in Go 1.13
The compiler rewrites lookups, insertion and removal into functions calls to the runtime package
v := found[r]
func mapaccess1_*(t *maptype, h *hmap, key uint32) unsafe.Pointer
found[r] = true
func mapassign_*(t *maptype, h *hmap, key uint32) unsafe.Pointer
- hmap implements the generic map logic
- maptype is unique to the specific map type declared and contains key and value type descriptors
* A closer look to the map data type
go tool compile -S isogram.go
0x00df 00223 (isogram.go:18) MOVL AX, "".r+40(SP) ;; key loaded in AX
0x00e3 00227 (isogram.go:19) LEAQ type.map[int32]bool(SB), CX ;; t *maptype loaded in CX
0x00ea 00234 (isogram.go:19) MOVQ CX, (SP) ;; t *maptype on the stack
0x00ee 00238 (isogram.go:19) LEAQ ""..autotmp_8+112(SP), DX ;; h *hmap loaded in DX
0x00f3 00243 (isogram.go:19) MOVQ DX, 8(SP) ;; h *hmap on the stack
0x00f8 00248 (isogram.go:19) MOVL AX, 16(SP) ;; key on the stack
;; runtime.mapaccess1_fast32(t, h, key)
0x00fc 00252 (isogram.go:19) CALL runtime.mapaccess1_fast32(SB)
0x010b 00267 (isogram.go:23) LEAQ type.map[int32]bool(SB), AX ;; t *maptype loaded in CX
0x0112 00274 (isogram.go:23) MOVQ AX, (SP) ;; t *maptype on the stack
0x0116 00278 (isogram.go:23) LEAQ ""..autotmp_8+112(SP), CX ;; h *hmap loaded in DX
0x011b 00283 (isogram.go:23) MOVQ CX, 8(SP) ;; h *hmap on the stack
0x0120 00288 (isogram.go:23) MOVL "".r+40(SP), DX ;; key loaded in DX
0x0124 00292 (isogram.go:23) MOVL DX, 16(SP) ;; key on the stack
;; runtime.mapassign_fast32(t, h, k)
0x0128 00296 (isogram.go:23) CALL runtime.mapassign_fast32(SB)
* A closer look to sort.Sort
from sort.go:
func quickSort(data Interface, a, b, maxDepth int) {
for b-a > 12 { // Use ShellSort for slices <= 12 elements
...
}
if b-a > 1 {
// Do ShellSort pass with gap 6
// It could be written in this simplified form cause b-a <= 12
for i := a + 6; i < b; i++ {
if data.Less(i, i-6) {
data.Swap(i, i-6)
}
}
insertionSort(data, a, b)
}
}
For small slices, the standard library doesn't even use Quicksort!
* A closer look to strings.ContainsRune
from indexbyte_amd64.s:
sseloop:
// Move the next 16-byte chunk of the data into X1.
MOVOU (DI), X1
// Compare bytes in X0 to X1.
PCMPEQB X0, X1
// Take the top bit of each byte in X1 and put the result in DX.
PMOVMSKB X1, DX
// Find first set bit, if any.
BSFL DX, DX
JNZ ssesuccess
// Advance to next block.
ADDQ $16, DI
Search is done using SIMD instructions
* Can we do even better?
Consider the test cases, we can see that:
- all runes are in the ASCII set
- we are only interested in the 26 alphabetical letters
* Array as a map
func IsIsogram(s string) bool {
foundRune := [26]bool{} //'a' to 'z'
for _, r := range s {
if !unicode.IsLetter(r) {
continue
}
// convert the rune to lowercase to index foundRune
r = unicode.ToLower(r)
i := r - 'a'
if foundRune[i] == true {
return false
}
foundRune[i] = true
}
return true
}
* Benchmarks
BenchmarkIsIsogram/01-map-8 200000 8203 ns/op
BenchmarkIsIsogram/02-sorting-8 300000 5534 ns/op
BenchmarkIsIsogram/03-linear-search-8 1000000 1382 ns/op
BenchmarkIsIsogram/04-bool-array-8 3000000 525 ns/op
* Pitfalls of the time complexity model
Constant factors are important. What are they determined by?
- memory access pattern
- algorithms implemented in hardware
- characteristics of our input dataset
- ...
* Take home message
Time complexity with Big O notation tells us just a part of the story!
If we want to get the most ouf of our applications we need to:
- understand our hardware, the Go runtime internals and their interactions
- learn to profile and benchmark our application with meaningful dataset
That's why the rest of this workshop is dedicated to that! ;-)
* Roadmap
- Modern CPU architecture
- Benchmarking
- Profiling & Tracing
- The Go scheduler
- The Go memory allocator & garbage collector
* Modern CPU architecture
* Microprocessors evolution
.image img/40-years-processor-trend.png 580 _
* Single core performance
Single core performance is approaching a limit
CPUs manufacturers need to keep power consumption, and thus heat dissipation, below levels that will damage their CPUs
* Number of transistors
Number of transistors on a die continues to increase
But packing more transistors in the same area is getting harder and expensive!
In conclusion, adding cores is not that easy.
Besides, improvements from parallelization are limited by the portion of work that is not parallelizable.
* Architectural improvements
*Out*of*order* execution to take advantage of _Instruction_level_parallelism_
1) h = a + b
2) f = c + d
3) g = h * f
1) and 2) can be executed in parallel
But connecting each execution unit to all others is costly
* Architectural improvements
*Speculative*execution* to guess the outcome of a branch condition and keeps the pipeline full
.image https://upload.wikimedia.org/wikipedia/commons/2/21/Fivestagespipeline.png
Performance improvements depend on the rate of branch prediction successes!
* Bulk operations
Modern CPUs are optimized for *bulk* operations
- memory is loaded per multiple of caches lines
- SIMD instructions
We've seen this effects in the introductory exercise!
* Memory latency
.image https://www.extremetech.com/wp-content/uploads/2017/07/Ch5-fig02.jpg
* Memory latency
L1 cache reference ~1ns
Branch mispredict ~3ns
L2 cache reference ~4ns
Main memory reference ~100ns
from [[https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html][Latency Numbers Every Programmer Should Know]]
Caches are critical to reduce memory latencies, unfortunately, their size are limited by their physical size on CPU die and power consumption
* Conclusions
_future_programmers_will_not_longer_be_able_to_rely_on_faster_hardware_to_fix_slow_programs_or_slow_programming_languages_
from [[http://www.gotw.ca/publications/concurrency-ddj.htm][The free lunch is over]] - Herb Sutter
* Why Go?
- efficient type system: it does not pretend every number is an ideal float
- efficient implementation of structs: in contrast with java objects that store references
- easily scales to multiple cores for highly parallelizable applications
* Benchmarking
* A stable environment
When benchmarking, try to make your environment as stable as possible:
- avoid shared hardware, virtual machines and shared cloud hosting
- close all resource hungry applications so that CPU utilization is lower than ~5%
Specifically
- do not browse the Web
- do not check emails
- do not use your editor to write stuff
and above all...
* A stable environment
- resist the urge to answer on Slack using Giphy
.image https://media.giphy.com/media/n2IekUAIvjz2w/giphy.gif
*Quick*quiz*: try to guess the increase in CPU utilization due to this GIF, then measure it with the top utility
* Advanced HW tweaking to improve stability
- disable Turbo Boost
# Intel
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
# AMD
echo 0 > /sys/devices/system/cpu/cpufreq/boost
- Set scaling_governor to "performance"
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
...
- Set process priority
nice -n 10 ./isogram.bench -test.run=^$ -test.bench=. -test.benchmem
More on this [[https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux][here]]
* Expensive HW tweaking to improve stability
Buy dedicated HW to run your benchmarks
* Benchmarking for humans
- create a test executable to do a before/after comparison
- use [[https://godoc.org/golang.org/x/perf/cmd/benchstat][benchstat]]
more on this later!
* Go benchmark tooling
Very similar to testing
func BenchmarkIsIsogram(b *testing.B) {
for i := 0; i < b.N; i++ {
IsIsogram("input test string")
}
}
You can execute them with:
go test -bench=.
This will first run the tests, and then the benchmarks
* Go benchmarking output
Sample output produced
goos: linux
goarch: amd64
pkg: path/to/package
BenchmarkName-8 100000 1859 ns/op 57 B/op
PASS
ok path/to/package 2.042s
A line starting with the BenchmarkName followed by the number of cores (GOMAXPROCS) available, then the number of times (100000) the benchmark was run.
The next two numbers indicate:
- the time it took to execute the benchmark, in ns/op (nanoseconds per operation)
- the memory usage, in B/op (bytes of memory allocated per operation)
lower is better.
* How benchmarks work
Each benchmark function is repeatedly called with different value for b.N, the number of iterations the benchmark should run for.
b.N starts at 1, if the benchmark function completes in less than 1 second, then b.N is increased and the benchmark function is called again.
b.N increases in a sequence adapted by the benchmark framework, [[https://github.com/golang/go/commit/03a79e94ac72f2425a6da0b399dc2f660cf295b6][up to 1.000.000]]
BenchmarkIsIsogram/01-map-8 200000 8203 ns/op
BenchmarkIsIsogram/02-sorting-8 300000 5534 ns/op
BenchmarkIsIsogram/03-linear-search-8 1000000 1382 ns/op
BenchmarkIsIsogram/04-bool-array-8 3000000 525 ns/op
*Quick*quiz*: why each benchmark run for different number of iterations (200000, 300000, 1000000 and 3000000)?
* Benchmark option -cpu
Change the value of CPUs used
-cpu
Example
go test -v -run=^$ -bench=. -cpu=1,4,8
goos: linux
goarch: amd64
pkg: performance-and-scalability-of-go-applications/01-introduction/isogram/solution
BenchmarkIsIsogram/01-map 200000 6886 ns/op
BenchmarkIsIsogram/01-map-4 200000 7246 ns/op
BenchmarkIsIsogram/01-map-8 200000 7951 ns/op
...
PASS
ok performance-and-scalability-of-go-applications/01-introduction/isogram/solution 19.518s
* Benchmark option -benchtime
By default, Go will run the benchmark, increasing b.N, until it takes more than 1s to complete.
You can change this behaviour using -benchtime option
Examples
- run the benchmark until it hits the 10s threshold
go test -v -run=^$ -bench=. -benchtime=10s
- make exactly 100 iterations
go test -v -run=^$ -bench=. -benchtime=100x
* Benchmark option -count
By default, each benchmark is repeated just once.
You can change this behaviour using -count option
Example
- run each benchmark twice
go test -v -run=^$ -bench=. -count=2
* benchstat
[[https://godoc.org/golang.org/x/perf/cmd/benchstat][benchstat]] is a tool to compute statitics about benchmarks
To download it
go get -u -v golang.org/x/perf/cmd/benchstat
How it works
_For_each_different_benchmark_listed_in_an_input_file,_benchstat_computes_the_mean,_minimum,_and_maximum_run_time,_after_removing_outliers_
It greatly helps to understand the statistical significance of your benchmarks
* benchstat on a single input file
If invoked on a single input file, benchstat prints the per-benchmark statistics for that file
$ go test -v -run=^$ -bench=. -count=10 > isogram.txt
$ benchstat isogram.txt
name time/op
IsIsogram/01-map-8 7.60µs ± 3%
IsIsogram/02-sorting-8 5.34µs ± 1%
IsIsogram/03-linear-search-8 1.31µs ± 2%
IsIsogram/04-bool-array-8 529ns ± 0%
It computes the mean value and gives the variation across the samples
*Lower* variation means *better* stability!
* benchstat on multiple input files
benchstat is very useful to analyze the impact of your changes
$ go test -v -run=^$ -bench=. -count=10 > old.txt
// changes to the code
$ go test -v -run=^$ -bench=. -count=10 > new.txt
$ benchstat old.txt new.txt
name old time/op new time/op delta
IsIsogram/01-map-8 7.52µs ± 2% 7.40µs ± 5% ~ (p=0.105 n=10+10)
IsIsogram/02-sorting-8 5.23µs ± 3% 5.13µs ± 3% -1.95% (p=0.016 n=10+10)
IsIsogram/03-linear-search-8 1.32µs ± 1% 1.32µs ± 1% ~ (p=0.669 n=10+10)
IsIsogram/04-bool-array-8 530ns ± 0% 533ns ± 0% +0.71% (p=0.000 n=9+10)
A ~ means that the difference between the benchmarks is not statistically significant
* Benchmarking allocations
Using the -benchmem option, go will report the heap allocations per operation
$ go test -v -run=^$ -bench=. -benchmem
goos: linux
goarch: amd64
pkg: performance-and-scalability-of-go-applications/01-introduction/isogram/solution
BenchmarkIsIsogram/01-map-8 200000 7458 ns/op 1273 B/op 11 allocs/op
BenchmarkIsIsogram/02-sorting-8 300000 5266 ns/op 928 B/op 25 allocs/op
BenchmarkIsIsogram/03-linear-search-8 1000000 1315 ns/op 48 B/op 3 allocs/op
BenchmarkIsIsogram/04-bool-array-8 3000000 532 ns/op 0 B/op 0 allocs/op
PASS
ok performance-and-scalability-of-go-applications/01-introduction/isogram/solution 6.692s
Note how less allocations means better performance... More on this later!
* Creating a benchmark "gold standard" file
Instead of saving the textual output of a benchmark run, you can save a binary to reproduce the benchmark
$ go test -c -o isogram.bench
This way you will be able to benchmark your code (with whatever set of options you like) for later comparison
$ ./isogram.bench -test.run=^$ -test.bench=. -test.benchmem
* Benchmarking pitfalls
Consider the following benchmark
func NaturalNumbersSum(n int) int {
return n * (n + 1) / 2
}
func BenchmarkNaturalNumbersSum(b *testing.B) {
for i := 0; i < b.N; i++ {
NaturalNumbersSum(i)
}
}
And its results
...
BenchmarkNaturalNumbersSum-8 2000000000 0.30 ns/op
PASS
0.30 ns/op is the time of a single clock cycle...
* Benchmarking pitfalls
Using
go test -v -run=^$ -bench=. -gcflags="-m"
We can see that the call to NaturalNumbersSum is inlined by the compiler...
...
./gauss.go:3:6: can inline NaturalNumbersSum
./gauss_test.go:7:20: inlining call to NaturalNumbersSum
...
* Benchmarking pitfalls
And since the body of the function has no side effects, it is optimized away by the CPU!
"".BenchmarkNaturalNumbersSum STEXT nosplit size=23 args=0x8 locals=0x0
0x0000 00000 (gauss_test.go:5) TEXT "".BenchmarkNaturalNumbersSum(SB), NOSPLIT|ABIInternal, $0-8
...
0x0000 00000 (gauss_test.go:6) MOVQ "".b+8(SP), AX
0x0005 00005 (gauss_test.go:6) XORL CX, CX
0x0007 00007 (gauss_test.go:6) JMP 13
0x0009 00009 (gauss_test.go:6) INCQ CX
0x000c 00012 (gauss_test.go:7) XCHGL AX, AX
0x000d 00013 (gauss_test.go:6) CMPQ 264(AX), CX
0x0014 00020 (gauss_test.go:6) JGT 9
...
0x0016 00022 (<unknown line number>) RET
* Fixing the benchmark
Force the body of the for loop to produce effects visible outside the loop itself
var result int
func BenchmarkNaturalNumbersSum(b *testing.B) {
var r int
for i := 0; i < b.N; i++ {
r = NaturalNumbersSum(i)
}
result = r
}
inspired by [[https://github.com/golang/go/issues/14813#issue-140603392][issue #14183]]
* Fixing the benchmark
"".BenchmarkNaturalNumbersSum STEXT nosplit size=54 args=0x8 locals=0x0
0x0000 00000 (gauss_test.go:7) TEXT "".BenchmarkNaturalNumbersSum(SB), NOSPLIT|ABIIn
ternal, $0-8
...
0x0000 00000 (gauss_test.go:10) MOVQ "".b+8(SP), AX
0x0005 00005 (gauss_test.go:10) XORL CX, CX
0x0007 00007 (gauss_test.go:10) XORL DX, DX
0x0009 00009 (gauss_test.go:10) JMP 37
0x000b 00011 (gauss_test.go:11) XCHGL AX, AX
0x000c 00012 (gauss.go:4) LEAQ 1(CX), BX
0x0010 00016 (gauss.go:4) IMULQ BX, CX
0x0014 00020 (gauss.go:4) MOVQ CX, SI
0x0017 00023 (gauss.go:4) SHRQ $63, CX
0x001b 00027 (gauss.go:4) LEAQ (CX)(SI*1), DX
0x001f 00031 (gauss.go:4) SARQ $1, DX
0x0022 00034 (gauss_test.go:10) MOVQ BX, CX
0x0025 00037 (gauss_test.go:10) CMPQ 264(AX), CX
0x002c 00044 (gauss_test.go:10) JGT 11
...
0x002e 00046 (gauss_test.go:14) MOVQ DX, "".result(SB)
0x0035 00053 (gauss_test.go:15) RET
* Insert and Remove exercise
Follow the instructions inside
03-benchmarking/INSTRUCTIONS.md
Happy coding!
* Insert and Remove exercise solution
.image img/InsertRemove.svg
Predictability of memory access makes slice the winner!
* Insert and Remove exercise solution
Useful things:
- benchstat
- -count
- -benchtime
to check benchmark stability and get more meaningful results for longer execution test cases
* Take home message(s)
- don't store data unnecessarily (more on that later)
- keep data compact (again... more on that later)
- access memory in a predictable manner
- don't make statements about performance without measurements
_"I_emphasize_the_importance_of_cache_effects._In_my_experience,_all_but_true_experts_tend_to_forget_those_when_algorithms_are_discussed."_
Bjarne Stroustrup - from [[https://isocpp.org/blog/2014/06/stroustrup-lists][Are lists evil?]]
* Profiling & Tracing
* Benchmarking vs Profiling
_Benchmarking_ allows us to inspect the performance of a specific piece of code.
_Profiling_ gives us a way to understand where our application is spending the most of the time and using the majority of resources.
* What is a profile?
_A_Profile_is_a_collection_of_stack_traces_showing_the_call_sequences_that_led_to_instances_of_a_particular_event,_such_as_allocation._
* Types os profiles
- CPU
- Heap
- Block
- Mutex
In this workshop we'll focus on CPU profiles and Heap profiles
* How the CPU profiler works
An handler to the `SIGPROF` signal is registered and a timer is started. The timer will count only when the program is actively consuming CPU cycles.
Every 10ms, the signal `SIGPROF` is delivered, and the handler unwinds the stack, registering the stack trace.
* How the Heap profiler works
These profiles don't require the delivery of any signals, the runtime itself will collect samples while allocating memory on the heap.
Not all allocations are sampled, if you like to see every one of them, you should set
runtime.MemProfileRate
to 1 as early as possibile in your program.
* Profiling benchmarks
You can generate profiles directly when launching benchmarks
go test -run=^$ -bench=. -cpuprofile cpu.prof
go test -run=^$ -bench=. -memprofile mem.prof
* Profiling web servers
For web servers, you can use the `net/http/pprof` package to expose an endpoint where you can collect various profiles type
1) Import `pprof` package with a blank import
import _ net/http/pprof
2) If your application is not already running an HTTP server, then start one
http.ListenAndServe(":8080", nil)
You can find an example in
04-profiling/profilingexamples/server
* Profiling whole programs
To profile a whole program you can use the `runtime/pprof` package.
Just add this lines at the start of your `main` function
cpuProfile := flag.String("cpuprofile", "", "write cpu profile to file")
flag.Parse()
if *cpuProfile != "" {
f, err := os.Create(*cpuProfile)
if err != nil {
log.Fatal(err)
}
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
}
Then launch your program with the flag
-cpuprofile cpu.prof
To save a CPU profile inside `cpu.prof` file
* pprof live demo
To analyse the profile you generated, you can use the Go `pprof` tool, with a textual interface
go tool pprof <profile>
or a web based one
go tool pprof -http=:9090 <profile>
Let's explore both of them with a live demo!
* Tracing
Excerpt from the Go [[https://golang.org/pkg/runtime/trace/][documentation]]
"The execution trace captures a wide range of execution events related to goroutines, syscalls, GC and heap size, processor start/stop, etc.
A precise nanosecond-precision timestamp and a stack trace is captured for most events."
While profiling helps us understand *what* happened, tracing is very useful to understand *when* things happened
* Tracing benchmarks
You can generate execution traces directly when launching benchmarks
go test -run=^$ -bench=. -trace trace.out
* Tracing web servers
1) Import `pprof` package with a blank import
import _ net/http/pprof
2) If your application is not already running an HTTP server, then start one
http.ListenAndServe(":8080", nil)
3) Collect the trace
go tool trace http://localhost:8080/debug/trace?seconds=10 > trace.out
* Tracing whole programs
To trace a whole program you can use the `runtime/trace` package.
Just add this lines at the start of your `main` function
traceFile := flag.String("trace", "", "trace file name")
flag.Parse()
if *traceFile != "" {
f, err := os.Create(*traceFile)
// check and handle error
if err := trace.Start(f); err != nil {
log.Fatalf("could not start trace: %v", err)
}
defer trace.Stop()
}
Then launch your program with the flag
-trace trace.out
To save an execution trace inside `trace.out` file
* Tracing cheatsheet
To analyze a trace
go tool trace <trace>
- use `?` to show Tracing Help
- navigate using `W`, `S`, `A` and `D` keys
- hold `Shift` to drag a box with your mouse pointer to ease selection of tiny events
*WARNING*: It can be viewed *only* using the Chrome web browser... may our RAM forgive us!
Let's explore it with a live demo!
* Run Length Encoding exercise
Follow the instructions inside
04-profiling/INSTRUCTIONS.md
Happy coding!
* Run Length Encoding exercise solution
The program opens an input file, read all its content one byte at a time to encode or decode it, then write the results in an output file.
* Run Length Encoding exercise solution
As a reference, let's compare runlengthencoding with the gzip/gunzip utilities.
$ time gzip divina-commedia.txt
real 0m0,056s
user 0m0,052s
sys 0m0,004s
$ time gunzip divina-commedia.txt.gz
real 0m0,043s
user 0m0,006s
sys 0m0,006s
$ time ./runlengthencoding e divina-commedia.txt
real 0m1,617s
user 0m0,568s
sys 0m1,065s
$ time ./runlengthencoding d encoded.rle
real 0m2,075s
user 0m0,845s
sys 0m1,249s
Lot of room for improvement!
* Run Length Encoding exercise solution
Let's inspect this code adding a `-cpuprofile` option to store the profile in a file
./runlengthencoding -cpuprofile cpu.prof e divina-commedia.txt
Now, we can inspect this profile with:
go tool pprof -http=:8080 cpu.prof
What can you see from this profile?
* Run Length Encoding exercise solution
.image img/RlePprofCpu.png _ 500
* Run Length Encoding exercise solution
We are spending the majority of the time doing syscalls.
A read syscall for very few bytes in memory can cost us [[http://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead][from ~50 to ~100ns]] (and Meltdown/Spectre mitigation patches are [[https://www.zdnet.com/article/linux-meltdown-patch-up-to-800-percent-cpu-overhead-netflix-tests-show/][making it worse]])
We are using unbufferead read and write, 1 byte at a time.
For the encoding, the total number of syscalls equals the number of encoded bytes.
* Run Length Encoding exercise solution
To solve the problem, we can simply buffer our read operations using [[https://golang.org/pkg/bufio/#NewReader][bufio.NewReader]]
$ time ./solution e divina-commedia.txt
real 0m0,020s
user 0m0,017s
sys 0m0,004s
$ time ./solution d encoded.rle
real 0m0,024s
user 0m0,020s
sys 0m0,005s
Here, you should note how the sys time has dropped down!
* Parallel quicksort exercise solution
Live demo
* The Go scheduler
* Requirements & constraints
- lightweight (up to 1 million goroutines per process)
- parallel and scalable
- simple (minimal API with just a single keyword)
go func() {
// ...
}()
- efficient handling of IO calls and syscalls
- one goroutine per connection model, without callbacks
* Why a new scheduler?
- OS thread context switch latency: from ~1500 to ~2000ns
- Goroutines context switch latency: from ~150 to ~200ns
Tested on my laptop, using Linux 4.19 and Go 1.13
More info about the benchmarks [[https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads][here]]
Considering a medium frequency of 3 GHz CPU, and an _Istructions_Per_Cycle_ of ~4, we can execute ~12 instructions per ns
- OS thread context switch cost ranges from ~18000 to ~24000 instructions
- Goroutines context switch cost ranges from ~1800 to ~2400 instructions
A ten times improvement!
* M:P:N scheduler
Based on a M:N threading models, where we *multiplex* N goroutines onto M OS threads
- *M* stands for _machine_ and represents the OS thread
- *P* stands for _processor_ and is an execution context for Go code
- *G* stands for _goroutine_
*P* holds the list of runnable goroutines and other Go execution contexts caches
*G* holds the Instruction Pointer and the goroutine stack
*To*execute*a*G,*a*M*needs*to*hold*a*P*
* Scheduler steady state
.image img/SchedulerSteadyState.png 450 _
At startup, the Go runtime creates *GOMAXPROCS* Ps, where *GOMAXPROCS* is set to the number of available cores
* Cooperative scheduling