-
Notifications
You must be signed in to change notification settings - Fork 0
/
methods.tex
808 lines (727 loc) · 68.4 KB
/
methods.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Master Thesis
% Ralf Krauth
% April 2021
%
% License:
% CC-BY-SA 4.0 -- Creative Commons Attribution-ShareAlike 4.0 International
% https://creativecommons.org/licenses/by-sa/4.0/legalcode
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Methods}
In the following sections, the technical details for the methods used within the thesis will be discussed.
First, input data and data preprocessing will be described, followed by an explanation of the quality metrics used to assess the predictions
of Hi-C matrices made throughout the thesis.
Next, \cref{sec:methods:dnn,sec:methods:hicgan} will give details with regard to the two
different approaches made in this thesis to advance the state of the art in predicting Hi-C matrices.
Finally, some special methods were employed to compare the results of this thesis to
other approaches in the field, which will be described in \cref{sec:methods:comparison}.
\subsection{Input data} \label{sec:methods:input_data}
For the thesis at hand, data from human cell lines GM12878 and K562 was used.
The exact data sources and data preprocessing for the Hi-C matrices and ChIP-seq data
will be outlined below.
\emph{Hi-C data} generated by Rao et al. \cite{Rao2014} was downloaded
in hic file format from Gene Expression Omnibus under accession key GSE63525.
Here, the quality-filtered ``combined\_30'' matrices were chosen, which contain only high-quality reads from
both replicates.
Next, matrices at \SI{5}{\kilo\bp} bin size were extracted and converted to cooler format using \texttt{hic2cool}
and subsequently coarsened to resolutions of 10, 25 and \SI{50}{\kilo\bp} using \texttt{cooler coarsen},
see \cref{list:methods:hic2cool} (p.~\pageref{list:methods:hic2cool}).
Contrary to the work by Farr\'e et al. \cite{Farre2018a}, which is using ICE- plus distance-based normalization,
and many others in the field which are using ICE- or KR-normalization,
these matrices have not been normalized for the thesis at hand
because no benefit of doing so was found during the study project \cite{Krauth2020}.
Within this thesis, \emph{ChIP-seq} data for 13 chromatin features plus DNaseI-seq data was used, cf. \cref{tab:methods:csdata}.
To this end, read alignments for replicates 1 and 2 were downloaded in bam format either via the \acrshort{encode} project \cite{Encode2012,Davis2017}
or directly from the file server of the University of California (UCSC) in their most recent versions (if applicable).
The UCSC- (wgEncode...) or GEO- (GSM...) identifiers are given in \cref{tab:methods:csdata}.
For convenience, the pdf version of this document also provides download links for all 28 files in \cref{sec:chromFeat_download_links}.
\begin{table}[ht!]
\centering
\begin{tabular}{lll}
\hline
\multicolumn{1}{c}{\multirow{2}{*}{\textbf{feature name}}} & \multicolumn{2}{l}{\hspace*{18mm}\textbf{cell line}} \\
\multicolumn{1}{c}{} & GM12878 & K562 \\ \hline
CTCF & GSM733752 & GSM733719\\
DNaseI & wgEncodeEH000534 & wgEncodeEH000530\\
H3k27ac & GSM733771 & GSM733656\\
H3k27me3 &GSM733758 & GSM733658\\
H3k36me3 &GSM733679 & GSM733714\\
H3k4me1 &GSM733772 & GSM733692\\
H3k4me2 &GSM733769 & GSM733651\\
H3k4me3 &GSM733708 & GSM733680\\
H3k79me2 &GSM733736 & GSM733653 \\
H3k9ac &GSM733677 & GSM733778\\
H3k9me3 &GSM733664 & GSM733776 \\
H4k20me1 &GSM733642 & GSM733675\\
Rad21 & wgEncodeEH000749 & wgEncodeEH000649\\
Smc3 & wgEncodeEH001833 & wgEncodeEH001845\\ \hline
\end{tabular}
\caption{Chromatin features used for the thesis} \label{tab:methods:csdata}
\end{table}
After downloading, the bam files were converted to bigwig format, which was found more convenient to handle, and the replicates were merged into
one bigwig file by taking the mean, as in the study project \cite{Krauth2020}.
Pseudocode for the full conversion process is given in \cref{list:methods:bamtobigwig} (p.~\pageref{list:methods:bamtobigwig}).
The choice of chromatin features is widely in line with the work by Zhang et al. \cite{Zhang2019};
besides structural proteins like CTCF and Cohesin subcomponents RAD21 and SMC3, active/passive markers are used as well.
\subsection{Quality metrics for predicted Hi-C matrices} \label{sec:methods:metrics}
Assessing the quality of synthetically generated images is a long-standing problem, which has not yet been solved, see e\,g. \cite[p.\,19]{Wang2020}.
Within this thesis, distance-stratified Pearson correlations were computed between predicted and real matrices to measure the quality of the predictions.
While the suitability of suchlike correlations as a quality metric for Hi-C matrices seems debatable in general \cite{Yang2017},
distance stratified Pearson correlation is quite common in the field of Hi-C matrices and thus useful at least for comparisons with other approaches.
To compute the correlations, the $w$ bins at the left and right boundaries of the investigated matrices were first removed, because these never contain
valid values due to the chosen sample generation approach,
cf.~\cref{sec:methods:sample_gen}, \cref{fig:methods:prediction}.
Next, the remaining values were grouped into vectors according to their genomic distance $d$, up to the maximum distance $w$,
and then Pearson correlations $\rho$ were computed separately for each vector as usual, \cref{eq:pearson}.
Here, $\mathit{true}|_d$ and $\mathit{pred}|_d$ are the true and predicted values restricted to bin distance $d$, while
$\sigma(\mathit{true}|_d)$ and $\sigma(\mathit{pred}|_d)$ are the respective standard deviations and $cov(\cdot,\cdot)$ is the covariance.
\begin{equation}
\rho(\mathit{true}|_d, \; \mathit{pred}|_d) = \frac{cov(\mathit{true}|_d, \mathit{pred}|_d)}{\sigma(\mathit{true}|_d)\; \sigma(\mathit{pred}|_d)} \label{eq:pearson}
\end{equation}
Since a numerically stable and efficient implementation for computing Pearson correlation is non-trivial, the corresponding function from pandas was used
for actual calculations \cite{pandasPearson}.
An obvious disadvantage of distance-stratified Pearson correlations is that they are undefined if the matrices have the same value,
and thus zero standard deviation, e.\,g. $\sigma(\mathit{pred}|_{d^*})=0$ for a certain distance $d^*$.
In addition to the correlations themselves, the \acrfull{auc} was computed for all curves in each plot
to obtain a single-valued, albeit very abstract quality metric.
In line with Zhang et al. \cite{Zhang2019}, trapeze integration method was used for numerical area computations
and the values were divided by the maximum distance $d_\mathit{max} = w$ to normalize them to value range [-1...1],
allowing for comparisons across experiments.
Furthermore, most Pearson correlation plots show the correlation between the ``true'' training matrix and the corresponding target matrix as a baseline,
i.\,e. the Pearson correlation that is obtained when simply using the training matrix as a ``prediction'' for a given target matrix.
In computer graphics, other similarity metrics like the \acrfull{ssim}, \acrfull{psnr} or, especially for images generated by \acrshort{gan}s,
\acrfull{fid} seem more common than Pearson correlation \cite[p.\,19\,ff.]{Wang2020}.
However, their suitability for plots of Hi-C matrices seems not to have been investigated so far.
Unlike Pearson correlation, \acrshort{ssim} and \acrshort{psnr} both depend on absolute values in the matrices $\mathbf{M}$ to be compared,
so that in general for example $\mathit{PSNR}(\mathbf{M}) \not=\mathit{PSNR}(k\cdot \mathbf{M})$ for $k\in\mathbb{R}^{>1}$.
This is likely not optimal for a use case where structures are more interesting than actual values.
Computing \acrshort{fid} would require reshaping outputs into images of shape (299, 299, 3) to meet Inception v3 specifications \cite{Szegedy2016}.
This is a non-trivial task for the given matrix plots,
where the longer edge -- defined by $\mathit{chromsize}/\mathit{binsize}$ -- is usually much longer
and the shorter edge -- defined by the window size $w$ -- is usually shorter than 299 bins, or pixels, respectively.
While computing the \acrshort{fid} for individual $(w,w)$-submatrices would be possible,
generalizing such results to the whole matrix seems without example in literature so far.
Since relying on one metric alone seemed problematic, selected areas of the test set were plotted against the true
Hi-C matrices in these areas, see \cref{sec:methods:matrix_plots} below.
Note that visual comparison of predicted Hi-C matrices and true Hi-C matrices may be affected by
the value range of the matrices (and the color map used to visualize them), while the Pearson correlation is generally not affected by different
value ranges.
\subsection{Matrix plots} \label{sec:methods:matrix_plots}
To allow for a visual comparison, selected areas of the predicted matrices were plotted against their true counterparts using pygenometracks \cite{LopezDelisle2020}.
Since it was impossible to plot the full test set every time due to the amount of figures needed,
three exemplary regions (a) to (c) were selected: (a) chr21:30-\SI{40}{\mega\bp}; (b) chr19:30-\SI{40}{\mega\bp} and (c) chr3:30-\SI{40}{\mega\bp}.
These regions were chosen because they feature both small and large \acrshort{tad}s as well as nested \acrshort{tad}s and subregions with very little chromatin feature signal values.
In general, most matrix plots additionally feature the sign of the first eigenvector in a purple track, named ``binarized PCA'' (values +1 or -1)
and the summarized value of all 14 chromatin features in a green track, named ``feature sum'', see e.g. \cref{fig:results:basic500}.
Feature sum was computed by binning all chromatin features at \SI{25}{\kilo\bp} and
summing up the values across all features for each bin.
Eigenvectors were computed using \texttt{hicPCA} \cite{Wolff2018} and then the sign for each \SI{25}{\kilo\bp} bin was extracted using
a simple python script. Finally, eigenvectors and feature sum were written out in bedgraph format, which pygenometracks can plot natively.
All tracks were plotted with an original resolution of 200\,dpi and a diagram width of \SI{150}{\mm} and then scaled to match the text width.
For the interaction counts in the Hi-C matrix plots, ``log1p'' transformation was applied and the true matrices were inverted,
i.\,e. reflected at the diagonal. Within this thesis, true (target) matrices are thus always shown at the bottom, while predicted matrices are always shown at the top
of each matrix plot (where applicable).
Except for comparisons with the approach by Farr\'e et al., see \cref{sec:methods:comparison},
no minimum or maximum values were set for plotting, but note that predicted matrices were scaled to value range 0...1000
before they were stored, cf. \cref{sec:methods:sample_gen}.
\subsection{Dense neural network approach} \label{sec:methods:dnn}
In the following sections, the Dense Neural network approach inspired by Farr\'e et al. \cite{Farre2018a}
will be discussed in detail.
First, the sample generation process will be explained in \cref{sec:methods:sample_gen,sec:methods:inputBinning},
followed by the basic network setup and its subsequent modifications in sections \ref{sec:methods:basicSetup} to \ref{sec:methods:score_loss}.
\subsubsection{Sample generation process for the dense neural network} \label{sec:methods:sample_gen}
With regard to sample generation, this thesis widely follows the sliding window approach proposed by Farr\'e et al. \cite{Farre2018a},
yet with an important difference, as will be outlined in the following paragraphs.
First, all chromatin features were binned to bin size $b_\mathit{feat}$ by splitting each chromosome of size $cs$ into
$l_\mathit{feat}=\left \lceil{\frac{cs}{b_\mathit{feat}}}\right \rceil$ non-overlapping bins of size $b_\mathit{feat}$
and taking the mean feature signal value within the genomic region belonging to each bin.
All $n$ chromatin factors were processed in this way and then stacked into a $(l_\mathit{feat},n)$-shaped array.
Last, all $n$ columns were indivdually scaled to value range 0...1.
The number of chromatin features was constant for all investigations within this thesis, $n=14$ (cf.~\cref{sec:methods:input_data}),
except for the comparisons with the method by Farr\'e et al., where 49 chromatin features from D. Melanogaster were used, cf. \cref{sec:methods:comparison}.
It should be noted that the way of feature binning described in the paragraph above constitutes a major difference to the proposal by Farr\'e et al. \cite[p.~9]{Farre2018a}.
In their work, values are assigned to each bin according to the percentage of basepairs in the respective bin covered by statistically significant enrichments of the respective feature,
where a value of one means 100\% coverage and a value of zero means the feature is not present in the respective bin at all.
Separate Hi-C matrices for each chromosome were derived from the cooler format as $(l_\mathit{mat}, l_\mathit{mat})$-shaped matrices,
$l_\mathit{mat}=\left \lceil{\frac{cs}{b_\mathit{mat}}}\right \rceil$ being the number of bins in the given chromosome.
Initially, $b_\mathit{feat} = b_\mathit{mat}$ was used, which leads to $l_\mathit{feat} = l_\mathit{mat}$, because $cs$ is a constant for a given chromosome.
While this is also a different process compared to the method used by Farr\'e et al. \cite[p.~8\,f.]{Farre2018a}, it should lead to the same outcome --
apart from the fact that non-normalized matrices have been used here, cf. \cref{sec:methods:input_data}.
The same sliding window approach as proposed by Farr\'e et al. \cite{Farre2018a} was then applied to generate training samples
from the chromatin feature array and the Hi-C matrices for usage with the neural networks $G$ described below.
Here, subarrays of size $(3w, n)$ were cut out from the feature array
such that the $i$-th training sample corresponded to the subarray containing the rows $i,\,i+1,\,i+2,\,\dots,\,i+3w$ of the full array.
Sliding the window along the array with step size one obviously yields $N=l-3w+1$ training samples.
The corresponding Hi-C matrices were then cut out along the diagonal of the original matrix
as submatrices with corner indices [$j,\;j$], [$j,\;j+w$], [$j+w,\;j+w$], [$j+w,\;j$] in clockwise order, where $j=i+w$.
The idea here is that the first $0,\,1,\,\dots, \,w$ rows of each feature sample form the left flanking region of the training matrix,
the next $w+1,\,w+2,\,\dots, \,2w$ correspond to the matrix' region and the last $2w+1,\,2w+2,\,\dots, \,3w$ rows form the right flanking region.
Since Hi-C matrices are symmetric by definition, only the upper triangular part of the submatrices was used,
flattened into a $w\cdot (w+1)/2$ vector.
\Cref{fig:methods:sample_gen} exemplarily shows the sample generation process for a matrix of shape (16,16) with $w=4$ and $n=3$ chromatin features.
In this case, five training samples would be generated -- the one encircled in green and four more to the right, as the window is sliding from left to right until the
right flanking region hits the feature array boundary.
\begin{figure}[hbp]
\begin{minipage}{0.65\textwidth}
\resizebox{\textwidth}{!}{
\small
\import{figures/}{sample_generation_process.pdf_tex}}
\caption{Sample generation process}
\label{fig:methods:sample_gen}
\end{minipage}\hfill
\begin{minipage}{0.3\textwidth}
\scriptsize
\begin{enumerate}[label=\Alph*:,leftmargin=*]
\raggedright
\item left flanking region
\item matrix region
\item right flanking region
\item chromatin feature array
\item sliding direction for sample generation windows
\item first 4 diagonals of a $(16, 16)$ Hi-C matrix, rotated \SI{45}{\deg} counterclockwise
\item first $(4, 4)$ training matrix, corresponding to feature window A-B-C
\end{enumerate}
\end{minipage}
\end{figure}
The sample generation process for predicting (unknown) matrices was the same as for training,
except that no matrix window was generated.
Due to the sliding window approach, the output of the network is a set of overlapping submatrices along the main diagonal of the actual target Hi-C matrix.
To generate the final submatrix, all submatrices were added up in a position-aware fashion
and finally all values were divided by the number of overlaps for their respective position.
\Cref{fig:methods:prediction} exemplarily shows the prediction process for $N=5$ samples with window size $w=4$ for a matrix of shape (16,16).
Note that the first and last $w$ bins in each row (matrix diagonal) always remain empty due to the flanking regions,
as do all bins outside the main diagonal and the first $w-1$ side diagonals.
To improve visualization, all predicted matrices were scaled to value range 0...1000 after re-assembly and stored in cooler format for further processing.
Conveniently, cooler also supports storing only the upper triangular part of symmetric matrices, minimizing conversion effort for the data at hand.
\begin{figure}
\begin{minipage}{0.65\textwidth}
\resizebox{\textwidth}{!}{
\small
\import{figures/}{prediction_process2.pdf_tex}}
\caption{Prediction process}
\label{fig:methods:prediction}
\end{minipage}\hfill
\begin{minipage}{0.3\textwidth}
\scriptsize
\begin{enumerate}[label=\Alph*:,leftmargin=*]
\raggedright
\item left flanking region
\item matrix region
\item right flanking region
\item chromatin feature array
\item sliding direction for predicted submatrices
\item first 4 diagonals of a predicted $(16, 16)$ Hi-C matrix, rotated \SI{45}{\deg} counterclockwise
\item first predicted $(4, 4)$ matrix
\end{enumerate}
\raggedright{gray background colors symbolize number of overlapping predictions at that position (bright = 1, dark = 4)}
\end{minipage}
\end{figure}
Generally, training samples were drawn from chromosomes 1, 2, 4, 7, 9, 11, 13, 14, 16, 17, 18, 20 and 22 of cell line GM12878,
validation samples from chromosomes 6, 8, 12 and 15 of cell line GM12878 and test samples from chromosomes 3, 5, 10, 19 and 21 of cell line K562.
This approximately implements a 60:20:20 train:validation:test split, see \cref{tab:methods:samples} for details.
Reversing training/validation and test cell lines, which differ strongly in sequencing depth within the \mbox{Hi-C} experiments,
was not tested for the \acrshort{dnn}, but briefly for the \acrshort{cgan} approach and seemed to work, see \cref{sec:appendix:reverseTrainTest}.
For comparability reasons, a \acrshort{dnn} and a \acrshort{cgan} were also trained and tested on data from \emph{Drosophila Melanogaster} embryonic cells, refer to \cref{sec:methods:comparison}
for details.
Note that window size $w=80$ is likely not suitable for resolutions beyond \SI{50}{\kilo\bp}, because the number of training samples becomes too small
to train a network with more than seven million parameters, cf.~\cref{tab:methods:samples,sec:methods:basicSetup}.
\begin{table}[hbp]
\centering
\begin{tabular}{lrrrrrrr}
\hline
\multicolumn{1}{c}{\multirow{2}{*}{\textbf{chrom.}}} & \multicolumn{1}{c}{\multirow{2}{*}{\textbf{length/bp}}} & \multicolumn{6}{c}{\textbf{samples at $\mathbf{w=80}$ and bin size...}} \\
\multicolumn{1}{c}{} & \multicolumn{1}{c}{} & \multicolumn{1}{c}{5k} & \multicolumn{1}{c}{10k} & \multicolumn{1}{c}{25k} & \multicolumn{1}{c}{50k} & \multicolumn{1}{c}{250k} & \multicolumn{1}{c}{500k} \\ \hline
1 & 249250621 & 49612 & 24687 & 9732 & 4747 & 759 & 260 \\
2 & 243199373 & 48401 & 24081 & 9489 & 4625 & 734 & 248 \\
4 & 191154276 & 37992 & 18877 & 7408 & 3585 & 526 & 144 \\
7 & 159138663 & 31589 & 15675 & 6127 & 2944 & 398 & 80 \\
9 & 141213431 & 28004 & 13883 & 5410 & 2586 & 326 & 44 \\
11 & 135006516 & 26763 & 13262 & 5162 & 2462 & 302 & 32 \\
13 & 115169878 & 22795 & 11278 & 4368 & 2065 & 222 & \\
14 & 107349540 & 21231 & 10496 & 4055 & 1908 & 191 & \\
16 & 90354753 & 17832 & 8797 & 3376 & 1569 & 123 & \\
17 & 81195210 & 16001 & 7881 & 3009 & 1385 & 86 & \\
18 & 78077248 & 15377 & 7569 & 2885 & 1323 & 74 & \\
20 & 63025520 & 12367 & 6064 & 2283 & 1022 & 14 & \\
22 & 51304566 & 10022 & 4892 & 1814 & 788 & & \\
\multicolumn{2}{r}{$\mathbf{\sum}$ \textbf{train samples}} & \textbf{337986} & \textbf{167442} & \textbf{65118} & \textbf{31009} & \textbf{3755} & \textbf{808} \\
& & & & & & & \\
6 & 171115067 & 33985 & 16873 & 6606 & 3184 & 446 & 104 \\
8 & 146364022 & 29034 & 14398 & 5616 & 2689 & 347 & 54 \\
12 & 133851895 & 26532 & 13147 & 5116 & 2439 & 297 & 29 \\
15 & 102531392 & 20268 & 10015 & 3863 & 1812 & 172 & \\
\multicolumn{2}{r}{$\mathbf{\sum}$ \textbf{valid. samples}} & \textbf{109819} & \textbf{54433} & \textbf{21201} & \textbf{10124} & \textbf{1262} & \textbf{187} \\
& & & & & & & \\
3 & 198022430 & 39366 & 19564 & 7682 & 3722 & 554 & 158 \\
5 & 180915260 & 35945 & 17853 & 6998 & 3380 & 485 & 123 \\
10 & 135534747 & 26868 & 13315 & 5183 & 2472 & 304 & 33 \\
19 & 59128983 & 11587 & 5674 & 2127 & 944 & & \\
21 & 48129895 & 9387 & 4574 & 1687 & 724 & & \\
\multicolumn{2}{r}{$\mathbf{\sum}$ \textbf{test samples}} & \textbf{123153} & \textbf{60980} & \textbf{23677} & \textbf{11242} & \textbf{1343} & \textbf{314} \\
& & \multicolumn{1}{l}{} & \multicolumn{1}{l}{} & \multicolumn{1}{l}{} & \multicolumn{1}{l}{} & \multicolumn{1}{l}{} & \multicolumn{1}{l}{} \\
\multicolumn{2}{r}{$\mathbf{\sum}$ \textbf{total samples}} & \textbf{570958} & \textbf{282855} & \textbf{109996} & \textbf{52375} & \textbf{6360} & \textbf{1309} \\ \hline
\end{tabular}
\caption{Training, validation and test samples for sliding window approach} \label{tab:methods:samples}
\end{table}
\subsubsection{Generalization of feature binning} \label{sec:methods:inputBinning}
Most of the binning- and sample generation procedures described above
also work for bin size relations $k=\frac{b_\mathit{mat}}{b_\mathit{feat}} \in \mathbb{N}^{>1}$.
To this end, the training matrices remained unchanged, i.\,e. $(l, l)$-shaped arrays, from which training submatrices of size $(w_\mathit{mat}, w_\mathit{mat})$
were extracted.
With $k \in \mathbb{N}^{>1}$, one bin on the matrix diagonal corresponded to $k$ bins of the feature array,
so the feature window size needed to be $k$ times the submatrix window size, $w_\mathit{feat} = k \cdot w_\mathit{mat}$.
Since the first layer of all dense neural networks used in this thesis was a 1D convolution, cf. \cref{sec:methods:basicSetup} (\cref{fig:methods:basic_dnn}),
this was achieved by setting the filter width and strides parameters of the (first) convolutional layer to $k$, leaving the rest of the network unchanged,
cf. \cref{sec:methods:variants}.
However, the number of bins along the matrix diagonal is generally not $k$ times the number of bins in the feature array,
see \cref{eq:methods:binning}.
\begin{equation}
l_\mathit{feat} = \left \lceil{\frac{cs}{b_\mathit{feat}}}\right \rceil
= \left \lceil{\frac{cs}{k \cdot b_\mathit{mat}}}\right \rceil
\not = k \cdot \left \lceil{\frac{cs}{ b_\mathit{mat}}}\right \rceil
= k \cdot l_\mathit{mat}
\mathrm{\quad for\;most\;} b_{feat}, b_{mat}, cs \in \mathbb{N} \label{eq:methods:binning}
\end{equation}
For the training process, this discrepancy was resolved by simply dropping the last training sample,
if the feature window belonging to it had missing bins.
For the prediction process, the feature array was padded with zeros on its end.
This procedure ensured that no errors were introduced into the \emph{training} process by imputing values,
but kept the size of the \emph{predicted} matrix consistent with the actual chromosome size.
\Cref{fig:methods:sample_gen_generalized} shows the generalized training process with a $(16, 16)$-training matrix and $k=2$.
If $15\cdot b_\mathit{mat} + 1 \leq cs < 15\cdot b_\mathit{mat} + b_\mathit{feat}$ holds for the chromosome size $cs$, as in the example,
then the number of bins on the matrix diagonal would be $l_\mathit{mat}=16$, while the number of chromatin feature bins would be $l_\mathit{feat}=31 \not = 2 \cdot l_\mathit{mat}$.
\begin{figure}
\begin{minipage}{0.65\textwidth}
\resizebox{\textwidth}{!}{
\small
\import{figures/}{sample_generation_process_generalized.pdf_tex}}
\caption{Generalized sample generation process f. $k=2$}
\label{fig:methods:sample_gen_generalized}
\end{minipage}\hfill
\begin{minipage}{0.3\textwidth}
\scriptsize
\begin{enumerate}[label=\Alph*:,leftmargin=*]
\raggedright
\item left flanking region
\item matrix region
\item right flanking region
\item chromatin features array ($k=2$)
\item sliding direction for sample generation windows
\item first 4 diagonals of a $(16, 16)$ Hi-C matrix, rotated \SI{45}{\deg} counterclockwise
\item last $(4, 4)$ training matrix, corresponding to feature window A-B-C; to be dropped/padded
\end{enumerate}
\end{minipage}
\end{figure}
In this case, the 5th sample would be dropped for training,
while a column with zero bins -- symbolized by pink crosses in \cref{fig:methods:sample_gen_generalized} -- would be added to the feature array for prediction
so that the resulting matrix would still have the desired size of $(16, 16)$.
\subsubsection{Basic setup} \label{sec:methods:basicSetup}
The basic setup for the dense neural network approach closely follows the proposal by Farr\'e et al. \cite{Farre2018a},
so a window size of $w=80$ and $b_\mathit{feat}=b_\mathit{mat} = \SI{10}{\kilo\bp}$ was initially used.
With these parameters, the network has a total of \SI{7486436}{} trainable weights in five trainable layers;
one convolutional layer and four densely connected layers, \cref{fig:methods:basic_dnn}.
For brevity, the sigmoid activation after the 1D convolution and the \acrshort{relu} activations after the Dense layers are not shown.
The dense layers all use a ``L2'' kernel regularizer and all dropout layers have a dropout probability of $10\%$.
\begin{figure}[htb]
\small
\centering
\import{figures/}{DNN_basic_model.pdf_tex}
\caption{Basic dense neural network with generalized feature binning}
\label{fig:methods:basic_dnn}
\end{figure}
The training goal for the neural network $G$ is to find weights $\bm{\omega}^*$ for its five trainable layers
such that the mean squared error $L_2$ between the predicted submatrices $\mathbf{M}_s = G_s(\bm{\omega})$
and the training submatrices $\mathbf{T}_s$ becomes minimal for all $N$ training samples $s \in (1,2,\dots, N)$.
Here, $\mathbf{M}_s$ is given by the activations of the last dense layer, which are to be interpreted as the upper triangular
part of a symmetric $(w, w)$-shaped Hi-C matrix.
Formally, one needs to optimize
\begin{equation}
\bm{\omega}^* = \mathit{arg\,min}_{\bm{\omega}} L_2(\bm{\omega}) =
\mathit{arg\,min}_{\bm{\omega}} \frac{1}{N} \sum_{s=1}^N (\mathbf{M}_s - \mathbf{T}_s)^2 \label{eq:methods:nn-mse}
\end{equation}
For the thesis at hand, \acrfull{sgd} with learning rate $10^{-5}$ was used to find $\bm{\omega}^*$,
except for some of the comparisons with other approaches described in \cref{sec:methods:comparison}.
Initial values $\bm{\omega}_\mathit{init}$ were drawn from a Xavier initializer, a uniform distribution with limits depending on the in- and output shapes.
Following \cite{Farre2018a}, optimization was performed on minibatches of 30 samples, assembled randomly from the $N$ training samples
to prevent location-dependent effects and improve generalization.
For training, the last batch was dropped, if $N/30 \not \in \mathbb{N}$.
The network and its learning algorithm were implemented in python using tensorflow deep learning framework, partially with keras frontend \cite{Abadi2015,Chollet2015}.
All code is available under an open source license, see github repository \cite{Krauth2021b}.
All computations were performed on virtual machines with different properties, partially with GPU assistance; see \cref{sec:appendix:hardware} for details.
\subsubsection{Modifying kernel size, number of filter layers and filters} \label{sec:methods:variants}
For the ``wider'' variant, the kernel size of the of the 1D convolutional layer was increased to $4k$ with strides $k$, \cref{fig:methods:basic_dnn},
where $k=b_\mathit{mat}/b_\mathit{feat}$ is the relation between matrix- and feature bin size.
Mirror-type padding was used to maintain the output dimensions of the basic network, which had \SI{7486478}{} trainable weights
for $k=1$.
For the ``longer'' variant, three 1D-convolutional layers with 4, 8 and 16 filters
were used in place of the basic network's single convolutional layer, cf. \cref{fig:methods:longer_dnn}.
This replacement was also made for the ``wider-longer''-variant,
additionally increasing the respective kernel sizes to $4k$ (with strides $k$), 4 and 4, cf. \cref{fig:methods:longer_dnn} (right).
In both cases, the dropout rate was increased to 20\%.
The ``longer'' variant had \SI{9142665}{} trainable weights and the ``wider-longer'' variant had \SI{9143313}{} for $k=1$.
For the variant with generalized binning, features were binned at $b_\textrm{feat} = \SI{5}{\kilo\bp}$ while keeping the matrix bin size $b_\textrm{mat} = \SI{25}{\kilo\bp}$,
so the factor for the relation between the two bin sizes was $k=25/5=5$.
This yields input feature arrays of size $(3w\cdot k, n) = (3\cdot80\cdot5 , 14) = (1200, 14)$.
Replacing the first convolutional layer of the basic network by a 1D-convolutional filter with kernel size $5$ and strides $5$ without padding,
this input was again compressed to a $(3w, 1) = (240, 1)$-shaped tensor and fed into the flatten layer, cf~\cref{fig:methods:basic_dnn}.
The rest of the network remained the same so that the number of trainable parameters increased only slightly to \SI{7486492}{}.
\begin{figure}[p]
\small
\centering
\import{figures/}{DNN_longer.pdf_tex}
\caption{``Longer'' and ``wider-longer'' variants for \acrshort{dnn}}
\label{fig:methods:longer_dnn}
\end{figure}
\subsubsection{Combination of mean squared error, perception loss and TV loss} \label{sec:methods:combined_loss}
For computing the combined \acrshort{mse}, perception- and \acrshort{tv}-loss,
input features were first passed through the network as normal,
and mean squared error was computed between the predicted ``upper triangular'' vectors and the real vectors.
Next, both the output- and target vectors were converted to symmetric grayscale images by reshaping them into
$(w, w, 1)$-shaped tensors, where $w$ is again the window size.
For a grayscale pixel-image $\mathbf{y}$, total variation can be defined as the sum of the absolute differences for neighboring pixel-values, \cref{eq:methods:tv} \cite{Rudin1992, wikiTV2021}.
\begin{equation}
tv(\mathbf{y}) = \sum_{i,j}\sqrt{|y_{i+1,j,1} - y_{i,j,1}|^2 + |y_{i,j+1,1} - y_{i,j,1} |^2 } \label{eq:methods:tv}
\end{equation}
For efficiency, the tensorflow implementation from tf.image.total\_variation was used,
taking the sum across batches as loss value, as recommended in the tensorflow documentation \cite{TensorflowTV2020}.
For perception loss, the predicted images and the true images were first fed through the pre-trained \emph{VGG-16} network \cite{Simonyan2015}.
Here, the network was truncated after the third convolution layer in the fourth block (``block4\_conv3''), which is the last layer also used by the influential work of Johnson et al. \cite{Johnson2016}.
The perception loss was then computed as mean squared error between the ``predicted'' and ``true'' output activations of the truncated \emph{VGG-16} network.
Let $\mathbf{M}_s=G_s(\bm{\omega})$ again be the output of the neural network $G$ described above, and $\mathbf{T}_s$ the true matrices for training samples $s$ in vector form,
and let $\mathbf{M}^\prime_s$ and $\mathbf{T}^\prime_s$ be their grayscale image counterparts as described above.
Furthermore, let $\mathit{tv}(\bm{y})$ be the total variation of image $\bm{y}$ and $\mathit{vgg}(\bm{y})$ the output of the perception loss network for image $\bm{y}$.
The optimization problem for the modified network was then formulated by means of \cref{eq:methods:combined_loss} to find weights $\bm{\omega}^*$ such that
\begin{equation}
\bm{\omega}^* = \mathit{arg\,min}_\omega ( \lambda_\mathit{MSE} \frac{1}{N} \sum_{s=1}^N (\mathbf{M}_s - \mathbf{T}_s)^2
+ \lambda_\mathit{TV} \sum_{s=1}^N \mathit{tv}( \mathbf{M}^\prime_s)
+ \lambda_\mathit{VGG} \frac{1}{N} \sum_{s=1}^N (\mathit{vgg}(\mathbf{M}^\prime_s) - \mathit{vgg}(\mathbf{T}^\prime_s))^2 ) \label{eq:methods:combined_loss}
\end{equation}
where $N$ is again the number of samples.
Weight initialization for network $G$ and minibatching were done as described in \cref{sec:methods:basicSetup}.
The \emph{VGG-16} network and the corresponding weights were taken from the keras \cite{Chollet2015} implementation pre-trained on \emph{ImageNet} \cite{deng2009}.
As a side effect, the usage of \emph{VGG-16} imposes the restriction $w \geq 32$ on the window size $w$, which is not problematic, since again $w=80$ was chosen for all experiments.
The network $G(\bm{\omega})$ could in principle be any of the variants described above in \cref{sec:methods:variants},
but for the thesis at hand, only the initial network from \cref{sec:methods:basicSetup} was used.
\subsubsection{Combination of mean squared error and TAD-score-based loss} \label{sec:methods:score_loss}
To optimize both for mean-squared error and a TAD-score-derived loss, the following optimization task was defined to find weights $\bm{\omega}^*$ such that
\begin{equation}
\bm{\omega}^* = \mathit{arg\,min}_\omega ( \lambda_\mathit{MSE} \frac{1}{N} \sum_{s=1}^N (\mathbf{M}_s - \mathbf{T}_s)^2
+ \lambda_\mathit{score} \frac{1}{N} \sum_{s=1}^N (\mathit{score}(\mathbf{M}^\prime_s,ds) - \mathit{score}(\mathbf{T}^\prime_s,ds))^2 \label{eq:methods:score_loss}
\end{equation}
where $\mathbf{M}_s$ is again the Hi-C submatrix predicted by the network $G(\bm{\omega})$ for sample $s$, $\mathbf{T}_s$ is the corresponding true Hi-C submatrix
and $N$ is the number of samples.
To compute the TAD-score-based loss $\mathit{score}(\cdot,\cdot)$, the predictions and true Hi-C matrices $\mathbf{M}$ and $\mathbf{T}$ in vector form (upper triangular part)
were first converted back to complete, symmetric Hi-C matrices $\mathbf{M}^\prime,\; \mathbf{T}^\prime$.
Next, in a custom network layer, all diamonds with size $ds$ inside the submatrices of size $w$ were cut out using tensor slicing and the values inside the diamonds were reduced to their respective mean.
This yields score vectors -- more exactly, tensors with shape $(w - 2\,ds, 1)$.
After computing the latter for both predicted- and real Hi-C submatrices, the mean squared error between them was computed as usual and weighted with
a user-selected loss weight $\lambda_\mathit{score}$, see \cref{eq:methods:score_loss}.
While it would also have been possible, and probably faster, to slice the outputs of the networks directly in vector form,
this is rather counterintuitive and was therefore not implemented for the thesis.
\clearpage
\subsection{Hi-cGAN approach} \label{sec:methods:hicgan}
In the following subsection, the setup and training process of the \acrshort{cgan} network developed for this thesis -- \emph{Hi-cGAN} -- and the respective
embeddings for the conditional input -- the chromatin features -- will be described in more detail.
First, the sample generation process and the basic network setup based on the well-known \emph{pix2pix} \acrshort{cgan} architecture \cite{Isola2017}
are explained in \cref{sec:methods:sample_gen_cgan,sec:methods:cGAN_initial},
then the respective embedding networks are discussed in \cref{sec:methods:dnn-embedding,sec:methods:cnn-embedding,sec:methods:mixed-embedding}.
\subsubsection{Sample generation process for Hi-cGAN} \label{sec:methods:sample_gen_cgan}
The samples for \emph{Hi-\acrshort{cgan}} were generated the same way as the samples for the dense neural network described
in \cref{sec:methods:sample_gen}, with two exceptions.
First, Hi-C (sub-)matrices were not entered as vectors corresponding to their upper triangular part,
but were instead taken as $(w, w, 1)$-shaped grayscale images with value range [0...1] in 32-bit floating point format.
Second, for the \acrshort{cgan} approach, the input matrices need not only be square,
but their sizes also need to be powers of two.
This requirement is due to the connected up- and downsampling operations in the generator,
which essentially are 2D-convolutions and transposed 2D-convolutions with strides two, see below.
Within the thesis, window sizes of $w=\{64,128,256\}$ were used, see \cref{sec:results:cgan}.
Training, validation and test chromosomes were generally the same as for the \acrshort{dnn} approach.
For comparability with \emph{HiC-Reg}, two \acrshort{cgan}s were additionally trained on single chromosomes 14 and 17 of cell line GM12878, cf. \cref{sec:methods:comparison}.
\subsubsection{Modified pix2pix network}\label{sec:methods:cGAN_initial}
Several implementations of the original \emph{pix2pix} network \cite{Isola2017} are publicly available, usually for the original image size of $256\times256$.
For the thesis at hand, implementation concepts from two tutorials \cite{tfpix2pix2020, brownlee2019} were combined and the code was adapted to the given requirements.
Within the generator, two of the down- and upsampling layers inside the \emph{U-Net} portion were made optional
to allow processing smaller images of sizes $64\times64$ and $128\times128$, see \cref{fig:methods:GAN_arch:generator}.
Note that the generator (still) only supports square images with edge lengths that are powers of two and at least $2^6=64$.
Furthermore, symmetry of the output images was enforced by adding their transpose and multiplying by 0.5, cf. \cite{Fudenberg2020}.
The down- and upsampling layers shown in \cref{fig:methods:GAN_arch:generator} are custom blocks
detailed in \cref{fig:methods:GAN_arch:downsampling,fig:methods:GAN_arch:upsampling}.
All 2D-convolutions and 2D-deconvolutions had kernel size $(4,4)$ and were initialized with values drawn from a normal distribution with mean $\mu=0$ and
standard deviation $\sigma=0.02$.
Finally, the output layer of the generator was changed from tanh- to sigmoid activation.
This empirically showed better results, probably because the training- and test matrices were scaled to 0...1, which
is also the value range of the sigmoid function, cf. \cref{sec:methods:sample_gen_cgan,eq:sigmoid} on page \pageref{eq:sigmoid}.
Compared to the original \emph{pix2pix} setup,
one downsampling layer was omitted in the discriminator for window size $w\in\{128,256,\dots\}$ and another one for $w=64$.
This made the discriminator patches larger, especially for the smaller window sizes,
cf. \cref{fig:methods:GAN_arch:discriminator}, and was empirically found to improve the results
slightly in the given application.
After each convolution, symmetry was enforced in the same way as for the generator, see above.
Kernel sizes and initializations for all 2D convolutions also were the same as for the generator,
and leaky \acrshort{relu}-activations were used with parameter $\alpha=0.2$, as in all downsampling layers.
Both the discriminator and the generator featured their own, trainable embedding network
to convert the conditional input, i.\,e. the chromatin feature data of shape $(3w, n)$,
into grayscale images of shape $(w,w,1)$.
These networks will be discussed below in \cref{sec:methods:dnn-embedding,sec:methods:cnn-embedding}.
\Cref{tab:methods:gen_disc_params} shows the number of trainable parameters
for the generator and discriminator without considering the embedding networks.
\begin{table}[htbp]
\centering
\begin{tabular}{lcc}
\hline
\multirow{2}{*}{\textbf{window size}} & \multicolumn{2}{c}{\textbf{trainable weights}} \\
& \multicolumn{1}{c}{Generator} & \multicolumn{1}{c}{Discriminator} \\ \hline
64 & \SI{29237889}{} & \SI{3164673}{} \\
128 & \SI{41822849}{} & \SI{2765057}{} \\
256 & \SI{54407809}{} & \SI{2765057}{} \\ \hline
\end{tabular}
\caption{Trainable weights for Generator and Discriminator w/o embedding network}
\label{tab:methods:gen_disc_params}
\end{table}
The loss functions for the generator and discriminator were implemented as shown in equations (\ref{eq:improve:cGAN_loss}) to (\ref{eq:improve:disc_loss_total})
with parameters $\lambda_\mathit{adv}=1.0, \lambda_\mathit{MAE}=100.0, \lambda_\mathit{TV}=10^{-12}$ and $\lambda_D=0.5$.
Optimization was performed on minibatches of size 32, 8 and 2 for window sizes 64, 128 and 256, respectively,
using Adam optimizers with learning rate $2\cdot10^{-5}$ for the generator, $10^{-6}$ for the discriminator and $\beta_1=0.5$ for both generator and discriminator.
The choice of batch sizes was partially dictated by memory limitations of the available GPUs, cf. \cref{sec:appendix:hardware}.
\subsubsection{Using a DNN for feature embedding} \label{sec:methods:dnn-embedding}
In order to use the \acrshort{dnn} described in \cref{sec:methods:basicSetup} as an embedding network
for the \acrshort{cgan}, only small amendments were required to adjust the input shapes,
i.\,e. to provide symmetric Hi-C matrices as grayscale images instead of the upper-triangular-vector representation
native to the \acrshort{dnn}, see \cref{fig:methods:dnn-embedding}.
The triu-reshape layer in \cref{fig:methods:dnn-embedding} was a custom tensorflow network layer which generates an output tensor
of appropriate shape $(w,w)$ and sets its upper triangular part to the values given by the input vector.
Symmetrization was then performed by adding this tensor to its transpose and subtracting the values on the diagonal once,
because the diagonal was contained both in the upper triangular part of the matrix and its transpose.
Finally, the required third axis was added to get the shape of a grayscale image.
The number of trainable parameters for the \acrshort{dnn} embedding is shown in \cref{tab:methods:embedding_network_params}, rightmost column.
Note that all trainable parameters stem from the \acrshort{dnn} here; the reshaping layers do not have any trainable parameters.
\begin{table}[htbp]
\centering
\begin{tabular}{lrr}
\hline
\multicolumn{1}{c}{\multirow{2}{*}{\textbf{window size}}} & \multicolumn{2}{c}{\textbf{trainable weights}} \\
\multicolumn{1}{c}{} & \multicolumn{1}{c}{CNN} & \multicolumn{1}{c}{DNN} \\ \hline
64 & \SI{4243968}{} & \SI{5502796}{} \\
128 & \SI{4260416}{} & \SI{16034732}{} \\
256 & \SI{4293312}{} & \SI{57877612}{} \\ \hline
\end{tabular}
\caption{Trainable weights for embedding networks}\label{tab:methods:embedding_network_params}
\end{table}
For this thesis, the \acrshort{dnn} embedding was used in two ways.
First, it was trained together with the rest of the \acrshort{cgan} with weight initialization as described in
\cref{sec:methods:basicSetup,sec:methods:cGAN_initial}.
Second, a \acrshort{dnn} was pre-trained as described in \cref{sec:methods:basicSetup} with window size $w=64$
and the learned weights were transferred to the \acrshort{cgan} once before the start of the training process,
which was then continued as described in \cref{sec:methods:cGAN_initial}.
The results of the pre-training are visualized in \cref{sec:appendix:pretraining_results} (\cref{fig:results:DNN64_pearson}, p.~\pageref{fig:results:DNN64_pearson});
the state after 250 epochs was the one transferred to the \acrshort{cgan}.
\acrshort{dnn} embeddings were only used with window size $w=64$ due to the large number of parameters
to be trained at window sizes 128, 256 and higher.
\subsubsection{Using a CNN for feature embedding} \label{sec:methods:cnn-embedding}
The convolutional embedding network consisted of 8 convolutional blocks and a final 1D convolution layer,
as shown in \cref{fig:methods:GAN_arch:embedding_network}.
Each of the convolution blocks started with a 1D convolution with kernel size 4, strides 1, padding ``same''
and ``L2'' kernel regularization with parameter $l=0.02$, followed by batch normalization and leaky \acrshort{relu} activation
with parameter $\alpha=0.2$.
The last 1D convolution consisted of $w$ filters with kernel size 4, strides 3 and padding ``same'',
followed by sigmoid activation; this last convolution layer was not using kernel regularization.
All kernel weights of the 1D convolutions in the embedding network were initialized by a Xavier initializer.
In the \acrshort{cnn} embedding, the number of trainable parameters was constant for all layers, except for the last convolutional layer, where the number
of convolutional filters was equal to the window size, cf. \cref{fig:methods:GAN_arch:embedding_network}.
For the three window sizes $w=\{64,128,256\}$ used within this thesis, the \acrshort{cnn}-embedding network
contained about 4.2 to 4.3 million trainable weights, see left column of \cref{tab:methods:embedding_network_params} for details.
\subsubsection{Mixed embedding for generator and discriminator} \label{sec:methods:mixed-embedding}
In the mixed setting, the dense neural network described above in \cref{sec:methods:dnn-embedding} was used as
embedding network for the generator and the \acrshort{cnn} described in \cref{sec:methods:cnn-embedding} was used as embedding
network for the discriminator.
The mixed setting was used both without and with weight transfer for its \acrshort{dnn} embedding.
In the latter case, again the weights of the \acrshort{dnn} were replaced by the ones of the same pre-trained \acrshort{dnn} as in \cref{sec:methods:dnn-embedding}
before the training process started.
\begin{figure}[p] %generator
\tiny
\import{figures/GAN_arch/}{generatorModel.pdf_tex}
\caption{Adapted generator model from \emph{pix2pix}} \label{fig:methods:GAN_arch:generator}
\end{figure}
\begin{figure}[p] %discriminator
\scriptsize
\centering
\import{figures/GAN_arch/}{discriminatorModel.pdf_tex}
\caption{Adapted discriminator model from \emph{pix2pix}} \label{fig:methods:GAN_arch:discriminator}
\end{figure}
\begin{figure}[p]%downsampling blocks
\scriptsize
\centering
\import{figures/GAN_arch/}{downsampling_block.pdf_tex}
\caption{Downsampling block} \label{fig:methods:GAN_arch:downsampling}
\end{figure}
\begin{figure}[p] %upsampling blocks
\scriptsize
\centering
\import{figures/GAN_arch/}{upsampling_block.pdf_tex}
\caption{Upsampling block} \label{fig:methods:GAN_arch:upsampling}
\end{figure}
\begin{figure}[p] %DNN-embedding
\scriptsize
\centering
\import{figures/GAN_arch/}{dnn_embedding.pdf_tex}
\caption{Embedding network, \acrshort{dnn}} \label{fig:methods:dnn-embedding}
\end{figure}
\begin{figure}[p] %CNN-embedding
\scriptsize
\centering
\import{figures/GAN_arch/}{embedding_network.pdf_tex}
\caption{Embedding network, \acrshort{cnn}} \label{fig:methods:GAN_arch:embedding_network}
\end{figure}
\clearpage
\subsection{Comparison with other approaches} \label{sec:methods:comparison}
The results of this thesis were compared to the ones from \emph{HiC-Reg} by Zhang et al. \cite{Zhang2019},
who used pretty much the same Hi-C and chromatin feature data
as described above in \cref{sec:methods:input_data},
and to results by Farr\'e et al. \cite{Farre2018a}, who used data from Drosophila Melanogaster embryonic cells.
\textbf{Comparison with HiC-Reg}\\
For comparison with \emph{HiC-Reg}, stored predictions for cell line K562, chromosomes 14 and 17 were downloaded from Zenodo \cite{ShiluZhang2019,ShiluZhang2019a}.
Here, results from the ``WINDOW'' and ``MULTICELL'' approaches were selected, whereby the ``WINDOW'' method was using training data from GM12878, chromosome 14 or 17,
while the MULTICELL approach was using chromatin feature data from cell lines GM12878, HMEC, HUVEC, K562(!) and NHEK with training matrices from GM12878, chromosome 14 or 17.
These results were deemed to offer the best comparability to the \acrshort{cgan}- and \acrshort{dnn} approaches of this thesis.
The downloaded data is in text format and was first converted into cooler format, whereby a surprising sparsity of the data was noted.
For chromosome 14, only \SI{42.4}{\percent} and for chromosome 17, only \SI{59.9}{\percent} of all possible interacting pairs
for the chosen window size $w=200$ at bin size \SI{5}{\kilo\bp} were contained in the datasets,
and no interacting pairs with predicted interaction counts below 0.01 were found in the data.
Instead, the value range was between 0.36 and 5.10 across all four datasets, with similar values for each dataset alone.
The missing pairs were found to be distributed all across the chromosome and all distance ranges, except for distance zero, i.\,e. the Hi-C matrix diagonal, which was not contained in the datasets at all.
For the conversion to cooler format, the missing values were assumed to be zero.
Next, the cooler matrices were coarsened to bin size \SI{25}{\kilo\bp} to allow comparisons with the \acrshort{cgan}- and \acrshort{dnn} results, using \texttt{cooler} \texttt{coarsen}.
For the comparison between \emph{HiC-Reg} and \emph{Hi-\acrshort{cgan}}, two additional \acrshort{cgan} models with \acrshort{cnn} embedding were trained on chromosome 14 and 17, respectively,
using (arbitrarily selected) chromosome 10 for validation.
Due to the small number of training samples in this setting, window size was set to $w=64$ and batch size to $bs=2$.
The rest of the training process remained the same, cf. \cref{sec:methods:cGAN_initial}.
Additionally, \emph{HiC-Reg} was compared to a \acrshort{cgan} model with window size $w=256$ and \acrshort{cnn} embedding, trained on the typical training data set,
which includes chromosomes 14 and 17 besides others, cf. \cref{sec:methods:cGAN_initial,sec:methods:sample_gen,sec:methods:sample_gen_cgan}.
When computing the distance-stratified Pearson correlations only from the available (non-zero) predicted values in the datasets \cite{ShiluZhang2019,ShiluZhang2019a},
the results for chromosome 17 showed very good accordance with the data published in the paper \cite[fig.\,10]{Zhang2019}, \cref{fig:methods:zhang_correlations_reconstructed},
allowing the conclusion that the correlation computations according to \cref{eq:pearson} should in principle be comparable between this thesis and the paper by Zhang et al.
\begin{figure}[htbp]
\begin{subfigure}{0.45\textwidth}
\resizebox{\textwidth}{!}{
\scriptsize
\import{figures/randomforest/}{pearson_chr14_originalData_ZhangEtAl.pdf_tex}}
\end{subfigure}\hfill
\begin{subfigure}{0.45\textwidth}
\resizebox{\textwidth}{!}{
\scriptsize
\import{figures/randomforest/}{pearson_chr17_originalData_ZhangEtAl.pdf_tex}}
\end{subfigure}
\caption{Pearson correlations reconstructed from \cite{Zhang2019,ShiluZhang2019,ShiluZhang2019a}} \label{fig:methods:zhang_correlations_reconstructed}
\end{figure}
For reference, training and prediction were re-done for the WINDOW approach and cell lines GM12878 $\rightarrow$ K562, chromosome 17.
With regard to chromatin features, the same bam files were taken as in \cref{sec:methods:input_data}, except for SMC3, which was replaced by TBP,
in line with the publication by Zhang et al. \cite{Zhang2019}.
The bam files were then converted to text files using custom bash scripts and the \texttt{aggregateSignal} application provided by \emph{HiC-Reg},
cf. \cref{list:methods:bam2bedgraph,list:methods:bedgraph2hicreg}.
For simplicity, only replicate 1 was used in all cases.
With regard to matrix data, cooler files were first prepared as described in \cref{sec:methods:input_data}.
Next, square root vanilla coverage correction was applied, the matrices were dumped into text files and a custom python script was used
to convert the text inputs to meet \emph{HiC-Reg}'s requirements, cf. \cref{list:methods:convertForHicReg_bash,list:methods:convertForHicReg_python}.
For the rest of the training- and prediction process, the instructions in \emph{HiC-Reg}'s github repository were followed \cite{Zhang2019}.
The resulting text files were converted back to cooler using another custom python script, see \cref{list:methods:hicreg2cool}.
The sparsity in the results was \SI{59.6}{\percent}, confirming that filtering occurs during the sample generation- or training process of \emph{HiC-Reg},
although it is not explicitly mentioned in the paper.
In terms of Pearson correlations -- here, including zero-values -- the recomputed results from replicate 1 were slightly better,
but overall very similar to the ones restored from the published datasets \cite{ShiluZhang2019,ShiluZhang2019a}.
It is surprising that zero-values have not been considered by Zhang et al. when computing the correlations for their publication \cite{Zhang2019}.
Correctly predicted zero-values strongly increase the quality of the output matrices
and should thus be included in the correlation computations.
Pearson correlations for the predictions in \cref{sec:results:comparison} were thus computed for full chromosomes, including zero-values.
\textbf{Comparison with dense network by Farr\'e et al.}\\
Unfortunately, Farr\'e et al. have not published any data from which (predicted) Hi-C matrices and Pearson correlations could have been reconstructed,
but Pau Farr\'e provided some of the non-public source code used for the paper \cite{Farre2018a} following private communication.
After some minor bug fixes and adaptions to make the code work with python version 3, it was usable to recompute the published results.
To this end, \emph{Hi-C data} from 16-18 hours old Drosophila Melanogaster embryonic cells by Schuettengruber et al. \cite{Schuettengruber2014}
was downloaded in text format from the gene expression omnibus, accession GSE61471, for bin size \SI{10}{\kilo\bp}.
Next, 49 \emph{chromatin features} from modEncode ChIP-seq experiments with 14-16 hours old Drosophila Melanogaster embryonic cells \cite{Roy2010}
were obtained from GEO in the gff3 file format required by the provided python application, see \cref{tab:methods:features_farre} (p.~\pageref{tab:methods:features_farre}).
Note that while Farr\'e et al. state they have used 50 features, in fact only 49 unique features are specified in their paper, H3k4me1 being listed twice \cite[p.~9]{Farre2018a}.
This explains why table \ref{tab:methods:features_farre} contains 49 instead of the expected 50 entries.
Looking at the alphabetical sorting in the table, it is assumed that H3k4me2 is the 50th feature missing in the paper.
Note that in contrast to the bigwig files used for the \acrshort{dnn} and \emph{Hi-cGAN} so far, the gff3 files contain locations of statistically significant enrichments,
and not mean read counts for the chromatin features at hand.
\begin{table}[htb]
\centering
\begin{tabular}{lllllll}
\cline{1-3} \cline{5-7}
& \multicolumn{2}{c}{\textbf{accession}} & & & \multicolumn{2}{c}{\textbf{accession}} \\
\textbf{feature} & \multicolumn{1}{c}{reads} & \multicolumn{1}{c}{peaks} & & \textbf{feature} & \multicolumn{1}{c}{reads} & \multicolumn{1}{c}{peaks} \\ \cline{1-3} \cline{5-7}
BEAF & SRR1164521 & GSE51986 & & H3K9me1 & SRR869989 & GSE47240 \\
Chro Chriz WR & SRR869929 & GSE47263 & & H3K9me2 & SRR869864 & GSE47247 \\
CP190-HB & SRR869732 & GSE47234 & & H3K9me3 & SRR869860 & GSE47246 \\
CTCF & SRR869933 & GSE47264 & & H4 & SRR870049 & GSE47291 \\
dMi & SRR869937 & GSE47265 & & H4K16acM & SRR869856 & GSE47245 \\
dRING & SRR869941 & GSE47266 & & H4K20me1 & SRR870057 & GSE47293 \\
GAF & SRR869740 & GSE47236 & & HP1 & SRR869718 & GSE47231 \\
H1 & SRR870123 & GSE47309 & & HP1b & SRR870061 & GSE47294 \\
H2AV & SRR869950 & GSE47268 & & HP1c & SRR870183 & GSE47323 \\
H2B-ubiq & SRR869956 & GSE47269 & & HP2 & SRR870138 & GSE47312 \\
H3 & SRR869965 & GSE47271 & & HP4 & SRR869880 & GSE47251 \\
H3K18ac & SRR869977 & GSE47274 & & JHDM1 & SRR870069 & GSE47296 \\
H3K23ac & SRR869981 & GSE47275 & & LSD1 & SRR870073 & GSE47297 \\
H3K27ac & SRR869744 & GSE47237 & & MBD-R2 & SRR870077 & GSE47298 \\
H3K27me2 & SRR869993 & GSE47277 & & MOF & SRR870081 & GSE47299 \\
H3K27me3 & SRR869712 & GSE47230 & & NURF301 & SRR870085 & GSE47300 \\
H3K36me1 & SRR869839 & GSE47241 & & Pc & SRR869724 & GSE47232 \\
H3K36me2 & SRR870231 & GSE47335 & & POF & SRR870089 & GSE47301 \\
H3K36me3 & SRR869900 & GSE47256 & & Psc & SRR869736 & GSE47235 \\
H3K4me1 & SRR870009 & GSE47281 & & RNA Pol II & SRR870093 & GSE47302 \\
H3K4me3 & SRR870025 & GSE47285 & & RPD3 & SRR870105 & GSE47305 \\
H3K79me1 & SRR869748 & GSE47239 & & SU(HW)-HB & SRR869728 & GSE47233 \\
H3K79me2 & SRR869884 & GSE47252 & & Su(var)3 & SRR869876 & GSE47250 \\
H3K79me3 & SRR870191 & GSE47325 & & ZW5 & SRR870117 & GSE47308 \\
H3K9acS10P & SRR870033 & GSE47287 & & & & \\ \hline
\end{tabular}
\caption[Chromatin features from D. Melanogaster used for the comparison DNN / cGAN / Farré et al. \cite{Farre2018a}]{Chromatin features from D. Melanogaster used for the comparison\\DNN / cGAN / Farré et al. \cite{Farre2018a}}
\label{tab:methods:features_farre}
\end{table}
Originally, Farré et al. have been using chromosomes 2L, 2R, 3L and the first half of 3R for training,
but after another simple modification to the source code provided, the training set was reduced to samples from chromosomes 2L, 2R and 3L,
which is easier to match with the implementations of \acrshort{dnn} and \emph{Hi-cGAN}, see below.
This change did not visually impair the results compared to the published results.
Validation samples were generated by a 80:20 split of the training/validation set, leaving Pau Farr\'e's implementation unchanged in this respect.
In disagreement with \cite{Farre2018a}, the provided implementation is not using the \acrshort{sgd} optimizer with batch size 30,
but instead the RMSprop optimizer with batch size 100.
For the \acrshort{dnn} and \emph{Hi-cGAN}, the Hi-C matrix text files were converted to cooler format
using a custom python script \cite[scripts/schuettengruberToCooler.py]{Krauth2021b}.
Since the gff3 file format is not useful for the \acrshort{dnn} and \emph{Hi-cGAN}, reads from the modEncode ChIP-seq experiments \cite{Roy2010}
were obtained from sequence read archive (SRA) for 49 features in fastqsanger format, using the accession keys given in \cref{tab:methods:features_farre}.
After downloading, the reads were mapped to the Drosophila Melanogaster reference genome BDGP5/dm3 using bowtie2 \cite{Langmead2012}
mostly with default parameters, see \cref{list:methods:map_drosophila} for details.
The resulting bam alignment files were then converted to bigwig format as described in \cref{sec:methods:input_data,list:methods:bamtobigwig}.
While most experiments feature 2 to 4 replicates, only replicate 2 was considered for each chromatin feature to limit disk space usage and processing time.
The bigwig files were then used as inputs for the \acrshort{dnn} and \acrshort{cgan} as described above in \cref{sec:methods:sample_gen,sec:methods:sample_gen_cgan,sec:methods:dnn,sec:methods:hicgan}.
For the comparisons, the \acrshort{dnn} was used with window size $w=80$,
but a larger batch size of 100.
Additionally, sample flipping was used as described in \cite{Farre2018a} and implemented in the provided code.
Furthermore, the \acrshort{sgd} optimizer used so far was replaced by the RMS\-prop optimizer with standard parameters, namely a learning rate of 0.001;
training matrices were ICE-normalized and predicted matrices were not scaled.
All changes were made following a comparison with the source code supposedly used for the paper \cite{Farre2018a}, see above.
Training was performed on samples from chromosome 2L, 2R and 3L, and chromosome X was used for validation.
All other parameters and settings of the \acrshort{dnn} were left unchanged, cf. \cref{sec:methods:dnn}.
\emph{Hi-cGAN} was used with a window size $w=64$, which is the largest power of two
smaller than the maximum window size of $w_\mathit{max}=80$ imposed by the given Hi-C data from Schuettengruber et al. \cite{Schuettengruber2014}.
Here, batch size was set to 2, and training was performed on samples from chromosomes 2L and 2R,
using chromosome 3L(!) for validation and 3R and X for testing.
All other parameters and settings were left unchanged, cf. \cref{sec:methods:hicgan}.
For \emph{Hi-cGAN}, the training matrices were not ICE-normalized,
but the \emph{predicted} matrices were ICE-normalized according to \cref{list:appendix:ice-norm} for better comparability (\cref{fig:appendix:farre-vs-ours_ours-ice-normalized}).
The predicted matrices from Pau Farr\'e's python program for chromosome 3R were written out as numpy arrays by adding a simple one-liner to the provided code.
These arrays were then converted to cooler and Pearson correlations \emph{per distance} were computed from them as usual.
Since Farr\'e et al. have reported Pearson correlations \emph{per position}, an additional custom python script
was written to compute such correlations \cite[scripts/corr\_per\_pos.py]{Krauth2021b}.
In terms of matrix plots, it was attempted to use the same cutout as in the paper \cite{Farre2018a}.
However, the paper seems to contain an error, since the plotted excerpt does not coincide with Hi-C data from chromosome 3R:15-\SI{20}{\mega\bp},
as stated in the caption \cite[fig.~2]{Farre2018a}, but most likely stems from chromosome 3R:16.75-\SI{21.75}{\mega\bp},
which was subsequently used for the thesis at hand.
For comparability reasons, all matrix cutouts were plotted twice, once with the standard color map of this thesis (``RdYlBu\_r'', \cref{fig:results:farre-vs-ours_matrices}) and once
with the blue and red color maps used in the publication \cite{Farre2018a} (\cref{fig:appendix:farre-vs-ours_matrices}).
Additionally, the track file for pygenometracks was setup to obey the minimum and maximum values in the given cutout,
because the automatic value range adjustment did not work well in this case.
The results for the comparison between the \acrshort{dnn}, \emph{Hi-cGAN} and Farr\'e et al. are shown at the end of section \ref{sec:results:comparison}.
\clearpage