-
Notifications
You must be signed in to change notification settings - Fork 15
/
Copy pathAgglomerative-Clustering.html
348 lines (308 loc) · 17.9 KB
/
Agglomerative-Clustering.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
<!--
University of Freiburg WS 2017/2018
Chair for Bioinformatics
Supervisor: Martin Raden
Authors: Alexander Mattheis, Martin Raden
-->
<div id="algorithm_description">
<div class="description">
Different agglomerative (hierarchical) clustering algorithms were proposed to compute phylogenetic trees
from evolutionary distances:
<ul class="listed">
<li>
<em>Complete Linkage (Furthest Neighbour)</em>
by Thorvald Sørensen (1948)
- Merges the two clusters $i$ and $j$ whose biggest inter-cluster-distance $D_{i,j}$
(defined between elements of the cluster)
is minimal between all clusters.
</li>
<li>
<em>Neighbour-Joining (Nearest Neighbour on clusters)</em> of
<a href="https://doi.org/10.1093/oxfordjournals.molbev.a040454">Naruya Saitou and Masatoshi Nei (1987)</a>
modified by
<a href="https://doi.org/10.1093/oxfordjournals.molbev.a040527">James A. Studier and Karl J. Keppler (1988)</a>
- Merges the two clusters $i$ and $j$ whose value $M_{i,j}$ (defined further below) is minimal between all clusters.
</li>
<li>
<em>Single Linkage (Nearest Neighbour on elements between two different clusters)</em>
by Kazimierz Florek, Józef Łukaszewicz, Julian Perkal, Hugo Steinhaus and Stefan Zubrzycki (1951)
- Merges the two clusters $i$ and $j$ whose lowest inter-cluster-distance $D_{i,j}$ is minimal between all clusters.
</li>
<li>
<em>Unweighted Pair Group Method with Arithmetic Mean (Group Average or Average Linkage)</em> by <br />
<a href="https://archive.org/details/cbarchive_33927_astatisticalmethodforevaluatin1902">Robert R. Sokal and Charles D. Michener (1958)</a>
- Merges the two clusters $i$ and $j$ whose average inter-cluster-distance $D_{i,j}$ is minimal between all clusters.
</li>
<li>
<em>Weighted Pair Group Method with Arithmetic Mean (Simple Average)</em> by
<a href="http://journals.sagepub.com/doi/abs/10.1177/001316446602600201">Louis L. McQuitty (1967)</a>
- Merges the two clusters $i$ and $j$ whose weighted-average inter-cluster-distance $D_{i,j}$ is minimal between all clusters.
</li>
</ul>
Neighbour-Joining is considered to be
the more realistic approach (giving biologically more relevant results),
because unlike UPGMA (Complete Linkage, Single Linkage or WPGMA) it does not assume
the same evolution rate for merged taxa (branch lengths of different taxa can be different).
This comes from the fact,that UPGMA assumes an ultrametric tree
(a tree in which distances from the root to the leafs are always the same)
and Neighbour-Joining only an additive tree (a tree in which branch lengths can be arbitrary).
That makes the approach more reliable.
<br />
<br />
The principle of all agglomerative (bottom-up) approaches is always the same,
in the beginning you have given a matrix of pairwise evolutionary distances
between singleton clusters (containing only one element)
and in this matrix $D$ or in a matrix $M$ (e.g. a neighbour-joining matrix) derived from that
you search for the entry with the lowest value $D_{min} = D_{i,j}$ (where $i \neq j$)
resp. $M_{min} = M_{i,j}$.
You merge the clusters $i$ and $j$ to a new cluster $ij = i \cup j$
and recompute the distances to the remaining clusters with some distance formula $\delta_m$
given with the respective approach.
Afterwards the distance between $i,j$ and the merged cluster $ij$
in the tree is recomputed with a second formula $\delta_t$.
These stated steps are iterated for $O(N)$ rounds (usually $N-1$ rounds, but not in Neighbour-Joining),
where $N$ is number of input singleton clusters.
<br />
<br />
In the Neighbour-Joining approach,
the entries of the $N \times N$ neighbour-joining matrix $M$
are computed by $M_{i,j} = (N-2) \cdot D_{i,j} - D_{i,J} - D_{I,j}$,
where $D_{i,J} = \sum^N_{k=1} D_{i,k}$ ($D_{I,j}$ defined analogous)
is the total distance between cluster $i$ and all other clusters.
This matrix together with the distance computation formulas allows us
to find an additive tree.
<br />
<br />
For the given input, the
<ul>
<li>changing distance matrices (with intermediate matrices $M$)</li>
<li>phylogenetic tree</li>
</ul>
are computed and displayed in reversed order.
</div>
<div class="picture">
<img src="Agglomerative-Clustering-120x90.png" />
</div>
</div>
<h1>Input:</h1>
<div id="algorithm_input">
<div class="row">
<div class="colW100"><label>Approach:</label></div>
<div class="colW400">
<select class="selector" data-bind="options: input.availableApproaches, selectedOptions: input.selectedApproach"
id="approach_selector"></select>
</div>
</div>
<div class="row">
<div class="colW100">
</div>
<div class="colW600">
<span data-bind="text: $root.input.selectedFormula()"></span>
<span data-bind="text: $root.input.selectedSubformula()"></span>
<!-- ko if: $root.input.selectedApproach() == "Neighbour-Joining" -->
where $\Delta_{i,j} = \frac{D_{i,J} - D_{I,j}}{N-2}$
<!-- /ko -->
</div>
</div>
<div class="row">
<div class="colW100"><label>CSV-Data:</label></div>
<div class="colW400">
<textarea class="csv_data" data-bind="text: input.csvTable" id="csv_table"></textarea>
<br \>
<span class="error_hint" data-bind="text: input.errorInput"></span>
<div class="group_hint">
<b>Hints:</b> <br />
Empty columns interpreted as "0" and column-entries below the diagonal are ignored.
The ";"-symbol is used as a column separator and every non-empty line
has to have the same number of separators or there won't be an output. <br />
</div>
</div>
</div>
</div>
<h1>Output:</h1>
<div id="algorithm_output">
<h2>Phylogenetic Tree</h2>
<div class="newick_tree">
<table class="result_header">
<thead>
<tr>
<th>
Newick Tree
</th>
</tr>
</thead>
</table>
<div class="result_with_scrollbar">
<table class="result">
<tbody>
<tr>
<td class="entry entry_start">
<code data-bind="text: $root.output.newickString()"></code>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<!-- ko if: $root.input.selectedApproach() != "Neighbour-Joining" -->
<div class="tree_container"> <!-- allows to delete and reinsert the div -->
<div id="phylogenetic_tree"></div>
</div>
<!-- ko if: $root.output.newickString().length !== 1 && $root.output.newickString().indexOf(SYMBOLS.MINUS) === -1 -->
<div class="group_hint">
<b>Visualization done with</b> <br />
Smits SA, Ouverney CC, 2010. jsPhyloSVG: <br />
A Javascript Library for Visualizing Interactive and Vector-Based Phylogenetic Trees on the Web. <br />
<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0012267">
PLoS ONE 5(8): e12267. doi:10.1371/journal.pone.0012267
</a>
</div>
<!-- /ko -->
<!-- /ko -->
<!-- ko if: $root.input.selectedApproach() == "Neighbour-Joining" -->
<div class="tree_container"> <!-- allows to delete and reinsert the div -->
<div id="phylogenetic_tree"></div>
</div>
<!-- ko if: $root.output.newickString().length !== 1 && $root.output.newickString().indexOf(SYMBOLS.MINUS) === -1 -->
<div class="group_hint">
<b>Visualization done with</b> <br />
Smits SA, Ouverney CC, 2010. jsPhyloSVG: <br />
A Javascript Library for Visualizing Interactive and Vector-Based Phylogenetic Trees on the Web. <br />
<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0012267">
PLoS ONE 5(8): e12267. doi:10.1371/journal.pone.0012267
</a>
</div>
<!-- /ko -->
<!-- /ko -->
<!-- ko if: $root.input.selectedApproach() != "Neighbour-Joining" -->
<h2>Distance Tables</h2>
<!-- /ko -->
<!-- ko if: $root.input.selectedApproach() == "Neighbour-Joining" -->
<h2>Distance & Neighbour Joining Tables</h2>
<!-- /ko -->
<div class="matrices">
<!-- ko foreach: $root.output.distanceMatrices() -->
<!-- ko if: $index() > 0 -->
<h3 class="header">Iteration <span data-bind="text: $index()"></span></h3>
<!-- /ko -->
<!-- ko if: $root.output.distanceMatrices()[$index()].length > 0 -->
<!-- ko if: $root.input.selectedApproach() != "Neighbour-Joining" -->
<div class="distance_table">
<table class="distances">
<thead>
<tr>
<th><span data-bind="text: $root.output.matrixDLatex()[$index()]"></span></th>
<!-- ko foreach: $root.output.remainingClusters()[$index()] -->
<th data-bind="drawChar: [$data, undefined]"></th>
<!-- /ko -->
</tr>
</thead>
<tbody>
<!-- ko foreach: $root.output.distanceMatrices()[$index()] --> <!-- to get i-indexes = $parentContext.$index() -->
<tr>
<th data-bind="drawChar: [$root.output.remainingClusters()[$parentContext.$index()][$index()], undefined]"></th>
<!-- ko foreach: $root.output.distanceMatrices()[$parentContext.$index()][$index()] --> <!-- to get j-indexes = $index() -->
<!-- ko if: $index() >= $parentContext.$index() -->
<td class="non_selectable_entry"
data-bind="text: $root.output.distanceMatrices()[$parentContext.$parentContext.$index()][$parentContext.$index()][$index()]">
</td>
<!-- /ko -->
<!-- ko if: $index() < $parentContext.$index() -->
<td class="non_selectable_entry gray"
data-bind="text: $root.output.distanceMatrices()[$parentContext.$parentContext.$index()][$parentContext.$index()][$index()]">
</td>
<!-- /ko -->
<!-- /ko -->
</tr>
<!-- /ko -->
<!-- ko if: $index() < $root.output.distanceMatrices().length-1 --> <!-- The last minimum is not computed -->
<tr>
<th class="hint" colspan=100%> <!-- HINT: move colspan into "hint"-class when browsers are ready! -->
<small>
<b>Minimum:</b> <span data-bind="text: $root.output.minimums()[$index()]"></span>
</small>
</th>
</tr>
<!-- /ko -->
</tbody>
</table>
</div>
<!-- /ko -->
<!-- ko if: $root.input.selectedApproach() == "Neighbour-Joining" -->
<div class="distance_table extra_small">
<table class="distances">
<thead>
<tr>
<th><span data-bind="text: $root.output.matrixDLatex()[$index()]"></span></th>
<!-- ko foreach: $root.output.remainingClusters()[$index()] -->
<th data-bind="drawChar: [$data, undefined]"></th>
<!-- /ko -->
</tr>
</thead>
<tbody>
<!-- ko foreach: $root.output.distanceMatrices()[$index()] --> <!-- to get i-indexes = $parentContext.$index() -->
<tr>
<th data-bind="drawChar: [$root.output.remainingClusters()[$parentContext.$index()][$index()], undefined]"></th>
<!-- ko foreach: $root.output.distanceMatrices()[$parentContext.$index()][$index()] --> <!-- to get j-indexes = $index() -->
<!-- ko if: $index() >= $parentContext.$index() -->
<td class="non_selectable_entry"
data-bind="text: $root.output.distanceMatrices()[$parentContext.$parentContext.$index()][$parentContext.$index()][$index()]">
</td>
<!-- /ko -->
<!-- ko if: $index() < $parentContext.$index() -->
<td class="non_selectable_entry gray"
data-bind="text: $root.output.distanceMatrices()[$parentContext.$parentContext.$index()][$parentContext.$index()][$index()]">
</td>
<!-- /ko -->
<!-- /ko -->
</tr>
<!-- /ko -->
<!-- ko if: $index() < $root.output.distanceMatrices().length - 1 -->
<tr class="thick_line">
<th><span data-bind="text: $root.output.sumLatex"></span></th>
<!-- ko foreach: $root.output.totalDistancesPerRound()[$index()] -->
<td class="entry" data-bind="text: $data"></td>
<!-- /ko -->
</tr>
<!-- /ko -->
</tbody>
</table>
</div>
<!-- ko if: $index() < $root.output.distanceMatrices().length - 1 -->
<div class="distance_table extra_small">
<table class="distances">
<thead>
<tr>
<th><span data-bind="text: $root.output.matrixDStarLatex()[$index()]"></span></th>
<!-- ko foreach: $root.output.remainingClusters()[$index()] -->
<th data-bind="drawChar: [$data, undefined]"></th>
<!-- /ko -->
</tr>
</thead>
<tbody>
<!-- ko foreach: $root.output.neighbourJoiningMatrices()[$index()] --> <!-- to get i-indexes = $parentContext.$index() -->
<tr>
<th data-bind="drawChar: [$root.output.remainingClusters()[$parentContext.$index()][$index()], undefined]"></th>
<!-- ko foreach: $root.output.neighbourJoiningMatrices()[$parentContext.$index()][$index()] --> <!-- to get j-indexes = $index() -->
<td class="non_selectable_entry"
data-bind="text: $root.output.neighbourJoiningMatrices()[$parentContext.$parentContext.$index()][$parentContext.$index()][$index()]">
</td>
<!-- /ko -->
</tr>
<!-- /ko -->
<tr>
<th class="hint" colspan=100%> <!-- HINT: move colspan into "hint"-class when browsers are ready! -->
<small>
<b>Minimum:</b> <span data-bind="text: $root.output.minimums()[$index()]"></span>
</small>
</th>
</tr>
</tbody>
</table>
</div>
<!-- /ko -->
<!-- /ko -->
<!-- /ko -->
<!-- /ko -->
</div>
</div>