-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
200 lines (171 loc) · 11.9 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
### README --- CS194 HMM PS 5
Jae Ryoo
Saba Khalilnaji
In this project we were responsible for implementing a Hidden Markov Model (HMM) from scratch.
Programming Language: Python
INSTRUCTIONS on compiling and running our code:
Get into the file HMM_Khalilnaji_Ryoo
on the command line$python main.py sequences.fasta initial_parameters.txt
you should get the following files:
estimated_parameters.txt
likelihoods.txt
decodings_initial.txt
decodings_estimated.txt
TROUBLESHOOTING:
NOTE: you will get print lines assuring you that our code is working but taking its
time :) Thanks!
Please email: [email protected] if you are having trouble running our code. Thanks!
_______________________________________________________________________________________________________________________
Instructor: Yun S. Song
Out: March 22, 2012
Due: April 16, 2012
Problem Set 5
UC Berkeley, CS194-1: Algorithms for Computational Biology (Spring 2012)
_________________________
Instructions:
HMM implementation
The goal of this problem set is to implement the key algorithms for HMM discussed in class. Throughout, we will consider
the following interesting biological application:
Meiotic recombination is an important biological mechanism common to most forms of life. As a consequence of
recombination, different positions on the same chromosome may have different genealogical histories. For example, given
a pair of homologous sequences, different positions may have different times (denoted TMRCA) to the most recent common
ancestor (MRCA), as illustrated in Figure 1. Recently, Li and Durbin (Nature, 475:493-496, 2011) used a hidden Markov
model to estimate the position-specific T_MRCA for a pair of sequences. The transition and emission probabilities in
their HMM arise from a stochastic genealogical process (called the coalescent), which you do not need to know to do the
problems described below.
-----------------------------------------------------------------------------------------
| || | | |
| || | | |
4 -| || | | |
| || | | |
| || | | |
| || | | |
3 -| || | | |
| || | | |
Time to | || | | |
MRCA | || | | |
2 -| || | | |
| .-..------------. .------------. | . |
| | | | | | | |
| | | | | | | |
1 -| | | | | | | |
| | | | | | | |
| | | | | | | |
| -------------------. .----. .-----------------. .-------- |
-----------------------------------------------------------------------------------------
| | | | | |
0 20 40 60 80 100
Position (kb)
Figure 1: Time to the most recent common ancestor along a pair of homologous sequences, each of length 100 kb. Time is
measured in units of 2N_e generations, where N_e is the so-called "effective" population size. For humans, an
approximate long-term effective population size is N_e = 10,000.
Instruction:
- You may use any of the following programming languages: C, C++, Java, Python, Perl, Ruby. Use only the standard
libraries for each language. Please provide a document detailing how one can compile and run your code. you should
submit your source code and answers to the questions below, via e-mail to both the instructor and the GSI.
- You are strongly encouraged to pair up with a fellow student in class.
- Download ps5data.tgz from the course webpage. Included in the tar archive is a file called sequences.fasta, which
contains a pair of DNA sequences of length L = 100,000 in FASTA format. Consider the following HMM:
- The observed symbol x_i in (SIGMA) = {I,D} at position 1 <= i <= L corresponds to whether the two sequences are
identical (I) or different (D) at that position.
- The hidden state q_i in S = {t_1, t_2, t_3, t_4} at position 1 <= i <= L corresponds to the T_MRCA at that
position.
- Assume that the hidden random variables {Q_i, 1 <= i <= L} form a homogeneous Markov chain, with transition
probabilities a_kl, for k,l in S.
- As usual, the probability of emitting symbol (sigma) in (SIGMA) from state k in S is denoted by e_k((sigma)). The
parameters of the model are (THETA) = {a_kl, e_k((sigma)), m_k}_{k,l in S; (sigma) in (SIGMA)}, where m_k denotes
the marginal probability P(Q_1 = k), which can be regarded as the transition probability a_begin,k.
Remark: We expect P(D|Q = t_j) > P(D|Q=t_i), for t_j > t_i. Why?
Problems:
1. Implement the forward and backward algorithms.
2. Implement the Baum-Welch algorithm and use it to estimate the parameters of the model. For initialization,
use the parameters (THETA)_initial provided in initial_parameters.txt. Store your estimated parameters
(THETA)_estimated in a file called estimated_parameters.txt.
3. In a file called likelihoods.txt, store the log-likelihoods for the initial parameters (THETA)_initial and
for your estimated parameters (THETA)_estimated.
4. Using the initial parameters (THETA)_initial, produce both Viterbi and posterior decodings, and compute the
posterior mean E[T_MRCA | x, (THETA)_initial] for each position. Assume that S = {0.32, 1.75, 4.54, 9.40}. To
identify which hidden state should correspond to which time, think about the remark mentioned above.
(a) Output your results to decodings_initial.txt in a 3-column format (Viterbi decoding, posterior
decoding, posterior mean).
(b) Plot your results, together with the true TMRCA in true_tmrca.txt. (In fact, Figure 1 shows the true
T_MRCA for the data you are analyzing.) Name your figure file plot_initial.pdf.
5. Using your estimated parameters estimated, produce both Viterbi and posterior decodings, and compute the
posterior mean E[T_MRCA | x, (THETA)_estimated] for each position.
(a) Output your results to decodings_estimated.txt in a 3-column format (Viterbi decoding, posterior
decoding, posterior mean).
(b) Plot your results, together with the true T_MRCA in true_tmrca.txt. Name your figure file
plot_estimated.pdf.
Additional exercise (not to be turned in): Try starting the Baum-Welch algorithm with different initial parameter
settings. Do you obtain the same final estimates?
Copyright 2012: Yun S. Song
_______________________________________________________________________________________________________________________
FILE OUTPUT INSTRUCTIONS:
For ease of grading, please use the output format described below.
Your program should take 2 input arguments:
The input sequence file: sequence.fasta
The initial parameters file: initial_parameters.txt
Your program should output 4 files:
estimated_parameters.txt
likelihoods.txt
decodings_initial.txt
decodings_estimated.txt
For any floating-point output, use the format "%.2e", which is scientific notation with 2 digits of precision after the
decimal. For example, 123.456 should be output as "1.23e+02". Note the lower case "e".
In C/C++, this would be
printf("%.4e", 123.45678);
and in Python, this would be
print "%.4e" % 123.45678
Label the 4 states 1, 2, 3, and 4, where 1 corresponds to coalescent time 0.32, 2 corresponds to 1.75, etc.
The format for the estimated_parameters.txt file is:
The first 4 lines are the probabilities for the initial state. Each line should have the state label (an integer from 1 to 4)
followed by a single space followed by the probability in the proper floating-point format.
The next 4 lines are the transition probabilities in matrix format. Each line should have 4 numbers, separated by _single_
spaces, using the proper floating-point format. Do not label the states in the transition matrix. It should be 4 lines, with
4 numbers on each line.
The next 4 lines are the emission probabilities. Each line should have the state label (an integer from 1 to 4) followed by
the emission probability for "I", in the proper floating-point format, followed by the emission probability for "D".
As an example, the initial_parameters.txt file in this format would look as follows.
1 6.03e-01
2 3.57e-01
3 3.89e-02
4 5.39e-04
1.00e+00 7.61e-05 8.28e-06 1.15e-07
1.28e-04 1.00e+00 8.49e-05 1.18e-06
1.28e-04 7.80e-04 9.99e-01 2.36e-05
1.28e-04 7.80e-04 1.70e-03 9.97e-01
1 9.96e-01 3.90e-03
2 9.84e-01 1.64e-02
3 9.60e-01 4.00e-02
4 9.22e-01 7.83e-02
As a quick check, your output file should be exactly 268 bytes.
Please do not forget the new line character at the end of the last line. This does _not_ mean the last line should be an empty
line; it means that the last line should have a new line character at the end.
The output file likelihoods.txt should have exactly 2 lines. The first line is the log likelihood for the initial parameters and
the second line is the log likelihood for the your estimated parameters. Again, use the proper floating-point format, and don't
forget the newline character at the end of the second line.
For example:
-100e+00
-100e+00
The output file decodings_initial.txt and decodings_estimated.txt have the same format.
Each file should have 100,000 lines, one for each site in the data. Each line should have 3 numbers. The first number is the
state
(an integer) corresponding to the Viterbi decoding, the second number is the state (an integer) corresponding to the posterior
decoding, and the third number is the posterior mean (a floating-point number). Use the above format for the floating-point
number. These numbers should be separated by _single_ spaces.
For example, the first 5 lines of decodings_initial.txt might look like this
1 1 8.23e-01
1 1 8.23e-01
1 1 8.23e-01
2 2 1.95e+00
2 3 2.33e+00
As a check, decodings_initial.txt and decodings_estimated.txt should each be exactly 1300000 bytes. (Each line is exactly 13 bytes,
including the newline character, and there are 100,000 lines.)
The plots for 4(b) and 5(b) may be generated however you choose. Those should _not_ be automatically generated by your program.
When writing to files, be sure that they are open for "write" and not for "append".
Let me know if you have any questions on this specification.
- Andrew Chan
_______________________________________________________________________________________________________________________
## Copyright 2012: Saba Khalilnaji, Jae Young Ryoo
##
## Keywords: CS194-1, CS194, HMM