-
Notifications
You must be signed in to change notification settings - Fork 3
/
ChangeLog
307 lines (241 loc) · 12.8 KB
/
ChangeLog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
2014-02-07 <[email protected]>
* Makefile: Trying to make things more standards compliant.
- Pure c99 does not support: getopt, popen, exp10, strdup, getline.
getopt(): _POSIX_C_SOURCE >= 2 || _XOPEN_SOURCE
popen(), pclose(): _POSIX_C_SOURCE >= 2 || _XOPEN_SOURCE || _BSD_SOURCE || _SVID_SOURCE
exp10(): _GNU_SOURCE
strdup(): _SVID_SOURCE || _BSD_SOURCE || _XOPEN_SOURCE >= 500 || _XOPEN_SOURCE && _XOPEN_SOURCE_EXTENDED || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
getline(): /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200809L || _XOPEN_SOURCE >= 700 /* Before glibc 2.10: */ _GNU_SOURCE
2014-01-12 Deniz Yuret <[email protected]>
* test27.log: Using OMP_NUM_THREADS=30 on balina finished wsj in
23m instead of 4h45m. Output identical.
2014-01-11 Deniz Yuret <[email protected]>
* fastsubs-omp.c: OpenMP multithreaded version. Tests with
test.lm4.gz and the top 3000 sentences of wsj show that the time
to complete fastsubs is 11 + 144/P where P is the number of
threads up to about P=24. After 24 there is no benefit in
increasing the number of threads and after 40 the performance
starts to deteriorate. At the best time of 17 secs, about 9 secs
is spent on things outside of fastsubs, i.e. read, write, alloc,
free. Note that memory accounting does not work because of the
global variable _d_memsize. Turning off memory accounting if
NDEBUG.
2014-01-09 Deniz Yuret <[email protected]>
* test26.out: Ran it on the whole wsj dataset. Checked for
unordered substitutes using test26-check.pl. None detected.
Checked the old wsj.sub.gz. 31 lines have unordered substitutes
(saved in test26.unordered). There are a total of 66 lines that
differ between the two files after normalization. DONE: lm_free
does not seem to get back to 0 memory in lm-test.c. It turns out
strhash was not getting freed.
2014-01-05 Deniz Yuret <[email protected]>
* test25.out: Ran it on the Sangyo sentence and it fixed the badly
ordered substitutes.
* test24.out: Turned on logB_heaps, still same output for zcat
/work/upos/data/wsj.test1M.tok.gz | head -1 | fastsubs -n 100
/work/upos/run/wsj.lm.gz > test24.out
* test23.out: Rewrote fastsubs to use the logB_heap. Confirmed
same output when logB_heaps are ignored.
2014-01-01 Deniz Yuret <[email protected]>
* test22.out: Implemented logB_heap.
* test21.out: Got rid of lmheap.c.
* test20.out: Got rid of static variables in lmheap. Output same
as test.out (except one switch).
* TODO:
- Modify lmheap to have logB queues for positive values.
- Add a sum node so we can use logB queues, fix bug, publish.
- Remove static variables and implement multithreading.
- Separate the node types into different structures.
- Make lmheap more efficient by using pointer-and-hole keys.
- Implement heaps in dlib (2^n?)
- Put dot in dlib.
2013-12-30 Deniz Yuret <[email protected]>
* test19.out: Debugging the difference on the Sangyo sentence.
Old fastsubs output (/work/upos/bin/fastsubs):
[t=38 m=1618706432] lmheap_init start
[t=146 m=4031504384] lmheap_init done
[t=147 m=1832644608] calls=9 subs/call=100 pops/call=5184.44
New fastsubs output:
[1m32.65s 1,117,273,488 1,129,984,000b] lm_init done: logP=16x(27427357/33554432) logB=16x(7119768/8388608) toks=83530373
[3m30.01s 3,663,812,672 3,676,844,032b] lmheap_init done: 16x(29792712/67108864), hpairs=83530373
[3m30.55s 397,744 515,514,368b] calls=9 subs/call=100 pops/call=4512.44
DONE: Why does lm_init take three times as long in the new
version? Using zcat 10 times faster than zlib (6 secs vs 50
secs). Is it my getline? No, if I use my getline for regular
files it is as fast as the GNU getline. This seems to be a
problem only with the old version of zlib1g (1.2.3.4) installed on
altay and is documented on the internet. The new version on
istanbul (1.2.7) does not have this problem.
Output difference:
Douglas -14.65023136 is missing in the new output.
instead ': -15.43919945' is added at the end.
Things are out of order in both old and new output! (see Capital
below)
test19.out: Sangyo Financial -11.73654747 <unk> -12.17256355 Bond -12.34492683 Capital -11.86773300 Mortgage -12.58209801
test19old.out:Sangyo Financial -11.73654747 <unk> -12.17256355 Bond -12.34492683 Capital -11.86773300 Mortgage -12.58209801
test19subs.out: is the output by the old subs, it has the
probabilities for all the words.
test19sangyo.out,sort: is the Sangyo substitutes sorted. The
values seem ok but the order is wrong.
* test18.out: Did the full wsj test set.
[1m31.34s 1,117,273,488 1,129,988,096b] lm_init done: logP=16x(27427357/33554432) logB=16x(7119768/8388608) toks=83530373
[3m15.89s 3,663,812,672 3,676,848,128b] lmheap_init done: 16x(29792712/67108864), hpairs=83530373
[4h44m25.47s 3,664,899,608 3,676,852,224b] free lmheap...
[4h44m25.61s 458,928 515,518,464b] calls=1222974 subs/call=100 pops/call=1510.1
TODO: investigate leftover memory
Most differences from the original wsj.sub.gz is the switching of
equal probability substitutes, except for the 23 following
suspicious cases in test18.diff.
TODO: compile subs.c to see who is right.
First comma in: If government or private watchdogs insist , however , on introducing greater friction between the markets -LRB- limits on price moves , two-tiered execution , higher margin requirements , taxation , etc. -RRB- , the end loser will be the markets themselves .
Sangyo in: Koizumi Sangyo Corp . -LRB- Japan -RRB- --
System in: A $ 550 million offering of Turner Broadcasting System Inc . high-yield securities sold last week by Drexel was increased $ 50 million because of strong demand .
2013-12-29 Deniz Yuret <[email protected]>
* TODO: all optimization will probably halve the memory
requirement. The basic cost, sorted word arrays for each ngram
position (8 bytes x total number of words in lm), is irreducible.
Faster sorting may reduce init time, but currently at 3 minutes it
is not a big cost. The speed at which we spit out substitutes
will not get effected.
* test17.out: I moved the hash resize from 75% to 87.5% to gain
some memory.
[1m31.22s 1,117,273,488 1,129,988,096b] lm_init done: logP=16x(27427357/33554432) logB=16x(7119768/8388608) toks=83530373
[3m15.81s 3,663,812,672 3,676,852,224b] lmheap_init done: 16x(29792712/67108864), hpairs=83530373
* test16.out: Printing out hash sizes.
[1m34.73s 1,788,362,128 1,801,076,736b] lm_init done: logP=16x(27427357/67108864) logB=16x(7119768/16777216) toks=83530373
[3m14.47s 4,334,901,120 4,346,888,192b] lmheap_init done: 16x(29792712/67108864), hpairs=83530373
TODO: add dots to dlib. Replace % with bit manipulation. Arg to
dot is power of 2. Use macro. Add newline. See old macro.
TODO: Separate ngram tables (array+hash) for each ngram order.
Order 1 already done in symtable. Hash can use 4 byte keys if
index into array, not pointer. Tough for strings as they vary in
size. Maybe just keep symtable. lmheap could also have 4 byte
keys as array indices into an array of structs. But we need
pointer to variable length hpairs. Think of this 4 byte keys some
more.
* test15.out: Liberated all code from glib and procinfo. TODO:
fix subs.c. Output same as test.out. DONE: remove mallinfo, it
uses 32 bit integers. We have our own accounting and proc vss
output now. New output has time, dallocsize, vsize.
[3m15.05s 4,334,901,120 4,346,884,096b] lmheap_init done
zcat /work/upos/data/wsj.test1M.tok.gz | head -1 | fastsubs -n 100 /work/upos/run/wsj.lm.gz > test15.out 2> test15.log
* test14.out: trying to find the best way to report memory. Here
are before and after lm load numbers (page size 4096 bytes, lm
load time 96 secs):
[t=195 m=4352221184] lmheap_init done
sbrk(0): 8M 456M (x4 = 1825M 1793M)
stat.vss: 15M 1806M (bytes)
stat.rss: 206 437474 (pages)
stat.utime: 0 9366
stat.stime: 0 147
statm.size: 3891 441010 (pages=15M 1806M bytes)
statm.data: 520 437639 (pages=2M 1792M bytes data+stack)
status.1: VmPeak=VmSize=15564kB VmData=1224kB VmStk=856kB VmExe=32kB (text) VmLib=3172kB (shared libs)
status.2: VmPeak=2244M VmSize=1764M VmData=1749M VmStk=856kB VmExe=32kB VmLib=3172kB
mallinfo.arena: 135K 448M (system bytes)
mallinfo.uordblks: 1408 445M (in use bytes)
mallinfo.hblkhd: 1052K 1343M (arena+hblkhd=1791627264)
malloc_stats.total system bytes: (arena+hblkhd=1791627264)
malloc_stats.total in use bytes: (uordblks+hblkhd=1788366864)
* test13.out: using dalloc for darr_t and D_HASH. The only system
malloc left is in file open and gets. Code gets SEGFAULT with -O3
(specifically with the -ftree-vectorize) at _mknull(data[i++]).
Could be alignment related. -O2 or less is fine. However the
bigger problem is excessive memory use without real free or
realloc during array/hash resizing, undoing it:
[t=200 m=6773850112] lmheap_init done
* test12.out: cleanup in dlib.c dlib.h.
[t=194 m=4352221184] lmheap_init done
* test11.out: testing fnv1a applied to ngrams.
[t=195 m=4353826816] lmheap_init done
2013-12-28 Deniz Yuret <[email protected]>
* test10.out: cleaned glib from fastsubs.
[t=186 m=4355215360] lmheap_init done
* test09.out: cleaned glib from lm and lmheap. Memory use
higher probably due to hash doubling. Output identical to
test.out.
[t=186 m=4355215360] lmheap_init done
* lmheap.c: Removing glib.
TODO: try heapify or a faster sort.
DONE: get gives pointer, forhash gives element, confusing?
forhash should give pointers, otherwise we cannot modify each
value, for example.
* lm.c: Removing glib.
TODO: dlib needs chomp, or split needs to take a set of strings
like strtok:
size_t len = strlen(tok[1]);
if (tok[1][len-1] == '\n') tok[1][len-1] = '\0';
TODO: die("Only one LM is allowed."); // why?
fastsubs takes an lm and initializes its internal state which
includes a static lmheap. Should rearrange the code so lm and
lmheap can be initialized / freed by the caller.
Memory with new D_HASH:
$ lm-test /work/upos/run/wsj.lm.gz
[t=0 m=14077952] Loading model file /work/upos/run/wsj.lm.gz
[t=94 m=1807220736] ngram order = 4
[t=94 m=1807224832] logP=27427357
[t=94 m=1807228928] logB=7119768
Memory with glib hash from test08:
[t=0 m=14876672] Loading model file /work/upos/run/wsj.lm.gz
[t=91 m=1610895360] vocab:78499
DONE: memory I get from mallinfo seems off. Should debug. The
arrows are reported by mallinfo:
[t=0 m=14077952] Loading model file wsj.lm.gz
==> [t=0 m=832] sizeof(_lp_t)=16
==> [t=93 m=447640640] sizeof(_lp_t)=16
[t=93 m=1807220736] ngram order = 4
[t=93 m=1807224832] logP=27427357/67108864
[t=93 m=1807228928] logB=7119768/16777216
[t=93 m=1807233024] ==> Enter ngram:
Counts confirm the accuracy of new lm:
$ zcat wsj.lm.gz | head
ngram 1=78499
ngram 2=8587685
ngram 3=8768188
ngram 4=9992985
total 27427357
$ zcat wsj.lm.gz | awk -F'\t' '{print NF}' | rcount
20307589 2
7119768 3
TODO: try resizing less often (one more bit shift?). ngram size
is 4 bytes, it could fit into one (in fact 3 bits). logp and logb
can be part of the struct (but most logb is empty!). struct packs
8+4 into 16 bytes so using float is useless. Using 4 byte
pointers for ngrams (like sym_t) is not a good idea there could be
more than 4B? Do the optimization after converting lmheap as
well. We would not need length if we just kept the ngrams of
different lengths separate. lmheap keys can point to full ngrams
and the index for wildcard instead of copying the whole ngram.
* test08.out: Heap is cleaned of glib.
[t=190 m=4006862848] lmheap_init done
* test07.out: Got rid of minialloc and foreach, replacing them
with dalloc and dlib. It did help with memory a bit. Next steps:
DONE: lm, lmheap, fastsubs, fastsubs-main
DONE: *-test, get rid of procinfo.
TODO: implement heap.c in dlib?
[t=187 m=4006862848] lmheap_init done
* test06.out: Inlined dalloc. No difference.
[t=194 m=4019134464] lmheap_init done
* test05.out: Tried to optimize dalloc a bit. Reallocing leftover
memory did not make a difference. Should make dalloc inline.
Need declare internal vars and export them. Should port
everything that uses minialloc to dalloc.
[t=193 m=4019134464] lmheap_init done
* test04.out: Replaced dalloc with regular malloc. Cleaned
sentence.h and sentence.c to use standard C99 types. Same output
as test.out. This definitely spends more memory. Going back to
dalloc.
[t=197 m=4837580800] lmheap_init done
* test03.out: cleaned up ngram.c and ngram.h from glib. Same
output as test.out. Replaced glib random with stdlib random.
Replaced minialloc with dalloc. done: Try having a non-random
hash function.
[t=192 m=4019134464] lmheap_init done
* test02.out: this is the output after we converted GQuark to
sym_t in token.h.
[t=184 m=4013473792] lmheap_init done
* test01.out: this is the output with latest version of fastsubs.
Identical to test.out except one switch of equal substitutes.
[t=195 m=4032561152] lmheap_init done
* test.out: this is the output for first sentence from
/work/upos/run/wsj.sub.gz.