ChangeLog

2014-02-07    <dyuret@ku.edu.tr>

	* Makefile: Trying to make things more standards compliant.
	- Pure c99 does not support: getopt, popen, exp10, strdup, getline.
	getopt(): _POSIX_C_SOURCE >= 2 || _XOPEN_SOURCE
	popen(), pclose(): _POSIX_C_SOURCE >= 2 || _XOPEN_SOURCE || _BSD_SOURCE || _SVID_SOURCE
	exp10(): _GNU_SOURCE
	strdup(): _SVID_SOURCE || _BSD_SOURCE || _XOPEN_SOURCE >= 500 || _XOPEN_SOURCE && _XOPEN_SOURCE_EXTENDED || /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
	getline(): /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200809L || _XOPEN_SOURCE >= 700 /* Before glibc 2.10: */ _GNU_SOURCE


2014-01-12  Deniz Yuret  <dyuret@ku.edu.tr>

	* test27.log: Using OMP_NUM_THREADS=30 on balina finished wsj in
	23m instead of 4h45m.  Output identical.

2014-01-11  Deniz Yuret  <dyuret@ku.edu.tr>

	* fastsubs-omp.c: OpenMP multithreaded version.  Tests with
	test.lm4.gz and the top 3000 sentences of wsj show that the time
	to complete fastsubs is 11 + 144/P where P is the number of
	threads up to about P=24.  After 24 there is no benefit in
	increasing the number of threads and after 40 the performance
	starts to deteriorate.  At the best time of 17 secs, about 9 secs
	is spent on things outside of fastsubs, i.e. read, write, alloc,
	free.  Note that memory accounting does not work because of the
	global variable _d_memsize.  Turning off memory accounting if
	NDEBUG.

2014-01-09  Deniz Yuret  <dyuret@ku.edu.tr>

	* test26.out: Ran it on the whole wsj dataset.  Checked for
	unordered substitutes using test26-check.pl.  None detected.
	Checked the old wsj.sub.gz.  31 lines have unordered substitutes
	(saved in test26.unordered).  There are a total of 66 lines that
	differ between the two files after normalization.  DONE: lm_free
	does not seem to get back to 0 memory in lm-test.c.  It turns out
	strhash was not getting freed.

2014-01-05  Deniz Yuret  <dyuret@ku.edu.tr>

	* test25.out: Ran it on the Sangyo sentence and it fixed the badly
	ordered substitutes.

	* test24.out: Turned on logB_heaps, still same output for zcat
	/work/upos/data/wsj.test1M.tok.gz | head -1 | fastsubs -n 100
	/work/upos/run/wsj.lm.gz > test24.out

	* test23.out: Rewrote fastsubs to use the logB_heap.  Confirmed
	same output when logB_heaps are ignored.

2014-01-01  Deniz Yuret  <dyuret@ku.edu.tr>

	* test22.out: Implemented logB_heap.

	* test21.out: Got rid of lmheap.c.

	* test20.out: Got rid of static variables in lmheap.  Output same
	as test.out (except one switch).

	* TODO:
	- Modify lmheap to have logB queues for positive values.
	- Add a sum node so we can use logB queues, fix bug, publish.
	- Remove static variables and implement multithreading.
	- Separate the node types into different structures.
	- Make lmheap more efficient by using pointer-and-hole keys.
	- Implement heaps in dlib (2^n?)
	- Put dot in dlib.

2013-12-30  Deniz Yuret  <dyuret@ku.edu.tr>

	* test19.out: Debugging the difference on the Sangyo sentence.
	Old fastsubs output (/work/upos/bin/fastsubs):
	[t=38 m=1618706432] lmheap_init start
	[t=146 m=4031504384] lmheap_init done
	[t=147 m=1832644608] calls=9 subs/call=100 pops/call=5184.44

	New fastsubs output:
	[1m32.65s 1,117,273,488 1,129,984,000b] lm_init done: logP=16x(27427357/33554432) logB=16x(7119768/8388608) toks=83530373
	[3m30.01s 3,663,812,672 3,676,844,032b] lmheap_init done: 16x(29792712/67108864), hpairs=83530373
	[3m30.55s 397,744 515,514,368b] calls=9 subs/call=100 pops/call=4512.44

	DONE: Why does lm_init take three times as long in the new
	version?  Using zcat 10 times faster than zlib (6 secs vs 50
	secs).  Is it my getline?  No, if I use my getline for regular
	files it is as fast as the GNU getline.  This seems to be a
	problem only with the old version of zlib1g (1.2.3.4) installed on
	altay and is documented on the internet.  The new version on
	istanbul (1.2.7) does not have this problem.

	Output difference:
	Douglas -14.65023136 is missing in the new output.
	instead ': -15.43919945' is added at the end.

	Things are out of order in both old and new output!  (see Capital
	below)
	test19.out: Sangyo	Financial -11.73654747	<unk> -12.17256355	Bond -12.34492683	Capital -11.86773300	Mortgage -12.58209801
	test19old.out:Sangyo	Financial -11.73654747	<unk> -12.17256355	Bond -12.34492683	Capital -11.86773300	Mortgage -12.58209801

	test19subs.out: is the output by the old subs, it has the
	probabilities for all the words.

	test19sangyo.out,sort: is the Sangyo substitutes sorted.  The
	values seem ok but the order is wrong.

	* test18.out: Did the full wsj test set.
	[1m31.34s 1,117,273,488 1,129,988,096b] lm_init done: logP=16x(27427357/33554432) logB=16x(7119768/8388608) toks=83530373
	[3m15.89s 3,663,812,672 3,676,848,128b] lmheap_init done: 16x(29792712/67108864), hpairs=83530373
	[4h44m25.47s 3,664,899,608 3,676,852,224b] free lmheap...
	[4h44m25.61s 458,928 515,518,464b] calls=1222974 subs/call=100 pops/call=1510.1
	TODO: investigate leftover memory

	Most differences from the original wsj.sub.gz is the switching of
	equal probability substitutes, except for the 23 following
	suspicious cases in test18.diff.
	TODO: compile subs.c to see who is right.

	First comma in: If government or private watchdogs insist , however , on introducing greater friction between the markets -LRB- limits on price moves , two-tiered execution , higher margin requirements , taxation , etc. -RRB- , the end loser will be the markets themselves .

	Sangyo in: Koizumi Sangyo Corp . -LRB- Japan -RRB- --

	System in: A $ 550 million offering of Turner Broadcasting System Inc . high-yield securities sold last week by Drexel was increased $ 50 million because of strong demand .


2013-12-29  Deniz Yuret  <dyuret@ku.edu.tr>

	* TODO: all optimization will probably halve the memory
	requirement.  The basic cost, sorted word arrays for each ngram
	position (8 bytes x total number of words in lm), is irreducible.
	Faster sorting may reduce init time, but currently at 3 minutes it
	is not a big cost.  The speed at which we spit out substitutes
	will not get effected.

	* test17.out: I moved the hash resize from 75% to 87.5% to gain
	some memory.
	[1m31.22s 1,117,273,488 1,129,988,096b] lm_init done: logP=16x(27427357/33554432) logB=16x(7119768/8388608) toks=83530373
	[3m15.81s 3,663,812,672 3,676,852,224b] lmheap_init done: 16x(29792712/67108864), hpairs=83530373

	* test16.out: Printing out hash sizes.
	[1m34.73s 1,788,362,128 1,801,076,736b] lm_init done: logP=16x(27427357/67108864) logB=16x(7119768/16777216) toks=83530373
	[3m14.47s 4,334,901,120 4,346,888,192b] lmheap_init done: 16x(29792712/67108864), hpairs=83530373

	TODO: add dots to dlib.  Replace % with bit manipulation.  Arg to
	dot is power of 2.  Use macro.  Add newline.  See old macro.

	TODO: Separate ngram tables (array+hash) for each ngram order.
	Order 1 already done in symtable.  Hash can use 4 byte keys if
	index into array, not pointer.  Tough for strings as they vary in
	size.  Maybe just keep symtable.  lmheap could also have 4 byte
	keys as array indices into an array of structs.  But we need
	pointer to variable length hpairs.  Think of this 4 byte keys some
	more.

	* test15.out: Liberated all code from glib and procinfo.  TODO:
	fix subs.c. Output same as test.out.  DONE: remove mallinfo, it
	uses 32 bit integers.  We have our own accounting and proc vss
	output now.  New output has time, dallocsize, vsize.
	[3m15.05s 4,334,901,120 4,346,884,096b] lmheap_init done

	zcat /work/upos/data/wsj.test1M.tok.gz | head -1 | fastsubs -n 100 /work/upos/run/wsj.lm.gz > test15.out 2> test15.log

	* test14.out: trying to find the best way to report memory.  Here
	are before and after lm load numbers (page size 4096 bytes, lm
	load time 96 secs):
	[t=195 m=4352221184] lmheap_init done

	sbrk(0): 8M 456M  (x4 = 1825M 1793M)
	stat.vss: 15M 1806M (bytes)
	stat.rss: 206 437474 (pages)
	stat.utime: 0 9366
	stat.stime: 0 147
	statm.size: 3891 441010 (pages=15M 1806M bytes)
	statm.data: 520 437639 (pages=2M 1792M bytes data+stack)
	status.1: VmPeak=VmSize=15564kB VmData=1224kB VmStk=856kB VmExe=32kB (text) VmLib=3172kB (shared libs)
	status.2: VmPeak=2244M VmSize=1764M VmData=1749M VmStk=856kB VmExe=32kB VmLib=3172kB
	mallinfo.arena: 135K 448M (system bytes)
	mallinfo.uordblks: 1408 445M (in use bytes)
	mallinfo.hblkhd: 1052K 1343M (arena+hblkhd=1791627264)
	malloc_stats.total system bytes: (arena+hblkhd=1791627264)
	malloc_stats.total in use bytes: (uordblks+hblkhd=1788366864)

	* test13.out: using dalloc for darr_t and D_HASH.  The only system
	malloc left is in file open and gets.  Code gets SEGFAULT with -O3
	(specifically with the -ftree-vectorize) at _mknull(data[i++]).
	Could be alignment related. -O2 or less is fine.  However the
	bigger problem is excessive memory use without real free or
	realloc during array/hash resizing, undoing it:
	[t=200 m=6773850112] lmheap_init done

	* test12.out: cleanup in dlib.c dlib.h.
	[t=194 m=4352221184] lmheap_init done

	* test11.out: testing fnv1a applied to ngrams.
	[t=195 m=4353826816] lmheap_init done

2013-12-28  Deniz Yuret  <dyuret@ku.edu.tr>

	* test10.out: cleaned glib from fastsubs.
	[t=186 m=4355215360] lmheap_init done

	* test09.out: cleaned glib from lm and lmheap.  Memory use
	higher probably due to hash doubling.  Output identical to
	test.out.
	[t=186 m=4355215360] lmheap_init done

	* lmheap.c: Removing glib.
	TODO: try heapify or a faster sort.

	DONE: get gives pointer, forhash gives element, confusing?
	forhash should give pointers, otherwise we cannot modify each
	value, for example.

	* lm.c: Removing glib.
	TODO: dlib needs chomp, or split needs to take a set of strings
	like strtok:
	size_t len = strlen(tok[1]);
	if (tok[1][len-1] == '\n') tok[1][len-1] = '\0';

	TODO: die("Only one LM is allowed."); // why?
	fastsubs takes an lm and initializes its internal state which
	includes a static lmheap.  Should rearrange the code so lm and
	lmheap can be initialized / freed by the caller.

	Memory with new D_HASH:
	$ lm-test /work/upos/run/wsj.lm.gz
	[t=0 m=14077952] Loading model file /work/upos/run/wsj.lm.gz
	[t=94 m=1807220736] ngram order = 4
	[t=94 m=1807224832] logP=27427357
	[t=94 m=1807228928] logB=7119768

	Memory with glib hash from test08:
	[t=0 m=14876672] Loading model file /work/upos/run/wsj.lm.gz
	[t=91 m=1610895360] vocab:78499

	DONE: memory I get from mallinfo seems off.  Should debug.  The
	arrows are reported by mallinfo:
	[t=0 m=14077952] Loading model file wsj.lm.gz
	==> [t=0 m=832] sizeof(_lp_t)=16
	==> [t=93 m=447640640] sizeof(_lp_t)=16
	[t=93 m=1807220736] ngram order = 4
	[t=93 m=1807224832] logP=27427357/67108864
	[t=93 m=1807228928] logB=7119768/16777216
	[t=93 m=1807233024] ==> Enter ngram:

	Counts confirm the accuracy of new lm:
	$ zcat wsj.lm.gz | head
	ngram 1=78499
	ngram 2=8587685
	ngram 3=8768188
	ngram 4=9992985
	total 27427357
	$ zcat wsj.lm.gz | awk -F'\t' '{print NF}' | rcount
	20307589	2
	7119768	3

	TODO: try resizing less often (one more bit shift?).  ngram size
	is 4 bytes, it could fit into one (in fact 3 bits).  logp and logb
	can be part of the struct (but most logb is empty!).  struct packs
	8+4 into 16 bytes so using float is useless.  Using 4 byte
	pointers for ngrams (like sym_t) is not a good idea there could be
	more than 4B?  Do the optimization after converting lmheap as
	well.  We would not need length if we just kept the ngrams of
	different lengths separate.  lmheap keys can point to full ngrams
	and the index for wildcard instead of copying the whole ngram.

	* test08.out: Heap is cleaned of glib.
	[t=190 m=4006862848] lmheap_init done

	* test07.out: Got rid of minialloc and foreach, replacing them
	with dalloc and dlib.  It did help with memory a bit.  Next steps:
	DONE: lm, lmheap, fastsubs, fastsubs-main
	DONE: *-test, get rid of procinfo.
	TODO: implement heap.c in dlib?
	[t=187 m=4006862848] lmheap_init done

	* test06.out: Inlined dalloc.  No difference.
	[t=194 m=4019134464] lmheap_init done

	* test05.out: Tried to optimize dalloc a bit.  Reallocing leftover
	memory did not make a difference.  Should make dalloc inline.
	Need declare internal vars and export them.  Should port
	everything that uses minialloc to dalloc.
	[t=193 m=4019134464] lmheap_init done

	* test04.out: Replaced dalloc with regular malloc.  Cleaned
	sentence.h and sentence.c to use standard C99 types.  Same output
	as test.out.  This definitely spends more memory.  Going back to
	dalloc.
	[t=197 m=4837580800] lmheap_init done

	* test03.out: cleaned up ngram.c and ngram.h from glib.  Same
	output as test.out.  Replaced glib random with stdlib random.
	Replaced minialloc with dalloc.  done: Try having a non-random
	hash function.
	[t=192 m=4019134464] lmheap_init done

	* test02.out: this is the output after we converted GQuark to
	sym_t in token.h.
	[t=184 m=4013473792] lmheap_init done

	* test01.out: this is the output with latest version of fastsubs.
	Identical to test.out except one switch of equal substitutes.
	[t=195 m=4032561152] lmheap_init done

	* test.out: this is the output for first sentence from
	/work/upos/run/wsj.sub.gz.