forked from pkgw/bloomdemo
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
74 lines (62 loc) · 3.01 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
Bloom Filter Demo
Version: 1.0
========================================
This Git repository contains a very small Python program, "indict",
which tells you whether a word might be in the system
dictionary. Sample usage is:
$ ./indict barn bern birn born burn
barn MIGHT BE in the dictionary
bern is DEFINITELY NOT in the dictionary
birn MIGHT BE in the dictionary
born MIGHT BE in the dictionary
burn MIGHT BE in the dictionary
"indict" takes one potential option, "-s", which tells it not to
print out information about words that are definitely not in the
dictionary:
$ ./indict -s barn bern birn
barn MIGHT BE in the dictionary
birn MIGHT BE in the dictionary
Implementation
========================================
This program is implemented using a data structure called a Bloom
filter. YOU DON'T HAVE TO WORRY ABOUT HOW IT WORKS FOR THIS EXERCISE
but you may find it interesting. Bloom filters are essentially like
Python sets: you can add items to them and later test whether they
contain a given item. For a given number of items, a Bloom filter is
much more efficient in computation time and memory than a Python
set. The price of this efficiency is that a Bloom filter can yield
false positives: it may say that a word is in the dictionary when it
actually isn't. On the other hand, a Bloom filter will never yield
false negatives: it won't say that a word isn't in the dictionary when
it actually is.
"indict" takes one potential option, "-s", which tells it not to
print out information about words that are definitely not in the
dictionary:
$ ./indict -s barn bern birn
barn MIGHT BE in the dictionary
birn MIGHT BE in the dictionary
$
Bloom filters are useful when you have a large dictionary-type
database that is slow to scan. If you're going to check the database
for the presence of many keys that it may not contain, the filter can
help you avoid a large number of slow checks to the database itself.
NOTE WELL that this particular demo is not any faster than just
grepping the system dictionary (/usr/share/dict/words on traditional
Unix systems). This is because of the significant overhead of starting
up the Python interpreter and loading the Bloom filter data as well as
the simple nature of the dictionary "database". Furthermore, the
filter implementation that I slapped together in 20 minutes is
completely unoptimized. The best data structure is the one that you
don't use because you don't actually need it.
Repository Contents
========================================
.gitignore -- tells "git status" not to report boring files
README -- this document
bloom.py -- a simple Bloom filter implementation. Don't use it
for anything real, because if you actually have a problem that
calls for a Bloom filter, you'll almost surely need an
implementation that was actually designed with care.
dictbf.dat.gz -- the precomputed filter data. (It takes a long
time to compute the filter data from the dictionary.)
importdictdata -- the program to precompute the filter data
indict -- the main dictionary testing program