Source-code for the Witan algorithm and accompanying experimentation framework, as used in the paper "Witan: Unsupervised Labelling Function Generation for Assisted Data Programming"
In order to run this project, you must have the following dependencies installed on your host:
- Docker Community Edition (>= 17.09)
- Docker Compose (>= 1.17) (Included with Docker Desktop on Mac/Windows)
- Make (technically optional if you don't mind running the commands in the Makefile directly)
Note: If you use Git bash on
Windows and also
install make
into Git bash, then you should be able to run this project on Windows.
- Ensure the dependencies listed above are installed.
- Run
make run
in this directory.- This will perform all Docker image build steps and dependency installations every time you run it, so that you can never forget to rebuild. The first time you run this, it make take some time for the base Docker image and other dependencies to be downloaded.
- Browse to http://localhost:9999 and enter the token displayed in the terminal (or just follow the link in the terminal).
- Run the following Jupyter notebooks in the order listed here to reproduce the results:
Notebook | Description |
---|---|
1_Datasets | Summary of datasets. NOTE: Some datasets must be manually downloaded by the user. When such a dataset is referenced during execution, an exception will be raised informing the user where to download the dataset from, and where to place the results. |
2_RuntimeExperiments.ipynb | Experiments comparing method runtimes |
3_BinaryClassExperiments.ipynb | Experiments with binary-class datasets |
4_MultiClassExperiments.ipynb | Experiments with multi-class datasets |
5_WitanLFs | Examples of labelling functions generated with Witan |
6_AblationExperiments | Experiments with variants of the Witan algorithm |
You can run flake8 linting with:
make lint
.
You can run pytest unit tests
with: make test
.
An HTML code-coverage reported will be generated for each module at:
<module-dir>/test/coverage/index.html
.
You can run type checks with: make types
.
You may not want to commit the outputs of notebook cells to your Git
repository. If you have Python 3 installed, you can use
nbstripout to configure your
Git repository to exclude the outputs of notebook cells when running
git add
:
python3 -m pip install nbstripout nbconvert
- Run
nbstripout --install
in this directory (installs hooks into.git
).
If you would like to open a bash shell inside the Docker container
running the Jupyter notebook server, use: make bash
or make sudo-bash
. If make run
is not currently running, you can instead
use make run-bash
or make run-sudo-bash
.
To install system packages or otherwise alter the Docker image's
operating system, you can make changes in the Dockerfile. An example
section that will allow you to install apt
packages is included.
Whenever the Docker image is rebuilt (after certain files are changed), Docker will transmit the contents of this directory to the Docker daemon.
To speed up build times, you should add an entry to the
.dockerignore
file for any directories containing large files you do
not need to be included in the Docker image at build time.
When Docker builds new versions of its images, it does not delete the old versions. Over time, this can lead to a significant amount of disk space being used for old versions of images that are no longer needed.
Because of this, you may wish to periodically run docker image prune
to delete any "dangling images" - images that aren't currently
associated with any image tags/names.
Did you know you can work with Jupyter notebooks from Emacs? All you
need to do is install
EIN: M-x package-refresh-contents <Enter> M-x package-install <Enter> ein
- Ensure
make run
is running. M-x ein:login
(URL: http://127.0.0.1:8888, Password: token frommake run
)M-x ein:notebooklist-open
M-<enter> - Execute cell and move to next.
C-c C-c - Execute cell.
C-c C-z - Interrupt command
C-c C-x C-r - Restart session
C-<up/down> - Navigate cells.
M-<up/down> - Move cells.
C-c C-b - Insert cell below (C-a for above).
C-c C-l - Clear cell output.
C-c C-k - Delete cell.
C-c C-f - Open file.
C-c C-h - Help at cursor.
C-c C-S-l - Clear all output.
C-c C-t - Toggle cell type.