- Introduction
- Running Knetminer via Docker
- Creating new settings for a new dataset
- Using Dockerised Knetminer with Amazon AWS and ElasticBeanStalk
- The Knetminer Docker architecture
- Developing Knetminer with Docker
- Troubleshooting
Knetminer can be deployed in a quick and simple way, using the Docker container platform. In the simplest case, you just need to install the Docker environment and get our Knetminer image from DockerHub. For more advanced uses, our image can be used with host-provided configurations.
Additionally, you can develop changes to Knetminer, starting from our image definitions.
You can quickly run an instance of Knetminer, based on a small sample data set (a subset of the Arabidopsis Thaliana, named Aratiny (details below). To do so, just install Docker and issue the line commad:
docker run -p 8080:8080 -it knetminer/knetminer
This will get our Knetminer image straight from Docker Hub. You'll see output on your
shell window and eventually you'll be able to access the sample application via
http://localhost:8080/client (as usually, you can change the port on the host via -p
, in our case
its format will always be -p <host-port>:8080
).
Running without further parameters makes our container to use the configuration and data it has in its file system (which, as said, is the Arabidopsis subset).
Note: the -it
options are needed because of the way the startup command associated to our container works, as explained here.
WARNING: until this note is removed, you might need to
rebuild the Docker image(s) locally, before issuing docker run
commands, since DockerHub images might be not up-to-date.
The docker-run.sh script is provided to make the invocation of the Knetminer container as simple
as possible, essentially by building Knetminer-specific docker run
commands. For instance, the
equivalent of the command that we just showed, assuming Docker is running, is simply:
./docker-run.sh
A variant of it, which maps the container to the port 8090, is:
./docker-run.sh --host-port 8090
In the follow, we describe more complex cases.
Note that you don't need to download (ie, git clone
) the whole Knetminer codebase just to get this
script, you just need to download a raw copy of it and then it will use our DockerHub image (see
the warning above).
While the example above runs Knetminer against a default dataset, a different application instance can be run against a dataset of choice. In order to do so, we need to explain the concepts of dataset and dataset directory.
A dataset is a set of files that serves a Knetminer instance. Most of datasets we provide are focused on a single specie, so we have datasets named arabidopsis, wheat, rice, etc. This is also the reason why sometimes we use the word specie as synonim of dataset.
A dataset directory contains all the dataset-specific information that a Knetminer instance needs to
run a dataset. Except for the default case (based on Aratiny and its data already stored in the Docker
image), you must provide a dataset directory to run Knetminer against a new instance. The typical way
to do so is to create such a directory in your Docker host (ie, the one that runs the container). When
using a predefined dataset ID (--dataset-id
), the only thing you MUST have into the dataset
directory is <dataset-dir>/data/knowledge-network.oxl
, which contains the dataset's core data in the
Ondex format. For pre-defined datasets, we usually provide such file upon request. More details
about creating new datasets and the structure of the dataset directory are given
below.
So, suppose you want to run the wheat dataset, using our pre-defined settings. The steps to follow are:
- Create the dataset directory on the Docker host, eg,
mkdir /home/knetminer/wheat-dataset
- Create the data directory,
mkdir /home/knetminer/wheat-dataset/data
, and copy our wheat OXL under/home/knetminer/wheat-dataset/data/knowledge-network.oxl
- Run with the wheat predefined dataset id (
--dataset-id
option):
./docker-run.sh \
--dataset-id wheat \
--dataset-dir /home/knetminer/wheat-dataset \
--host-port 8090 \
--container-name 'docker-wheat' \
--container-memory 20G
--detach
This command will create /home/knetminer/wheat-dataset/settings
, which will contain a copy of the
settings coming from our GitHub. The other options given to the command above do the following:
- --dataset-dir maps the "Knetminer Wheat"
- --host-port instance to the host port 8090
- --container-name names the corresponding container
docker-wheat
(a useful reference for all Docker commands) - --container-memory grants it 20G of memory to the container
- --detach detaches the container from the script (ie, the script immediately ends with the container remaining in background).
See docker-run.sh --help
for a documented list of all available
options.
The bare Docker equivalent of the command above is:
docker run -it --detach -p 8090:8080 --name docker-wheat --memory 20G \
--volume /home/knetminer/wheat-dataset:/root/knetminer-dataset \
knetminer/knetminer:latest wheat
As you can see, your dataset directory on the Docker host is bound to the container by means of a Docker volume, which maps it on a known container's path.
After the command above, the corresponding Docker instance should be available at
http://localhost:8090/client
. The running container can be stopped via Control-C
(when not using
--detach, else use something like docker stop docker-wheat
).
Beware that the startup phase of large datasets takes long time, so you might need some time (30 mins to few hours) before this URL will start showing something on your browser.
The Knetminer WS application creates a number of working files in the configured dataset directory. We recommend to clean them before restarting a container. The cleanup-volume.sh script is provided for this and it's use is pretty simple:
./cleanup-volume.sh /home/knetminer/wheat-dataset
The command will leave directories like settings/
or files like data/knowledge-network.oxl
untouched.
TODO
When running the Knetminer container with the --dataset-id <id>
option, it will use <id>
to pick 'settings files' from a subdirectory with the same name under the species/ folder, under the
Knetminer codebase.
On the other hand, if you want to create a new dataset (eg, for a new specie), rather than using one of
the predefined ones, typically you'll want to define your own settings, instead of using --dataset-id
.
You have to do so by creating certain files under <dataset-dir>/settings/
(which is created from a
codebase copy when using predefined settings). The quickest way to create such files is to make a copy
from our species/ directory and modify it.
In this section, we describe the details of what this settings directory contains, as well as other subdirectories in the dataset direcotory, which are populated by the Knetminer application.
The following is an excerpt of the contents in the dataset directory for the Arabidopsis dataset:
arabidopsis
├── client-src
│ ├── ...
│
├── config
│ ├── actual-maven-settings.xml
│ ├── data-source.xml
│ ├── neo4j
│ │ ├── ...
| |
│ └── SemanticMotifs.txt
├── data
│ ├── index
│ │ ├── ...
│ ├── knowledge-network.oxl
│ ├── ...
└── settings
├── client
│ ├── background.jpg
│ ├── basemap.xml
│ ├── release_notes.html
│ └── sampleQuery.xml
├── maven-settings.xml
└── ws
├── neo4j
│ └── state-machine-queries
│ ├── sm-001_L03_bioProc_4.cypher
│ ├── sm-002_L03_celComp_5.cypher
│ ├── ...
├── ...
└── SemanticMotifs.txt
The main files listed above are:
<dataset-dir>/settings/maven-settings.xml
, defining a number of configuration values, which are passed to Maven to populate other files mentioned below<dataset-dir>/settings/client/*
, defining specie-specific web files for the client application<dataset-dir>/settings/ws/SemanticMotifs.txt
, defining graph patterns that relate genes to other entities of interest.<dataset-dir>/settings/ws/neo4j/
to be used in Neo4j mode (TODO: details).
Even more details are given in the next sections.
These are mostly defined by <dataset-dir>/settings/maven-settings.xml
The values in this file are used
to instantiate Maven build variables throughout the aratiny project, which is a
test/demo/reference implementation of a Knetminer application (see below).
The variables in the specie-specific Maven settings override defaults defined in the
Knetminer's main POM (both in the <properties>
and <profiles>
sections).
These are located in <dataset-dir>/settings/client
and contains:
- dataset-specific web files for the client WAR (ie, the user interface application), eg,
organism.png
,background.png
,release_notes.html
- dataset-specific configuration files put into the client WAR. At the moment the only
file of this type is
sampleQuery.xml
, which defines sample queries appearing on the right column of the Knetminer interface.
Client files are used by the Docker container at runtime, after the corresponding image
has been built. In fact, the Docker container uses the files in <dataset-dir>/settings/client
to
override the reference client application and then rebuild the corresponding WAR, before passing it to
the Tomcat server and launching it. Rebuilding the client is necessary, since there isn't another
practical way to push specific static files or variants or Javascript files (which are based on Maven
settings, see next sections).
The main server setting files, located in <dataset-dir>/settings/ws/
are:
SemanticMotifs.txt
, which defines graph patterns that relate genes to other entities of interest. This is used for ranking gene searches based on gene evidence.neo4j/
, which contains configuration related to Knetminer running in Neo4j mode (TODO: additions due).
Note that, differently than the client, the server WAR is generated once for all at image
build time, not at runtime. That's because the WAR is independent on any specific dataset
you want run Knetminer with. However, the runtime still uses Maven to instantiate (in Maven
terms, interpolate) client and server configuration files with specie/dataset specific
values. It does so by issuing maven test-compile
against aratiny-ws
.
### Running Docker against a dataset with custom settings
Once you've prepared the right settings, and the OXL file, you can run a non-predefined dataset in a way similar to the commands shown previously:
./docker-run.sh \
--dataset-dir /home/knetminer/myspecie-dataset \
--host-port 8090 \
--container-name 'docker-myspecie' \
As you can see, the only difference is you don't give any --dataset-id
option.
This example corresponds to the Docker command:
docker run -it -p 8090:8080 --name docker-myspecie \
--volume /home/knetminer/myspecie-dataset:/root/knetminer-dataset \
knetminer/knetminer:latest ''
WARNING: if you want to type Docker commands manually, remember to pass an empty string as parameter for the Docker image. Else, it will use 'aratiny' as the value for the dataset id, as the default sample dataset is the one the container is expected to use when run without parameters.
Knetminer uses the files in the settings/
directory (in addition to defaults in the Maven POMs) to
generate actual configuration files, into <dataset-dir>/config
. Usually you don't need to change these
files, we list some of them hereby, since they might be useful for Knetminer development and customisation
Note: for the time being, changes to the files below must be put into the Knetminer codebase. You
cannot place them into <dataset-dir>/config
, since the copies in this directory are always
re-generated from the settings. TODO: in future, we will add a --keep-config
option, for testing
purposes.
The main of these files is data-source.xml, which is the first config item the WS application
looks for during its bootstrap stage. In turn, this file points to other paths, like
${knetminer.dataDir}
, the location where application-generated data are stored (eg, Lucene Indexes), or
${knetminer.oxlFile}
, the location where the dataset's OXL file is. The defaults for these values are such that they fall under <dataset-dir>/data
.
Note also that these two locations aren't dataset-specific in the Docker container: the data dir is always /root/knetminer-dataset/data
and the OXL file is always
/root/knetminer-dataset/data/knowledge-network.oxl
. That's because, as mentioned above, the final
locations for these directories and files are established by means of volume-based
mapping. That's why you don't find the corresponding variables in
<dataset-dir>/settings/maven-settings.xml
. Rather, they're defined in the main POM, (the
docker
Maven profile defines the Docker-specific values, while defaults on the top of the POMs are used for building a Jetty-based test environment during Maven builds).
Finally, the <dataset-dir>
directory contains a client-src/
directory, where the source files for the
client (ie, user interface) are generated, using a combination of default files (coming from Aratiny, see next sections) and files coming from settings/client/
, after that these are populated with values from maven-settings.xml
and other Maven default properties.
TODO: we need to test the new image on AWS and write documentation.
Knetminer, and consequently its Docker image, are based on two WAR applications: the web service (WS), which provides an HTTP/JSON based API, and the client, which is the user interface, based on a couple of technologies like JSP, Javascript, NPM packages.
Both the WS and the client WAR are built from the Aratiny project. This is a Knetminer reference/
test application, which is usually run against a small default dataset. In turn, aratiny has dependencies
from other more generic Maven modules and files (eg, client-base
, ws-base
).
As already mentioned, in Docker, the WS WAR is made once for all during the image build, while the client WAR is rebuilt at runtime, after its source files are rebuilt by overriding the aratiny client files with files coming from the settings, as explained above (aratiny is built and tested during the Knetminer default build, without any overriding, see the GitHub sources for details).
The Docker container gets/writes dataset-specific files from/to a fixed location inside its file system,
which is /root/knetminer-dataset/
. As explained above, this maps to customised versions on the Docker
host by means of Docker volumes.
The figure below (Right click => Open Image
to view the original size) summarises how the Knetminer
Docker container works, as explained above, which directories are used, either on the container file system or on the host, plus the mappings between each other.
As said above, the Docker image contains a copy of the Knetminer codebase, which is initially used to build the WS WAR and later, to build a dataset-specific instance of the client WAR. In both cases, the Aratiny reference application is used. Namely, mvn test-compile
is issued at runtime against the dataset-specifc variants of the aratinty WAR projects.
For practical reasons, we use different Docker images to instantiate the final Knetminer container.
As usually, the main container is defined by the Dockerfile image file. This builds what is needed to deploy our web applications (two .war files) on our Java web server (Tomcat).
This application image extends the Dockerfile-base image. This prepares a container with all the Maven-reachable dependencies downloaded and placed into the container file system.
Finally, the base image depends on Dockerfile-bare image. This pulls up the Andre's Tomcat container, which is essentially a Debian Linux with Java and Tomcat installed, and additionally deploys third-party dependencies that are needed by Knetminer (eg, Maven, the Node.js package manager, NPM).
The motivation for distributing the build of the final container into three different images is that this makes the development quicker. For instance, if one needs just to change a web application page, pushing such changes forwar requires only the rebuild of the main container.
This works with the usual docker build
command, but you need to issue it from the root of our
codebase (the one created by git clone
). That's because the build process copies the codebase into
the container and there isn's a way in Docker to copy from directories located above the one the build
command is started from. So, to build from scratch (and from the master branch in our repository) you need:
git clone https://github.com/Rothamsted/knetminer # Or another fork/branch
cd knetminer
# You don't always need to rebuild all of them, see below
# ==> After the first time, you might need --no-cache to force the rebuild
docker build -t knetminer/knetminer-bare -f common/quickstart/Dockerfile-bare .
docker build -t knetminer/knetminer-base -f common/quickstart/Dockerfile-base .
docker build -t knetminer/knetminer -f common/quickstart/Dockerfile .
That depends on what you're doing:
- If you changed the codebase, without adding third-party Maven dependencies, probably you only need to rebuild the main image. However, sometimes this doesn't work (eg, an existing third-party dependency was re-uploaded on an external Maven artifactory) and you need to rebuild the bare image too.
- If you changed third-party dependencies, rebuild the base image and then the main image.
- You rarely need to rebuild the base image, essentially only if you want to change general settings for the container environment (eg, Linux or JDK version)
When you don't rebuild an image and this isn't in your Docker cache already, it is taken from our DockerHub
WARNING: we're still unable to auto-update images as we update our code in GitHub. So, it's likely the two won't be always aligned for the time being. Until we remove this warning, try to rebuild the images(s) on your side if you see problems.
You can interact with Docker the tagging (versioning) mechanism to build different versions of Knetminer images. For instance, the following, builds images with a 'test' version (instead of 'latest', which is the implied tag when nothing is specified):
# Build the base image and assign it to the 'test' tag/version
docker build -t knetminer/knetminer-base:test -f common/quickstart/Dockerfile-base .
# Does the same for the main image and also uses the 'test' variant as a dependency (via --build-arg)
docker build -t knetminer/knetminer:review --build-arg DOCKER_TAG=review -f common/quickstart/Dockerfile .
Once you have an image with a given tag, you can use it at runtime by adding something like
--image-version review
to the docker-run.sh
commands explained in the previous sections.
Note to Knetminer developers: this is the way to push images to DockerHub, ie, push it after having built a given version (and using the same image coordinates). It can be useful to publish images that correspond to a given branch or fork.
As long as you install the necessary requirements (Linux, Windows), Knetminer can be built under your working computer, without involving Docker at all.
A simple way to do so is to build and play with the already-mentioned Aratiny application. This is
mainly useful to develop new application features (which apply to any dataset), test them and even adding
unit tests that verify them (aratiny-ws
already contains a few of them). Changes to thhis
existing application can happen directly into its Maven modules, or on its dependencies (eg,
client-base
), depending on the type of change you're introducing.
WARNING: always consider that Aratiny is a reference application, anything you change here is reflected on the dataset-specific instances that Docker builds based on Aratiny.
Another useful thing is the manual-test scripts within the Aratiny application, which can be used to quickly launch a working server running the Aratiny applications and accessible from your browser. See the files on GitHub for details.
A third option is to build and run the WAR files in your system, using the same scripts that are used to
build the Docker image, and to run the corresponding container. These are build-helper.sh
and
runtime-helper.sh
. Indeed, these scripts are designed to work both with Docker and out of it.
See their content for comments on how to use them. Moreover, have a look at this example in the codebase.
Warning: if you are a Knetminer developer putting your hands on those scripts, always ensure the above portability isn't broken.
In this section, we show a few details that can be useful to debug and troubleshoot the Docker container for Knetminer.
There are two log outputs from the container that you might want to check. One is the standard output from Tomcat, which can be accessed from the host via a command like:
docker logs -f arabidopsis
where arabidopsis
is the name you assigned to the container upon run.
The other interesting logs are in the Tomcat home. In particular, you might want to give a look to the
Knetminer WS log. This host command will show it live, until you stop with Control-C
docker exec -it arabidopsis tail -f /usr/local/tomcat/logs/ws.log
Another thing that might be useful is mapping the Tomcat logs directory to some location on the host, eg:
export DOCKER_OPTS='-it --volume /home/knetminer-logs:/usr/local/tomcat/logs'
./docker-run.sh ...
In particular, this is what you might need to ease the access to tools like web log analysers.
The Docker build command for the Knetminer image can be invoked with one parameter, a password for the Tomcat Manager web application, which can be useful to know if our WARs were started, as well as operations like quick deployment of a new WAR under a running Tomcat/container.
This can be done with a command like:
docker build -t knetminer/knetminer \
-f common/quickstart/Dockerfile --build-arg TOMCAT_PASSWORD='foo123' .
Remember to NOT use easy passwords for production images! These are accessible via web. Ideally, don't enable the manager in production.
If the client tells messages like "server is offline", it might useful to check if the WS application is responding. This has its own URL. For instance, if the container is mapped on the 8080 port and you started the default dataset (ie, aratiny), you can type this URL in your browser:
http://localhost:8080/ws/aratiny/countHits?keyword=seed
And you should get back some JSON showing how many results correspond to that keyword.
When using other datasets, you've to type the internal dataset identifier in place of /aratiny/
(eg, for Arabidopsis, http://localhost:8080/ws/araknet/countHits?keyword=seed). This ID is defined in the
maven-settings.xml
file for the dataset, by the property knetminer.dataSourceId
During the developement stage, when you have frequently to change a little bit about the Knetminer code and rebuild the corresponding image, you might want to build only certain Maven modules, rather than the whole Knetminer.
You can do this the following way:
- Ensure the last versions or the Maven modules you're not rebuilding are correctly deployed on our Maven artifactory server.
- Edit the build helper, as specified by the following lines, that you find in the script:
# ---- Regular full build of the server web reference application (aratiny-ws.war) ----
mvn clean install $MAVEN_ARGS -DskipTests -DskipITs
cd common/aratiny/aratiny-ws
# --- Alternatively, you can enable fast build during debugging
# mvn dependency:resolve
# cd common/aratiny/aratiny-ws
# mvn clean install $MAVEN_ARGS -DskipTests -DskipITs
# ---
- Run the
docker build
command as explained above. This should rebuild things quickly, depending on the subset of modules you select for the Maven build.
Remember to not push the above changes back to GitHub.