Multinode testing and observability

This is a guide into some aspects of how a multi-node Vespa cluster works. This example uses three nodes - this is a simplified setup and not a template for multinode systems. See multinode-HA for a template.

This is a guide for functional testing, deployed on one host for simplicity. The configuration server operations guide is a good resource for troubleshooting.

The guide goes through the use of Apache ZooKeeper by the Vespa clustercontrollers. Summary:

Clustercontrollers manage content node states.
Clustercontrollers use ZooKeeper to coordinate and save the cluster state.
By default, the ZooKeeper cluster runs on config servers.
As the ZooKeeper quorum requires minimum two nodes up, content node state changes will not be distributed in case of only one config server up.
When content nodes do not get cluster state updates, replicas are not activated for queries, causing partial query results. In order to avoid duplicates, only one bucket of documents are active - the other replicas are not active for queries.

Note that this guide is configured for minimum memory use for easier testing, adding:

-e VESPA_CONFIGSERVER_JVMARGS="-Xms32M -Xmx128M" \
-e VESPA_CONFIGPROXY_JVMARGS="-Xms32M -Xmx32M" \

to docker run commands. For real production use cases, do not do this. Also remove annotated memory-settings in services.xml.

Example cluster setup

Prerequisites:

Docker with 12G Memory
git
zip

$ docker info | grep "Total Memory"
$ git clone --depth 1 https://github.com/vespa-engine/sample-apps.git
$ cd sample-apps/examples/operations/multinode
$ docker network create --driver bridge vespanet

The nodes communicate over a Docker network, this guide stops docker containers to simulate node stops. Ports are mapped out of Docker containers for ease of use / inspect interfaces:

Use Docker for Mac dashboard to see output / status:

Also refer to https://github.com/vespa-engine/docker-image/blob/master/Dockerfile, start script in https://github.com/vespa-engine/docker-image/blob/master/include/start-container.sh, to understand how Vespa is started in a Docker container using the vespaengine/vespa image.

Start 3 nodes

$ docker run --detach --name node0 --hostname node0.vespanet \
    -e VESPA_CONFIGSERVERS=node0.vespanet,node1.vespanet,node2.vespanet \
    -e VESPA_CONFIGSERVER_JVMARGS="-Xms32M -Xmx128M" \
    -e VESPA_CONFIGPROXY_JVMARGS="-Xms32M -Xmx32M" \
    --network vespanet \
    --publish 8080:8080 --publish 19071:19071 --publish 19050:19050 --publish 19092:19092 \
    vespaengine/vespa configserver,services
$ docker run --detach --name node1 --hostname node1.vespanet \
    -e VESPA_CONFIGSERVERS=node0.vespanet,node1.vespanet,node2.vespanet \
    -e VESPA_CONFIGSERVER_JVMARGS="-Xms32M -Xmx128M" \
    -e VESPA_CONFIGPROXY_JVMARGS="-Xms32M -Xmx32M" \
    --network vespanet \
    --publish 8081:8080 --publish 19072:19071 --publish 19051:19050 --publish 19093:19092 \
    vespaengine/vespa configserver,services
$ docker run --detach --name node2 --hostname node2.vespanet \
    -e VESPA_CONFIGSERVERS=node0.vespanet,node1.vespanet,node2.vespanet \
    -e VESPA_CONFIGSERVER_JVMARGS="-Xms32M -Xmx128M" \
    -e VESPA_CONFIGPROXY_JVMARGS="-Xms32M -Xmx32M" \
    --network vespanet \
    --publish 8082:8080 --publish 19073:19071 --publish 19052:19050 --publish 19094:19092 \
    vespaengine/vespa configserver,services

Notes:

Use fully qualified hostnames.
VESPA_CONFIGSERVERS lists all nodes using exactly the same names as in hosts.xml

Wait for last config server to start:

$ curl -s --head http://localhost:19073/ApplicationStatus

HTTP/1.1 200 OK
Date: Thu, 17 Jun 2021 11:26:19 GMT
Content-Type: application/json
Content-Length: 12732

Make sure all ports are listed before continuing:

$ netstat -an | egrep '1907[1,2,3]|1905[0,1,2]|808[0,1,2]|1909[2,3,4]' | sort

tcp46      0      0  *.19050                *.*                    LISTEN
tcp46      0      0  *.19051                *.*                    LISTEN
tcp46      0      0  *.19052                *.*                    LISTEN
tcp46      0      0  *.19071                *.*                    LISTEN
tcp46      0      0  *.19072                *.*                    LISTEN
tcp46      0      0  *.19073                *.*                    LISTEN
tcp46      0      0  *.19092                *.*                    LISTEN
tcp46      0      0  *.19093                *.*                    LISTEN
tcp46      0      0  *.19094                *.*                    LISTEN
tcp46      0      0  *.8080                 *.*                    LISTEN
tcp46      0      0  *.8081                 *.*                    LISTEN
tcp46      0      0  *.8082                 *.*                    LISTEN

Deploy a three-node Vespa application

$ zip -r - . -x "img/*" README.md .gitignore | \
  curl --header Content-Type:application/zip --data-binary @- \
  localhost:19071/application/v2/tenant/default/prepareandactivate

Wait for services to start:

$ curl -s http://localhost:8082/state/v1/health

Clustercontroller status pages

Check that this works:

$ curl http://localhost:19050/clustercontroller-status/v1/music

Then open these in a browser:

http://localhost:19050/clustercontroller-status/v1/music
http://localhost:19051/clustercontroller-status/v1/music
http://localhost:19052/clustercontroller-status/v1/music

0 is normally master, 1 is next (and hence has an overview table), 2 is cold.

Do a node stop / start and observe changes to cluster state

$ docker stop node2

Observe at http://localhost:19050/clustercontroller-status/v1/music that storage and distributor on node2 go to state down, then start is again:

$ docker start node2

Observe at http://localhost:19050/clustercontroller-status/v1/music that storage and distributor go to state up again (this can take a minute or two).

Stop / start primary clustercontroller

$ docker stop node0

http://localhost:19050/clustercontroller-status/v1/music now goes blank as node0 is stopped.

Observe at http://localhost:19051/clustercontroller-status/v1/music that storage and distributor on node0 go to state down. Also see in "Master state" further down that this goes to primary after 60 seconds.

$ docker start node0

Observe 0 is master again

Stop two clustercontrollers

$ docker stop node0 node1

http://localhost:19050/clustercontroller-status/v1/music and http://localhost:19050/clustercontroller-status/v1/music now go blank as node0 and node1 are stopped.

Find at http://localhost:19052/clustercontroller-status/v1/music that node2 never becomes master! To understand, review https://stackoverflow.com/questions/32152467/can-zookeeper-remain-highly-available-if-one-of-three-nodes-fails :

in a 3 node cluster, if 2 of the nodes die, the third one will not be serving requests. The reason for that is that the one remaining node cannot know if it is in fact the only survivor or if it has been partitioned off from others. Continuing to serve request at that point could cause a split brain scenario and violate the core ZooKeeper guarantee.

By default, clustercontrollers use a ZooKeeper cluster running on the config servers:

With config server on node0 and node1 out, the ZooKeeper cluster quorum (the green part in the illustration) is broken - the clustercontrollers will not update the cluster state. This can be observed on node2's clustercontroller status page, where the current cluster state lags what node2's clustercontroller is observing (but cannot write to ZooKeeper):

[2021-06-18 07:57:52.246] WARNING : container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler
  Fleetcontroller 0: Failed to connect to ZooKeeper at node0.vespanet:2181,node1.vespanet:2181,node2.vespanet:2181
  with session timeout 30000: java.lang.NullPointerException
at org.apache.zookeeper.ClientCnxnSocketNetty.onClosing(ClientCnxnSocketNetty.java:247)
at org.apache.zookeeper.ClientCnxn$SendThread.close(ClientCnxn.java:1465)
at org.apache.zookeeper.ClientCnxn.disconnect(ClientCnxn.java:1508)
at org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1537)
at org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:1614)
at com.yahoo.vespa.clustercontroller.core.database.ZooKeeperDatabase.<init>(ZooKeeperDatabase.java:120)
at com.yahoo.vespa.clustercontroller.core.database.ZooKeeperDatabaseFactory.create(ZooKeeperDatabaseFactory.java:8)
at com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler.connect(DatabaseHandler.java:197)
at com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler.doNextZooKeeperTask(DatabaseHandler.java:252)
at com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:604)
at com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1127)
at java.base/java.lang.Thread.run(Thread.java:829)

$ docker start node0 node1

Observe 0 is master again.

Kubernetes users sometimes have issues with the Zookeeper cluster at startup, see troubleshooting.

Feed data, check distribution

Make sure the three nodes are started and up - then feed 5 documents:

$ i=0; (for doc in $(ls ../../../album-recommendation/ext/*.json); \
    do \
      curl -H Content-Type:application/json -d @$doc \
      http://localhost:8080/document/v1/mynamespace/music/docid/$i; \
      i=$(($i + 1)); echo; \
    done)

Use vespa-visit to validate all documents are fed:

$ docker exec node0 bash -c "/opt/vespa/bin/vespa-visit -i"

The redundancy configuration in services.xml is 3 replicas, i.e. one replica per node. Using metrics, expect 5 documents per node:

$ (for port in 19092 19093 19094; \
    do \
      curl -s http://localhost:$port/metrics/v1/values | \
      jq '.services[] | select (.name=="vespa.searchnode") | .metrics[].values' | \
      grep content.proton.documentdb.documents.total.last; \
    done)

Run queries while stopping nodes

Query any of the nodes using 8080, 8081 or 8082 - this query selects all documents:

$ curl --data-urlencode 'yql=select * from sources * where sddocname contains "music"' \
  http://localhost:8080/search/

Check http://localhost:19050/clustercontroller-status/v1/music, then set node2 down:

$ docker stop node2
$ sleep 5
$ curl --data-urlencode 'yql=select * from sources * where sddocname contains "music"' \
  http://localhost:8080/search/

See {"totalCount":5}. Then stop node1:

$ docker stop node1
$ sleep 5
$ curl --data-urlencode 'yql=select * from sources * where sddocname contains "music"' \
  http://localhost:8080/search/

We see that the last clustercontroller is still up. Count documents on the content node:

$ docker exec node0 /opt/vespa/bin/vespa-proton-cmd --local getState
...
"onlineDocs", "5"

However, query results are partial, 5 documents are not returned - check "{"totalCount": }".

Look at "SSV" which is "cluster state version" in the table - this shows the view the content node has of the cluster:

Compare this with the state changes in the table below, find a higher state number, which is not yet published due to missing quorum.

The replica activation is missing on node0, as the cluster state with two nodes down was never made due to missing ZooKeeper quorum.

Single-node clustercontroller

It is possible to set up clusters with only one clustercontroller - changes:

<host name="node3.vespanet">
    <alias>node3</alias>
</host>

<admin version='2.0'>
    <adminserver hostalias="node3" />
    <configservers>
        <configserver hostalias="node0" />
        <configserver hostalias="node1" />
        <configserver hostalias="node2" />
    </configservers>
    <cluster-controllers standalone-zookeeper="true">
        <cluster-controller hostalias="node3" />
    </cluster-controllers>
</admin>

$ docker run --detach --name node3 --hostname node3.vespanet \
    -e VESPA_CONFIGSERVERS=node0.vespanet,node1.vespanet,node2.vespanet \
    --network vespanet \
    --publish 8083:8080 --publish 19074:19071 --publish 19053:19050 --publish 19095:19092 \
    vespaengine/vespa services

Here, two content nodes, like node0 and node1, can go down while node2 serves the full data set in queries.

The clustercontroller can also go down with no impact to query serving, assuming all content nodes do not change state. I.e. if the single clustercontroller is down, and one content node goes down thereafter, the cluster state is not updated, and partial query results is expected.

Clean up after testing

$ docker rm -f node0 node1 node2
$ docker network rm vespanet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Multinode testing and observability

Example cluster setup

Start 3 nodes

Deploy a three-node Vespa application

Clustercontroller status pages

Do a node stop / start and observe changes to cluster state

Stop / start primary clustercontroller

Stop two clustercontrollers

Feed data, check distribution

Run queries while stopping nodes

Single-node clustercontroller

Clean up after testing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Multinode testing and observability

Example cluster setup

Start 3 nodes

Deploy a three-node Vespa application

Clustercontroller status pages

Do a node stop / start and observe changes to cluster state

Stop / start primary clustercontroller

Stop two clustercontrollers

Feed data, check distribution

Run queries while stopping nodes

Single-node clustercontroller

Clean up after testing