Skip to content

Unipept API Load Balancer configuration

Pieter Verschaffelt edited this page Aug 11, 2023 · 11 revisions

Since Unipept has multiple API-servers that are handling client's requests, we have set up a separate load balancer that spreads all requests that it receives over the different servers. HAProxy is a software package that is handling all of this for us and the full configuration of this load balancer can be found in this document.

Load balancing / HAProxy configuration

The configuration file for HAProxy can be found in /etc/haproxy/haproxy.cfg and looks like this. An explanation for each of the different non-standard configuration options is provided in comments in this file.

global
	log /dev/log	local0
	log /dev/log	local1 notice
	chroot /var/lib/haproxy
	stats socket /run/haproxy/admin.sock mode 660 level admin
	stats timeout 30s
	user haproxy
	group haproxy
	daemon

	# Default SSL material locations
	ca-base /etc/ssl/certs
	crt-base /etc/ssl/private

	# See: https://ssl-config.mozilla.org/#server=haproxy&server-version=2.0.3&config=intermediate
        ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
        ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
        ssl-default-bind-options ssl-min-ver TLSv1.2 no-tls-tickets

# These values are defaults that are taken over by all frontend and backend sections below. These can
# still be overriden if they are specified again in one of the frontend and backend blocks.
defaults
	log	global
	mode	http
	option	httplog
	option	dontlognull
        # The time that HAProxy waits for a TCP connection to the backend to be established.
        timeout connect 5s
        # This setting measures inactivity during a period that we expect the client to be speaking.
        timeout client  5s
        # This setting measures inactivity during a period that we expect the backend server to be
        # speaking.
        timeout server  1800s
	errorfile 400 /etc/haproxy/errors/400.http
	errorfile 403 /etc/haproxy/errors/403.http
	errorfile 408 /etc/haproxy/errors/408.http
	errorfile 500 /etc/haproxy/errors/500.http
	errorfile 502 /etc/haproxy/errors/502.http
	errorfile 503 /etc/haproxy/errors/503.http
	errorfile 504 /etc/haproxy/errors/504.http

frontend stats
	mode http
	bind *:8084
	stats enable
	stats uri /stats
	stats refresh 10s
	stats admin if LOCALHOST

frontend handlers
	mode http
        # Allow HAProxy to load balance normal HTTP-requests
	bind *:80
        # Allow HAProxy to load balance secure HTTPS-requests
        # HTTP2 is enabled and is preferred (alpn indicates that HTTP2 is preferred over HTTP1.1)
	bind *:443 ssl crt /etc/ssl/unipeptapi.ugent.be/unipeptapi.ugent.be.pem alpn h2,http1.1
        # Keep track of the last 100k HTTP-requests and from which ipv6 address they originated.
        # The records in this table are automatically removed after 120s (expire 120s)
        # By setting http_req_rate(60s) we are telling HAProxy to count the amount of requests made by
        # each IP-address in the last 60s.
	stick-table type ipv6 size 100k expire 120s store http_req_rate(60s)
        # Sticky counter in which the sticky table requests are stored
	http-request track-sc0 src

	# Allow HAProxy to scan the body of requests
	option http-buffer-request

	# Allow 5000 requests per minute from a client. If more requests are mode, respond with a status
        # 429 (Too Many Requests) error.
	# http-request deny deny_status 429 if { sc_http_req_rate(0) gt 5000 }

        # Automatically redirect traffic to https if it came from http. This is disabled for the API
        # for performance reasons since some clients don't want to use HTTPS
	# redirect scheme https code 301 if !{ ssl_fc }

	acl letsencrypt-acl path_beg /.well-known/acme-challenge/
	use_backend letsencrypt if letsencrypt-acl

	acl is_pept2data path_beg /mpa/pept2data
	acl is_peptinfo path_beg /api/v2/peptinfo
	acl is_protinfo path_beg /api/v2/protinfo
	acl is_missed_cleavage req.body -m reg \"missed\":[^,]*true

        use_backend ssd_handlers if is_pept2data is_missed_cleavage
	use_backend ssd_handlers if is_peptinfo || is_protinfo

	default_backend all_handlers

backend letsencrypt
	server letsencrypt 127.0.0.1:8888

backend ssd_handlers
        # GZIP responses from the backend servers before sending them to clients.
        filter compression
        compression algo gzip
        # Always send new requests to the backend handler that is currently handling the least amount
        # of connections.
        balance leastconn
        mode http
        # Check if a backend server is still healthy by periodically contacting a specific endpoint.
        option httpchk
        # The metadata endpoint does use the database and is very lightweight, making it an ideal
        # candidate to use for the HTTP check method. This way we check if both apache and mysql
        # are still functioning on the handler.
        http-check send meth GET uri /private_api/metadata.json

        # List of the different backend servers that are available for handling Unipept API-requests.
	server patty patty.ugent.be:80 check maxconn 100
	server selma selma.ugent.be:80 check maxconn 100

backend all_handlers
        # GZIP responses from the backend servers before sending them to clients.
        filter compression
        compression algo gzip
        # Always send new requests to the backend handler that is currently handling the least amount
        # of connections.
        balance leastconn
        mode http
        # Check if a backend server is still healthy by periodically contacting a specific endpoint.
        option httpchk
        # The metadata endpoint does use the database and is very lightweight, making it an ideal
        # candidate to use for the HTTP check method. This way we check if both apache and mysql
        # are still functioning on the handler.
        http-check send meth GET uri /private_api/metadata.json

        # List of the different backend servers that are available for handling Unipept API-requests.
        server patty patty.ugent.be:80 check maxconn 100
        server selma selma.ugent.be:80 check maxconn 100
	server rick rick.ugent.be:80 check maxconn 100
	server sherlock sherlock.ugent.be:80 check maxconn 100

Logging and monitoring server status

In order to check the amount of requests that are being made to the Unipept API, which endpoints are the most popular and how many server resources are being consumed, we have set up a system that automatically analyses and summarizes HAProxy logs and keeps this summary in a local MySQL database that can be accessed by Grafana.

1. Install and configure logrotate

We want to keep track of HAProxy's log files from the last 30 days. In order to keep things structured, we are going to install the logrotate package that is going to process the log file once every day, store it in a separate file and clear the original one. Once a backlogged file is 30 days old, it will automatically be removed and replaced by a new one.

  • So, first install logrotate by running sudo apt install logrotate.
  • Then, create a new logrotate config file for HAProxy with sudo nano -c /etc/logrotate.d/haproxy and paste in the following configuration:
/var/log/haproxy.log {
    daily
    # Keep backlog of the last 30 days
    rotate 30
    missingok
    notifempty
    compress
    # Delay compressing the current log file to the next day
    delaycompress
    postrotate
        [ ! -x /usr/lib/rsyslog/rsyslog-rotate ] || /usr/lib/rsyslog/rsyslog-rotate
    endscript
}
  • Check that a logrotate timer has been made that automatically triggers at midnight. This timer lives in /lib/systemd/system/logrotate.timer and /lib/systemd/system/logrotate.service. These are normally installed on the system by default. Make sure that

2. Download Unipept's monitoring script

We have developed our own script for parsing and transforming the data from HAProxy's HALog utility, which we are going to download now:

  • Navigate into /usr/local/bin and clone the repository git clone https://github.com/unipept/unipept-utilities.git.
  • Install NVM (Node Version Manager): curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash.
  • In order to start using NVM, we must first source our profile: source ~/.bashrc.
  • Install Node 20 (or higher) and set as the default: nvm install 20 && nvm alias default 20.
  • Globally install yarn which we need for the halog-collector script's dependencies: npm install --global yarn.
  • Navigate into the halog-collector scripts directory: cd /usr/local/bin/unipept-utilities/scripts/halog-collector.
  • Install all required Node-packages: yarn install.

3. Install and configure MySQL server

In order to keep track of the log summary that will be produced by a self-made script, we are going to need MySQL server and properly configure it.

  • Start by following this guide of DigitalOcean on how to install and set up the MySQL server. Make sure to use the username root for the MySQL root user and run through the mysql_secure_installation script such that the server is only accessible from the localhost and that anonymous access is disabled.
  • Create a new database called load_balancer_stats by running mysql -uroot -p$PASSWORD < /usr/local/bin/unipept-utilities/scripts/halog-collector/schema/default_schema.sql. Replace $PASSWORD with the password of your MySQL installation that you chose during the previous step.
  • Create a new user for the MySQL database that can only read data. This user will be used later-on by Grafana and protects our database from accidentally deleting data. Therefor, open a new MySQL-terminal: mysql -uroot -p$PASSWORD (replace $PASSWORD with the real deal) and execute the following SQL commands one by one:
# Replace $password with the real thing!
CREATE USER 'grafana'@'%' IDENTIFIED BY '$password';
GRANT SELECT ON load_balancer_stats.* TO 'grafana'@'localhost';
FLUSH PRIVILEGES;
  • In order for this MySQL server to be accessible from the Grafana host, we need to expose the database on a specific port (I chose 4840 in this example). Open the server's configuration (sudo nano -c /etc/mysql/mysql.conf.d/mysqld.cnf) and make the following changes:
Port = 4840
bind-address = xxx.xxx.xxx.xxx # actual server's IP address

4. Automatically run halog-collector once a day

We are going to produce statistics for our load balancer once a day and therefor need to run the halog-collector script at a fixed point in time every day. We will be setting up a new systemd service and timer for this purpose.

  • Create a new script that will automatically call halog-collector with the correct parameters: sudo nano -c /usr/local/bin/halog-collector.sh and paste the following contents in there (replace PASSWORD and USER with the correct credentials for your installation of MySQL):
#!/usr/bin/env bash

DB_NAME="load_balancer_stats"
DB_USER="USER"
DB_PASS="PASSWORD"
DB_PORT="4840"

# Always process the HAProxy log from yesterday
cat /var/log/haproxy.log.1 | halog -u -H | node /usr/local/bin/unipept-utilities/scripts/halog-collector/collect.js "$DB_USER" "$DB_PASS" "$DB_PORT" "$DB_NAME"
  • Make the new script executable: chmod u+x /usr/local/bin/halog-collector.sh.
  • Create a new systemd service: sudo nano -c /lib/systemd/system/halog-collector.service and add the following contents to this file:
[Unit]
Description=Collects and summarizes HALog-files
# This service should only be started once logrotate is finished
After=logrotate.service
RequiresMountsFor=/var/log

[Service]
Type=oneshot
ExecStart=/usr/local/bin/halog-collector.sh
  • Now, create a systemd timer that will accompany the service that we've just construced. Create new file (sudo nano -c /lib/systemd/system/halog-collector.timer) and add the following contents:
[Unit]
Description=Daily summary of load balancing endpoints

[Timer]
OnCalendar=daily
AccuracySec=1h
Persistent=true

[Install]
WantedBy=timers.target