Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose healthcheck server (#643) #644

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions main.go
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@ func main() {
var reloadRun bool
var noReload bool
var benchmark bool
var collectorHealthCheckEnabled bool
var collectorHealthCheckAddress string

logFlags := log.LstdFlags
logger := &util.Logger{}
Expand Down Expand Up @@ -107,6 +109,8 @@ func main() {
flag.StringVar(&stateFilename, "statefile", defaultStateFile, "Specify alternative path for state file")
flag.StringVar(&pidFilename, "pidfile", "", "Specifies a path that a pidfile should be written to (default is no pidfile being written)")
flag.BoolVar(&benchmark, "benchmark", false, "Runs collector in benchmark mode (skip submitting the statistics to the server)")
flag.BoolVar(&collectorHealthCheckEnabled, "collector-health-check-enabled", true, "Enable the health check endpoint on the collector")
flag.StringVar(&collectorHealthCheckAddress, "collector-health-check-address", ":8080", "Address the health check webserver should listen on")
flag.Parse()

// Automatically reload the configuration after a successful test run.
Expand Down Expand Up @@ -281,6 +285,13 @@ ReadConfigAndRun:
exitCode := 0
keepRunning, testRunSuccess, writeStateFile, shutdown := runner.Run(ctx, &wg, globalCollectionOpts, logger, configFilename)

if collectorHealthCheckEnabled {
err := util.SetupHealthCheck(ctx, logger, &wg, collectorHealthCheckAddress)
if err != nil {
logger.PrintError("Failed to setup health check server: %s", err)
}
}

if keepRunning {
// Block here until we get any of the registered signals
s := <-sigs
Expand Down
49 changes: 49 additions & 0 deletions util/healthcheck.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
package util

import (
"context"
"net/http"
"sync"
"time"
)

var (
healthCheckServerShutdownTimeout = 1 * time.Second
)

func SetupHealthCheck(ctx context.Context, logger *Logger, wg *sync.WaitGroup, address string) error {
var srv http.Server

wg.Add(1)
go func() {
defer wg.Done()

srv = http.Server{
Addr: address,
}
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how hard it'd be to determine whether all configured servers have successful recent full snapshots - still not sure if that'd catch #637, but it would catch e.g. password errors or other problems with collecting the data and (some) errors when sending it.

We could potentially use server.PrevState.LastStatementStatsAt for this purpose? (https://github.com/pganalyze/collector/blob/main/state/state.go#L44)

If we do this, we should also make sure to not report failure at startup (where it can take up to 10 minutes to have the first full snapshot that sets server.PrevState)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to configure this healthcheck in the ECS task definition or as a kubernetes liveness probe. The idea was that if the process is unresponsive, ECS or k8s would restart the process.

In that context, I'm not sure if an issue with one of the monitored DBs would be a good enough reason to kick the entire collector. Maybe the statuses of monitored DBs could be reported as a metric on the collector? As well as statuses of uploads? So in other words, collector could expose metrics for status of targets and status of uploads. If my understanding is correct, the later would cover #637 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, perhaps I dismissed this idea prematurely. I think what Lukas proposed could work: we'd just need to make sure the health check has access to the server list. We could potentially expose configuration around that to have some flexibility over what constitutes a health check failure.

@mwasilew2 for the collector metric approach, how were you thinking of reporting these metrics? That seems different from a health check endpoint, right?

})

err := srv.ListenAndServe()
if err != nil && err != http.ErrServerClosed {
logger.PrintError("error when running the healthcheck server: %s", err)
}

}()

wg.Add(1)
go func() {
defer wg.Done()

<-ctx.Done()
ctxWithTimeout, _ := context.WithTimeout(context.Background(), healthCheckServerShutdownTimeout)
err := srv.Shutdown(ctxWithTimeout)
if err != nil {
logger.PrintError("failed to shutdown the health check server: %s", err)
}

}()

return nil
}