Skip to content

Commit

Permalink
Implement minimal collection mode
Browse files Browse the repository at this point in the history
This commit adds the capability to reduce metrics collection to just a critical subset of the metrics. The set of metrics to be collected is derived from the YBA "MINIMAL" metrics collection level.

Add a constant minimalCollectionPromRE that stores the string for the PromQL regular expression used to filter metrics in MINIMAL collection mode.

Add the --collection_level flag which accepts the values "normal" (the default) or "minimal".

Add code to validate the --collection_level flag value.

Add handling for the new --collection_level flag to the PromQL metric builder. This adds a RegEx match expression for minimalCollectionPromRE on the saved_name label when the collection level is set to minimal and we are processing the tserver export.

Update README.md with documentation for the new flag.
  • Loading branch information
ionthegeek committed Dec 24, 2024
1 parent 5bc7b4a commit eb607a7
Show file tree
Hide file tree
Showing 2 changed files with 80 additions and 2 deletions.
66 changes: 64 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,8 +171,9 @@ that too many samples would be loaded into memory. The backoff mechanism halves

Flags in this section are applicable to YBA API Mode and Legacy Mode (see below). They do not apply to manual mode.

These flags will enable or disable specific exports or jobs when exporting data from Prometheus. In general, these
settings should be left at their defaults unless there is a compelling reason to do otherwise.
These flags will filter the metrics when exporting data from Prometheus. Most of these flags enable or disable specific
exports or jobs. In general, these settings should be left at their defaults unless there is a compelling reason to do
otherwise.

##### Exporters

Expand Down Expand Up @@ -206,6 +207,67 @@ for only the specified nodes.
| `--nodes` | | v0.2.0 | | Optional | Collect metrics for only the specified subset of nodes. Accepts a comma separated list of node numbers or ranges. For example, `--nodes=1,3-6,14` would collect metrics for nodes 1, 3, 4, 5, 6, and 14. Mutually exclusive with `--instances`. |
| `--instances` | | v0.2.0 | | Optional | Collect metrics for only the specified subset of nodes. Accepts a comma separated list of instance names. For example, `--instances=yb-prod-appname-n1,yb-prod-appname-n3,yb-prod-appname-n4,yb-prod-appname-n5,yb-prod-appname-n6,yb-prod-appname-n14`. Mutually exclusive with `--nodes`. |

#### Collection Level

It can be extremely challenging to export tserver metrics from systems with very large numbers of nodes, tables, or
tablets due to the sheer volume of data. The following flag can be used to apply the YBA "minimal" collection level
rules to the dump, reducing the amount of data dumped and reducing dump runtime and size. Note that this setting can
only *reduce* the amount of metrics data collected; if the YBA collection level is set to any level lower than
`MINIMAL`, (e.g. if metrics collection is `OFF` in YBA), no metrics will be dumped. You can't dump what was never
collected.

| Canonical Flag Name | Alias(es) | Added In | Default | Required? | Description |
|----------------------|-----------|----------|----------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------|
| `--collection_level` | | v0.7.0 | `normal` | Optional | Limit the size of the dump by collecting only the metrics associated with the YBA "minimal" collection level. One of `normal` or `minimal`. |

Minimal collection mode limits promdump to collecting only the following tserver metrics (where `*` matches anything):

`async_replication_committed_lag_micros`,
`async_replication_sent_lag_micros`,
`block_cache_*`,
`cpu_stime`,
`cpu_utime`,
`follower_lag_ms`,
`follower_memory_pressure_rejections`,
`generic_current_allocated_bytes`,
`generic_heap_size`,
`glog*`,
`handler_latency_outbound_call_queue_time*`,
`handler_latency_outbound_transfer*`,
`handler_latency_yb_client*`,
`handler_latency_yb_consensus_ConsensusService*`,
`handler_latency_yb_cqlserver_CQLServerService*`,
`handler_latency_yb_cqlserver_SQLProcessor*`,
`handler_latency_yb_master*`,
`handler_latency_yb_redisserver_RedisServerService_*`,
`handler_latency_yb_tserver_TabletServerService*`,
`handler_latency_yb_ysqlserver_SQLProcessor*`,
`hybrid_clock_skew`,
`involuntary_context_switches*`,
`leader_memory_pressure_rejections`,
`log_wal_size`,
`majority_sst_files_rejections`,
`operation_memory_pressure_rejections`,
`rocksdb_current_version_sst_files_size`,
`rpc_connections_alive`,
`rpc_inbound_calls_created`,
`rpc_incoming_queue_time*`,
`rpcs_in_queue*`,
`rpcs_queue_overflow`,
`rpcs_timed_out_in_queue`,
`spinlock_contention_time*`,
`threads_running*`,
`threads_started*`,
`transaction_pool_cache*`,
`voluntary_context_switches*`,
`yb_ysqlserver_active_connection_total`,
`yb_ysqlserver_connection_over_limit_total`,
`yb_ysqlserver_connection_total`,
`yb_ysqlserver_new_connection_total`

This list (and the corresponding `promdump` code) was derived from the metrics level params configuration file
`minimal_level_params.json` from the 2024.2.0-b145 YBA release.

### Output Flags

Flags in this section control aspects of how the exported data are written to disk.
Expand Down
16 changes: 16 additions & 0 deletions promdump/promdump.go
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,14 @@ const defaultBatchDuration = 15 * time.Minute
const defaultYbaHostname = "localhost"
const defaultPromPort = 9090

// Derived from the metrics level params configuration file `minimal_level_params.json` from the 2024.2.0-b145
// YBA release. Check this file for changes periodically and merge them in!
//
// Full path in repo:
//
// https://github.com/yugabyte/yugabyte-db/blob/2024.2.0.0-b145/managed/src/main/resources/metric/minimal_level_params.json
const minimalCollectionPromRE = "async_replication_committed_lag_micros|async_replication_sent_lag_micros|block_cache_|cpu_stime|cpu_utime|follower_lag_ms|follower_memory_pressure_rejections|generic_current_allocated_bytes|generic_heap_size|glog|handler_latency_outbound_call_queue_time|handler_latency_outbound_transfer|handler_latency_yb_client|handler_latency_yb_consensus_ConsensusService|handler_latency_yb_cqlserver_CQLServerService|handler_latency_yb_cqlserver_SQLProcessor|handler_latency_yb_master|handler_latency_yb_redisserver_RedisServerService_|handler_latency_yb_tserver_TabletServerService|handler_latency_yb_ysqlserver_SQLProcessor|hybrid_clock_skew|involuntary_context_switches|leader_memory_pressure_rejections|log_wal_size|majority_sst_files_rejections|operation_memory_pressure_rejections|rocksdb_current_version_sst_files_size|rpc_connections_alive|rpc_inbound_calls_created|rpc_incoming_queue_time|rpcs_in_queue|rpcs_queue_overflow|rpcs_timed_out_in_queue|spinlock_contention_time|threads_running|threads_started|transaction_pool_cache|voluntary_context_switches|yb_ysqlserver_active_connection_total|yb_ysqlserver_connection_over_limit_total|yb_ysqlserver_connection_total|yb_ysqlserver_new_connection_total"

type promExport struct {
exportName string
jobName string
Expand Down Expand Up @@ -79,6 +87,7 @@ var (
prefixValidation = flag.Bool("node_prefix_validation", true, "set to false to disable node prefix validation")
universeName = flag.String("universe_name", "", "the name of the Universe for which to collect metrics, as shown in the YBA UI")
universeUuid = flag.String("universe_uuid", "", "the UUID of the Universe for which to collect metrics")
collectionLevel = flag.String("collection_level", "normal", "the scope of metrics to collect; set to \"minimal\" to collect a subset of only critical metrics")
instanceList = flag.String("instances", "", "the instance name(s) for which to collect metrics (optional, mutually exclusive with --nodes; comma separated list, e.g. yb-prod-appname-n1,yb-prod-appname-n3,yb-prod-appname-n4,yb-prod-appname-n5,yb-prod-appname-n6,yb-prod-appname-n14; disables collection of platform metrics unless explicitly enabled with --platform")
nodeSet = flag.String("nodes", "", "the node number(s) for which to collect metrics (optional, mutually exclusive with --instances); comma separated list of node numbers or ranges, e.g. 1,3-6,14; disables collection of platform metrics unless explicitly requested with --platform")
batchesPerFile = flag.Uint("batches_per_file", 1, "batches per output file")
Expand Down Expand Up @@ -1053,6 +1062,10 @@ func main() {
useYbaApi = true
}

if *collectionLevel != "normal" && *collectionLevel != "minimal" {
logger.Fatalf("main: invalid collection level '%v': must be one of 'normal' or 'minimal'", *collectionLevel)
}

if useYbaApi {
if *ybaToken == "" {
logger.Fatalln("The --yba_api_token flag is required when using the YBA API. See the YBA API documentation at: https://api-docs.yugabyte.com/")
Expand Down Expand Up @@ -1316,6 +1329,9 @@ func main() {
// prefix isn't required.
labels = append(labels, fmt.Sprintf("node_prefix=\"%s\"", *nodePrefix))
}
if *collectionLevel == "minimal" && v.exportName == "tserver_export" {
labels = append(labels, fmt.Sprintf("saved_name=~\"%s\"", minimalCollectionPromRE))
}
if instanceLabelString != "" {
labels = append(labels, instanceLabelString)
}
Expand Down

0 comments on commit eb607a7

Please sign in to comment.