官方文档:https://prometheus.io/docs/prometheus/latest/configuration/configuration/
Prometheus 的配置由命令行选项和配置文件构成。
可以通过./prometheus -h
来查看命令行参数。
Prometheus 可以在运行期间重新加载配置,如果新配置有语法错误,则不会启用新配置。
可以通过给 Prometheus 发送 SIGHUP 信号来重加载配置,也可以通过 API :POST /-/reload(前提是 --web.enable-lifecycle 开启)来重加载配置文件。
配置文件路径
监听的地址
读取的超时时间
最大连接数
普罗米修斯外部可访问的URL(例如,如果普罗米修斯是通过反向代理服务的)。用于生成返回到普罗米修斯自身的相对和绝对链接。如果URL有一个路径部分,它将用于前缀Prometheus服务的所有HTTP端点。如果省略,相关的URL组件将自动派生
Prefix for the internal routes of web endpoints. Defaults to path of --web.external-url
Path to static asset directory, available at /user
是否可以通过 HTTP 请求来实现重启和关闭
Enable API endpoints for admin control actions.
定义 html 文件的路径
定义网站文件所需依赖的路径
网站标题
跨域
储存数据的目录
数据保存多长时间,保留时间默认为15d。支持的单位:y、w、d、h、m、s、ms。
[实验特性] 数据保存的最大大小,支持的单位 B, KB, MB, GB, TB, PB, EB
不在数据目录中创建锁文件
[实验特性] 允许重叠的块,从而启用垂直压缩和垂直查询合并。
压缩 wal 文件
关闭或重新加载配置时等待刷新的时间,超过这个时间,刷新配置就会被中断
单个查询中通过远程读取接口返回的最大样本总数。0表示没有限制。对于流式响应类型,将忽略此限制。
最大并发远程读取调用数。0表示无限制
在封送处理之前,用于流式远程读取响应类型的单个帧中的最大字节数。请注意,客户端可能也有帧大小限制。默认为protobuf推荐的1MB。
为恢复“for”警报状态而允许普罗米修斯中断的最长时间。
可以使用参数 --config.file
来指明配置文件。配置文件是 yaml 格式的。
一个配置示例文件如下:https://github.com/prometheus/prometheus/blob/release-2.16/config/testdata/conf.good.yml
以下代码文件中,方括号代表是可选的,另外约定如下:
- 代表一个布尔值,true 或 false,
- 时间段,[0-9]+(ms|[smhdwy])
- [a-zA-Z_][a-zA-Z0-9_]*
- 任意 unicode 值
- 有效的文件路径
- 带有协议 主机名或IP 和端口号
- URL 路径
- 协议,http 或 https
- 加密字符串
- <tmpl_string> 使用模版扩展的字符串。
全局配置指定在所有其他配置上下文中有效的参数。它们还用作其他配置部分的默认设置。
全局配置如下,官网代码显示渣的一批。
global:
# 抓取间隔
[ scrape_interval: <duration> | default = 1m ]
# 抓取超时时间
[ scrape_timeout: <duration> | default = 10s ]
# 评估规则间隔,用于对告警规则做定期计算
[ evaluation_interval: <duration> | default = 1m ]
# The labels to add to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
[ <labelname>: <labelvalue> ... ]
# 查询的日志文件
[ query_log_file: <string> ]
# 规则文件列表
rule_files:
[ - <filepath_glob> ... ]
# 抓取配置列表
scrape_configs:
[ - <scrape_config> ... ]
# 告警设置
alerting:
alert_relabel_configs:
[ - <relabel_config> ... ]
alertmanagers:
[ - <alertmanager_config> ... ]
# Settings related to the remote write feature.
# 远程写入,比如写入到 Kafka 里边
remote_write:
[ - <remote_write> ... ]
# Settings related to the remote read feature.
remote_read:
[ - <remote_read> ... ]
scrape_config 表示一条抓取规则,这里可以静态配置,也可以使用服务发现机制动态发现目标。
具体字段如下:
# 任务名称
job_name: <job_name>
# 抓取间隔
[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]
# 抓取超时时间
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]
# The HTTP resource path on which to fetch metrics from targets.
# 获取 metrics 的 http 端点
[ metrics_path: <path> | default = /metrics ]
# honor_labels controls how Prometheus handles conflicts between labels that are
# already present in scraped data and labels that Prometheus would attach
# server-side ("job" and "instance" labels, manually configured target
# labels, and labels generated by service discovery implementations).
#
# If honor_labels is set to "true", label conflicts are resolved by keeping label
# values from the scraped data and ignoring the conflicting server-side labels.
#
# If honor_labels is set to "false", label conflicts are resolved by renaming
# conflicting labels in the scraped data to "exported_<original-label>" (for
# example "exported_instance", "exported_job") and then attaching server-side
# labels.
#
# Setting honor_labels to "true" is useful for use cases such as federation and
# scraping the Pushgateway, where all labels specified in the target should be
# preserved.
#
# Note that any globally configured "external_labels" are unaffected by this
# setting. In communication with external systems, they are always applied only
# when a time series does not have a given label yet and are ignored otherwise.
[ honor_labels: <boolean> | default = false ]
# honor_timestamps controls whether Prometheus respects the timestamps present
# in scraped data.
#
# If honor_timestamps is set to "true", the timestamps of the metrics exposed
# by the target will be used.
#
# If honor_timestamps is set to "false", the timestamps of the metrics exposed
# by the target will be ignored.
[ honor_timestamps: <boolean> | default = true ]
# 协议
[ scheme: <scheme> | default = http ]
# 可选的 HTTP URL 的参数
params:
[ <string>: [<string>, ...] ]
# Sets the `Authorization` header on every scrape request with the
# configured username and password.
# password and password_file are mutually exclusive.
basic_auth:
[ username: <string> ]
[ password: <secret> ]
[ password_file: <string> ]
# Sets the `Authorization` header on every scrape request with
# the configured bearer token. It is mutually exclusive with `bearer_token_file`.
[ bearer_token: <secret> ]
# Sets the `Authorization` header on every scrape request with the bearer token
# read from the configured file. It is mutually exclusive with `bearer_token`.
[ bearer_token_file: /path/to/bearer/token/file ]
# TLS 配置
tls_config:
[ <tls_config> ]
# 可选的代理 URL
[ proxy_url: <string> ]
# Azure 服务发现配置
azure_sd_configs:
[ - <azure_sd_config> ... ]
# Consul 服务发现配置
consul_sd_configs:
[ - <consul_sd_config> ... ]
# DNS 服务发现配置
dns_sd_configs:
[ - <dns_sd_config> ... ]
# EC2的服务发现配置
ec2_sd_configs:
[ - <ec2_sd_config> ... ]
# OpenStack 的服务发现配追
openstack_sd_configs:
[ - <openstack_sd_config> ... ]
# 基于文件的服务发现配置
file_sd_configs:
[ - <file_sd_config> ... ]
# GCE 服务发现配置
gce_sd_configs:
[ - <gce_sd_config> ... ]
# Kubernetes 服务发现配置
kubernetes_sd_configs:
[ - <kubernetes_sd_config> ... ]
# Marathon 服务发现配置
marathon_sd_configs:
[ - <marathon_sd_config> ... ]
# AirBnB's Nerve 服务发现配置
nerve_sd_configs:
[ - <nerve_sd_config> ... ]
# Zookeeper Serverset 服务发现配置
serverset_sd_configs:
[ - <serverset_sd_config> ... ]
# Triton 服务发现配置
triton_sd_configs:
[ - <triton_sd_config> ... ]
# 静态配置,入门常用
static_configs:
[ - <static_config> ... ]
# List of target relabel configurations.
relabel_configs:
[ - <relabel_config> ... ]
# List of metric relabel configurations.
metric_relabel_configs:
[ - <relabel_config> ... ]
# Per-scrape limit on number of scraped samples that will be accepted.
# If more than this number of samples are present after metric relabelling
# the entire scrape will be treated as failed. 0 means no limit.
[ sample_limit: <int> | default = 0 ]
# CA certificate to validate API server certificate with.
[ ca_file: <filename> ]
# Certificate and key files for client cert authentication to the server.
[ cert_file: <filename> ]
[ key_file: <filename> ]
# ServerName extension to indicate the name of the server.
# https://tools.ietf.org/html/rfc4366#section-3.1
[ server_name: <string> ]
# Disable validation of the server certificate.
[ insecure_skip_verify: <boolean> ]
下面重点学习一下 kubernetes_sd_config
,其他的服务发现机制都略过。
Kubernetes 服务发现机制允许从 Kubernetes 的 REST Api 来抓取目标,并始终与集群保持同步状态。
可以将以下的类型配置为目标
可以为每个群集节点提供一个目标,其地址默认为Kubelet的HTTP端口。
可获得的元标签:
__meta_kubernetes_node_name
:节点名称__meta_kubernetes_node_label_<labelname>
: Each label from the node object.__meta_kubernetes_node_labelpresent_<labelname>
:true
for each label from the node object.__meta_kubernetes_node_annotation_<annotationname>
: Each annotation from the node object.__meta_kubernetes_node_annotationpresent_<annotationname>
:true
for each annotation from the node object.__meta_kubernetes_node_address_<address_type>
: The first address for each node address type, if it exists.
###pod
# The information to access the Kubernetes API.
# The API server addresses. If left empty, Prometheus is assumed to run inside
# of the cluster and will discover API servers automatically and use the pod's
# CA certificate and bearer token file at /var/run/secrets/kubernetes.io/serviceaccount/.
[ api_server: <host> ]
# The Kubernetes role of entities that should be discovered.
role: <role>
# Optional authentication information used to authenticate to the API server.
# Note that `basic_auth`, `bearer_token` and `bearer_token_file` options are
# mutually exclusive.
# password and password_file are mutually exclusive.
# Optional HTTP basic authentication information.
basic_auth:
[ username: <string> ]
[ password: <secret> ]
[ password_file: <string> ]
# Optional bearer token authentication information.
[ bearer_token: <secret> ]
# Optional bearer token file authentication information.
[ bearer_token_file: <filename> ]
# Optional proxy URL.
[ proxy_url: <string> ]
# TLS configuration.
tls_config:
[ <tls_config> ]
# Optional namespace discovery. If omitted, all namespaces are used.
namespaces:
names:
[ - <string> ]
Prometheus 有两种规则,一种是记录规则,一种是告警规则。Prometheus 使用 rule_files 这个字段来定义记录规则和告警规则的文件列表,规则文件的格式是 YAML 。
规则语法检查:
下载的二进制包中,有一个二进制可执行文件,叫 promtool ,下面利用这个文件进行语法检查:
$ ./promtool check rules prometheus.rules.yml
Checking prometheus.rules.yml
SUCCESS: 1 rules found
记录规则可以实现预先计算经常计算,或计算量大的表达式,并将其结果保存为一组新的时间序列。这样,在查询时就会变得快很多,这对于仪表盘来说非常有用,仪表盘每次刷新时都需要查询相同的表达式。
规则文件中,都会包含一个 groups ,如下:
groups:
[ - <rule_group> ]
下面看一个简单的规则:
groups:
- name: example
rules:
- record: job:http_inprogress_requests:sum
expr: sum(http_inprogress_requests) by (job)
每个 group 下,都会有一个 rules
# The name of the time series to output to. Must be a valid metric name.
record: <string>
# The PromQL expression to evaluate. Every evaluation cycle this is
# evaluated at the current time, and the result recorded as a new set of
# time series with the metric name as given by 'record'.
expr: <string>
# Labels to add or overwrite before storing the result.
labels:
[ <labelname>: <labelvalue> ]
告警规则可以实现 Prometheus 表达式语言定义警报条件,并通知外部服务。
告警规则的配置方法和记录规则差不多,下面是一个例子:
groups:
- name: example
rules:
- alert: HighRequestLatency
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
for: 10m
labels:
severity: page
annotations:
summary: High request latency
for:评估等待时间,可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending。
为了让告警具有良好的可读性,可以使用模版。
上边的 label 和 annotation 可以写模版。
在告警规则文件的annotations中使用summary
描述告警的概要信息,description
用于描述告警的详细信息。同时Alertmanager的UI也会根据这两个标签值,显示告警信息。为了让告警信息具有更好的可读性,Prometheus支持模板化label和annotations的中标签的值。
通过$labels.
变量可以访问当前告警实例中指定标签的值。$value则可以获取当前PromQL表达式计算的样本值。
例如:
groups:
- name: example
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
报警的查看地址是:http://localhost:9090/alerts
也可以通过表达式来查询告警:
ALERTS{alertname="<alert name>", alertstate="pending|firing", <additional alert labels>}
样本值为1表示当前告警处于活动状态(pending或者firing),当告警从活动状态转换为非活动状态时,样本值则为0。
为了更好地控制告警的发送,可以使用 Alertmanager
Prometheus 的模版也是 Go 模版,和 Helm 的模版都是一样的。不过多介绍了,懂得都懂。
关于 Prometheus 模版中可以使用的变量和函数可以参考:https://prometheus.io/docs/prometheus/latest/configuration/template_reference/
可以使用 promtool 工具测试规则:
$ # For a single test file.
$ ./promtool test rules test.yml
$ # If you have multiple test files, say test1.yml,test2.yml,test2.yml
$ ./promtool test rules test1.yml test2.yml test3.yml
测试文件格式:
# This is a list of rule files to consider for testing. Globs are supported.
rule_files:
[ - <file_name> ]
# optional, default = 1m
evaluation_interval: <duration>
# The order in which group names are listed below will be the order of evaluation of
# rule groups (at a given evaluation time). The order is guaranteed only for the groups mentioned below.
# All the groups need not be mentioned below.
group_eval_order:
[ - <group_name> ]
# All the tests are listed here.
tests:
[ - <test_group> ]
# Series data
interval: <duration>
input_series:
[ - <series> ]
# Unit tests for the above data.
# Unit tests for alerting rules. We consider the alerting rules from the input file.
alert_rule_test:
[ - <alert_test_case> ]
# Unit tests for PromQL expressions.
promql_expr_test:
[ - <promql_test_case> ]
# External labels accessible to the alert template.
external_labels:
[ <labelname>: <string> ... ]
# This follows the usual series notation '<metric name>{<label name>=<label value>, ...}'
# Examples:
# series_name{label1="value1", label2="value2"}
# go_goroutines{job="prometheus", instance="localhost:9090"}
series: <string>
# This uses expanding notation.
# Expanding notation:
# 'a+bxc' becomes 'a a+b a+(2*b) a+(3*b) … a+(c*b)'
# 'a-bxc' becomes 'a a-b a-(2*b) a-(3*b) … a-(c*b)'
# Examples:
# 1. '-2+4x3' becomes '-2 2 6 10'
# 2. ' 1-2x4' becomes '1 -1 -3 -5 -7'
values: <string>