Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs about multi cluster prom monitoring #1490

Merged
merged 24 commits into from
Mar 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added media/pull-prom-federation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/pull-thanos-query.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/push-thanos-receive.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions zh/TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
- [部署 TiDB 集群监控与告警](monitor-a-tidb-cluster.md)
- [使用 TiDB Dashboard 监控诊断 TiDB 集群](access-dashboard.md)
- [聚合多个 TiDB 集群的监控数据](aggregate-multiple-cluster-monitor-data.md)
- [跨多个 Kubernetes 集群监控 TiDB 集群](deploy-tidb-monitor-across-multiple-kubernetes.md)
just1900 marked this conversation as resolved.
Show resolved Hide resolved
- [开启 TidbMonitor 动态配置](enable-monitor-dynamic-configuration.md)
- [开启 TidbMonitor 分片功能](enable-monitor-shards.md)
- 数据迁移
Expand Down
4 changes: 4 additions & 0 deletions zh/deploy-tidb-cluster-across-multiple-kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -633,3 +633,7 @@ kubectl patch tidbcluster cluster1 --type merge -p '{"spec":{"acrossK8s": true}}
完成上述步骤后,该 TidbCluster 可以作为跨 Kubernetes 集群部署 TiDB 集群的初始 TidbCluster。可以参考[部署新的 TidbCluster 加入 TiDB 集群](#第-2-步部署新的-tidbcluster-加入-tidb-集群)一节部署其他的 TidbCluster。

更多示例信息以及开发信息,请参阅 [`multi-cluster`](https://github.com/pingcap/tidb-operator/tree/master/examples/multi-cluster)。

## 跨多个 Kubernetes 集群部署的 TiDB 集群监控

请参阅[跨多个 Kubernetes 集群监控 TiDB 集群](deploy-tidb-monitor-across-multiple-kubernetes.md)
321 changes: 321 additions & 0 deletions zh/deploy-tidb-monitor-across-multiple-kubernetes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,321 @@
---
title: 跨多个 Kubernetes 集群监控 TiDB 集群
DanielZhangQD marked this conversation as resolved.
Show resolved Hide resolved
summary: 介绍如何对跨多个 Kubernetes 集群的 TiDB 集群进行监控,并集成到常见 Prometheus 多集群监控体系中
---

# 跨多个 Kubernetes 集群监控 TiDB 集群

你可以监控跨多个 Kubernetes 集群的 TiDB 集群,实现从统一全局视图访问监控数据。本文档介绍如何与几种常见的 Prometheus 多集群监控方式进行集成,并使用 Grafana 可视化多集群数据:

- [Push 方式](#push-方式)
- [Pull 方式 - 使用 Thanos Query](#使用-thanos-query)
- [Pull 方式 - 使用 Prometheus Federation](#使用-prometheus-federation)
- [使用 Grafana 可视化多集群监控数据](#使用-grafana-可视化多集群监控数据)

## Push 方式

ran-huang marked this conversation as resolved.
Show resolved Hide resolved
Push 方式指利用 Prometheus remote-write 的特性,使位于不同 Kubernetes 集群的 Prometheus 实例将监控数据推送至中心化存储中。

本节所描述的 Push 方式以 Thanos 为例。如果你使用了其他[兼容 Prometheus Remote API 的中心化存储方案](https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage),只需对 Thanos 相关组件进行替换即可。

### 前置条件

多个 Kubernetes 集群间的组件满足以下条件:

- 各 Kubernetes 集群上的 Prometheus(即 TidbMonitor)组件有能力访问 Thanos Receiver 组件。

关于 Thanos Receiver 部署,可参考 [kube-thanos](https://github.com/thanos-io/kube-thanos) 以及 [Example](https://github.com/pingcap/tidb-operator/tree/master/examples/monitor-prom-remotewrite)。

### 部署架构图

![push-thanos-receive.png](/media/push-thanos-receive.png)

### 部署 TiDB 集群监控

1. 根据不同 TiDB 集群所在的 Kubernetes 集群,设置以下环境变量:

- `cluster_name`:TiDB 集群名称。
- `cluster_namespace`:TiDB 集群所在的命名空间。
- `kubernetes_cluster_name`:自定义的 Kubernetes 集群名称,在标识 Prometheus 的 `externallabels` 中使用。
- `storageclass_name`:当前集群中的存储。
- `remote_write_url`:`thanos-receiver` 组件的 host,或其他兼容 Prometheus remote API 组件的 host 。

{{< copyable "shell-regular" >}}

```sh
cluster_name="cluster1"
cluster_namespace="pingcap"
kubernetes_cluster_name="kind-cluster-1"
storageclass_name="local-storage"
remote_write_url="http://thanos-receiver:19291/api/v1/receive"
```

2. 执行以下指令,创建 `TidbMonitor`:

{{< copyable "shell-regular" >}}

```sh
cat << EOF | kubectl apply -n ${cluster_namespace} -f -
apiVersion: pingcap.com/v1alpha1
kind: TidbMonitor
metadata:
name: ${cluster_name}
spec:
clusters:
- name: ${cluster_name}
namespace: ${cluster_namespace}
externalLabels:
# k8s_clsuter indicates the k8s cluster name, you can change
# the label's name on your own, but you should notice that the
# "cluster" label has been used by the TiDB metrics already.
# For more information, please refer to the issue
# https://github.com/pingcap/tidb-operator/issues/4219.
k8s_clsuter: ${kubernetes_cluster_name}
# add other meta labels here
#region: us-east-1
initializer:
baseImage: pingcap/tidb-monitor-initializer
version: v5.4.0
persistent: true
storage: 100Gi
storageClassName: ${storageclass_name}
prometheus:
baseImage: prom/prometheus
logLevel: info
remoteWrite:
- url: ${remote_write_url}
retentionTime: 2h
version: v2.27.1
reloader:
baseImage: pingcap/tidb-monitor-reloader
version: v1.0.1
imagePullPolicy: IfNotPresent
EOF
```

## Pull 方式

ran-huang marked this conversation as resolved.
Show resolved Hide resolved
Pull 方式是指从不同 Kubernetes 集群的 Prometheus 实例中拉取监控数据,聚合后提供统一全局视图查询。本文中将其分为:[使用 Thanos Query](#使用-thanos-query) 和 [使用 Prometheus Federation](#使用-prometheus-federation)。

<SimpleTab>
<div label="Thanos Query">

### 使用 Thanos Query

本节中的示例为每个 Prometheus (TidbMonitor) 组件部署了 Thanos Sidecar,并使用 thanos-query 组件进行聚合查询。如果不需要对监控数据做长期存储,你可以不部署 thanos-store、S3 等组件。

#### 前置条件

需要配置 Kubernetes 的网络和 DNS,使得 Kubernetes 集群满足以下条件:

- Thanos Query 组件有能力访问各 Kubernetes 集群上的 Prometheus (即 TidbMonitor) 组件的 Pod IP。
- Thanos Query 组件有能力访问各 Kubernetes 集群上的 Prometheus (即 TidbMonitor) 组件的 Pod FQDN。

关于 Thanos Query 部署, 参考 [kube-thanos](https://github.com/thanos-io/kube-thanos) 以及 [Example](https://github.com/pingcap/tidb-operator/tree/master/examples/monitor-with-thanos)。

#### 部署架构图

![pull-thanos-query.png](/media/pull-thanos-query.png)

#### 部署 TiDB 集群监控

1. 根据不同 TiDB 集群所在的 Kubernetes 集群,设置以下环境变量:

- `cluster_name`:TiDB 集群名称。
- `cluster_namespace`:TiDB 集群所在的命名空间。
- `kubernetes_cluster_name`:自定义的 Kubernetes 集群名称,在标识 Prometheus 的 `externallabels` 中使用。
- `cluster_domain`:当前 Kubernetes 集群的 [Cluster Domain](https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/#introduction)。
- `storageclass_name`:当前 Kubernetes 集群中的存储。

{{< copyable "shell-regular" >}}

```sh
cluster_name="cluster1"
cluster_namespace="pingcap"
kubernetes_cluster_name="kind-cluster-1"
storageclass_name="local-storage"
cluster_domain="svc.local"
```

2. 执行以下指令,创建 `TidbMonitor`:
csuzhangxc marked this conversation as resolved.
Show resolved Hide resolved

{{< copyable "shell-regular" >}}

```sh
cat <<EOF | kubectl apply -n ${cluster_namespace} -f -
apiVersion: pingcap.com/v1alpha1
kind: TidbMonitor
metadata:
name: ${cluster_name}
spec:
clusters:
- name: ${cluster_name}
namespace: ${cluster_namespace}
externalLabels:
# k8s_clsuter indicates the k8s cluster name, you can change
# the label's name on your own, but you should notice that the
# "cluster" label has been used by the TiDB metrics already.
# For more information, please refer to the issue
# https://github.com/pingcap/tidb-operator/issues/4219.
k8s_clsuter: ${kubernetes_cluster_name}
# add other meta labels here
#region: us-east-1
initializer:
baseImage: pingcap/tidb-monitor-initializer
version: v5.4.0
persistent: true
storage: 20Gi
storageClassName: ${storageclass_name}
prometheus:
baseImage: prom/prometheus
logLevel: info
version: v2.27.1
reloader:
baseImage: pingcap/tidb-monitor-reloader
version: v1.0.1
thanos:
baseImage: quay.io/thanos/thanos
version: v0.22.0
#enable config below if long-term storage is needed.
#objectStorageConfig:
# key: objectstorage.yaml
# name: thanos-objectstorage
imagePullPolicy: IfNotPresent
EOF
```

3. 配置 Thanos Query Stores:

通过静态服务发现的方式,在 Thanos Query 的命令行启动参数中添加 `--store=${cluster_name}-prometheus.${cluster_namespace}.svc.${cluster_domain}:10901` 来指定 Store 节点,需要对变量值进行替换。如果你使用了其他服务发现方式,请参考 [thanos-service-discovery](https://thanos.io/tip/thanos/service-discovery.md/) 进行配置。

</div>

<div label="Federation">

### 使用 Prometheus Federation

本节中的示例使用 Federation Prometheus Server 作为数据统一存储与查询的入口,建议在数据规模较小的环境下使用。

#### 前置条件

需要配置 Kubernetes 的网络和 DNS,使得 Kubernetes 集群满足以下条件:

- Federation Prometheus 组件有能力访问各 Kubernetes 集群上的 Prometheus (即 TidbMonitor) 组件的 Pod IP。
- Federation Prometheus 组件有能力访问各 Kubernetes 集群上的 Prometheus (即 TidbMonitor) 组件的 Pod FQDN。

#### 部署架构图

![pull-prom-federation.png](/media/pull-prom-federation.png)

#### 部署 TiDB 集群监控

1. 根据不同 TiDB 集群所在的 Kubernetes 集群,设置以下环境变量:

- `cluster_name`:TiDB 集群名称。
- `cluster_namespace`:TiDB 集群所在的命名空间。
- `kubernetes_cluster_name`:自定义的 Kubernetes 集群名称,在标识 Prometheus 的 `externallabels` 中使用。
- `storageclass_name`:当前集群中的存储。

{{< copyable "shell-regular" >}}

```sh
cluster_name="cluster1"
cluster_namespace="pingcap"
kubernetes_cluster_name="kind-cluster-1"
storageclass_name="local-storage"
```

2. 执行以下指令,创建 `TidbMonitor` :

{{< copyable "shell-regular" >}}

```sh
cat << EOF | kubectl apply -n ${cluster_namespace} -f -
apiVersion: pingcap.com/v1alpha1
kind: TidbMonitor
metadata:
name: ${cluster_name}
spec:
clusters:
- name: ${cluster_name}
namespace: ${cluster_namespace}
externalLabels:
# k8s_clsuter indicates the k8s cluster name, you can change
# the label's name on your own, but you should notice that the
# "cluster" label has been used by the TiDB metrics already.
# For more information, please refer to the issue
# https://github.com/pingcap/tidb-operator/issues/4219.
k8s_clsuter: ${kubernetes_cluster_name}
# add other meta labels here
#region: us-east-1
initializer:
baseImage: pingcap/tidb-monitor-initializer
version: v5.4.0
persistent: true
storage: 20Gi
storageClassName: ${storageclass_name}
prometheus:
baseImage: prom/prometheus
logLevel: info
version: v2.27.1
reloader:
baseImage: pingcap/tidb-monitor-reloader
version: v1.0.1
imagePullPolicy: IfNotPresent
EOF
```

#### 配置 Federation Prometheus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Federation Prometheus 需要部署吗?有的话部署步骤或者参考链接写一下
  2. “配置 Federation Prometheus” 和“部署 TiDB 集群监控”有先后顺序吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 需要部署,部署方式就是部署 Prometheus,让用户参考https://prometheus.io/docs/prometheus/latest/federation/#hierarchical-federation 这个文档就可以,不需要在 tidb-operator 的 doc 里面再赘述了。
  2. 二者没有先后顺序。


关于 Federation 方案,参考 [Federation 文档](https://prometheus.io/docs/prometheus/latest/federation/#hierarchical-federation)。部署完成后,修改 Prometheus 采集配置,添加需要聚合的 Prometheus (TidbMonitor) 的 host 信息。

```
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s

honor_labels: true
metrics_path: '/federate'

params:
'match[]':
- '{__name__=~".+"}'

static_configs:
- targets:
- 'source-prometheus-1:9090'
- 'source-prometheus-2:9090'
- 'source-prometheus-3:9090'
```

</div>
</SimpleTab>

## 使用 Grafana 可视化多集群监控数据

just1900 marked this conversation as resolved.
Show resolved Hide resolved
使用 Prometheus 获取数据后,你可以使用 Grafana 可视化多集群监控数据

1. 执行以下指令,获取 TiDB 相关组件的 Grafana Dashboards, 其中 version 变量的值为 initializer 镜像版本,应该和 TiDB 版本保持一致,但是目前仅 **nightly** 版本 Initializer 镜像适用于**多 Kubernetes 集群**监控。

{{< copyable "shell-regular" >}}

```sh
# set tidb version here
version=nightly
docker run --rm -i -v ${PWD}/dashboards:/dashboards/ pingcap/tidb-monitor-initializer:${version} && \
cd dashboards
```

执行上述命令后,可以在当前目录下查看所有组件 dashboard 的 json 定义文件。

2. [配置 Prometheus 数据源](https://grafana.com/docs/grafana/latest/datasources/prometheus/),为了与上述获得的 dashboard json 文件保持一致,需将数据源 Name 字段值配置为 `tidb-cluster`。如果希望使用已有的数据源,请执行以下指令对上述 dashboard json 文件中的数据源名称进行替换,其中 `DS_NAME` 变量的值为数据源的名称。

{{< copyable "shell-regular" >}}

```sh
# define your datasource name here.
DS_NAME=thanos
sed -i 's/"datasource": "tidb-cluster"/"datasource": "$DS_NAME"/g' *.json
csuzhangxc marked this conversation as resolved.
Show resolved Hide resolved
```

3. [在 Grafana 中导入 Dashboard](https://grafana.com/docs/grafana/latest/dashboards/export-import/#import-dashboard)