整理流程 开始安装 访问下prometheus的UI ingress创建 注意:删除自带的网络策略,否则访问服务都会被阻塞 通过ingress的 kubecontrollermanager和kubes…
监控 整理流程 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 使用 Prometheus 结合 Kubernetes Operator 来监控服务的整体流程,没有包含 YAML 配置文件,只讲解步骤: 1. 安装 Prometheus Operator 首先,确保在 Kubernetes 集群中安装了 Prometheus Operator,这个 Operator 负责管理 Prometheus 的部署、配置、服务发现和监控。你可以通过 Helm 或直接应用 YAML 文件来安装 Prometheus Operator。 2. 创建 Service(服务暴露) 你需要在 Kubernetes 中为你的应用创建一个 Service,它会暴露应用的端口,以便 Prometheus 可以访问这些指标数据。 Service 是用来暴露应用的端口和接口,让外部或其他服务能够访问该服务。 在创建时,需要指定端口、选择器(确定哪个 Pod 或容器将被暴露)等信息。 3. 创建 Endpoints(可选) 一般情况下,Kubernetes 会自动为你的 Service 创建 Endpoints,用于暴露应用的网络地址和端口。如果你的应用暴露的是静态地址或需要手动配置,那么可以创建一个 Endpoints 资源。 4. 创建 ServiceMonitor(Prometheus 监控) ServiceMonitor 是 Prometheus Operator 提供的一个资源对象,用来告知 Prometheus 哪些服务需要监控。ServiceMonitor 会关联到指定的 Kubernetes Service,并定义如何抓取该服务的指标数据。 在 ServiceMonitor 中,需要指定你要监控的 Service 的标签(例如,选择 app: my-app 的 Service),并配置抓取的端口和间隔等。 Prometheus 会根据 ServiceMonitor 自动发现并抓取服务的指标数据。 5. 验证 Prometheus 配置 一旦配置了 ServiceMonitor,Prometheus 会开始监控你定义的服务。可以通过访问 Prometheus 的 Web 界面来检查抓取的目标和监控的服务。 访问 Prometheus UI,查看 “Targets” 页面,确认目标服务是否正确显示并处于正常状态。 在 Prometheus UI 中,通过查询相关指标来确保服务的数据正在被抓取。 6. 设置告警(可选) 如果你想要设置告警规则,Prometheus 可以监控服务的健康状况并在出现问题时发送告警。例如,如果应用的响应时间过长,Prometheus 可以触发告警。 配置 PrometheusRule 资源来定义告警规则,如 HTTP 请求响应时间过长、错误率过高等。 配合 Alertmanager,Prometheus 可以将告警发送到不同的通知渠道(例如 Slack、邮件等)。 7. 维护与优化 扩展与调整:随着应用规模的增长,可以根据需求扩展 Prometheus 的配置,增加更多的 ServiceMonitor 或调整抓取间隔。 故障处理:根据告警和监控数据,及时发现服务问题并修复。可以利用监控数据来进行根本原因分析(RCA)。
开始安装 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 1. 解压下载的代码包 wget https://github.com/prometheus-operator/kube-prometheus/archive/refs/tags/v0.13.0.zip unzip kube-prometheus-0.13.0.zip rm -f kube-prometheus-0.13.0.zip && cd kube-prometheus-0.13.0 2. 这里先看下有哪些镜像 quay.io/prometheus-operator/prometheus-config-reloader:v0.67.1 grafana/grafana:9.5.3 docker.io/cloudnativelabs/kube-router quay.io/brancz/kube-rbac-proxy:v0.14.2 quay.io/fabxc/prometheus_demo_service quay.io/prometheus/alertmanager:v0.26.0 quay.io/prometheus/blackbox-exporter:v0.24.0 quay.io/prometheus/node-exporter:v1.6.1 quay.io/prometheus-operator/prometheus-operator:v0.67.1 quay.io/prometheus/prometheus:v2.46.0 registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.9.2 registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.11.1 find ./ -type f |xargs sed -ri 's+quay.io/.*/+docker.io/bogeit/+g' find ./ -type f |xargs sed -ri 's+docker.io/cloudnativelabs/+docker.io/bogeit/+g' find ./ -type f |xargs sed -ri 's+grafana/+docker.io/bogeit/+g' find ./ -type f |xargs sed -ri 's+registry.k8s.io/.*/+docker.io/bogeit/+g' 3. 开始创建所有服务 kubectl create -f manifests/setup kubectl create -f manifests/ 过一会查看创建结果: kubectl -n monitoring get all kubectl -n monitoring get pod -w kubectl delete --ignore-not-found =true -f manifests/ -f manifests/setup
访问下prometheus的UI 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 apiVersion: networking.k8s.io/v1 kind: Ingress metadata: namespace: monitoring name: prometheus spec: rules: - host: prometheus.k8s.com http: paths: - backend: service: name: prometheus-k8s port: number: 9090 path: / pathType: Prefix
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 apiVersion: networking.k8s.io/v1 kind: Ingress metadata: namespace: monitoring name: grafana spec: rules: - host: grafana.k8s.com http: paths: - backend: service: name: grafana port: number: 3000 path: / pathType: Prefix grafana 账号密码都是admin
注意:删除自带的网络策略,否则访问服务都会被阻塞
1 kubectl -n monitoring delete networkpolicies.networking.k8s.io --all
通过ingress的
1 2 3 4 5 6 7 8 9 10 11 12 访问的流程是 通过命名空间 ingress-nginx 下面svc 关联的po去类似nginx的 就是这个pod的ip 加上 svc暴漏的端口去代理 grafana.k8s.com 或者是 prometheus.k8s.com grafana.k8s.com:3000 prometheus.k8s.com:3000 kubectl get svc -n ingress-nginx NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ingress-nginx-controller NodePort 10.96.186.188 <none > 80:30000/TCP,443:31388/TCP 7d3h kubectl get po -n ingress-nginx -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ingress-nginx-controller-x84p4 1/1 Running 6 (5m20s ago) 8m31s 192.168.85.129 k8s-node1 <none > <none >
kube-controller-manager和kube-scheduler被监控 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 LISTEN 0 32768 *:10257 *:* users:(("kube-controller",pid =3528 ,fd=3 )) LISTEN 0 32768 *:10259 *:* users:(("kube-scheduler",pid =837 ,fd=3 )) ----------------------------------------------------------------------------------------------------------- 如果是127.0.0.1:10257 解决办法如下 kubectl edit po kube-controller-manager-k8s-master -n kube-system - --bind-address =0.0 .0.0 然后因为K8s的这两上核心组件我们是以二进制形式部署的,为了能让K8s上的prometheus能发现,我们需要来创建相应的service和endpoints来将其关联起来 这里面有所以不需要创建servicemonitoring kubectl get servicemonitors.monitoring.coreos.com -A NAMESPACE NAME AGE monitoring alertmanager-main 28h monitoring blackbox-exporter 28h monitoring coredns 28h monitoring grafana 28h monitoring kube-apiserver 28h monitoring kube-controller-manager 28h monitoring kube-scheduler 28h monitoring kube-state-metrics 28h monitoring kubelet 28h monitoring node-exporter 28h monitoring prometheus-adapter 28h monitoring prometheus-k8s 28h monitoring prometheus-operator 28h ------------------------------------------------------------------------------------------------------ apiVersion: v1 kind: Service metadata: namespace: kube-system name: kube-controller-manager labels: app.kubernetes.io/name: kube-controller-manager spec: type: ClusterIP clusterIP: None ports: - name: https-metrics port: 10257 targetPort: 10257 protocol: TCP --- apiVersion: v1 kind: Endpoints metadata: labels: app.kubernetes.io/name: kube-controller-manager name: kube-controller-manager namespace: kube-system subsets: - addresses: - ip: 192.168.85.128 ports: - name: https-metrics port: 10257 protocol: TCP --- apiVersion: v1 kind: Service metadata: namespace: kube-system name: kube-scheduler labels: app.kubernetes.io/name: kube-scheduler spec: type: ClusterIP clusterIP: None ports: - name: https-metrics port: 10259 targetPort: 10259 protocol: TCP --- apiVersion: v1 kind: Endpoints metadata: labels: app.kubernetes.io/name: kube-scheduler name: kube-scheduler namespace: kube-system subsets: - addresses: - ip: 192.168.85.128 ports: - name: https-metrics port: 10259 protocol: TCP ---------------------------------------------------------------------------------- 将上面的yaml配置保存为repair-prometheus.yaml,然后创建它 kubectl apply -f repair-prometheus.yaml 创建完成后确认下 kube-controller-manager ClusterIP None <none> 10252/TCP 58s kube-scheduler ClusterIP None <none> 10251/TCP 58s 然后再返回prometheus UI处,耐心等待一会,就能看到已经被发现了 serviceMonitor/monitoring/kube-controller-manager/0 (2/2 up) serviceMonitor/monitoring/kube-scheduler/0 (2/2 up)
监控etcd 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 利用下面命令,我们可以看到ETCD都暴露出了哪些监控指标出来 curl -k --cacert /etc/kubernetes/pki/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key https://192.168.85.128:2379/metrics -------------------------------------------------------------------------------------------- 上面查看没问题后,接下来我们开始进行配置使ETCD能被prometheus发现并监控 kubectl -n monitoring create secret generic etcd-certs --from-file =/etc/kubernetes/pki/ca.crt --from-file=/etc/kubernetes/pki/etcd/server.crt --from-file=/etc/kubernetes/pki/etcd/server.key kubectl -n monitoring edit prometheus k8s spec: ... secrets: - etcd-certs /prometheus $ ls /etc/prometheus/secrets/etcd-certs/ ca.pem etcd-key.pem etcd.pem ----------------------------------------------------------------------------------------------- 因为etcd没有servicemonitoring所以要创建svc ep 和 servicemonitoring 接下来准备创建service、endpoints以及ServiceMonitor的yaml配置 注意替换下面的NODE节点IP为实际ETCD所在NODE内网IP apiVersion: v1 kind: Service metadata: name: etcd-k8s namespace: monitoring labels: k8s-app: etcd spec: type: ClusterIP clusterIP: None ports: - name: api port: 2379 protocol: TCP --- apiVersion: v1 kind: Endpoints metadata: name: etcd-k8s namespace: monitoring labels: k8s-app: etcd subsets: - addresses: - ip: 192.168.156.128 ports: - name: api port: 2379 protocol: TCP --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: etcd-k8s namespace: monitoring labels: k8s-app: etcd-k8s spec: jobLabel: k8s-app endpoints: - port: api interval: 30s scheme: https tlsConfig: caFile: /etc/prometheus/secrets/etcd-certs/ca.crt certFile: /etc/prometheus/secrets/etcd-certs/server.crt keyFile: /etc/prometheus/secrets/etcd-certs/server.key insecureSkipVerify: true selector: matchLabels: k8s-app: etcd namespaceSelector: matchNames: - monitoring ------------------------------------------------------------------------- 开始创建上面的资源 service/etcd-k8s created endpoints/etcd-k8s created servicemonitor.monitoring.coreos.com/etcd-k8s created 过一会,就可以在prometheus UI上面看到ETCD集群被监控了 serviceMonitor/monitoring/etcd-k8s/0 (3/3 up) ------------------------------------------------------------------------------ 在grafana官网模板中心搜索etcd,下载这个json格式的模板文件 https://grafana.com/grafana/dashboards/3070-etcd/ 然后打开自己先部署的grafana首页, 点击左上边菜单栏HOME --- Data source --- Add data source --- 选择 Prometheus 查看prometheus的详细地址 并编辑进去保存: 再点击右上角 +^ Import dashboard --- 点击Upload .json File 按钮,上传上面下载好的json文件 3070_rev3.json, 点击Import,即可显示etcd集群的图形监控信息
监控ingress 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 注意 因为prometheus 命名空间是 monitoring ingress的命名空间 是 ingress-nginx 需要对这个prometheus命名空间绑定角色 ------------------------------------------------------------------------------------ cat cr.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus-access-all-namespaces rules: - apiGroups: [""] resources: ["services", "endpoints", "pods"] verbs: ["get", "list"] ---------------------------------------------------------------- cat crb.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus-access-all-namespaces-binding subjects: - kind: ServiceAccount name: prometheus-k8s namespace: monitoring roleRef: kind: ClusterRole name: prometheus-access-all-namespaces apiGroup: rbac.authorization.k8s.io ----------------------------------------------------------------- 因为前面ingress-nginx服务是以daemonset形式部署的,并且映射了自己的端口到宿主机上,那么我可以直接用pod运行NODE上的IP来看下metrics kubectl -n ingress-nginx get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ingress-nginx-controller-mbs95 1/1 Running 0 74m 192.168.85.129 k8s-node1 <none> <none> ------------------------------------------------------------------------------ 创建 servicemonitor配置让prometheus能发现ingress-nginx的metrics cat ingress.servicemonitoring.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: ingress-servicemonitor namespace: monitoring labels: app.kubernetes.io/name: ingress-nginx spec: jobLabel: ingress-test endpoints: - port: app interval: 30s scheme: http path: /metrics selector: matchLabels: app.kubernetes.io/name: ingress-nginx namespaceSelector: matchNames: - ingress-nginx ------------------------------------------------------------------- 创建它 servicemonitor.monitoring.coreos.com/nginx-ingress-scraping created NAME AGE alertmanager-main 32h blackbox-exporter 32h coredns 32h etcd-k8s 27h grafana 32h ingress-servicemonitor 78m kube-apiserver 32h kube-controller-manager 32h kube-scheduler 32h kube-state-metrics 32h kubelet 32h node-exporter 32h prometheus-adapter 32h prometheus-k8s 32h prometheus-operator 32h 再到prometheus UI上看下,发现已经有了 serviceMonitor/monitoring/ingress-servicemonitor/0 (1/1 up) 下载grafana模板导入使用 https://grafana.com/grafana/dashboards/14314-kubernetes-nginx-ingress-controller-nextgen-devops-nirvana/