Prometheus、Grafana、ELK Stack的集成与最佳实践
# 查看Pod的日志
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container-name> # 查看Pod中特定容器的日志
# 实时查看日志
kubectl logs -f <pod-name>
# 查看之前的日志
kubectl logs --previous <pod-name>
# 查看Deployment的日志
kubectl logs deployment/<deployment-name>
组成:Elasticsearch + Logstash + Kibana
作用:集中式日志收集、存储、分析和可视化
Pod日志 → Fluentd/Filebeat → Logstash → Elasticsearch → Kibana
特点:由Grafana Labs开发的轻量级日志聚合系统,与Prometheus和Grafana无缝集成
# 使用Helm部署Loki和Promtail
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki-stack --set grafana.enabled=true
作用:开源的监控系统和时间序列数据库,用于收集和存储指标数据
# 使用Helm部署Prometheus
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack
作用:开源的可视化平台,用于创建监控仪表板
# 访问Grafana
kubectl port-forward svc/prometheus-grafana 3000:80
# 默认登录凭据
# 用户名: admin
# 密码: 查看secret
kubectl get secret prometheus-grafana -o jsonpath='{.data.admin-password}' | base64 --decode
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nginx-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: nginx
endpoints:
- port: metrics
interval: 15s
path: /metrics
# 在应用中暴露Prometheus指标
# 例如,在Node.js应用中使用prom-client库
const promClient = require('prom-client');
const counter = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status']
});
// 在Express应用中使用
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
counter.labels(req.method, req.route.path, res.statusCode).inc();
});
next();
});
// 暴露指标端点
app.get('/metrics', async (req, res) => {
res.set('Content-Type', promClient.register.contentType);
res.end(await promClient.register.metrics());
});
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: nginx-alerts
namespace: monitoring
spec:
groups:
- name: nginx
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}%, which is above the threshold of 5%"
- alert: HighCPUUsage
expr: sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod) / sum(kube_pod_container_resource_requests_cpu_cores{}) by (pod) > 0.8
for: 10m
labels:
severity: critical
annotations:
summary: "High CPU usage detected"
description: "CPU usage is {{ $value }}%, which is above the threshold of 80%"
# 查看Prometheus目标
kubectl port-forward svc/prometheus-server 9090:80
# 访问 http://localhost:9090/targets
# 查看Grafana仪表板
kubectl port-forward svc/grafana 3000:80
# 查看告警状态
kubectl get prometheusrule -n monitoring
kubectl get alertmanager -n monitoring
# 测试告警规则
curl -X POST http://localhost:9090/api/v1/query -d 'query=sum(rate(http_requests_total[5m]))'
# 查看Pod的资源使用情况
kubectl top pod
kubectl top node
# 查看事件
kubectl get events
kubectl get events --sort-by='.lastTimestamp'
可能原因:网络策略阻止了Prometheus访问应用的指标端点
解决方案:配置网络策略,允许Prometheus访问应用的指标端点
可能原因:日志轮转配置不当,或日志收集 agent 未正确配置
解决方案:检查日志轮转配置,确保日志收集 agent 正确部署
可能原因:告警阈值设置过低,或告警规则过于敏感
解决方案:调整告警阈值,实现告警分组和抑制
日志与监控是Kubernetes集群管理的重要组成部分,通过合理配置日志收集和监控系统,可以:
在实际应用中,需要根据应用的特性和需求,选择合适的日志收集和监控方案,并进行合理的配置。日志与监控系统的设计应该考虑系统的可扩展性、可靠性和易用性。