convert to gitea

This commit is contained in:
2025-09-15 13:56:20 +09:00
commit 07eb8d9ca4
17 changed files with 5350 additions and 0 deletions

774
README.md Normal file
View File

@ -0,0 +1,774 @@
# Prometheus & Grafana 모니터링 시스템 구축 가이드
**Version:** 1.0.0
**Last Modified:** 2025-08-08
## 개요
본 문서는 Prometheus, Grafana, Alertmanager를 포함하는 모니터링 스택을 구축하는 엔지니어링 절차를 상세히 기술한다. 시스템은 서버 인프라의 핵심 메트릭을 수집, 시각화하며, 정의된 임계값에 따라 MS Teams로 자동화된 알림을 전송하는 것을 목표로 한다.
구축 과정에서 발생한 기술적 문제와 해결 과정을 '트러블슈팅' 섹션에 상세히 기록하여, 향후 유사 시스템 구축 시 재현성과 안정성을 보장하고 기술적 부채를 최소화하는 데 중점을 둔다.
## 목차
1. [시스템 아키텍처](#1-시스템-아키텍처)
2. [설치 및 구성 절차](#2-설치-및-구성-절차)
1. [사전 준비 (`ds-commandcenter` 서버)](#21-사전-준비-ds-commandcenter-서버)
2. [Node Exporter 설치 (모든 대상 서버)](#22-node-exporter-설치-모든-대상-서버)
3. [Prometheus 설치 및 구성](#23-prometheus-설치-및-구성)
4. [Alertmanager 설치 및 구성](#24-alertmanager-설치-및-구성)
5. [Prometheus-MSTeams (Docker) 설치](#25-prometheus-msteams-docker-설치)
6. [Grafana 설치 및 구성](#26-grafana-설치-및-구성)
7. [Load Balancer 및 TLS 설정](#27-load-balancer-및-tls-설정)
3. [최종 확인 및 테스트](#3-최종-확인-및-테스트)
4. [트러블슈팅 (Troubleshooting)](#4-트러블슈팅-troubleshooting)
5. [부록 (Appendix)](#5-부록)
1. [주요 컴포넌트 버전](#51-주요-컴포넌트-버전)
2. [보안 권장 사항](#52-보안-권장-사항)
---
## 1. 시스템 아키텍처
![시스템 아키텍처 다이어그램](https'://i.imgur.com/your-architecture-diagram.png') <!-- 다이어그램 이미지가 있다면 여기에 링크를 추가하세요 -->
1. **데이터 수집 (Data Collection):** 모든 대상 서버에 설치된 `Node Exporter`가 시스템 메트릭을 `:9500` 포트로 노출한다.
2. **수집 및 평가 (Scraping & Evaluation):** `ds-commandcenter` 서버의 `Prometheus`가 모든 Node Exporter로부터 메트릭을 수집하고, `resource_alert.yml` 규칙에 따라 알림 조건을 평가한다.
3. **알림 라우팅 (Alert Routing):** 알림 조건이 충족되면, `Prometheus``Alertmanager`에게 알림을 전송한다.
4. **알림 처리 및 프록시 (Alert Processing & Proxying):** `Alertmanager`는 알림을 그룹화하고, `msteams.tmpl` 템플릿을 적용하여 `prometheus-msteams` Docker 컨테이너로 웹훅을 전송한다.
5. **최종 발송 (Final Delivery):** `prometheus-msteams` 컨테이너는 Alertmanager로부터 받은 데이터를 MS Teams 카드 형식으로 변환하여 최종적으로 MS Teams 채널에 알림을 보낸다.
6. **시각화 (Visualization):** `Grafana`는 Prometheus를 데이터 소스로 사용하여 모든 메트릭을 대시보드로 시각화한다.
7. **외부 접속 (External Access):** 사용자는 `Load Balancer`를 통해 HTTPS로 안전하게 Grafana와 Prometheus UI에 접근한다.
---
## 2. 설치 및 구성 절차
### 2.1. 사전 준비 (`ds-commandcenter` 서버)
* **목적:** 서비스 실행에 필요한 시스템 계정, 디렉토리 생성 및 패키지 다운로드.
* **실행 위치:** `ds-commandcenter` 서버.
* **명령어:**
```bash
# 시스템 계정 생성
useradd --no-create-home --shell /bin/false prometheus
useradd --no-create-home --shell /bin/false alertmanager
# 디렉토리 생성 및 소유권 변경
mkdir -p /etc/prometheus/rules /etc/alertmanager/templates /data/prometheus /data/alertmanager /data/promteams
chown -R prometheus:prometheus /etc/prometheus /data/prometheus
chown -R alertmanager:alertmanager /etc/alertmanager /data/alertmanager
# 패키지 다운로드
cd /data
wget https://github.com/prometheus/prometheus/releases/download/v3.5.0/prometheus-3.5.0.linux-amd64.tar.gz
wget https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz
wget https://github.com/prometheus/node_exporter/releases/download/v1.9.1/node_exporter-1.9.1.linux-amd64.tar.gz
# 압축 해제
tar -xvf prometheus-3.5.0.linux-amd64.tar.gz
tar -xvf alertmanager-0.28.1.linux-amd64.tar.gz
tar -xvf node_exporter-1.9.1.linux-amd64.tar.gz
```
### 2.2. Node Exporter 설치 (모든 대상 서버)
* **목적:** 메트릭 수집 에이전트 설치 및 실행.
* **실행 위치:** 모니터링할 모든 서버 (공용, 게임 서버 포함).
* **실행 스크립트:**
```bash
#!/bin/bash
# Node Exporter 설치 및 실행 스크립트
# 바이너리 파일 이동
mv /data/node_exporter-1.9.1.linux-amd64/node_exporter /usr/local/bin/
# 시스템 계정 생성
useradd --no-create-home --shell /bin/false node_exporter
# systemd 서비스 파일 생성
cat <<EOF > /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter --web.listen-address=":9500"
[Install]
WantedBy=multi-user.target
EOF
# 서비스 등록 및 시작
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter
systemctl status node_exporter
```
### 2.3. Prometheus 설치 및 구성
* **목적:** 메트릭 수집 서버 설정.
* **실행 위치:** `ds-commandcenter` 서버.
* **설치 명령어:**
```bash
# 바이너리 및 설정 파일 이동
mv /data/prometheus-3.5.0.linux-amd64/{prometheus,promtool} /usr/local/bin/
mv /data/prometheus-3.5.0.linux-amd64/{consoles,console_libraries} /etc/prometheus
chown -R prometheus:prometheus /etc/prometheus/*
chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
```
* **`/etc/prometheus/prometheus.yml`**
```yaml
global:
scrape_interval: 1m
evaluation_interval: 1m
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9094
rule_files:
- "/etc/prometheus/rules/resource_alert.yml"
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9091"]
- job_name: "common_servers"
static_configs:
- targets: ["10.0.10.21:9500"]
labels:
hostname: "ds-battlefield"
ip: "10.0.10.21"
- targets: ["10.0.10.6:9500"]
labels:
hostname: "ds-commandcenter"
ip: "10.0.10.6"
- targets: ["10.0.10.7:9500"]
labels:
hostname: "ds-crashreport"
ip: "10.0.10.7"
- targets: ["10.0.10.17:9500"]
labels:
hostname: "ds-maingate001"
ip: "10.0.10.17"
- targets: ["10.0.10.18:9500"]
labels:
hostname: "ds-maingate002"
ip: "10.0.10.18"
- targets: ["10.0.10.14:9500"]
labels:
hostname: "ds-mongodb001"
ip: "10.0.10.14"
- targets: ["10.0.10.15:9500"]
labels:
hostname: "ds-mongodb002"
ip: "10.0.10.15"
- targets: ["10.0.10.16:9500"]
labels:
hostname: "ds-mongodb003"
ip: "10.0.10.16"
- targets: ["10.0.10.8:9500"]
labels:
hostname: "ds-opensearch001"
ip: "10.0.10.8"
- targets: ["10.0.10.9:9500"]
labels:
hostname: "ds-opensearch002"
ip: "10.0.10.9"
- targets: ["10.0.10.10:9500"]
labels:
hostname: "ds-opensearch003"
ip: "10.0.10.10"
- targets: ["10.0.10.22:9500"]
labels:
hostname: "ds-promotor"
ip: "10.0.10.22"
- targets: ["10.0.10.24:9500"]
labels:
hostname: "ds-racetrack"
ip: "10.0.10.24"
- targets: ["10.0.10.11:9500"]
labels:
hostname: "ds-redis001"
ip: "10.0.10.11"
- targets: ["10.0.10.12:9500"]
labels:
hostname: "ds-redis002"
ip: "10.0.10.12"
- targets: ["10.0.10.13:9500"]
labels:
hostname: "ds-redis003"
ip: "10.0.10.13"
- targets: ["10.0.10.23:9500"]
labels:
hostname: "ds-social"
ip: "10.0.10.23"
- targets: ["10.0.10.25:9500"]
labels:
hostname: "ds-tavern"
ip: "10.0.10.25"
- targets: ["10.0.10.19:9500"]
labels:
hostname: "ds-warehouse001"
ip: "10.0.10.19"
- targets: ["10.0.10.20:9500"]
labels:
hostname: "ds-warehouse002"
ip: "10.0.10.20"
- job_name: "game_servers"
static_configs:
- targets: ["110.234.163.37:9500"]
labels:
hostname: "ds-jpn-game001"
ip: "110.234.163.37"
- targets: ["110.234.163.30:9500"]
labels:
hostname: "ds-jpn-game002"
ip: "110.234.163.30"
- targets: ["110.234.161.170:9500"]
labels:
hostname: "ds-jpn-game003"
ip: "110.234.161.170"
- targets: ["110.234.160.149:9500"]
labels:
hostname: "ds-jpn-game004"
ip: "110.234.160.149"
- targets: ["110.234.162.181:9500"]
labels:
hostname: "ds-jpn-game005"
ip: "110.234.162.181"
- targets: ["110.234.160.50:9500"]
labels:
hostname: "ds-jpn-game006"
ip: "110.234.160.50"
- targets: ["110.234.165.61:9500"]
labels:
hostname: "ds-jpn-game007"
ip: "110.234.165.61"
- targets: ["110.234.163.151:9500"]
labels:
hostname: "ds-jpn-game008"
ip: "110.234.163.151"
- targets: ["110.234.195.8:9500"]
labels:
hostname: "ds-sgn-game001"
ip: "110.234.195.8"
- targets: ["110.234.193.164:9500"]
labels:
hostname: "ds-sgn-game002"
ip: "110.234.193.164"
- targets: ["110.234.193.189:9500"]
labels:
hostname: "ds-sgn-game003"
ip: "110.234.193.189"
- targets: ["110.234.192.213:9500"]
labels:
hostname: "ds-sgn-game004"
ip: "110.234.192.213"
- targets: ["110.234.194.108:9500"]
labels:
hostname: "ds-sgn-game005"
ip: "110.234.194.108"
- targets: ["110.234.194.199:9500"]
labels:
hostname: "ds-sgn-game006"
ip: "110.234.194.199"
- targets: ["110.234.194.179:9500"]
labels:
hostname: "ds-sgn-game007"
ip: "110.234.194.179"
- targets: ["110.234.193.159:9500"]
labels:
hostname: "ds-sgn-game008"
ip: "110.234.193.159"
- targets: ["44.198.4.245:9500"]
labels:
hostname: "ds-us-game001"
ip: "44.198.4.245"
- targets: ["52.5.176.32:9500"]
labels:
hostname: "ds-us-game002"
ip: "52.5.176.32"
- targets: ["98.86.208.130:9500"]
labels:
hostname: "ds-us-game003"
ip: "98.86.208.130"
- targets: ["98.87.57.10:9500"]
labels:
hostname: "ds-us-game004"
ip: "98.87.57.10"
- targets: ["18.153.131.248:9500"]
labels:
hostname: "ds-de-game001"
ip: "18.153.131.248"
- targets: ["18.185.201.217:9500"]
labels:
hostname: "ds-de-game002"
ip: "18.185.201.217"
- targets: ["3.124.28.212:9500"]
labels:
hostname: "ds-de-game003"
ip: "3.124.28.212"
- targets: ["3.69.139.75:9500"]
labels:
hostname: "ds-de-game004"
ip: "3.69.139.75"
- job_name: "game_info"
static_configs:
- targets: ["10.0.10.22:9200"]
labels:
hostname: "ds-promotor"
ip: "10.0.10.22"
- targets: ["110.234.163.37:9200"]
labels:
hostname: "ds-jpn-game001"
ip: "110.234.163.37"
- targets: ["110.234.163.30:9200"]
labels:
hostname: "ds-jpn-game002"
ip: "110.234.163.30"
- targets: ["110.234.161.170:9200"]
labels:
hostname: "ds-jpn-game003"
ip: "110.234.161.170"
- targets: ["110.234.160.149:9200"]
labels:
hostname: "ds-jpn-game004"
ip: "110.234.160.149"
- targets: ["110.234.162.181:9200"]
labels:
hostname: "ds-jpn-game005"
ip: "110.234.162.181"
- targets: ["110.234.160.50:9200"]
labels:
hostname: "ds-jpn-game006"
ip: "110.234.160.50"
- targets: ["110.234.165.61:9200"]
labels:
hostname: "ds-jpn-game007"
ip: "110.234.165.61"
- targets: ["110.234.163.151:9200"]
labels:
hostname: "ds-jpn-game008"
ip: "110.234.163.151"
- targets: ["110.234.195.8:9200"]
labels:
hostname: "ds-sgn-game001"
ip: "110.234.195.8"
- targets: ["110.234.193.164:9200"]
labels:
hostname: "ds-sgn-game002"
ip: "110.234.193.164"
- targets: ["110.234.193.189:9200"]
labels:
hostname: "ds-sgn-game003"
ip: "110.234.193.189"
- targets: ["110.234.192.213:9200"]
labels:
hostname: "ds-sgn-game004"
ip: "110.234.192.213"
- targets: ["110.234.194.108:9200"]
labels:
hostname: "ds-sgn-game005"
ip: "110.234.194.108"
- targets: ["110.234.194.199:9200"]
labels:
hostname: "ds-sgn-game006"
ip: "110.234.194.199"
- targets: ["110.234.194.179:9200"]
labels:
hostname: "ds-sgn-game007"
ip: "110.234.194.179"
- targets: ["110.234.193.159:9200"]
labels:
hostname: "ds-sgn-game008"
ip: "110.234.193.159"
- targets: ["44.198.4.245:9200"]
labels:
hostname: "ds-us-game001"
ip: "44.198.4.245"
- targets: ["52.5.176.32:9200"]
labels:
hostname: "ds-us-game002"
ip: "52.5.176.32"
- targets: ["98.86.208.130:9200"]
labels:
hostname: "ds-us-game003"
ip: "98.86.208.130"
- targets: ["98.87.57.10:9200"]
labels:
hostname: "ds-us-game004"
ip: "98.87.57.10"
- targets: ["18.153.131.248:9200"]
labels:
hostname: "ds-de-game001"
ip: "18.153.131.248"
- targets: ["18.185.201.217:9200"]
labels:
hostname: "ds-de-game002"
ip: "18.185.201.217"
- targets: ["3.124.28.212:9200"]
labels:
hostname: "ds-de-game003"
ip: "3.124.28.212"
- targets: ["3.69.139.75:9200"]
labels:
hostname: "ds-de-game004"
ip: "3.69.139.75"
```
* **`/etc/prometheus/rules/resource_alert.yml`**
```yaml
groups:
- name: 리소스 사용량 경고
rules:
- alert: CPU사용량경고
expr: 100 - (avg by(instance, hostname, ip) (rate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) > 70
for: 10m
labels:
severity: warning
annotations:
summary: "높은 CPU 사용량 감지"
description: "{{ $labels.hostname }}에서 지난 10분 동안 CPU 사용량이 70%를 초과했습니다."
value: "{{ $value | printf \"%.2f\" }}"
runbook_url: "https://grafana.dungeonstalkers.com:8443"
- alert: CPU사용량심각
expr: 100 - (avg by(instance, hostname, ip) (rate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) > 80
for: 10m
labels:
severity: critical
annotations:
summary: "심각한 CPU 사용량 감지"
description: "{{ $labels.hostname }}에서 지난 10분 동안 CPU 사용량이 80%를 초과했습니다."
value: "{{ $value | printf \"%.2f\" }}"
runbook_url: "https://grafana.dungeonstalkers.com:8443"
- alert: 메모리사용량경고
expr: (1 - (node_memory_MemAvailable_bytes{hostname!=""} / node_memory_MemTotal_bytes{hostname!=""})) * 100 > 70
for: 10m
labels:
severity: warning
annotations:
summary: "높은 메모리 사용량 감지"
description: "{{ $labels.hostname }}에서 지난 10분 동안 메모리 사용량이 70%를 초과했습니다."
value: "{{ $value | printf \"%.2f\" }}"
runbook_url: "https://grafana.dungeonstalkers.com:8443"
- alert: 메모리사용량심각
expr: (1 - (node_memory_MemAvailable_bytes{hostname!=""} / node_memory_MemTotal_bytes{hostname!=""})) * 100 > 80
for: 10m
labels:
severity: critical
annotations:
summary: "심각한 메모리 사용량 감지"
description: "{{ $labels.hostname }}에서 지난 10분 동안 메모리 사용량이 80%를 초과했습니다."
value: "{{ $value | printf \"%.2f\" }}"
runbook_url: "https://grafana.dungeonstalkers.com:8443"
```
* **`/etc/systemd/system/prometheus.service`**
```ini
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /data/prometheus/ \
--web.listen-address=":9091" \
--web.enable-lifecycle \
--web.external-url=https://prometheus.dungeonstalkers.com:8444/
[Install]
WantedBy=multi-user.target
```
* **서비스 시작:**
```bash
systemctl daemon-reload
systemctl enable prometheus
systemctl start prometheus
```
### 2.4. Alertmanager 설치 및 구성
* **목적:** 알림 처리 및 라우팅 서버 설정.
* **실행 위치:** `ds-commandcenter` 서버.
* **설치 명령어:**
```bash
mv /data/alertmanager-0.28.1.linux-amd64/{alertmanager,amtool} /usr/local/bin/
chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool
```
* **`/etc/alertmanager/alertmanager.yml`**
```yaml
route:
group_by: ['alertname', 'hostname']
group_wait: 15s
group_interval: 1m
repeat_interval: 10m
receiver: "resource_alert"
receivers:
- name: "resource_alert"
webhook_configs:
- url: "http://127.0.0.1:2000/resource_alert"
templates:
- '/etc/alertmanager/templates/msteams.tmpl'
```
* **`/etc/alertmanager/templates/msteams.tmpl`**
```go-template
{{ define "teams.card" }}
{
"@type": "MessageCard",
"@context": "http://schema.org/extensions",
"summary": "{{ .CommonAnnotations.summary }}",
"themeColor": "0078D7",
"title": "🚨 {{ .CommonAnnotations.summary }}",
"sections": [
{{ $root := . }}
{{ range $index, $alert := .Alerts }}
{
"activityTitle": "{{ $alert.Annotations.description }}",
"facts": [
{
"name": "상태",
"value": "**{{ printf "%.0f" $alert.Annotations.value }}%**"
},
{
"name": "심각도",
"value": "{{ $alert.Labels.severity }}"
},
{
"name": "호스트명",
"value": "{{ $alert.Labels.hostname }}"
},
{
"name": "IP 주소",
"value": "{{ $alert.Labels.ip }}"
},
{
"name": "발생 일시",
"value": "{{ $alert.StartsAt }}"
}
],
"markdown": true
}{{ if ne (add $index 1) (len $root.Alerts) }},{{ end }}
{{ end }}
],
"potentialAction": [
{
"@type": "OpenUri",
"name": "Grafana에서 보기",
"targets": [
{
"os": "default",
"uri": "{{ .CommonAnnotations.runbook_url }}"
}
]
}
]
}
{{ end }}
```
* **`/etc/systemd/system/alertmanager.service`**
```ini
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/data/alertmanager/ \
--web.listen-address=":9094" \
--cluster.listen-address=":9095"
[Install]
WantedBy=multi-user.target
```
* **서비스 시작:**
```bash
systemctl daemon-reload
systemctl enable alertmanager
systemctl start alertmanager
```
### 2.5. Prometheus-MSTeams (Docker) 설치
* **목적:** Alertmanager와 MS Teams를 연결하는 프록시 설치.
* **실행 위치:** `ds-commandcenter` 서버.
* **`/data/promteams/config.env`**
```env
# MS Teams Webhook URL
WEBHOOK_URL="https://oneunivrs.webhook.office.com/webhookb2/7248d32a-3473-43bd-961b-c2a2516f28f5@1e8605cc-8007-46b0-993f-b388917f9499/IncomingWebhook/ac17804386cc4efdad5c78b3a8c182f7/f5368752-03f7-4e64-93e6-b40991c04c0c/V2jpWgnliaoihAzy3iMA2p_2KWou2hMIj4T32F8MCMVH01"
# Alertmanager의 webhook_configs.url 경로와 일치해야 하는 요청 URI
REQUEST_URI="resource_alert"
# 사용할 템플릿 파일의 호스트 경로 (마운트할 원본 파일)
TEMPLATE_HOST_PATH="/etc/alertmanager/templates/msteams.tmpl"
# 컨테이너 내부에서 템플릿 파일이 위치할 경로
TEMPLATE_CONTAINER_PATH="/app/default-message-card.tmpl"
```
* **`/data/promteams/start_promteams.sh`**
```bash
#!/bin/bash
# 설정 파일 로드
source /data/promteams/config.env
# 필수 변수 확인
if [ -z "$WEBHOOK_URL" ] || [ -z "$REQUEST_URI" ]; then
echo "필수 설정 값이 누락되었습니다. config.env 파일을 확인하세요."
exit 1
fi
echo "기존 promteams 컨테이너를 중지하고 삭제합니다."
docker stop promteams >/dev/null 2>&1
docker rm promteams >/dev/null 2>&1
echo "환경변수 방식을 사용하는 구버전 이미지(v1.5.2)로 Prometheus-MSTeams 컨테이너를 시작합니다."
docker run -d -p 2000:2000 \
--name="promteams" \
--restart=always \
-e TEAMS_INCOMING_WEBHOOK_URL="$WEBHOOK_URL" \
-e TEAMS_REQUEST_URI="$REQUEST_URI" \
-v "$TEMPLATE_HOST_PATH:$TEMPLATE_CONTAINER_PATH" \
quay.io/prometheusmsteams/prometheus-msteams:v1.5.2
echo "컨테이너가 시작되었습니다. 아래 명령어로 상태를 확인하세요:"
echo "docker ps | grep promteams"
```
* **`/data/promteams/stop_promteams.sh`**
```bash
#!/bin/bash
CONTAINER_NAME="promteams"
if [ $(docker ps -q -f name=$CONTAINER_NAME) ]; then
echo "Prometheus-MSTeams 컨테이너($CONTAINER_NAME)를 중지하고 삭제합니다."
docker stop $CONTAINER_NAME
docker rm $CONTAINER_NAME
echo "완료되었습니다."
else
echo "실행 중인 Prometheus-MSTeams 컨테이너가 없습니다."
fi
```
* **컨테이너 실행:**
```bash
chmod +x /data/promteams/*.sh
/data/promteams/start_promteams.sh
```
### 2.6. Grafana 설치 및 구성
* **목적:** 시각화 대시보드 설치 및 설정.
* **실행 위치:** `ds-commandcenter` 서버.
* **설치 명령어:**
```bash
apt-get update && apt-get install -y adduser libfontconfig1 musl
wget https://dl.grafana.com/enterprise/release/grafana-enterprise_12.1.0_amd64.deb
dpkg -i grafana-enterprise_12.1.0_amd64.deb
```
* **`/etc/grafana/grafana.ini` 수정:** 아래 `sed` 명령어는 주요 설정을 변경한다.
```bash
# 외부 접속 주소 및 포트 설정
sed -i 's/;http_port = 3000/http_port = 3001/' /etc/grafana/grafana.ini
sed -i 's/;domain = localhost/domain = grafana.dungeonstalkers.com/' /etc/grafana/grafana.ini
sed -i "s|;root_url = .*|root_url = https://grafana.dungeonstalkers.com:8443/|" /etc/grafana/grafana.ini
# 임베딩 및 익명 접속 설정 추가
cat <<'EOF' | tee -a /etc/grafana/grafana.ini
[security]
allow_embedding = true
[auth.anonymous]
enabled = true
org_name = Main Org.
org_role = Viewer
EOF
```
* **서비스 시작:**
```bash
systemctl enable grafana-server
systemctl start grafana-server
```
### 2.7. Load Balancer 및 TLS 설정
* **목적:** 외부 접속을 위한 HTTPS 통신 및 포트 포워딩 설정.
* **설정 위치:** AWS, GCP 등 클라우드 콘솔 또는 L4 장비.
* **구성 요약:**
* `https://grafana.dungeonstalkers.com:8443` -> `http://10.0.10.6:3001`
* `https://prometheus.dungeonstalkers.com:8444` -> `http://10.0.10.6:9091`
* `dungeonstalkers.com`에 대한 유효한 TLS 인증서 필요.
---
## 3. 최종 확인 및 테스트
1. **웹 UI 접속:**
* `https://grafana.dungeonstalkers.com:8443`
* `https://prometheus.dungeonstalkers.com:8444`
2. **Prometheus Targets 확인:** Prometheus UI의 'Status' -> 'Targets' 페이지에서 모든 대상이 `UP` 상태인지 확인.
3. **전체 알림 파이프라인 테스트:**
```bash
amtool alert add \
--alertmanager.url="http://localhost:9094" \
alertname="Final-System-Test" \
severity="critical" \
hostname="ds-commandcenter" \
ip="10.0.10.6" \
summary="전체 시스템 최종 테스트" \
description="이 알림이 도착하고 모든 링크가 올바르게 작동하면 성공이다." \
value="99" \
runbook_url="https://grafana.dungeonstalkers.com:8443"
```
MS Teams 채널에 알림 카드 도착 및 'Grafana에서 보기' 버튼 링크의 정상 작동 여부를 확인한다.
---
## 4. 트러블슈팅 (Troubleshooting)
본 섹션은 구축 과정에서 발생했던 주요 문제와 해결 과정을 기술한다.
| 문제 현상 | 원인 | 해결 방안 |
| -------------------------------------------------------------- | ------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `alertmanager` 서비스 시작 실패 (`address already in use`) | 웹 포트(`:9094`)와 클러스터 포트(기본값 `:9094`)가 충돌함. | `alertmanager.service` 파일에 `--cluster.listen-address=":9095"` 옵션을 추가하여 클러스터 포트를 명시적으로 변경. |
| `alertmanager` 서비스 시작 실패 (`function "add"/"floor" not defined`) | MS Teams 템플릿 파일(`msteams.tmpl`)에 Alertmanager가 지원하지 않는 함수가 포함됨. | `floor`는 `printf "%.0f"`로 대체하고, `add` 함수를 사용하는 로직은 더 단순하고 호환성 높은 방식으로 수정. |
| `prometheus` 서비스 시작 실패 (`yaml: did not find expected key`) | `prometheus.yml` 파일의 YAML 문법 오류 (주로 들여쓰기 문제). | `external_url` 설정을 `prometheus.yml`에서 제거하고, 대신 `prometheus.service` 파일의 실행 옵션에 `--web.external-url`을 추가하는 방식으로 변경하여 YAML 파일의 무결성을 보장. |
| `prometheus-msteams` 컨테이너 Crash Loop (재시작 반복) | 최신 Docker 이미지와 구버전 설정 방식(환경 변수) 간의 비호환성 문제. | 이전에 성공했던 방식이 환경 변수를 사용했음을 확인하고, 해당 방식을 지원하는 구버전 이미지 태그(`v1.5.2`)를 명시적으로 사용하여 컨테이너를 실행. |
| MS Teams 알림은 실패하는데 템플릿 관련 에러가 발생함. | `amtool`로 보낸 테스트 알림 데이터에 템플릿이 요구하는 필드(`ip`, `runbook_url`)가 누락됨. | Prometheus 설정에서 모든 타겟에 `ip` 라벨을 추가하고, 알림 규칙에 `runbook_url` 어노테이션을 추가하여 실제 알림 데이터에 해당 필드가 포함되도록 구성. `amtool` 테스트 시에도 해당 필드를 직접 포함하여 전송. |
| `node_exporter` 포트 충돌 (`:9300`) | OpenSearch 등 다른 서비스가 이미 `:9300`번대 포트를 사용하고 있었음. | 모든 서버의 `node_exporter` 포트를 `:9500`으로 변경하고, `prometheus.yml`의 수집 대상 포트도 모두 `:9500`으로 수정. |
---
## 5. 부록 (Appendix)
### 5.1. 주요 컴포넌트 버전
* **Prometheus:** `3.5.0`
* **Alertmanager:** `0.28.1`
* **Node Exporter:** `1.9.1`
* **Grafana:** `12.1.0`
* **Prometheus-MSTeams (Docker):** `quay.io/prometheusmsteams/prometheus-msteams:v1.5.2`
### 5.2. 보안 권장 사항
* **Grafana 관리자 비밀번호 변경:** 설치 후 즉시 Grafana의 `admin` 계정 비밀번호를 변경해야 한다.
```bash
grafana-cli admin reset-admin-password <새롭고-안전한-비밀번호>
```
* **네트워크 방화벽:** LB의 공인 포트 외에, 각 서비스의 내부 포트(`9091`, `9094`, `3001` 등)는 외부에서 직접 접근할 수 없도록 방화벽으로 차단하는 것을 권장한다.
* **Webhook URL 보안:** MS Teams Webhook URL은 민감 정보이므로, `config.env` 파일의 권한을 제한(`chmod 600`)하고 Git 등 버전 관리 시스템에 포함되지 않도록 주의해야 한다.

View File

@ -0,0 +1,16 @@
route:
group_by: ['...']
group_wait: 15s
group_interval: 1m
repeat_interval: 10m
receiver: "resource_alert"
receivers:
- name: "resource_alert"
webhook_configs:
- url: "http://127.0.0.1:2000/resource_alert"
send_resolved: false
templates:
- '/etc/alertmanager/templates/msteams.tmpl'

View File

@ -0,0 +1,53 @@
{{ define "teams.card" }}
{
"@type": "MessageCard",
"@context": "http://schema.org/extensions",
"summary": "{{ .CommonAnnotations.summary }}",
"themeColor": "0078D7",
"title": "🚨 {{ .CommonAnnotations.summary }}",
"sections": [
{{- /* add 함수 대신 .CommonAnnotations를 사용하여 쉼표를 처리하는 안정적인 방식 */ -}}
{{- range .Alerts }}
{
"activityTitle": "{{ .Annotations.description }}",
"facts": [
{
"name": "상태",
"value": "**{{ printf "%.0f" .Annotations.value }}%**"
},
{
"name": "심각도",
"value": "{{ .Labels.severity }}"
},
{
"name": "호스트명",
"value": "{{ .Labels.hostname }}"
},
{
"name": "IP 주소",
"value": "{{ .Labels.ip }}"
},
{
"name": "발생 일시",
"value": "{{ .StartsAt }}"
}
],
"markdown": true
}
{{- if .CommonAnnotations -}},{{- end }}
{{- end }}
],
"potentialAction": [
{
"@type": "OpenUri",
"name": "Grafana에서 보기",
"targets": [
{
"os": "default",
"uri": "{{ .CommonAnnotations.runbook_url }}"
}
]
}
]
}
{{ end }}

957
dashboard001.json Normal file
View File

@ -0,0 +1,957 @@
{
"__inputs": [
{
"name": "DS_PROMETHEUS",
"label": "prometheus",
"description": "",
"type": "datasource",
"pluginId": "prometheus",
"pluginName": "Prometheus"
}
],
"__elements": {},
"__requires": [
{
"type": "grafana",
"id": "grafana",
"name": "Grafana",
"version": "11.6.0"
},
{
"type": "datasource",
"id": "prometheus",
"name": "Prometheus",
"version": "1.0.0"
},
{
"type": "panel",
"id": "table",
"name": "Table",
"version": ""
}
],
"annotations": {
"list": [
{
"$$hashKey": "object:2875",
"builtIn": 1,
"datasource": {
"type": "datasource",
"uid": "grafana"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"target": {
"limit": 100,
"matchAny": false,
"tags": [],
"type": "dashboard"
},
"type": "dashboard"
}
]
},
"description": "Command Center Frontend Dashboard",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [
{
"$$hashKey": "object:2302",
"asDropdown": true,
"icon": "external link",
"tags": [],
"targetBlank": true,
"title": "",
"type": "dashboards"
}
],
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"description": "",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"custom": {
"align": "center",
"cellOptions": {
"type": "auto"
},
"filterable": false,
"inspect": false
},
"decimals": 1,
"mappings": [],
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green"
}
]
},
"unit": "none"
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "Memory"
},
"properties": [
{
"id": "unit",
"value": "bytes"
},
{
"id": "decimals"
},
{
"id": "custom.width",
"value": 89
},
{
"id": "decimals",
"value": 0
}
]
},
{
"matcher": {
"id": "byName",
"options": "Uptime"
},
"properties": [
{
"id": "unit",
"value": "none"
},
{
"id": "custom.width",
"value": 90
},
{
"id": "decimals"
}
]
},
{
"matcher": {
"id": "byName",
"options": "Disk read"
},
"properties": [
{
"id": "unit",
"value": "binBps"
},
{
"id": "custom.cellOptions",
"value": {
"mode": "gradient",
"type": "color-background"
}
},
{
"id": "thresholds",
"value": {
"mode": "absolute",
"steps": [
{
"color": "rgba(50, 172, 45, 0.97)"
},
{
"color": "rgba(237, 129, 40, 0.89)",
"value": 10485760
},
{
"color": "rgba(245, 54, 54, 0.9)",
"value": 20485760
}
]
}
},
{
"id": "custom.width",
"value": 108
}
]
},
{
"matcher": {
"id": "byName",
"options": "Disk write"
},
"properties": [
{
"id": "unit",
"value": "binBps"
},
{
"id": "custom.cellOptions",
"value": {
"mode": "gradient",
"type": "color-background"
}
},
{
"id": "thresholds",
"value": {
"mode": "absolute",
"steps": [
{
"color": "rgba(50, 172, 45, 0.97)"
},
{
"color": "rgba(237, 129, 40, 0.89)",
"value": 10485760
},
{
"color": "rgba(245, 54, 54, 0.9)",
"value": 20485760
}
]
}
},
{
"id": "custom.width",
"value": 107
}
]
},
{
"matcher": {
"id": "byName",
"options": "Download"
},
"properties": [
{
"id": "unit",
"value": "binbps"
},
{
"id": "custom.cellOptions",
"value": {
"mode": "gradient",
"type": "color-background"
}
},
{
"id": "thresholds",
"value": {
"mode": "absolute",
"steps": [
{
"color": "rgba(50, 172, 45, 0.97)"
},
{
"color": "rgba(237, 129, 40, 0.89)",
"value": 30485760
},
{
"color": "rgba(245, 54, 54, 0.9)",
"value": 104857600
}
]
}
},
{
"id": "custom.width",
"value": 109
},
{
"id": "decimals"
}
]
},
{
"matcher": {
"id": "byName",
"options": "Upload"
},
"properties": [
{
"id": "unit",
"value": "binbps"
},
{
"id": "custom.cellOptions",
"value": {
"mode": "gradient",
"type": "color-background"
}
},
{
"id": "thresholds",
"value": {
"mode": "absolute",
"steps": [
{
"color": "rgba(50, 172, 45, 0.97)"
},
{
"color": "rgba(237, 129, 40, 0.89)",
"value": 30485760
},
{
"color": "rgba(245, 54, 54, 0.9)",
"value": 104857600
}
]
}
},
{
"id": "custom.width",
"value": 95
},
{
"id": "decimals"
}
]
},
{
"matcher": {
"id": "byName",
"options": "TCP conn"
},
"properties": [
{
"id": "custom.cellOptions",
"value": {
"mode": "gradient",
"type": "color-background"
}
},
{
"id": "thresholds",
"value": {
"mode": "absolute",
"steps": [
{
"color": "rgba(50, 172, 45, 0.97)"
},
{
"color": "rgba(237, 129, 40, 0.89)",
"value": 1000
},
{
"color": "rgba(245, 54, 54, 0.9)",
"value": 1500
}
]
}
},
{
"id": "custom.width",
"value": 106
},
{
"id": "decimals"
}
]
},
{
"matcher": {
"id": "byName",
"options": "CPU"
},
"properties": [
{
"id": "custom.width",
"value": 75
},
{
"id": "decimals",
"value": 0
}
]
},
{
"matcher": {
"id": "byRegexp",
"options": "/.*used.*/"
},
"properties": [
{
"id": "unit",
"value": "percent"
},
{
"id": "custom.cellOptions",
"value": {
"mode": "gradient",
"type": "gauge"
}
},
{
"id": "color",
"value": {
"mode": "continuous-GrYlRd"
}
},
{
"id": "custom.width",
"value": 110
}
]
},
{
"matcher": {
"id": "byName",
"options": "Memory used%"
},
"properties": [
{
"id": "custom.width",
"value": 144
}
]
},
{
"matcher": {
"id": "byName",
"options": "CPU used%"
},
"properties": [
{
"id": "custom.width",
"value": 132
}
]
},
{
"matcher": {
"id": "byName",
"options": "IO used"
},
"properties": [
{
"id": "custom.width",
"value": 116
}
]
},
{
"matcher": {
"id": "byName",
"options": "Partition used"
},
"properties": [
{
"id": "custom.width",
"value": 122
}
]
},
{
"matcher": {
"id": "byName",
"options": "IP"
},
"properties": [
{
"id": "links",
"value": [
{
"title": "Show details",
"url": "d/rYdddlPWk/node-exporter-full?orgId=1&var-job=${job}&var-node=${__value.raw}"
}
]
},
{
"id": "custom.align",
"value": "left"
}
]
},
{
"matcher": {
"id": "byName",
"options": "Hostname"
},
"properties": [
{
"id": "links",
"value": [
{
"title": "Show details",
"url": "/d/feh6u5st2pou8b/node-exporter-full-with-node-name?orgId=1&var-name=${__value.raw}"
}
]
},
{
"id": "custom.align",
"value": "left"
}
]
}
]
},
"gridPos": {
"h": 15,
"w": 24,
"x": 0,
"y": 0
},
"id": 198,
"options": {
"cellHeight": "sm",
"footer": {
"countRows": false,
"enablePagination": false,
"fields": [
"Value #B",
"Value #C",
"Value #L",
"Value #H",
"Value #I",
"Value #M",
"Value #N",
"Value #J",
"Value #K"
],
"reducer": [
"sum"
],
"show": false
},
"showHeader": true,
"sortBy": [
{
"desc": false,
"displayName": "Hostname"
}
]
},
"pluginVersion": "11.6.0",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "node_uname_info{job=~\"$job\"} - 0",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "Hostname",
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "node_memory_MemTotal_bytes{job=~\"$job\"} - 0",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "Memory",
"refId": "B"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "count(node_cpu_seconds_total{job=~\"$job\",mode='system'}) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "CPU cores",
"refId": "C"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "sum(time() - node_boot_time_seconds{job=~\"$job\"})by(instance)/86400",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "Uptime",
"refId": "D"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "node_load5{job=~\"$job\"}",
"format": "table",
"instant": true,
"interval": "",
"legendFormat": "5m load",
"refId": "L"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "(1 - avg(irate(node_cpu_seconds_total{job=~\"$job\",mode=\"idle\"}[$interval])) by (instance)) * 100",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "CPU used%",
"refId": "F"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "(1 - (node_memory_MemAvailable_bytes{job=~\"$job\"} / (node_memory_MemTotal_bytes{job=~\"$job\"})))* 100",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "Memory used%",
"refId": "G"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "max((node_filesystem_size_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}-node_filesystem_free_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}) *100/(node_filesystem_avail_bytes {job=~\"$job\",fstype=~\"ext.?|xfs\"}+(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}-node_filesystem_free_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"})))by(instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "Partition used",
"refId": "E"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "max(irate(node_disk_read_bytes_total{job=~\"$job\"}[$interval])) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "Disk read",
"refId": "H"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "max(irate(node_disk_written_bytes_total{job=~\"$job\"}[$interval])) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "Disk write",
"refId": "I"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "node_netstat_Tcp_CurrEstab{job=~\"$job\"} - 0",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "TCP connections",
"refId": "M"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "node_sockstat_TCP_tw{job=~\"$job\"} - 0",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "TCP sockets",
"refId": "N"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "max(irate(node_network_receive_bytes_total{job=~\"$job\"}[$interval])*8) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "Download",
"refId": "J"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "max(irate(node_network_transmit_bytes_total{job=~\"$job\"}[$interval])*8) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "Upload",
"refId": "K"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "((1-(1 - avg(irate(node_cpu_seconds_total{job=~\"$job\",mode=\"idle\"}[$interval])) by (instance))^1.3)^(1/3)*0.5 + \r\n(1-(1 - avg(node_memory_MemAvailable_bytes{job=~\"$job\"} / node_memory_MemTotal_bytes{job=~\"$job\"})by (instance))^6)^(1/3)*0.3 + \r\n(1 - max(irate(node_disk_io_time_seconds_total{job=~\"$job\"}[$interval]))by (instance)^1.1)^(1/2)*0.2)*100",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "__auto",
"refId": "O"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "max(irate(node_disk_io_time_seconds_total{job=~\"$job\"}[$interval])) by (instance) *100",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "IO used",
"refId": "P"
}
],
"transformations": [
{
"id": "merge",
"options": {
"reducers": []
}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Value #C": false,
"Value #L": true,
"Value #N": true,
"Value #O": true,
"exp": false,
"iid": false
},
"includeByName": {},
"indexByName": {
"Time": 20,
"Value #A": 36,
"Value #B": 7,
"Value #C": 8,
"Value #D": 4,
"Value #E": 13,
"Value #F": 10,
"Value #G": 11,
"Value #H": 14,
"Value #I": 15,
"Value #J": 18,
"Value #K": 19,
"Value #L": 9,
"Value #M": 16,
"Value #N": 17,
"Value #O": 6,
"Value #P": 12,
"__name__": 37,
"account": 21,
"cservice": 22,
"domainname": 23,
"exp": 5,
"group": 24,
"iaccount": 25,
"igroup": 26,
"iid": 3,
"iname": 27,
"instance": 2,
"job": 28,
"machine": 29,
"name": 1,
"nodename": 0,
"origin_prometheus": 30,
"region": 31,
"release": 32,
"sysname": 33,
"vendor": 34,
"version": 35
},
"renameByName": {
"Value #B": "Memory",
"Value #C": "CPU",
"Value #D": "Uptime",
"Value #E": "Partition used",
"Value #F": "CPU used%",
"Value #G": "Memory used%",
"Value #H": "Disk read",
"Value #I": "Disk write",
"Value #J": "Download",
"Value #K": "Upload",
"Value #L": "5m load",
"Value #M": "TCP conn",
"Value #N": "TCP sockets",
"Value #O": "Health",
"Value #P": "IO used",
"exp": "到期日",
"iid": "实例ID",
"instance": "IP",
"name": "",
"nodename": "Hostname"
}
}
},
{
"id": "filterFieldsByName",
"options": {
"include": {
"names": [
"Hostname",
"IP",
"Uptime",
"Health",
"Memory",
"CPU",
"CPU used%",
"Memory used%",
"IO used",
"Partition used",
"Disk read",
"Disk write",
"TCP conn",
"TCP sockets",
"Download",
"Upload",
"5m load"
]
}
}
}
],
"type": "table"
}
],
"refresh": "",
"schemaVersion": 41,
"tags": [
"Dashboard",
"CommandCenter"
],
"templating": {
"list": [
{
"auto": false,
"auto_count": 30,
"auto_min": "10s",
"current": {
"text": "3m",
"value": "3m"
},
"label": "Interval",
"name": "interval",
"options": [
{
"selected": true,
"text": "3m",
"value": "3m"
}
],
"query": "3m",
"refresh": 2,
"type": "interval"
},
{
"current": {},
"definition": "label_values(node_uname_info,job)",
"label": "JOB",
"name": "job",
"options": [],
"query": {
"qryType": 1,
"query": "label_values(node_uname_info,job)",
"refId": "PrometheusVariableQueryEditor-VariableQuery"
},
"refresh": 1,
"regex": "",
"sort": 5,
"type": "query"
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"30s",
"1m",
"3m",
"5m",
"15m",
"30m"
]
},
"timezone": "browser",
"title": "dashboard",
"uid": "behgmepd9v08wd",
"version": 35,
"weekStart": ""
}

331
dashboard002.json Normal file
View File

@ -0,0 +1,331 @@
{
"__inputs": [
{
"name": "DS_PROMETHEUS",
"label": "Prometheus",
"description": "",
"type": "datasource",
"pluginId": "prometheus",
"pluginName": "Prometheus"
}
],
"__elements": {},
"__requires": [
{
"type": "grafana",
"id": "grafana",
"name": "Grafana",
"version": "11.1.0"
},
{
"type": "datasource",
"id": "prometheus",
"name": "Prometheus",
"version": "1.0.0"
},
{
"type": "panel",
"id": "table",
"name": "Table",
"version": ""
}
],
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "grafana"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"description": "Server Resource Dashboard",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus_ds"
},
"description": "",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"custom": {
"align": "auto",
"cellOptions": {
"type": "auto"
},
"filterable": false
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "Hostname"
},
"properties": [
{
"id": "custom.align",
"value": "left"
}
]
},
{
"matcher": {
"id": "byName",
"options": "IP"
},
"properties": [
{
"id": "custom.align",
"value": "left"
}
]
},
{
"matcher": {
"id": "byName",
"options": "CPU used%"
},
"properties": [
{
"id": "unit",
"value": "percent"
},
{
"id": "custom.cellOptions",
"value": {
"type": "gauge",
"mode": "gradient"
}
},
{
"id": "thresholds",
"value": {
"mode": "absolute",
"steps": [
{ "color": "#73BF69", "value": null },
{ "color": "#F2CC0C", "value": 70 },
{ "color": "#F2495C", "value": 80 }
]
}
},
{
"id": "min",
"value": 0
},
{
"id": "max",
"value": 100
}
]
},
{
"matcher": {
"id": "byName",
"options": "Memory used%"
},
"properties": [
{
"id": "unit",
"value": "percent"
},
{
"id": "custom.cellOptions",
"value": {
"type": "gauge",
"mode": "gradient"
}
},
{
"id": "thresholds",
"value": {
"mode": "absolute",
"steps": [
{ "color": "#73BF69", "value": null },
{ "color": "#F2CC0C", "value": 70 },
{ "color": "#F2495C", "value": 80 }
]
}
},
{
"id": "min",
"value": 0
},
{
"id": "max",
"value": 100
}
]
},
{
"matcher": {
"id": "byName",
"options": "Disk used"
},
"properties": [
{
"id": "unit",
"value": "percent"
},
{
"id": "custom.cellOptions",
"value": {
"type": "gauge",
"mode": "gradient"
}
},
{
"id": "thresholds",
"value": {
"mode": "absolute",
"steps": [
{ "color": "#73BF69", "value": null },
{ "color": "#F2CC0C", "value": 70 },
{ "color": "#F2495C", "value": 80 }
]
}
},
{
"id": "min",
"value": 0
},
{
"id": "max",
"value": 100
}
]
},
{
"matcher": {
"id": "byName",
"options": "Uptime"
},
"properties": [
{
"id": "unit",
"value": "dtdurations"
}
]
},
{
"matcher": {
"id": "byName",
"options": "Download"
},
"properties": [
{
"id": "unit",
"value": "bps"
}
]
},
{
"matcher": {
"id": "byName",
"options": "Upload"
},
"properties": [
{
"id": "unit",
"value": "bps"
}
]
}
]
},
"gridPos": { "h": 24, "w": 24, "x": 0, "y": 0 },
"id": 1,
"options": {
"cellHeight": "sm",
"footer": { "show": false },
"showHeader": true
},
"pluginVersion": "11.1.0",
"targets": [
{ "refId": "A", "datasource": { "type": "prometheus", "uid": "prometheus_ds" }, "expr": "node_uname_info{job=~\"$job\"}", "format": "table", "instant": true, "legendFormat": "Hostname" },
{ "refId": "B", "datasource": { "type": "prometheus", "uid": "prometheus_ds" }, "expr": "100 - (avg by(hostname) (rate(node_cpu_seconds_total{mode=\"idle\", job=~\"$job\"}[$interval])) * 100)", "format": "table", "instant": true, "legendFormat": "CPU used%" },
{ "refId": "C", "datasource": { "type": "prometheus", "uid": "prometheus_ds" }, "expr": "(1 - (node_memory_MemAvailable_bytes{job=~\"$job\"} / node_memory_MemTotal_bytes{job=~\"$job\"})) * 100", "format": "table", "instant": true, "legendFormat": "Memory used%" },
{ "refId": "D", "datasource": { "type": "prometheus", "uid": "prometheus_ds" }, "expr": "(1 - (node_filesystem_avail_bytes{mountpoint=~\"/|/data\", job=~\"$job\"} / node_filesystem_size_bytes{mountpoint=~\"/|/data\", job=~\"$job\"})) * 100", "format": "table", "instant": true, "legendFormat": "Disk used" },
{ "refId": "E", "datasource": { "type": "prometheus", "uid": "prometheus_ds" }, "expr": "time() - node_boot_time_seconds{job=~\"$job\"}", "format": "table", "instant": true, "legendFormat": "Uptime" },
{ "refId": "F", "datasource": { "type": "prometheus", "uid": "prometheus_ds" }, "expr": "sum by (hostname) (rate(node_network_receive_bytes_total{job=~\"$job\"}[$interval])) * 8", "format": "table", "instant": true, "legendFormat": "Download" },
{ "refId": "G", "datasource": { "type": "prometheus", "uid": "prometheus_ds" }, "expr": "sum by (hostname) (rate(node_network_transmit_bytes_total{job=~\"$job\"}[$interval])) * 8", "format": "table", "instant": true, "legendFormat": "Upload" }
],
"transformations": [
{ "id": "merge", "options": {} },
{ "id": "organize", "options": { "indexByName": {}, "renameByName": { "Value #A": "Hostname", "Value #B": "CPU used%", "Value #C": "Memory used%", "Value #D": "Disk used", "Value #E": "Uptime", "Value #F": "Download", "Value #G": "Upload", "instance": "IP" } } }
],
"type": "table"
}
],
"refresh": "1m",
"schemaVersion": 39,
"tags": ["command-center", "overview"],
"templating": {
"list": [
{
"current": { "selected": true, "text": "common_servers", "value": "common_servers" },
"hide": 0,
"includeAll": false,
"multi": false,
"name": "job",
"options": [
{ "selected": true, "text": "common_servers", "value": "common_servers" },
{ "selected": false, "text": "game_servers", "value": "game_servers" }
],
"query": "common_servers,game_servers",
"skipUrlSync": false,
"type": "custom"
},
{
"current": { "selected": true, "text": "1m", "value": "1m" },
"hide": 0,
"name": "interval",
"options": [
{ "selected": false, "text": "30s", "value": "30s" },
{ "selected": true, "text": "1m", "value": "1m" },
{ "selected": false, "text": "5m", "value": "5m" }
],
"query": "30s,1m,5m",
"skipUrlSync": false,
"type": "interval"
}
]
},
"time": { "from": "now-1h", "to": "now" },
"timepicker": {},
"timezone": "browser",
"title": "Server Overview",
"uid": "server-overview-dashboard",
"version": 1,
"weekStart": ""
}

2464
grafana/grafana.ini Normal file

File diff suppressed because it is too large Load Diff

75
grafana/ldap.toml Normal file
View File

@ -0,0 +1,75 @@
# To troubleshoot and get more log info enable ldap debug logging in grafana.ini
# [log]
# filters = ldap:debug
[[servers]]
# Ldap server host (specify multiple hosts space separated)
host = "127.0.0.1"
# Default port is 389 or 636 if use_ssl = true
port = 389
# Set to true if LDAP server should use an encrypted TLS connection (either with STARTTLS or LDAPS)
use_ssl = false
# If set to true, use LDAP with STARTTLS instead of LDAPS
start_tls = false
# The value of an accepted TLS cipher. By default, this value is empty. Example value: ["TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384"])
# For a complete list of supported ciphers and TLS versions, refer to: https://go.dev/src/crypto/tls/cipher_suites.go
# Starting with Grafana v11.0 only ciphers with ECDHE support are accepted for TLS 1.2 connections.
tls_ciphers = []
# This is the minimum TLS version allowed. By default, this value is empty. Accepted values are: TLS1.1 (only for Grafana v10.4 or older), TLS1.2, TLS1.3.
min_tls_version = ""
# set to true if you want to skip ssl cert validation
ssl_skip_verify = false
# set to the path to your root CA certificate or leave unset to use system defaults
# root_ca_cert = "/path/to/certificate.crt"
# Authentication against LDAP servers requiring client certificates
# client_cert = "/path/to/client.crt"
# client_key = "/path/to/client.key"
# Search user bind dn
bind_dn = "cn=admin,dc=grafana,dc=org"
# Search user bind password
# If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
bind_password = 'grafana'
# We recommend using variable expansion for the bind_password, for more info https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/#variable-expansion
# bind_password = '$__env{LDAP_BIND_PASSWORD}'
# Timeout in seconds (applies to each host specified in the 'host' entry (space separated))
timeout = 10
# User search filter, for example "(cn=%s)" or "(sAMAccountName=%s)" or "(uid=%s)"
search_filter = "(cn=%s)"
# An array of base dns to search through
search_base_dns = ["dc=grafana,dc=org"]
## For Posix or LDAP setups that does not support member_of attribute you can define the below settings
## Please check grafana LDAP docs for examples
# group_search_filter = "(&(objectClass=posixGroup)(memberUid=%s))"
# group_search_base_dns = ["ou=groups,dc=grafana,dc=org"]
# group_search_filter_user_attribute = "uid"
# Specify names of the ldap attributes your ldap uses
[servers.attributes]
name = "givenName"
surname = "sn"
username = "cn"
member_of = "memberOf"
email = "email"
# Map ldap groups to grafana org roles
[[servers.group_mappings]]
group_dn = "cn=admins,ou=groups,dc=grafana,dc=org"
org_role = "Admin"
# To make user an instance admin (Grafana Admin) uncomment line below
# grafana_admin = true
# The Grafana organization database id, optional, if left out the default org (id 1) will be used
# org_id = 1
[[servers.group_mappings]]
group_dn = "cn=editors,ou=groups,dc=grafana,dc=org"
org_role = "Editor"
[[servers.group_mappings]]
# If you want to match all (or no ldap groups) then you can use wildcard
group_dn = "*"
org_role = "Viewer"

View File

@ -0,0 +1,68 @@
# ---
# # config file version
# apiVersion: 2
# # <list> list of roles to insert/update/delete
# roles:
# # <string, required> name of the role you want to create or update. Required.
# - name: 'custom:users:writer'
# # <string> uid of the role. Has to be unique for all orgs.
# uid: customuserswriter1
# # <string> description of the role, informative purpose only.
# description: 'Create, read, write users'
# # <int> version of the role, Grafana will update the role when increased.
# version: 2
# # <int> org id. Defaults to Grafana's default if not specified.
# orgId: 1
# # <list> list of the permissions granted by this role.
# permissions:
# # <string, required> action allowed.
# - action: 'users:read'
# #<string> scope it applies to.
# scope: 'global.users:*'
# - action: 'users:write'
# scope: 'global.users:*'
# - action: 'users:create'
# - name: 'custom:global:users:reader'
# # <bool> overwrite org id and creates a global role.
# global: true
# # <string> state of the role. Defaults to 'present'. If 'absent', role will be deleted.
# state: 'absent'
# # <bool> force deletion revoking all grants of the role.
# force: true
# - uid: 'basic_editor'
# version: 2
# global: true
# # <list> list of roles to copy permissions from.
# from:
# - uid: 'basic_editor'
# global: true
# - name: 'fixed:users:writer'
# global: true
# # <list> list of the permissions to add/remove on top of the copied ones.
# permissions:
# - action: 'users:read'
# scope: 'global.users:*'
# - action: 'users:write'
# scope: 'global.users:*'
# # <string> state of the permission. Defaults to 'present'. If 'absent', the permission will be removed.
# state: absent
# # <list> list role assignments to teams to create or remove.
# teams:
# # <string, required> name of the team you want to assign roles to. Required.
# - name: 'Users writers'
# # <int> org id. Will default to Grafana's default if not specified.
# orgId: 1
# # <list> list of roles to assign to the team
# roles:
# # <string> uid of the role you want to assign to the team.
# - uid: 'customuserswriter1'
# # <int> org id. Will default to Grafana's default if not specified.
# orgId: 1
# # <string> name of the role you want to assign to the team.
# - name: 'fixed:users:writer'
# # <bool> overwrite org id to specify the role is global.
# global: true
# # <string> state of the assignment. Defaults to 'present'. If 'absent', the assignment will be revoked.
# state: absent

View File

@ -0,0 +1,227 @@
# # config file version
apiVersion: 1
# # List of rule groups to import or update
# groups:
# # <int> organization ID, default = 1
# - orgId: 1
# # <string, required> name of the rule group
# name: my_rule_group
# # <string, required> name of the folder the rule group will be stored in
# folder: my_first_folder
# # <duration, required> interval of the rule group evaluation
# interval: 60s
# # <list, required> list of rules that are part of the rule group
# rules:
# # <string, required> unique identifier for the rule. Should not exceed 40 symbols. Only letters, numbers, - (hyphen), and _ (underscore) allowed.
# - uid: my_id_1
# # <string, required> title of the rule, will be displayed in the UI
# title: my_first_rule
# # <string, required> query used for the condition
# condition: A
# # <list, required> list of query objects that should be executed on each
# # evaluation - should be obtained via the API
# data:
# - refId: A
# datasourceUid: "__expr__"
# model:
# conditions:
# - evaluator:
# params:
# - 3
# type: gt
# operator:
# type: and
# query:
# params:
# - A
# reducer:
# type: last
# type: query
# datasource:
# type: __expr__
# uid: "__expr__"
# expression: 1==0
# intervalMs: 1000
# maxDataPoints: 43200
# refId: A
# type: math
# # <string> UID of a dashboard that the alert rule should be linked to
# dashboardUid: my_dashboard
# # <int> ID of the panel that the alert rule should be linked to
# panelId: 123
# # <string> state of the alert rule when no data is returned
# # possible values: "NoData", "Alerting", "OK", default = NoData
# noDataState: Alerting
# # <string> state of the alert rule when the query execution
# # fails - possible values: "Error", "Alerting", "OK"
# # default = Alerting
# executionErrorState: Alerting
# # <duration, required> how long the alert condition should be breached before Firing. Before this time has elapsed, the alert is considered to be Pending
# for: 60s
# # <map<string, string>> map of strings to attach arbitrary custom data
# annotations:
# some_key: some_value
# # <map<string, string> map of strings to filter and
# # route alerts
# labels:
# team: sre_team_1
# isPaused: false
# # optional settings that let configure notification settings applied to alerts created by this rule
# notification_settings:
# # <string> name of the receiver (contact-point) that should be used for this route
# receiver: grafana-default-email
# # <list<string>> The labels by which incoming alerts are grouped together. For example,
# # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# # be batched into a single group.
# #
# # To aggregate by all possible labels, use the special value '...' as
# # the sole label name, for example:
# # group_by: ['...']
# # This effectively disables aggregation entirely, passing through all
# # alerts as-is. This is unlikely to be what you want, unless you have
# # a very low alert volume or your upstream notification system performs
# # its own grouping.
# # If defined, must contain the labels 'alertname' and 'grafana_folder', except when contains '...'
# group_by: ["alertname", "grafana_folder", "region"]
# # <list> Times when the route should be muted. These must match the name of a
# # mute time interval.
# # Additionally, the root node cannot have any mute times.
# # When a route is muted it will not send any notifications, but
# # otherwise acts normally (including ending the route-matching process
# # if the `continue` option is not set)
# mute_time_intervals:
# - abc
# # <duration> How long to initially wait to send a notification for a group
# # of alerts. Allows to collect more initial alerts for the same group.
# # (Usually ~0s to few minutes).
# # If not specified, the corresponding setting of the default policy is used.
# group_wait: 30s
# # <duration> How long to wait before sending a notification about new alerts that
# # are added to a group of alerts for which an initial notification has
# # already been sent. (Usually ~5m or more).
# # If not specified, the corresponding setting of the default policy is used.
# group_interval: 5m
# # <duration> How long to wait before sending a notification again if it has already
# # been sent successfully for an alert. (Usually ~3h or more)
# # If not specified, the corresponding setting of the default policy is used.
# repeat_interval: 4h
# # List of alert rule UIDs that should be deleted
# deleteRules:
# # <int> organization ID, default = 1
# - orgId: 1
# # <string, required> unique identifier for the rule
# uid: my_id_1
# # List of contact points to import or update
# contactPoints:
# # <int> organization ID, default = 1
# - orgId: 1
# # <string, required> name of the contact point
# name: cp_1
# receivers:
# # <string, required> unique identifier for the receiver. Should not exceed 40 symbols. Only letters, numbers, - (hyphen), and _ (underscore) allowed.
# - uid: first_uid
# # <string, required> type of the receiver
# type: prometheus-alertmanager
# # <object, required> settings for the specific receiver type
# settings:
# url: http://test:9000
# # List of receivers that should be deleted
# deleteContactPoints:
# - orgId: 1
# uid: first_uid
# # List of notification policies to import or update
# policies:
# # <int> organization ID, default = 1
# - orgId: 1
# # <string> name of the receiver that should be used for this route
# receiver: grafana-default-email
# # <list<string>> The labels by which incoming alerts are grouped together. For example,
# # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# # be batched into a single group.
# #
# # To aggregate by all possible labels, use the special value '...' as
# # the sole label name, for example:
# # group_by: ['...']
# # This effectively disables aggregation entirely, passing through all
# # alerts as-is. This is unlikely to be what you want, unless you have
# # a very low alert volume or your upstream notification system performs
# # its own grouping.
# group_by:
# - grafana_folder
# - alertname
# # <list> a list of matchers that an alert has to fulfill to match the node
# matchers:
# - alertname = Watchdog
# - severity =~ "warning|critical"
# # <list> Times when the route should be muted. These must match the name of a
# # mute time interval.
# # Additionally, the root node cannot have any mute times.
# # When a route is muted it will not send any notifications, but
# # otherwise acts normally (including ending the route-matching process
# # if the `continue` option is not set)
# mute_time_intervals:
# - abc
# # <duration> How long to initially wait to send a notification for a group
# # of alerts. Allows to collect more initial alerts for the same group.
# # (Usually ~0s to few minutes), default = 30s
# group_wait: 30s
# # <duration> How long to wait before sending a notification about new alerts that
# # are added to a group of alerts for which an initial notification has
# # already been sent. (Usually ~5m or more), default = 5m
# group_interval: 5m
# # <duration> How long to wait before sending a notification again if it has already
# # been sent successfully for an alert. (Usually ~3h or more), default = 4h
# repeat_interval: 4h
# # <list> Zero or more child routes
# routes:
# ...
# # List of orgIds that should be reset to the default policy
# resetPolicies:
# - 1
# # List of templates to import or update
# templates:
# # <int> organization ID, default = 1
# - orgID: 1
# # <string, required> name of the template, must be unique
# name: my_first_template
# # <string, required> content of the template
# template: Alerting with a custome text template
# # List of templates that should be deleted
# deleteTemplates:
# # <int> organization ID, default = 1
# - orgId: 1
# # <string, required> name of the template, must be unique
# name: my_first_template
# # List of mute time intervals to import or update
# muteTimes:
# # <int> organization ID, default = 1
# - orgId: 1
# # <string, required> name of the mute time interval, must be unique
# name: mti_1
# # <list> time intervals that should trigger the muting
# refer to https://prometheus.io/docs/alerting/latest/configuration/#time_interval-0
# time_intervals:
# - times:
# - start_time: '06:00'
# end_time: '23:59'
# weekdays: ['monday:wednesday','saturday', 'sunday']
# months: ['1:3', 'may:august', 'december']
# years: ['2020:2022', '2030']
# days_of_month: ['1:5', '-3:-1']
# # List of mute time intervals that should be deleted
# deleteMuteTimes:
# # <int> organization ID, default = 1
# - orgId: 1
# # <string, required> name of the mute time interval, must be unique
# name: mti_1

View File

@ -0,0 +1,11 @@
# # config file version
apiVersion: 1
#providers:
# - name: 'default'
# orgId: 1
# folder: ''
# folderUid: ''
# type: file
# options:
# path: /var/lib/grafana/dashboards

View File

@ -0,0 +1,71 @@
# Configuration file version
apiVersion: 1
# # List of data sources to delete from the database.
# deleteDatasources:
# - name: Graphite
# orgId: 1
# # List of data sources to insert/update depending on what's
# # available in the database.
# datasources:
# # <string, required> Sets the name you use to refer to
# # the data source in panels and queries.
# - name: Graphite
# # <string, required> Sets the data source type.
# type: graphite
# # <string, required> Sets the access mode, either
# # proxy or direct (Server or Browser in the UI).
# # Some data sources are incompatible with any setting
# # but proxy (Server).
# access: proxy
# # <int> Sets the organization id. Defaults to orgId 1.
# orgId: 1
# # <string> Sets a custom UID to reference this
# # data source in other parts of the configuration.
# # If not specified, Grafana generates one.
# uid: my_unique_uid
# # <string> Sets the data source's URL, including the
# # port.
# url: http://localhost:8080
# # <string> Sets the database user, if necessary.
# user:
# # <string> Sets the database name, if necessary.
# database:
# # <bool> Enables basic authorization.
# basicAuth:
# # <string> Sets the basic authorization username.
# basicAuthUser:
# # <bool> Enables credential headers.
# withCredentials:
# # <bool> Toggles whether the data source is pre-selected
# # for new panels. You can set only one default
# # data source per organization.
# isDefault:
# # <map> Fields to convert to JSON and store in jsonData.
# jsonData:
# # <string> Defines the Graphite service's version.
# graphiteVersion: '1.1'
# # <bool> Enables TLS authentication using a client
# # certificate configured in secureJsonData.
# tlsAuth: true
# # <bool> Enables TLS authentication using a CA
# # certificate.
# tlsAuthWithCACert: true
# # <map> Fields to encrypt before storing in jsonData.
# secureJsonData:
# # <string> Defines the CA cert, client cert, and
# # client key for encrypted authentication.
# tlsCACert: '...'
# tlsClientCert: '...'
# tlsClientKey: '...'
# # <string> Sets the database password, if necessary.
# password:
# # <string> Sets the basic authorization password.
# basicAuthPassword:
# # <int> Sets the version. Used to compare versions when
# # updating. Ignored when creating a new data source.
# version: 1
# # <bool> Allows users to edit data sources from the
# # Grafana UI.
# editable: false

View File

@ -0,0 +1,11 @@
# # config file version
apiVersion: 1
# apps:
# - type: grafana-example-app
# org_name: Main Org.
# disabled: true
# - type: raintank-worldping-app
# org_id: 1
# jsonData:
# apiKey: "API KEY"

199
prometheus/prometheus.yml Normal file
View File

@ -0,0 +1,199 @@
global:
scrape_interval: 1m
evaluation_interval: 1m
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9094
rule_files:
- "/etc/prometheus/rules/resource_alert.yml"
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9091"]
- job_name: "common_servers"
static_configs:
- targets: ["10.0.10.21:9500"]
labels:
hostname: "ds-battlefield"
ip: "10.0.10.21"
- targets: ["10.0.10.6:9500"]
labels:
hostname: "ds-commandcenter"
ip: "10.0.10.6"
- targets: ["10.0.10.7:9500"]
labels:
hostname: "ds-crashreport"
ip: "10.0.10.7"
- targets: ["10.0.10.17:9500"]
labels:
hostname: "ds-maingate001"
ip: "10.0.10.17"
- targets: ["10.0.10.18:9500"]
labels:
hostname: "ds-maingate002"
ip: "10.0.10.18"
- targets: ["10.0.10.14:9500"]
labels:
hostname: "ds-mongodb001"
ip: "10.0.10.14"
- targets: ["10.0.10.15:9500"]
labels:
hostname: "ds-mongodb002"
ip: "10.0.10.15"
- targets: ["10.0.10.16:9500"]
labels:
hostname: "ds-mongodb003"
ip: "10.0.10.16"
- targets: ["10.0.10.8:9500"]
labels:
hostname: "ds-opensearch001"
ip: "10.0.10.8"
- targets: ["10.0.10.9:9500"]
labels:
hostname: "ds-opensearch002"
ip: "10.0.10.9"
- targets: ["10.0.10.10:9500"]
labels:
hostname: "ds-opensearch003"
ip: "10.0.10.10"
- targets: ["10.0.10.22:9500"]
labels:
hostname: "ds-promotor"
ip: "10.0.10.22"
- targets: ["10.0.10.24:9500"]
labels:
hostname: "ds-racetrack"
ip: "10.0.10.24"
- targets: ["10.0.10.11:9500"]
labels:
hostname: "ds-redis001"
ip: "10.0.10.11"
- targets: ["10.0.10.12:9500"]
labels:
hostname: "ds-redis002"
ip: "10.0.10.12"
- targets: ["10.0.10.13:9500"]
labels:
hostname: "ds-redis003"
ip: "10.0.10.13"
- targets: ["10.0.10.23:9500"]
labels:
hostname: "ds-social"
ip: "10.0.10.23"
- targets: ["10.0.10.25:9500"]
labels:
hostname: "ds-tavern"
ip: "10.0.10.25"
- targets: ["10.0.10.19:9500"]
labels:
hostname: "ds-warehouse001"
ip: "10.0.10.19"
- targets: ["10.0.10.20:9500"]
labels:
hostname: "ds-warehouse002"
ip: "10.0.10.20"
- job_name: "game_servers"
static_configs:
- targets: ["110.234.163.37:9500"]
labels:
hostname: "ds-jpn-game001"
ip: "110.234.163.37"
- targets: ["110.234.163.30:9500"]
labels:
hostname: "ds-jpn-game002"
ip: "110.234.163.30"
- targets: ["110.234.161.170:9500"]
labels:
hostname: "ds-jpn-game003"
ip: "110.234.161.170"
- targets: ["110.234.160.149:9500"]
labels:
hostname: "ds-jpn-game004"
ip: "110.234.160.149"
- targets: ["110.234.162.181:9500"]
labels:
hostname: "ds-jpn-game005"
ip: "110.234.162.181"
- targets: ["110.234.160.50:9500"]
labels:
hostname: "ds-jpn-game006"
ip: "110.234.160.50"
- targets: ["110.234.165.61:9500"]
labels:
hostname: "ds-jpn-game007"
ip: "110.234.165.61"
- targets: ["110.234.163.151:9500"]
labels:
hostname: "ds-jpn-game008"
ip: "110.234.163.151"
- targets: ["110.234.195.8:9500"]
labels:
hostname: "ds-sgn-game001"
ip: "110.234.195.8"
- targets: ["110.234.193.164:9500"]
labels:
hostname: "ds-sgn-game002"
ip: "110.234.193.164"
- targets: ["110.234.193.189:9500"]
labels:
hostname: "ds-sgn-game003"
ip: "110.234.193.189"
- targets: ["110.234.192.213:9500"]
labels:
hostname: "ds-sgn-game004"
ip: "110.234.192.213"
- targets: ["110.234.194.108:9500"]
labels:
hostname: "ds-sgn-game005"
ip: "110.234.194.108"
- targets: ["110.234.194.199:9500"]
labels:
hostname: "ds-sgn-game006"
ip: "110.234.194.199"
- targets: ["110.234.194.179:9500"]
labels:
hostname: "ds-sgn-game007"
ip: "110.234.194.179"
- targets: ["110.234.193.159:9500"]
labels:
hostname: "ds-sgn-game008"
ip: "110.234.193.159"
- targets: ["44.198.4.245:9500"]
labels:
hostname: "ds-us-game001"
ip: "44.198.4.245"
- targets: ["52.5.176.32:9500"]
labels:
hostname: "ds-us-game002"
ip: "52.5.176.32"
- targets: ["98.86.208.130:9500"]
labels:
hostname: "ds-us-game003"
ip: "98.86.208.130"
- targets: ["98.87.57.10:9500"]
labels:
hostname: "ds-us-game004"
ip: "98.87.57.10"
- targets: ["18.153.131.248:9500"]
labels:
hostname: "ds-de-game001"
ip: "18.153.131.248"
- targets: ["18.185.201.217:9500"]
labels:
hostname: "ds-de-game002"
ip: "18.185.201.217"
- targets: ["3.124.28.212:9500"]
labels:
hostname: "ds-de-game003"
ip: "3.124.28.212"
- targets: ["3.69.139.75:9500"]
labels:
hostname: "ds-de-game004"
ip: "3.69.139.75"

View File

@ -0,0 +1,46 @@
groups:
- name: 리소스 사용량 경고
rules:
- alert: CPU사용량경고
expr: 100 - (avg by(instance, hostname, ip) (rate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) > 70
for: 10m
labels:
severity: warning
annotations:
summary: "높은 CPU 사용량 감지"
description: "{{ $labels.hostname }}에서 지난 10분 동안 CPU 사용량이 70%를 초과했습니다."
value: "{{ $value | printf \"%.2f\" }}"
runbook_url: "https://grafana.dungeonstalkers.com:8443"
- alert: CPU사용량심각
expr: 100 - (avg by(instance, hostname, ip) (rate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) > 80
for: 10m
labels:
severity: critical
annotations:
summary: "심각한 CPU 사용량 감지"
description: "{{ $labels.hostname }}에서 지난 10분 동안 CPU 사용량이 80%를 초과했습니다."
value: "{{ $value | printf \"%.2f\" }}"
runbook_url: "https://grafana.dungeonstalkers.com:8443"
- alert: 메모리사용량경고
expr: (1 - (node_memory_MemAvailable_bytes{hostname!=""} / node_memory_MemTotal_bytes{hostname!=""})) * 100 > 70
for: 10m
labels:
severity: warning
annotations:
summary: "높은 메모리 사용량 감지"
description: "{{ $labels.hostname }}에서 지난 10분 동안 메모리 사용량이 70%를 초과했습니다."
value: "{{ $value | printf \"%.2f\" }}"
runbook_url: "https://grafana.dungeonstalkers.com:8443"
- alert: 메모리사용량심각
expr: (1 - (node_memory_MemAvailable_bytes{hostname!=""} / node_memory_MemTotal_bytes{hostname!=""})) * 100 > 80
for: 10m
labels:
severity: critical
annotations:
summary: "심각한 메모리 사용량 감지"
description: "{{ $labels.hostname }}에서 지난 10분 동안 메모리 사용량이 80%를 초과했습니다."
value: "{{ $value | printf \"%.2f\" }}"
runbook_url: "https://grafana.dungeonstalkers.com:8443"

10
promteams/config.env Normal file
View File

@ -0,0 +1,10 @@
# MS Teams Webhook URL
WEBHOOK_URL="https://oneunivrs.webhook.office.com/webhookb2/7248d32a-3473-43bd-961b-c2a2516f28f5@1e8605cc-8007-46b0-993f-b388917f9499/IncomingWebhook/ac17804386cc4efdad5c78b3a8c182f7/f5368752-03f7-4e64-93e6-b40991c04c0c/V2jpWgnliaoihAzy3iMA2p_2KWou2hMIj4T32F8MCMVH01"
# Alertmanager의 webhook_configs.url 경로와 일치해야 하는 요청 URI
REQUEST_URI="resource_alert"
# 사용할 템플릿 파일의 호스트 경로 (마운트할 원본 파일)
TEMPLATE_HOST_PATH="/etc/alertmanager/templates/msteams.tmpl"
# 컨테이너 내부에서 템플릿 파일이 위치할 경로
TEMPLATE_CONTAINER_PATH="/app/default-message-card.tmpl"

View File

@ -0,0 +1,26 @@
#!/bin/bash
# 설정 파일 로드
source /data/promteams/config.env
# 필수 변수 확인
if [ -z "$WEBHOOK_URL" ] || [ -z "$REQUEST_URI" ]; then
echo "필수 설정 값이 누락되었습니다. config.env 파일을 확인하세요."
exit 1
fi
echo "기존 promteams 컨테이너를 중지하고 삭제합니다."
docker stop promteams >/dev/null 2>&1
docker rm promteams >/dev/null 2>&1
echo "환경변수 방식을 사용하는 구버전 이미지(v1.5.2)로 Prometheus-MSTeams 컨테이너를 시작합니다."
docker run -d -p 2000:2000 \
--name="promteams" \
--restart=always \
-e TEAMS_INCOMING_WEBHOOK_URL="$WEBHOOK_URL" \
-e TEAMS_REQUEST_URI="$REQUEST_URI" \
-v "$TEMPLATE_HOST_PATH:$TEMPLATE_CONTAINER_PATH" \
quay.io/prometheusmsteams/prometheus-msteams:v1.5.2
echo "컨테이너가 시작되었습니다. 아래 명령어로 상태를 확인하세요:"
echo "docker ps | grep promteams"

View File

@ -0,0 +1,11 @@
#!/bin/bash
CONTAINER_NAME="promteams"
if [ $(docker ps -q -f name=$CONTAINER_NAME) ]; then
echo "Prometheus-MSTeams 컨테이너($CONTAINER_NAME)를 중지하고 삭제합니다."
docker stop $CONTAINER_NAME
docker rm $CONTAINER_NAME
echo "완료되었습니다."
else
echo "실행 중인 Prometheus-MSTeams 컨테이너가 없습니다."
fi