Monitoring and Alerting

11 April 2023

github

Monitoring and alerting refer to the process of continuously tracking and analyzing system metrics and data in real-time to detect and alert on various types of issues or anomalies. Monitoring involves collecting and analyzing data from various sources such as servers, networks, applications, and services, while alerting involves notifying IT teams of issues or events that require immediate attention or remediation


Architecture Diagram:

The Architecture includes:

Kong Gateway: This is the entry point for external traffic to access the different services. It can also expose metrics about the traffic and forward them to Prometheus. Prometheus: This is the primary data source for metrics, it scrapes metrics from the exporters and other applications that expose metrics, stores them and serves them for querying and alerting.

Applications: These are applications that expose metrics in a format that Prometheus can scrape. Examples include node_exporter for system-level metrics, blackbox_exporter for synthetic monitoring, and more.

Other Applications: These are applications that expose metrics directly, without the need for an exporter, and can be scrapped by Prometheus. Examples include Spring Boot applications with Micrometer, Kubernetes API server, and more.

Grafana: This is a visualization and dashboarding tool that can be used to create dashboards to display Prometheus metrics.

Alertmanager: This is responsible for processing alerts generated by Prometheus based on user-defined rules and sending them to different notification channels such as Slack, PagerDuty, or email.

Overall, this architecture provides a comprehensive monitoring solution for various applications, services, and infrastructure components, enabling users to collect, store, query, visualize, and alert on metrics across the entire system.

Why do we go for Monitoring and Alerting tools?

Monitoring and alerting tools are used to help ensure the reliability and availability of computer systems, networks, and applications. These tools are designed to detect and alert on various types of events and issues, such as system failures, errors, performance degradation, security threats, and other anomalies.


  • Resource utilization monitoring
  • Forecasting future resource needs
  • Performance metric analysis
  • Optimization identification
  • System performance improvement
  • Compliance monitoring
  • Security threat detection
  • Vulnerability monitoring
  • Industry regulation compliance
  • Data breach prevention

Required stack for development:

  • Prometheus
  • Alertmanager
  • Grafana

Prometheus collects the metrics data from the different sources using its powerful query language, PromQL, which allows users to easily query and analyze collected metrics. 

Alertmanager allows users to define alerting rules that specify the conditions under which an alert should be generated. When an alert is triggered, Alertmanager evaluates the routing tree to determine which receivers should receive the alert. It then sends the alert to those receivers, either as an email, a webhook, or some other format.

Grafana is an open-source data visualization and monitoring tool that can be used to create dashboards and panels to display metrics and analytics from various data sources. It is highly customizable and supports a wide range of data sources, including Prometheus, Elasticsearch, Graphite, InfluxDB, and many others.

How to Setup the Project?

Project folder Structure:

Clk-monitoring

├── Volumes

│   ├── alertmanager              # Directory Config files of alertmanager

│     │      ├── alertmanger.yml     # yaml files for alertmanager

│    ├── prometheus               # Directory Config of prometheus

│    │    ├──config

│            ├── alert.rules.yml  # yaml files for rules

│            ├── prometheus.yml     # yaml files for prometheus

├── Dockerfile                     # Docker file for Kong

├── docker-compose.yml           # Used for docker-compose

├── .env

├── .gitignore

└── Readme.md

In this case we are using docker to setup the projects in the local machine.

1. Install Docker and Docker-compose

https://docs.docker.com/get-docker/

2. Check docker and docker-compose installed or not

For docker 

docker - -version

For docker-compose

docker-compose - -version

3. Checkout the project from the github using the below command:

    git clone https://github.com/arockiyastephenl/Clk-Moniter.git

4. You can start the “Auto_script.sh” shell script in your command prompt or run the below cmd one by one

Auto_script.sh


  docker-compose build kong
  docker-compose up -d kong-db
  docker-compose run --rm kong kong migrations bootstrap
  docker-compose run --rm kong kong migrations up
  docker-compose up -d kong
  docker-compose ps
  docker-compose up -d konga
  sleep 2m
  docker-compose up -d keycloak-db
  docker-compose up -d keycloak
  docker-compose up -d alertmanager
  docker-compose up -d prometheus
  docker-compose up -d grafana
  docker-compose ps

After that all services running on the machine. The below screenshot will show running application on the machine.

5. Verify the services

Below files contain the configuration of the prometheus. We can specify the target URI that prometheus is going to collect the metrics on.

prometheus.yml


  global:
  scrape_interval: 15s
  evaluation_interval: 15s
rule_files:
  - ./alert.rules.yml
alerting:
  alertmanagers:
    - scheme: http
      static_configs:
        - targets:
            - '192.168.43.16:9093'
scrape_configs:
  - job_name: prometheus
    metrics_path: /metrics
    honor_labels: false
    honor_timestamps: true
    sample_limit: 0
    static_configs:
      - targets:
          - '192.168.43.16:9090'
          - 'kong:8001'
          - '192.168.43.16:8001'
  - job_name: Fast-api
    static_configs:
      - targets:
          - '192.168.43.16:8097'
  - job_name: mysql_DB
    static_configs:
      - targets:
          - '192.168.43.16:3306'
          

Targets:

In Prometheus, a target is a resource or endpoint that is monitored by the Prometheus server. Targets can be specified as URLs, IP addresses, or DNS names, and can be located anywhere on a network.

When Prometheus is configured to scrape a target, it sends HTTP requests to the target's metrics endpoint to collect data about the target's performance and health. This data is then stored in the Prometheus time-series database and can be queried and visualized using Grafana or other data visualization tools.

Assuming that Prometheus is running on the same machine as the command is being executed, you can use the following command to retrieve the metrics data for the "localhost" target:


  curl http://localhost:9090/metrics
  

This command sends an HTTP GET request to the Prometheus server running on port 9090 on the local machine, requesting the metrics data for the "localhost" target. The response will contain the metrics data in plain text format, which can be parsed and analyzed using various tools and libraries.

Rules:

In Prometheus, rules are used to define additional recording and alerting rules based on the existing time-series data. Recording rules allow users to define new time-series data based on existing data, while alerting rules allow users to define alerts that are triggered when certain conditions are met.

Recording rules allow users to define new metrics based on existing metrics. These rules are defined using PromQL expressions that are evaluated at regular intervals and the result is stored as a new time-series. This allows users to preprocess and aggregate metrics data before querying it, which can improve performance and reduce the load on the Prometheus server.

Alerting rules allow users to define alerts based on the time-series data. These rules are also defined using PromQL expressions that are evaluated at regular intervals, and if the expression evaluates to true, an alert is triggered. Users can define various parameters for alerts, such as the severity level, the notification channels, and the duration and frequency of alerts.

Rules are defined in a separate rules file or in the Prometheus configuration file. Once defined, Prometheus automatically evaluates the rules at the specified intervals and stores the results in the time-series database. The results can then be queried and visualized using Grafana or other data visualization tools.

We can write a customized alert based on our needs. Here is github url to all the sample alert refer the github

alert.rules.yml


groups:
- name: alert.rules
  rules:
    - alert: InstanceDown
      expr: up == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: 'Endpoint {{ $labels.instance }} down'
        description: >-
          {{ $labels.instance }} of job {{ $labels.job }} has been down for
          more than 1 minutes.
http://localhost:9090/rules

Alertmanager :

Alertmanager is a component of the Prometheus monitoring system that handles alerts sent by client applications such as Prometheus server or other monitoring tools. Its main purpose is to receive, group, deduplicate, and route alerts to the appropriate receivers, such as email, Slack, PagerDuty, or other notification mechanisms.

Alertmanager can be configured to group similar alerts together, silence specific alerts for a period of time, and route alerts to different receivers based on their severity, source, or other attributes. It also provides a web interface for viewing and managing alerts, as well as a set of APIs for integrating with other systems.

We have to create the webhook for the slack and mention it in the alertmanager config file it will automatically trigger when the Alert rules conditions are met.


global:
resolve_timeout: 1m
slack_api_url: >-
  https://hooks.slack.com/services/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
route:
receiver: slack-notifications
receivers:
- name: slack-notifications
  slack_configs:
    - channel: '#testing'
      send_resolved: true
      title: >-
        [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing
        | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{
        .CommonLabels.job }}

        {{- if gt (len .CommonLabels) (len .GroupLabels) -}}
          {{" "}}(
          {{- with .CommonLabels.Remove .GroupLabels.Names }}
            {{- range $index, $label := .SortedPairs -}}
              {{ if $index }}, {{ end }}
              {{- $label.Name }}="{{ $label.Value -}}"
            {{- end }}
          {{- end -}}
          )
        {{- end }}
      text: >-
        {{ range .Alerts -}} *Alert:* {{ .Annotations.title }}{{ if
        .Labels.severity }} - {{ .Labels.severity }} {{ end }}

        *Description:* {{ .Annotations.description }}

        *Details:*
          {{ range .Labels.SortedPairs }} • *{{ .Name }}:* {{ .Value }} 
          {{ end }}
        {{ end }}
  
http://localhost:9093

Grafana

  • Open the browser and navigate this url: http://localhost:3000
  • The default username and password for the grafana “admin/admin”

Steps to Create Dashboard on Grafana:

here are the step-by-step instructions to create a Grafana dashboard using Prometheus as the data source:

  1. First, you need to have Grafana and Prometheus running. You can install them on the same server or separate servers. Once installed, make sure that both services are up and running.

  2. Open Grafana in your web browser and log in using your credentials.

  3. In the Grafana web interface, click on the "+" icon in the left sidebar and select "Dashboard" from the dropdown menu.

  4. On the new dashboard screen, click on the "Add Query" button to add a data source. Select Prometheus from the list of available data sources.

  5. In the "Query" field, enter a Prometheus query that you want to use for the dashboard. For example, you can use the following query to display the CPU usage of a server:

  6. 100 - (avg by (instance)(irate(node_cpu_seconds_total

    {mode="idle"}[5m])) * 100)

  7. Click on the "Run Query" button to test the query and make sure that it returns the expected results.

  8. If the query returns the expected results, click on the "Save Dashboard" button to save the dashboard.

  9. Give your dashboard a name and click on the "Save" button to save it.

  10. Your new dashboard will now be displayed in the Grafana web interface. You can customize it further by adding new panels, changing the layout, or applying different visualization options.

Slack Notification:

Screenshot explain how the slack notification looks like:

Steps to create webhook on slack:

here are the step-by-step instructions to create a Slack webhook:

  1. Log in to your Slack workspace and go to the "Apps" page.

  2. Search for "Incoming Webhooks" in the search bar and select it from the results.

  3. Click the "Add to Slack" button.

  4. Choose the channel where you want to post messages with the webhook and click "Add Incoming Webhooks Integration".

  5. On the next page, you will see a "Webhook URL" field. This is the URL you will use to post messages to the selected channel. You can customize the name and icon of the webhook to better identify it in the channel.

  6. Scroll down to the "Integration Settings" section, where you can customize other settings for the webhook, such as the default username and avatar for the messages.

  7. Once you have configured the webhook to your liking, click the "Save Settings" button.

Related articles