Monitoring and Alerting

Overview

Although Milvus is highly available, it is critical to actively monitor the overall performance of a system running in production, and to create alerting rules that promptly send notifications when there are events that require investigation or intervention.

Monitoring solution

Milvus uses Prometheus to store and monitor its metrics and Grafana to visualize data.

Prometheus

Prometheus is a system monitoring and alerting toolkit with a multi-dimensional data model and a flexible query language.

The Prometheus ecosystem consists of multiple components, of which the following are used in Milvus:

  • Prometheus server which scrapes and stores time series data.
  • Client libraries for instrumenting application metrics.
  • Alertmanager for alert handling.
  • Pushgateway to allow short-lived, batch metrics, which may not be scraped in time, to be exposed to Prometheus.

Milvus collects monitoring data and pushes it to Pushgateway. At the same time, the Prometheus server periodically pulls data from Pushgateway and save it to its time-series database. The following graph shows how Prometheus works in Milvus:

prometheus

Grafana

Grafana is an open source platform for time-series analytics and used in Milvus to visualize various performance metrics:

dashboard

Events to create alert rules

Active monitoring helps you identify problems early, but it is also essential to create alerting rules that promptly send notifications when there are events that require investigation or intervention.

This section includes the most important events for which you must create alerting rules.

Server is down

  • Rule: Send an alert when the Milvus server is down.
  • How to detect: If the Milvus server is down, No Data displays on the monitoring dashboard.

CPU/GPU temperature is too high

  • Rule: Send an alert when the CPU/GPU temperature exceeds 80 degrees Celsius.
  • How to detect: Check the metrics CPU Temperature and GPU Temperature on the monitoring dashboard.

Use Prometheus and Alertmanager

Milvus generates detailed time series metrics. This page shows you how to pull these metrics into Prometheus, and how to connect Grafana and Alertmanager to Prometheus for flexible data visualizations and notifications.

Before you begin

  • Make sure you have already started a Milvus server and enabled the monitoring function.

Install Prometheus

  1. Download the Prometheus tarball for your OS.
  2. Go to the Prometheus file directory, and make sure Prometheus is installed successfully:

    $ ./prometheus --version
    
    You can add the path to Prometheus to PATH. This makes it easy to start Prometheus from any shell.

Configure and start Prometheus

  1. Start Pushgateway:

    ./pushgateway
    
    You must start Pushgateway before starting the Milvus Server.
  2. Start the Prometheus monitor in server_config.yaml and set the address and port number of Pushgateway:

    metric:
      enable: true       # Set the value to true to enable the Prometheus monitor.
      address: 127.0.0.1 # Set the IP address of Pushgateway.
      port: 9091         # Set the port number of Pushgateway.
    
  3. Go to the Prometheus root directory, and download starter Prometheus configuration file for Milvus:

    $ wget https://raw.githubusercontent.com/milvus-io/docs/v0.10.0/assets/monitoring/prometheus.yml \ -O prometheus.yml
    
  4. Configure the file to suit your requirements. Refer to https://prometheus.io/docs/prometheus/latest/configuration/configuration/ to learn more about the configuration file for Prometheus.

    Note: If you use distributed cluster, you must expand the targets field to include localhost: <http-port> for each additional node in the cluster.

  5. Download starter alerting rules for Milvus to the Prometheus root directory:

    wget -P rules https://raw.githubusercontent.com/milvus-io/docs/v0.10.0/assets/monitoring/alert_rules.yml
    
  6. Edit the Prometheus configuration file according to your needs:

    • global: Configures parameters such as scrape_interval and evaluation_interval.
    global:
     scrape_interval:     2s # Set the crawl time interval to 2s.
     evaluation_interval: 2s # Set the evaluation interval to 2s.
    
    • alerting: Sets the address and port of Alertmanager.
    alerting:
    alertmanagers:
    - static_configs:
      - targets: ['localhost:9093']
    
    • rule_files: Specifies the file that defines the alerting rules.
    rule_files:
      - "alert_rules.yml"
    
    • scrapeconfigs: Sets `jobnameandtargets` for scraping data.
    scrape_configs:
    - job_name: 'prometheus'
      static_configs:
      - targets: ['localhost:9090']
    
    - job_name: 'pushgateway'
      honor_labels: true
      static_configs:
      - targets: ['localhost:9091']
    
    See Prometheus Configuration for more information about the configuration file of Prometheus.
  7. Start Prometheus:

    ./prometheus --config.file=prometheus.yml
    

Configure Prometheus in Kubernetes

  1. Start up Pushgateway and Prometheus.
  2. On the node to monitor in the Kubernetes cluster, set the following in server_config.yaml:
metric:
  enable: true       # Enable the Prometheus monitor.
  address: 127.0.0.1 # Set the IP address of Pushgateway.
  port: 9091         # Set the port number of Pushgateway.

Visualize metrics in Grafana

  1. Use the following command to install and start Grafana for your OS:

    $ docker run -i -p 3000:3000 grafana/grafana
    
  2. Use your browser to open http://<hostname of machine running grafana>:3000 and log into the Grafana UI.
Grafana's default username and password are both "admin". You can create a Grafana account of your own.
  1. Add Prometheus as a data source.
  2. In Grafana UI, click Configuration > Data Sources > Prometheus, and then configure the data source as follows:

    Field Definition
    Name Prometheus
    Default True
    URL http://<hostname of machine running prometheus>:9090
    Access Browser
  3. Download the starter Grafana dashboard for Milvus:

    $ wget https://raw.githubusercontent.com/milvus-io/docs/v0.10.0/assets/monitoring/dashboard.json
    
  4. Add the dashboard to Grafana.

Send notifications with Alertmanager

  1. Download the latest Alertmanager tarball for your OS.
  2. Ensure that Alertmanager is properly installed:

    $ alertmanager --version
    
    You can add the path to Alertmanager to PATH. This makes it easy to start Alertmanager from any shell.
  3. Create the Alertmanager configuration file to specify the desired receivers for notifications, and add it to Alertmanager root directory.
  4. Start the Alertmanager server, with the --config.file flag pointing to the configuration file:

    alertmanager --config.file=simple.yml
    
  5. Use your browser to open http://<hostname of machine running alertmanager>:9093, and use the Alertmanager UI to define rules for muting alerts.

Monitoring Metrics

Edit
© 2019 - 2020 Milvus. All rights reserved.