Load metrics from Prometheus server

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

Overview

This tutorial loads CoreDNS metrics from a Prometheus server into a tf.data.Dataset, then uses tf.keras for training and inference.

CoreDNS is a DNS server with a focus on service discovery, and is widely deployed as a part of the Kubernetes cluster. For that reason it is often closely monitoring by devops operations.

This tutorial is an example that could be used by devops looking for automation in their operations through machine learning.

Setup and usage

Install required tensorflow-io package, and restart runtime

import os
try:
  %tensorflow_version 2.x
except Exception:
  pass
TensorFlow 2.x selected.
pip install tensorflow-io
from datetime import datetime

import tensorflow as tf
import tensorflow_io as tfio

Install and setup CoreDNS and Prometheus

For demo purposes, a CoreDNS server locally with port 9053 open to receive DNS queries and port 9153 (defult) open to expose metrics for scraping. The following is a basic Corefile configuration for CoreDNS and is available to download:

.:9053 {
  prometheus
  whoami
}

More details about installation could be found on CoreDNS's documentation.

curl -s -OL https://github.com/coredns/coredns/releases/download/v1.6.7/coredns_1.6.7_linux_amd64.tgz
tar -xzf coredns_1.6.7_linux_amd64.tgz

curl -s -OL https://raw.githubusercontent.com/tensorflow/io/master/docs/tutorials/prometheus/Corefile

cat Corefile
.:9053 {
  prometheus
  whoami
}
# Run `./coredns` as a background process.
# IPython doesn't recognize `&` in inline bash cells.
get_ipython().system_raw('./coredns &')

The next step is to setup Prometheus server and use Prometheus to scrape CoreDNS metrics that are exposed on port 9153 from above. The prometheus.yml file for configuration is also available for download:

curl -s -OL https://github.com/prometheus/prometheus/releases/download/v2.15.2/prometheus-2.15.2.linux-amd64.tar.gz
tar -xzf prometheus-2.15.2.linux-amd64.tar.gz --strip-components=1

curl -s -OL https://raw.githubusercontent.com/tensorflow/io/master/docs/tutorials/prometheus/prometheus.yml

cat prometheus.yml
global:
  scrape_interval:     1s
  evaluation_interval: 1s
alerting:
  alertmanagers:

  - static_configs:
    - targets:
rule_files:
scrape_configs:
- job_name: 'prometheus'
  static_configs:
  - targets: ['localhost:9090']
- job_name: "coredns"
  static_configs:
  - targets: ['localhost:9153']
# Run `./prometheus` as a background process.
# IPython doesn't recognize `&` in inline bash cells.
get_ipython().system_raw('./prometheus &')

In order to show some activity, dig command could be used to generate a few DNS queries against the CoreDNS server that has been setup:

sudo apt-get install -y -qq dnsutils
dig @127.0.0.1 -p 9053 demo1.example.org
; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> @127.0.0.1 -p 9053 demo1.example.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53868
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 3
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 855234f1adcb7a28 (echoed)
;; QUESTION SECTION:
;demo1.example.org.     IN  A

;; ADDITIONAL SECTION:
demo1.example.org.  0   IN  A   127.0.0.1
_udp.demo1.example.org. 0   IN  SRV 0 0 45361 .

;; Query time: 0 msec
;; SERVER: 127.0.0.1#9053(127.0.0.1)
;; WHEN: Tue Mar 03 22:35:20 UTC 2020
;; MSG SIZE  rcvd: 132
dig @127.0.0.1 -p 9053 demo2.example.org
; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> @127.0.0.1 -p 9053 demo2.example.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53163
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 3
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: f18b2ba23e13446d (echoed)
;; QUESTION SECTION:
;demo2.example.org.     IN  A

;; ADDITIONAL SECTION:
demo2.example.org.  0   IN  A   127.0.0.1
_udp.demo2.example.org. 0   IN  SRV 0 0 42194 .

;; Query time: 0 msec
;; SERVER: 127.0.0.1#9053(127.0.0.1)
;; WHEN: Tue Mar 03 22:35:21 UTC 2020
;; MSG SIZE  rcvd: 132

Now a CoreDNS server whose metrics are scraped by a Prometheus server and ready to be consumed by TensorFlow.

Create Dataset for CoreDNS metrics and use it in TensorFlow

Create a Dataset for CoreDNS metrics that is available from PostgreSQL server, could be done with tfio.experimental.IODataset.from_prometheus. At the minimium two arguments are needed. query is passed to Prometheus server to select the metrics and length is the period you want to load into Dataset.

You can start with "coredns_dns_request_count_total" and "5" (secs) to create the Dataset below. Since earlier in the tutorial two DNS queries were sent, it is expected that the metrics for "coredns_dns_request_count_total" will be "2.0" at the end of the time series:

dataset = tfio.experimental.IODataset.from_prometheus(
      "coredns_dns_request_count_total", 5, endpoint="http://localhost:9090")


print("Dataset Spec:\n{}\n".format(dataset.element_spec))

print("CoreDNS Time Series:")
for (time, value) in dataset:
  # time is milli second, convert to data time:
  time = datetime.fromtimestamp(time // 1000)
  print("{}: {}".format(time, value['coredns']['localhost:9153']['coredns_dns_request_count_total']))
Dataset Spec:
(TensorSpec(shape=(), dtype=tf.int64, name=None), {'coredns': {'localhost:9153': {'coredns_dns_request_count_total': TensorSpec(shape=(), dtype=tf.float64, name=None)} } })

CoreDNS Time Series:
2020-03-03 22:35:17: 2.0
2020-03-03 22:35:18: 2.0
2020-03-03 22:35:19: 2.0
2020-03-03 22:35:20: 2.0
2020-03-03 22:35:21: 2.0

Further looking into the spec of the Dataset:

(
  TensorSpec(shape=(), dtype=tf.int64, name=None),
  {
    'coredns': {
      'localhost:9153': {
        'coredns_dns_request_count_total': TensorSpec(shape=(), dtype=tf.float64, name=None)
      }
    }
  }
)

It is obvious that the dataset consists of a (time, values) tuple where the values field is a python dict expanded into:

"job_name": {
  "instance_name": {
    "metric_name": value,
  },
}

In the above example, 'coredns' is the job name, 'localhost:9153' is the instance name, and 'coredns_dns_request_count_total' is the metric name. Note that depending on the Prometheus query used, it is possible that multiple jobs/instances/metrics could be returned. This is also the reason why python dict has been used in the structure of the Dataset.

Take another query "go_memstats_gc_sys_bytes" as an example. Since both CoreDNS and Prometheus are written in Golang, "go_memstats_gc_sys_bytes" metric is available for both "coredns" job and "prometheus" job:

dataset = tfio.experimental.IODataset.from_prometheus(
    "go_memstats_gc_sys_bytes", 5, endpoint="http://localhost:9090")

print("Time Series CoreDNS/Prometheus Comparision:")
for (time, value) in dataset:
  # time is milli second, convert to data time:
  time = datetime.fromtimestamp(time // 1000)
  print("{}: {}/{}".format(
      time,
      value['coredns']['localhost:9153']['go_memstats_gc_sys_bytes'],
      value['prometheus']['localhost:9090']['go_memstats_gc_sys_bytes']))
Time Series CoreDNS/Prometheus Comparision:
2020-03-03 22:35:17: 2385920.0/2775040.0
2020-03-03 22:35:18: 2385920.0/2775040.0
2020-03-03 22:35:19: 2385920.0/2775040.0
2020-03-03 22:35:20: 2385920.0/2775040.0
2020-03-03 22:35:21: 2385920.0/2775040.0

The created Dataset is ready to be passed to tf.keras directly for either training or inference purposes now.

Use Dataset for model training

With metrics Dataset created, it is possible to directly pass the Dataset to tf.keras for model training or inference.

For demo purposes, this tutorial will just use a very simple LSTM model with 1 feature and 2 steps as input:

n_steps, n_features = 2, 1
simple_lstm_model = tf.keras.models.Sequential([
    tf.keras.layers.LSTM(8, input_shape=(n_steps, n_features)),
    tf.keras.layers.Dense(1)
])

simple_lstm_model.compile(optimizer='adam', loss='mae')

The dataset to be used is the value of 'go_memstats_sys_bytes' for CoreDNS with 10 samples. However, since a sliding window of window=n_steps and shift=1 are formed, additional samples are needed (for any two consecute elements, the first is taken as x and the second is taken as y for training). The total is 10 + n_steps - 1 + 1 = 12 seconds.

The data value is also scaled to [0, 1].

n_samples = 10

dataset = tfio.experimental.IODataset.from_prometheus(
    "go_memstats_sys_bytes", n_samples + n_steps - 1 + 1, endpoint="http://localhost:9090")

# take go_memstats_gc_sys_bytes from coredns job 
dataset = dataset.map(lambda _, v: v['coredns']['localhost:9153']['go_memstats_sys_bytes'])

# find the max value and scale the value to [0, 1]
v_max = dataset.reduce(tf.constant(0.0, tf.float64), tf.math.maximum)
dataset = dataset.map(lambda v: (v / v_max))

# expand the dimension by 1 to fit n_features=1
dataset = dataset.map(lambda v: tf.expand_dims(v, -1))

# take a sliding window
dataset = dataset.window(n_steps, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda d: d.batch(n_steps))


# the first value is x and the next value is y, only take 10 samples
x = dataset.take(n_samples)
y = dataset.skip(1).take(n_samples)

dataset = tf.data.Dataset.zip((x, y))

# pass the final dataset to model.fit for training
simple_lstm_model.fit(dataset.batch(1).repeat(10),  epochs=5, steps_per_epoch=10)
Train for 10 steps
Epoch 1/5
10/10 [==============================] - 2s 150ms/step - loss: 0.8484
Epoch 2/5
10/10 [==============================] - 0s 10ms/step - loss: 0.7808
Epoch 3/5
10/10 [==============================] - 0s 10ms/step - loss: 0.7102
Epoch 4/5
10/10 [==============================] - 0s 11ms/step - loss: 0.6359
Epoch 5/5
10/10 [==============================] - 0s 11ms/step - loss: 0.5572
<tensorflow.python.keras.callbacks.History at 0x7f1758f3da90>

The trained model above is not very useful in reality, as the CoreDNS server that has been setup in this tutorial does not have any workload. However, this is a working pipeline that could be used to load metrics from true production servers. The model could then be improved to solve the real-world problem of devops automation.