OpenMetrics and the future of the prometheus exposition format

23 Aug 2018 - Tags: go, open source, metrics, prometheus, influxdb, openmetrics, cncf, oss, exposition format, snmp

Who am I to tell you the future about the prometheus exposition format? Nobody!

I was at the PromCon in Munich in August 2018 and I found the conference great! A lot of use cases about metrics, monitoring and prometheus itself. I work at InfluxData and we was there as sponsor but I followed a lot of talks and I had the chance to attend the developer summit the next day with a lot of promehteus maintainers. Really good conversarsations!

To be honest my scope few years ago was very different, I was working in PHP writing webapplication that yes I was deploying but I wasn’t digging to much around them and I was not smart enough to uderstand that all the pull vs push situation was just all garbage. Smoke in the eyes that luckily I left behind me pretty soon because I had the chance to meet smart people that drove me out.

Provide a comfortable way for me to expose and store metrics is a vital request and the library needs to expose the RIGHT data it doesn’t matter if they are pushing or pulling.

RIGHT means the best I can get to have more observability from an ops point of view, but also from a business intelligence prospetive probably just manipulating again the same data.

It is safe to say that a pull based exposition format is easy to pack together because it works even if the server that should grab the exposed endpoint is unavailable or even if nothing will grab them. A push based service will always create some network noice even if nobody has interest on getting the metrics.

Back in the day we had SNMP but other than being an internet standard the adoption is not comparable with the prometheus one, if we had how old it is and how fast prometheus growed the situation gets even worst.

.1.0.0.0.1.1.0 octet_str "foo"
.1.0.0.0.1.1.1 octet_str "bar"
.1.0.0.0.1.102 octet_str "bad"
.1.0.0.0.1.2.0 integer 1
.1.0.0.0.1.2.1 integer 2
.1.0.0.0.1.3.0 octet_str "0.123"
.1.0.0.0.1.3.1 octet_str "0.456"
.1.0.0.0.1.3.2 octet_str "9.999"
.1.0.0.1.1 octet_str "baz"
.1.0.0.1.2 uinteger 54321
.1.0.0.1.3 uinteger 234

It also started as network exposing format, so it doesn’t express really well other kind of metrics.

The prometheus exposition format is extremly valuable and I recently instrumented a legacy application using the prometheus sdk and my code looks a lot more clean and readable.

At the beginning I was using logs as transport layer for my metrics and time series but I ended up having a lot of spam in log themself because I was also streaming a lot of “not logs but metrics” garbage.

The link to the prometheus doc above is the best place to start, here I am just copy pasting something form there:

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"}    3 1395066363000

# Escaping in label values:
msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9

# Minimalistic line:
metric_without_timestamp_and_labels 12.47

# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045

# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3102
rpc_duration_seconds{quantile="0.05"} 3272
rpc_duration_seconds{quantile="0.5"} 4773
rpc_duration_seconds{quantile="0.9"} 9001
rpc_duration_seconds{quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

Think about that not as the prometheus way to grab metrics, but as the language that your application uses to teach the outside world how does it feels.

It is just a plain text entrypoint over HTTP that everyone can parse and re-use.

For example kapacitor or telegraf have specific ways to parse and extract metrics from that URL.

If you don’t have time to write a parser for that you can use prom2json to get a JSON version of that.

In Go you can dig a bit more inside that code and reuse some of functions for example:

// FetchMetricFamilies retrieves metrics from the provided URL, decodes them
// into MetricFamily proto messages, and sends them to the provided channel. It
// returns after all MetricFamilies have been sent.
func FetchMetricFamilies(
	url string, ch chan<- *dto.MetricFamily,
	certificate string, key string,
	skipServerCertCheck bool,
) error {
	defer close(ch)
	var transport *http.Transport
	if certificate != "" && key != "" {
		cert, err := tls.LoadX509KeyPair(certificate, key)
		if err != nil {
			return err
		}
		tlsConfig := &tls.Config{
			Certificates:       []tls.Certificate{cert},
			InsecureSkipVerify: skipServerCertCheck,
		}
		tlsConfig.BuildNameToCertificate()
		transport = &http.Transport{TLSClientConfig: tlsConfig}
	} else {
		transport = &http.Transport{
			TLSClientConfig: &tls.Config{InsecureSkipVerify: skipServerCertCheck},
		}
	}
	client := &http.Client{Transport: transport}
	return decodeContent(client, url, ch)
}

FetchMetricsFamilies can be used to get a channel with all the fetched metrics. When you have the channel you can make what you desire:

mfChan := make(chan *dto.MetricFamily, 1024)

go func() {
    err := prom2json.FetchMetricFamilies(flag.Args()[0], mfChan, *cert, *key, *skipServerCertCheck)
    if err != nil {
        log.Fatal(err)
    }
}()

result := []*prom2json.Family{}
for mf := range mfChan {
    result = append(result, prom2json.NewFamily(mf))
}

As you can see prom2json converts the result to JSON.

It is pretty fleximple! And it is a common API to read applicatin status. A common API we all know means automation! Dope automation!

Future

The prometheus exposition format growed in adoption across the board and a couple of people leaded by Richard are now pushing to have this format as new Internet Standard!

The project is called OpenMetrics and it is a Sandbox project under CNCF.

if you are looking to follow the project here the official repository on GitHub.

Probably it looks just a political step with no value at all from a tech point of view but I bet when it will be a standard and not just “the prometheus exposition” we will start to have routers exposing stats over http://192.168.1.1/metrics and it will be a lot of fun!

It will be obvious that it is not a only-prometheus feature and this new group has people from difference companies and backgrounds. So the exposition format will be probably not just for operational metrics but more generic.