Metrics Concepts

Definition

Metrics are quantitative measurements that provide insights into the performance and behavior of a system, application, or infrastructure. In the context of observability, metrics typically represent numerical data points collected over time.

These data points can include various types of information, such as resource utilization (CPU, memory), request counts, error rates, and latency. Metrics are fundamental for monitoring and analyzing the health and efficiency of a system, helping teams identify issues, track trends, and make informed decisions to optimize performance and reliability.

In short: metrics are the backbone of any monitoring system, offering insights into the behavior of your systems and applications. In Prometheus, metrics represent a structured approach to represent monitoring data as time-series, which are streams of timestamped values that describe specific aspects of system behavior.

What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit. It was developed by SoundCloud and made open-source in 2012. Today, it’s a standalone open-source project and maintained independently of any company.

Official Prometheus Documentation

Prometheus has a well-maintained and robust official documentation which provides extensive information about installation, configuration, querying and more. You can refer to it for better understanding and usage of the tool.

Official Prometheus Documentation

Core Principles of Prometheus

Prometheus's design revolves around several key principles:

Multi-dimensional data model: Prometheus stores all data as time-series, i.e., streams of timestamped values belonging to the same metric and the same set of labelled dimensions, enabling a diversity of queries.
It boasts a powerful query language, called PromQL, which allows you to select and aggregate time series data in real-time.
Pull-based metric collection: Instead of relying on sending monitoring data, Prometheus collects or “scrapes” metric data at regular intervals over HTTP from the application's endpoints.
You can use multiple modes of graphing and dashboarding to visualize metrics data through its built-in expression browser.
It has a deeply ingrained alerting mechanism that works with its data storage to generate notifications based on flexible alert rules.
Prometheus is highly reliable with each server acting autonomously, there is no need for distributed storage.

Architecture of Prometheus

Prometheus architecture is relatively simple. Applications expose an HTTP endpoint (often /metrics), and the Prometheus server scrapes these endpoints at a regular interval, storing the information as time-series data.

This data can be queried via an API and visualized with a UI like Grafana or the built-in expression browser. Prometheus's alert manager allows the creation of sophisticated alerts, anticipating potential issues and responding effectively to system changes.

Finding More Information About Prometheus

Given its widespread usage, there's a wealth of documentation available for Prometheus. The official Prometheus documentation is an excellent place to start, providing an overview of concepts, detailed guides, and best practices. For advanced topics, there are numerous blogs, tutorials, and talks available online.

The Prometheus GitHub repository is also a good place to visit for those interested in Prometheus' latest development updates or those looking into contributing to the project.

Remember, understanding and optimizing the use of Prometheus is an ongoing process. Keep seeking out and researching information, and don't hesitate to consult the community or professional literature if you hit any roadblocks.

Existing tooling

For many existing systems like database servers, web servers and application stacks, there are ready made tools to export metrics. Take a look at the official Prometheus documentation that contains a (non-exhaustive) list of available exporters.

If no existing exporter seems to exist, take a moment to look at the documentation to see if Prometheus support is built-in, and may be exposed by changing the configuration.

When using existing tooling, take a moment to see what metrics and how many series are generated in your setup. There may be more configuration required if the amount of metrics is too low, or too high.

Types of Metrics

Prometheus identifies four primary types of metrics. Extensive documentation is available on all of these in the official Prometheus documentation

Counter

A counter represents a cumulative metric that frames a single, monotonically increasing count or sum. This could include the number of requests served, tasks completed, or errors produced.

Gauge

A gauge represents a single numerical value that can arbitrarily go up or down; think of it as taking a snapshot of a system state. Examples include current memory usage, the temperature of a server room, or the number of active requests.

Histogram

A histogram gathers observations (like request durations or response sizes) and sorts them into configurable buckets. It allows you to count observed values falling into each bucket to analyze the distribution of your data.

Summary

Similar to a histogram, a summary collects observations over a sliding time window. However, it provides additional information, like the total count and sum of data, and configurable quantiles.

Naming metrics

When creating Prometheus metrics, it's essential to adhere to certain naming conventions to ensure consistency and compatibility.

Remember that clear and consistent metric naming is crucial for effective monitoring and querying in Prometheus. Well-named metrics make it easier to write queries and create informative dashboards and alerts.

Prometheus provides a set of best practices for naming your metrics.

If you mostly use existing exporters and instrumentation, metrics have already been named for you, and the only choice you might need to make is if you want to prefix a custom namespace name to differentiate your metrics from other metrics with the same name.

Labels

Metrics are defined by their unique name, but all metrics require labels to add more dimension and context to their values. For instance, labels can be used to categorize metrics by endpoint, status code, or other dimensions, enabling more precise analysis and alerting.

For instance:

http_requests_total{app="ourapp”} 123 will tell you this value is specific the app named "ourapp".

Cardinality

All unique combinations of a metrics and its labels are stored on our platform as unique series. Take this example:

http_requests_total{app="ourapp”, server="app1”, status=”200”, handler=”/api/v2/foo”} 509

If we store this metric for four apps, six servers, four statuses and twenty handlers, the end result will be 1920 series. This amount is not unusual, but care must be taken that this amount of series for a single metric is intentional, as each series actively receiving metrics counts towards your monthly usage.

More about cardinality

Making metrics usable

Each system or application to be monitored by Prometheus exposes an HTTP endpoint (usually '/metrics'), which provides the current value of all its metrics. Prometheus servers 'scrape' this exposed data at regular intervals, storing them as time-series data for later analysis. You can use Prometheus's query language, PromQL, to filter and aggregate this data to create alert rules or visualize the data.