Prometheus Metrics

How to set up and view prometheus metrics for valuable workload insights

Overview

NGINX Service Mesh uses Prometheus for metrics and Grafana for visualizations. Both are included in the installation by default.

Important:
In version 1.5, NGINX Service Mesh won’t deploy Prometheus or Grafana for you anymore. Refer to the Monitoring and Tracing guide for instructions on using your own servers with NGINX Service Mesh.
Note:
You can configure NGINX Service Mesh to use a different Prometheus server when deploying. Refer to the Monitoring and Tracing task for instructions.

The mesh supports the SMI spec, including traffic metrics. The NGINX Service Mesh creates an extension API Server and shim that query Prometheus and return the results in a traffic metrics format. See SMI Traffic Metrics for more information.

Note:
Occasionally metrics are reset when the nginx-mesh-sidecar reloads NGINX Plus. If traffic is flowing and you fail to see metrics, retry after 30 seconds.

If you are deploying NGINX Plus Ingress Controller with the NGINX Service Mesh, make sure to configure the NGINX Plus Ingress Controller to export metrics. Refer to the Metrics section of the NGINX Plus Ingress Controller Deployment tutorial for instructions.

Prometheus Metrics

The NGINX Service Mesh sidecar exposes the following metrics in Prometheus format via the /metrics path on port 8887:

  • NGINX Plus metrics.
  • upstream_server_response_latency_ms: a histogram of upstream server response latencies in milliseconds. The response time is the time from when NGINX establishes a connection to an upstream server to when the last byte of the response body is received by NGINX.

All metrics have the namespace nginxplus, for example nginxplus_http_requests_total and nginxplus_upstream_server_response_latency_ms_count.

Examples

This section includes a set of example metrics that you may plug into your existing Prometheus-based tooling to gain insights into the traffic flowing through your applications.

HTTP
  • View the rate of requests currently flowing:

    irate(nginxplus_http_requests_total[30s])
    
  • View unsuccessful response codes of your applications:

    nginxplus_upstream_server_responses{code=~"3xx|4xx|5xx"}
    

    This can be used to form more complex queries such as current success rate:

    sum(irate(nginxplus_upstream_server_responses{code=~"1xx|2xx"}[30s])) by (app, version) / sum(irate(nginxplus_upstream_server_responses[30s])) by (app, version)
    
UDP/TCP
  • View the current throughput of clients sending to upstreams:

    irate(nginxplus_stream_upstream_server_sent[30s])
    
  • You can also see the total number of connections made:

    nginxplus_stream_upstream_server_connections
    
  • (TCP Only): NGINX Service Mesh exposes a whole host of latency information for TCP connections:

    nginxplus_stream_upstream_server_connect_time
    
    nginxplus_stream_upstream_server_first_byte_time
    
    nginxplus_stream_upstream_server_response_time
    

Labels

All metrics have the following labels:

Metric Name Description
job Prometheus job name. All metrics scraped from an nginx-mesh-sidecar have a job name of nginx-mesh-sidecars, and all metrics scraped from an NGINX Plus Ingress Controller have a job name of nginx-plus-ingress.
pod Name of the Pod.
namespace Namespace where the Pod resides.
instance Address of the Pod.
pod_template_hash Value of the pod-template-hash Kubernetes label.
deployment, statefulset, or daemonset Name of the Deployment, StatefulSet, or DaemonSet that the Pod belongs to.

Metrics for upstream servers, such as nginxplus_upstream_server_requests, have these additional labels:

Metric Name Description
code Response code of the upstream server. For NGINX Plus metrics, the code will be one of the following: 1xx, 2xx, 3xx, 4xx, or 5xx. For the upstream_server_response_latency_ms metrics, the code is the specific response code, such as 201.
upstream Name of the upstream server group.
server Address of the upstream server selected by NGINX.

Metrics for outgoing requests have the following destination labels:

Metric Name Description
dst_pod Name of the Pod that the request was sent to.
dst_service Name of the Service that the request was sent to.
dst_deployment, dst_statefulset, or dst_daemonset Name of the Deployment, StatefulSet, or DaemonSet that the request was sent to.
dst_namespace Namespace that the request was sent to.

Metrics exported by NGINX Plus Ingress Controller have these additional labels:

Metric Name Description
ingress Set to true if ingress traffic is enabled.
egress Set to true if egress traffic is enabled.
class Ingress class of the NGINX Plus Ingress Controller.
resource_type Type of resource: VirtualServer, VirtualServerRoute, or Ingress.
resource_name Name of the VirtualServer, VirtualServerRoute, or Ingress resource.
resource_namespace Namespace of the resource. This value is kept for backwards compatibility; for consistency with NGINX Service Mesh metrics you can use dst_namespace for queries and filters.
service Service the request was sent to. This value is kept for backwards compatibility; for consistency with NGINX Service Mesh metrics you can use dst_service for queries and filters.
pod_name Name of the Pod that the request was sent to. This value is kept for backwards compatibility; for consistency with NGINX Service Mesh metrics you can use dst_pod for queries and filters.
Filter Prometheus Metrics using Labels

Here are some examples of how you can use the labels above to filter your Prometheus metrics:

  • Find all upstream server responses with server side errors for deployment productpage-v1 in namespace prod:

    nginxplus_upstream_server_responses{deployment="productpage-v1",namespace="prod",code="5xx"}
    
  • Find all upstream server responses with successful response codes for deployment productpage-v1 in namespace prod:

    nginxplus_upstream_server_responses{deployment="productpage-v1",namespace="prod",code=~"1xx|2xx"}
    
  • Find the p99 latency of all requests sent from deployment productpage-v1 in namespace prod to service details in namespace prod over the last 30 seconds:

    histogram_quantile(0.99, sum(irate(nginxplus_upstream_server_response_latency_ms_bucket{namespace="prod",deployment="productpage-v1",dst_service="details"}[30s])) by (le))
    
  • Find the p90 latency of all requests sent from deployment productpage-v1 in namespace prod to service details in namespace prod over the last 30 seconds, excluding 301 response codes:

    histogram_quantile(0.90, sum(irate(nginxplus_upstream_server_response_latency_ms_bucket{namespace="prod",deployment="productpage-v1",dst_service="details",code!="301"}[30s])) by (le))
    
  • Find the p50 latency of all successful(response codes of 200, or 201) requests sent from deployment productpage-v1 in namespace prod to service details in namespace prod over the last 30 seconds:

    histogram_quantile(0.50, sum(irate(nginxplus_upstream_server_response_latency_ms_bucket{namespace="prod",deployment="productpage-v1",dst_service="details",code=~"200|201"}[30s])) by (le))
    
  • Find all active connections for the NGINX Plus Ingress Controller:

    nginxplus_connections_active{job="nginx-plus-ingress"}
    

Grafana

By default, NGINX Service Mesh deploys Grafana for metric visualization and configures the NGINX Mesh Top dashboard as the default dashboard. For an example of the dashboard and a list of its features, see the Grafana example in the nginx-service-mesh GitHub repo.

To view the dashboard, port-forward the Grafana service:

kubectl port-forward -n nginx-mesh svc/grafana 3000

If you prefer to use your own Grafana instance refer to the Monitoring and Tracing task for instructions.