Observability

Uses: Kong Mesh

This page describes how to configure different observability tools to work with Kong Mesh.

kumactl ships with a built-in observability stack that includes:

  • Prometheus for metrics
  • Jaeger for ingesting and storing traces
  • Loki for ingesting and storing logs
  • Grafana for querying and displaying metrics, traces, and logs

To enable observability, you need the following policies:

On Kubernetes, the stack can be installed with:

kumactl install observability | kubectl apply -f -

This creates a namespace named mesh-observability with Prometheus, Jaeger, Loki, and Grafana installed and set up to work with Kong Mesh.

This setup is meant for testing purposes. Do not use it for production. For production setups, we recommend referring to each project’s website or using a hosted solution such as Grafana Cloud or Datadog.

Control plane observability

The control plane supports metrics and traces for observability.

Metrics

Control plane metrics are exposed on port :5680 and available under the standard path /metrics.

Traces

You can configure Kong Mesh to export OpenTelemetry traces. It exports traces for:

  • API server
  • KDS on global (only basic information about the connections to zones are traced, nothing resource-specific)
  • Inter-CP server

To enable tracing, set the KUMA_TRACING_OPENTELEMETRY_ENABLED or tracing.openTelemetry.enabled control plane config variable to "true" and configure OpenTelemetry using the standard OTEL_EXPORTER_OTLP_* environment variables.

Configure Prometheus

The Kuma community has contributed built-in service discovery for Prometheus. It is documented in the Prometheus docs. This service discovery connects to the control plane and retrieves all data planes with enabled metrics, which Prometheus scrapes and retrieves according to your MeshMetric policies.

There are three ways to run Prometheus:

  1. Inside the mesh (default with kumactl install observability).
  2. Outside the mesh. In this case, you must specify tls.mode: disabled in the MeshMetric configuration. This is less secure but ensures Prometheus is as available as possible. It’s also easier to add to an existing setup with services in and outside the mesh.
  3. Outside the mesh with TLS enabled. In this case, you need to provide certificates for each data plane and specify the configuration in the MeshMetric policy. This is more secure than the second option but requires more configuration.

In production, we recommend the second option because it provides better visibility when things go wrong, and it’s usually acceptable for metrics to be less secure.

Use an existing prometheus setup

In Prometheus version 2.29 or later, you can add Kong Mesh metrics to your prometheus.yml:

scrape_configs:
  - job_name: 'kuma-dataplanes'
    scrape_interval: "5s"
    relabel_configs:
      - source_labels:
          - __meta_kuma_mesh
        regex: "(.*)"
        target_label: mesh
      - source_labels:
          - __meta_kuma_dataplane
        regex: "(.*)"
        target_label: dataplane
      - action: labelmap
        regex: __meta_kuma_label_(.+)
    kuma_sd_configs:
      - server: "http://kong-mesh-control-plane.kong-mesh-system.svc:5676"

For more information, see the Prometheus documentation.

If you have MeshMetric enabled for your mesh, check the Targets page in the Prometheus dashboard. You should see a list of data plane proxies from your mesh.

Configure Grafana

You can use Grafana to visualize traces from Jaeger and logs from Loki, and the Kuma community ships dashboards and a data source for deeper integration.

Visualize traces

To visualize your traces with Grafana, you can configure a new data source with the URL http://jaeger-query.mesh-observability/ (or any other URL Jaeger can be queried at). Grafana can then retrieve traces from Jaeger.

You can then add a MeshTrace policy to your mesh to start emitting traces. At this point you can visualize your traces in Grafana by choosing the Jaeger data source in the Explore section.

Visualize logs

To visualize your containers’ logs and your access logs with Grafana, you can then add a MeshAccessLog policy to your mesh to start emitting access logs. Loki picks up logs that are sent to stdout. To send logs to stdout, you can configure the logging backend as shown below:

You can then visualize your containers’ logs and your access logs in Grafana by choosing the Loki data source in the Explore section.

For example, running {container="kuma-sidecar"} |= "GET" shows all GET requests on your cluster. For more information about the search syntax, see the Loki docs.

Grafana extensions

The Kuma community has built a data source and a set of dashboards to provide better integrations between Kong Mesh and Grafana.

Data source and service map

The Grafana data source is specifically built to relate information from the control plane with Prometheus metrics.

Current features include:

  • Display the graph of your services with MeshGraph using the Grafana node graph panel.
  • List meshes.
  • List zones.
  • List services.

To use the plugin, you need to add the binary to your Grafana instance by following the installation instructions.

The data source is installed and configured when using kumactl install observability.

Dashboards

Kong Mesh ships with default dashboards that are available to import from the Grafana Labs repository:

  • Kuma CP: Investigate control plane statistics.
  • Kuma Dataplane: Investigate the status of a single data plane in the mesh. To see these metrics, you need to create a MeshMetric policy first.
  • Kuma Gateway: Investigate aggregated statistics for each built-in gateway.
  • Kuma Mesh: Investigate the aggregated statistics of a single mesh. It provides a topology view of your service traffic dependencies (Service Map) and includes information such as the number of requests and error rates.
  • Kuma Service: Investigate aggregated statistics for each service.
  • Kuma Service to Service: Investigate aggregated statistics from data planes of specified source services to data planes of specified destination services.

Configure Datadog

The recommended way to use Datadog is with its agent.

Metrics

Kong Mesh exposes metrics with the MeshMetric policy in Prometheus format.

You can add annotations to your Pods to enable the Datadog agent to scrape metrics.

For Kubernetes, refer to the dedicated documentation.

On Universal, set up your agent with an openmetrics.d/conf.yaml.

Tracing

To configure tracing using Datadog on Universal, see the Datadog agent docs.

On Kubernetes, configure the Datadog agent for APM.

If Datadog isn’t running on each node, you can expose the APM agent port to Kong Mesh via a Kubernetes service.

apiVersion: v1
kind: Service
metadata:
  name: trace-svc
spec:
  selector:
    app.kubernetes.io/name: datadog-agent-deployment
  ports:
    - protocol: TCP
      port: 8126
      targetPort: 8126

Check that the label of the installed Datadog Pod hasn’t changed (app.kubernetes.io/name: datadog-agent-deployment). If it changed, adjust accordingly.

Once the agent is configured to ingest traces, you must configure a MeshTrace policy.

Logs

The best way to have Kong Mesh and Datadog work together is with TCP ingest.

Once your agent is configured with TCP ingest, you can configure a MeshAccessLog policy for data plane proxies to send logs.

OpenTelemetry collector

You can run an OpenTelemetry collector to receive metrics, traces, and access logs from Kong Mesh sidecars and forward them to one or more backends. For step-by-step setup, see Deploy an OpenTelemetry collector.

How Kong Mesh talks to the collector

Sidecars push telemetry to the collector over OTLP gRPC on port 4317. The collector receives the telemetry, batches it, and exports it to whatever backends you configure.

Kong Mesh uses a push model: each sidecar opens an outbound connection to one collector Pod and writes its own telemetry. In a pull model, by contrast, a collector scrapes Prometheus endpoints from every workload it can reach.

The distinction matters when you pick a topology. A CNCF post warns about 20-40x metric explosion when DaemonSet collectors all scrape the same Prometheus targets which is a problem specific to the pull model. Because Kong Mesh pushes, each metric reaches one collector instance regardless of how many collector Pods exist.

Topologies

Two patterns work for the OTLP receiver.

Deployment + ClusterIP service

Run two or three collector replicas behind a ClusterIP service. Sidecars resolve otel-collector.observability:4317 to the Service IP, and kube-proxy load-balances each connection to a collector Pod.

We recommend this topology because it’s simple, the failure domain is the whole replica set, and a rolling update of the collector doesn’t drop telemetry from any specific node. Use a Deployment for small and medium clusters, or any cluster where collector throughput isn’t a bottleneck.

Per-node DaemonSet

Run one collector Pod per node and route traffic node-locally. With internalTrafficPolicy: Local on the service, kube-proxy on each node only forwards to the collector Pod on that same node. Sidecars still resolve the same DNS name (otel-collector.observability:4317), but the hop never leaves the node.

Pick a DaemonSet for large clusters or workloads where the extra network hop matters. A DaemonSet improves locality, distributes load across nodes, and isolates collector failure to a single node’s telemetry.

The trade-off is silent loss. If the collector Pod on a node crashes or is restarting, sidecars on that node have no fallback and drop their telemetry until the Pod is ready. The Local traffic policy does not fail over to other nodes.

Observability in multi-zone

The following sections explain how to architect your telemetry stack to accommodate multi-zone deployments.

Prometheus

When Kong Mesh is used in multi-zone, the recommended approach is to use one Prometheus instance in each zone and send the metrics of each zone to a global Prometheus instance.

Prometheus offers different ways to do this:

  • Federation: The global Prometheus scrapes Prometheus in each zone.
  • Remote Write: Prometheus in each zone directly writes metrics to the global instance. This is usually more efficient than federation.
  • Remote Read: The global Prometheus reads metrics from the zone instances.

Jaeger, Loki, Datadog, and others

Most telemetry components don’t have a hierarchical setup like Prometheus. If you want to have a central view of everything, you can set up the system in the global instance and have each zone send data to it. Because the zone is present in the data plane tags, metrics, logs, and traces should not overlap between zones.

Known issues

The following are known observability issues in Kong Mesh.

MADS server bug in 2.6.0

Version 2.6.0 of Kong Mesh introduced a bug in the MADS server that was fixed in version 2.7.0. This bug can cause delays in delivering monitoring assignments to Prometheus if you changed the default Prometheus configuration for kuma_sd_configs.fetch_timeout. This results in Prometheus not collecting metrics from new data plane proxies during that period.

To fix this issue, configure kuma_sd_configs as follows:

kuma_sd_configs:
  - fetch_timeout: 0s

This disables long polling on Prometheus service discovery.

Help us make these docs great!

Kong Developer docs are open source. If you find these useful and want to make them better, contribute today!