Skip to content

Kubernetes Cluster Receiver Dashboards

Kubernetes cluster monitoring dashboards using OpenTelemetry k8sclusterreceiver metrics, designed for SRE and DevOps workflows.

Overview

The k8sclusterreceiver is an OpenTelemetry Collector receiver that collects cluster-level metrics from the Kubernetes API server. It provides visibility into cluster health, workload status, resource utilization, and autoscaling behavior.

Important: The k8sclusterreceiver must be deployed as a single instance per cluster to avoid duplicate metrics.

Dashboards

Dashboard File Description
Cluster Overview 01-cluster-overview.yaml Entry point for cluster health triage
Workload Health 02-workload-health.yaml Deployment and container health
Resource Allocation 03-resource-allocation.yaml Capacity planning and quota analysis
Batch Jobs 04-batch-jobs.yaml Job and CronJob monitoring
Autoscaling 05-autoscaling.yaml HPA scaling behavior

All dashboards include navigation links for easy switching between views.

Dashboard Definitions

Cluster Overview (01-cluster-overview.yaml)
---
# Kubernetes Cluster Overview Dashboard
# SRE Entry Point: "Is my cluster healthy? Where should I look?"
dashboards:
  - id: k8s-cluster-overview
    name: '[Metrics K8s Cluster] Overview'
    description: High-level Kubernetes cluster health for rapid SRE triage
    controls:
      - type: options
        label: Namespace
        data_view: metrics-*
        field: k8s.namespace.name
    filters:
      - field: data_stream.dataset
        equals: kubernetesclusterreceiver.otel
    panels:
      # ═══════════════════════════════════════════════════════════════════════
      # NAVIGATION
      # ═══════════════════════════════════════════════════════════════════════
      - title: Navigation
        size: {w: 48, h: 3}
        links:
          layout: horizontal
          items:
            - label: 📊 Overview
              dashboard: k8s-cluster-overview
            - label: ⚙️ Workloads
              dashboard: k8s-cluster-workloads
            - label: 📦 Resources
              dashboard: k8s-cluster-resources
            - label: 🔄 Batch Jobs
              dashboard: k8s-cluster-batch
            - label: 📈 Autoscaling
              dashboard: k8s-cluster-hpa

      # ═══════════════════════════════════════════════════════════════════════
      # CLUSTER HEALTH SUMMARY (4 metric cards - at-a-glance health)
      # ═══════════════════════════════════════════════════════════════════════
      - title: Cluster Health
        size: {w: 48, h: 3}
        markdown:
          content: '## 🏥 Cluster Health'
          font_size: 14
      - title: Running Pods
        description: Pods in Running phase (phase=2).
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: unique_count(k8s.pod.name)
            label: Running
            format:
              type: number
              decimals: 0
          filters:
            - field: k8s.pod.phase
              equals: '2'
      - title: Pending Pods
        description: >-
          Pods in Pending phase (phase=1), waiting for scheduling or container
          image pull.
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: unique_count(k8s.pod.name)
            label: Pending
            format:
              type: number
              decimals: 0
          filters:
            - field: k8s.pod.phase
              equals: '1'
      - title: Failed Pods
        description: Pods in Failed phase (phase=4). Check pod logs for root cause.
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: unique_count(k8s.pod.name)
            label: Failed
            format:
              type: number
              decimals: 0
          filters:
            - field: k8s.pod.phase
              equals: '4'
      - title: Container Restarts
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.container.restarts)
            label: Restarts
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.container.restarts

      # ═══════════════════════════════════════════════════════════════════════
      # ANALYSIS: Pod Health Distribution & Trends
      # ═══════════════════════════════════════════════════════════════════════
      - title: Pod Health Distribution
        size: {w: 20, h: 14}
        lens:
          type: pie
          data_view: metrics-*
          breakdowns:
            - field: k8s.pod.phase
              type: values
              label: Status
              size: 5
          metrics:
            - aggregation: unique_count
              field: k8s.pod.name
              label: Pods
              format:
                type: number
                decimals: 0
          color:
            palette: eui_amsterdam_color_blind
            assignments:
              - value: '1'
                color: '#FEC514'
              - value: '2'
                color: '#54B399'
              - value: '3'
                color: '#6092C0'
              - value: '4'
                color: '#D36086'
              - value: '5'
                color: '#9170B8'
      - title: Pod Health Over Time
        size: {w: 28, h: 14}
        lens:
          type: area
          mode: stacked
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          breakdown:
            field: k8s.pod.phase
            type: values
            size: 5
          metrics:
            - aggregation: unique_count
              field: k8s.pod.name
              label: Pods
              format:
                type: number
                decimals: 0
          color:
            palette: eui_amsterdam_color_blind
            assignments:
              - value: '1'
                color: '#FEC514'
              - value: '2'
                color: '#54B399'
              - value: '3'
                color: '#6092C0'
              - value: '4'
                color: '#D36086'
              - value: '5'
                color: '#9170B8'

      # ═══════════════════════════════════════════════════════════════════════
      # WORKLOAD HEALTH PREVIEW
      # ═══════════════════════════════════════════════════════════════════════
      - title: Workload Health
        size: {w: 48, h: 3}
        markdown:
          content: '## 🚀 Workload Health Preview'
          font_size: 14
      - title: Deployments - Desired vs Available
        description: >-
          Gap between lines indicates deployments that can't reach desired
          replica count.
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.deployment.desired)
              label: Desired
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.deployment.available)
              label: Available
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.deployment.name
      - title: Container Restarts by Namespace
        size: {w: 24, h: 12}
        lens:
          type: bar
          data_view: metrics-*
          dimension:
            field: k8s.namespace.name
            type: values
            size: 10
            sort:
              by: Restarts
              direction: desc
          metrics:
            - formula: sum(k8s.container.restarts)
              label: Restarts
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.container.restarts

      # ═══════════════════════════════════════════════════════════════════════
      # DETAIL: Unhealthy Deployments Table
      # ═══════════════════════════════════════════════════════════════════════
      - title: Unhealthy Deployments
        size: {w: 48, h: 3}
        markdown:
          content: '## 🔍 Unhealthy Deployments (Desired  Available)'
          font_size: 14
      - title: Deployments Missing Replicas
        description: >-
          Missing = Desired - Available. Positive values indicate failed
          provisioning or insufficient resources.
        size: {w: 48, h: 12}
        lens:
          type: datatable
          data_view: metrics-*
          breakdowns:
            - field: k8s.deployment.name
              type: values
              size: 25
              label: Deployment
              sort:
                by: Missing
                direction: desc
            - field: k8s.namespace.name
              type: values
              size: 1
              label: Namespace
          metrics:
            - field: k8s.deployment.desired
              aggregation: max
              label: Desired
              format:
                type: number
                decimals: 0
            - field: k8s.deployment.available
              aggregation: max
              label: Available
              format:
                type: number
                decimals: 0
            - formula: max(k8s.deployment.desired) - max(k8s.deployment.available)
              label: Missing
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.deployment.name
Workload Health (02-workload-health.yaml)
---
# Kubernetes Workload Health Dashboard
# SRE Question: "Are my deployments healthy? What's crashing?"
dashboards:
  - id: k8s-cluster-workloads
    name: '[Metrics K8s Cluster] Workload Health'
    description: Deployment, StatefulSet, DaemonSet, and container health monitoring
    controls:
      - type: options
        label: Namespace
        data_view: metrics-*
        field: k8s.namespace.name
      - type: options
        label: Deployment
        data_view: metrics-*
        field: k8s.deployment.name
    filters:
      - field: data_stream.dataset
        equals: kubernetesclusterreceiver.otel
    panels:
      # ═══════════════════════════════════════════════════════════════════════
      # NAVIGATION
      # ═══════════════════════════════════════════════════════════════════════
      - title: Navigation
        size: {w: 48, h: 3}
        links:
          layout: horizontal
          items:
            - label: 📊 Overview
              dashboard: k8s-cluster-overview
            - label: ⚙️ Workloads
              dashboard: k8s-cluster-workloads
            - label: 📦 Resources
              dashboard: k8s-cluster-resources
            - label: 🔄 Batch Jobs
              dashboard: k8s-cluster-batch
            - label: 📈 Autoscaling
              dashboard: k8s-cluster-hpa

      # ═══════════════════════════════════════════════════════════════════════
      # CONTAINER HEALTH SUMMARY (4 metric cards)
      # ═══════════════════════════════════════════════════════════════════════
      - title: Container Health
        size: {w: 48, h: 3}
        markdown:
          content: '## 🐳 Container Health'
          font_size: 14
      - title: Ready Containers
        description: Containers with all startup and liveness probes passing. Can receive traffic.
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: unique_count(k8s.container.name)
            label: Ready
            format:
              type: number
              decimals: 0
          filters:
            - field: k8s.container.ready
              equals: '1'
      - title: Not Ready Containers
        description: >-
          Containers failing probes. Check pod logs and events for startup or
          health issues.
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: unique_count(k8s.container.name)
            label: Not Ready
            format:
              type: number
              decimals: 0
          filters:
            - field: k8s.container.ready
              equals: '0'
      - title: Total Restarts
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.container.restarts)
            label: Restarts
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.container.restarts
      - title: Containers Restarting
        description: >-
          Containers that have restarted at least once since pod creation.
          Frequent restarts indicate instability.
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: unique_count(k8s.container.name)
            label: With Restarts
            format:
              type: number
              decimals: 0
          filters:
            - field: k8s.container.restarts
              gt: '0'

      # ═══════════════════════════════════════════════════════════════════════
      # DEPLOYMENT HEALTH
      # ═══════════════════════════════════════════════════════════════════════
      - title: Deployment Health
        size: {w: 48, h: 3}
        markdown:
          content: '## 🚀 Deployment Health (Desired vs Available)'
          font_size: 14
      - title: Deployments
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.deployment.desired)
              label: Desired
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.deployment.available)
              label: Available
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.deployment.name
      - title: StatefulSets
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.statefulset.desired_pods)
              label: Desired
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.statefulset.ready_pods)
              label: Ready
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.statefulset.name
      - title: DaemonSets
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.daemonset.desired_scheduled_nodes)
              label: Desired Nodes
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.daemonset.ready_nodes)
              label: Ready Nodes
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.daemonset.name
      - title: ReplicaSets
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.replicaset.desired)
              label: Desired
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.replicaset.available)
              label: Available
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.replicaset.name

      # ═══════════════════════════════════════════════════════════════════════
      # CONTAINER ANALYSIS
      # ═══════════════════════════════════════════════════════════════════════
      - title: Container Analysis
        size: {w: 48, h: 3}
        markdown:
          content: '## 📊 Container Analysis'
          font_size: 14
      - title: Container Readiness Over Time
        description: >-
          Green (1) = ready, red (0) = not ready. Correlate dips with
          deployments or incidents.
        size: {w: 24, h: 12}
        lens:
          type: area
          mode: stacked
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          breakdown:
            field: k8s.container.ready
            type: values
            size: 2
          metrics:
            - aggregation: unique_count
              field: k8s.container.name
              label: Containers
              format:
                type: number
                decimals: 0
          color:
            palette: eui_amsterdam_color_blind
            assignments:
              - value: '1'
                color: '#54B399'
              - value: '0'
                color: '#D36086'
      - title: Top Restarting Containers
        size: {w: 24, h: 12}
        lens:
          type: bar
          data_view: metrics-*
          dimension:
            field: k8s.container.name
            type: values
            size: 15
            sort:
              by: Restarts
              direction: desc
          metrics:
            - field: k8s.container.restarts
              aggregation: max
              label: Restarts
              format:
                type: number
                decimals: 0
          filters:
            - field: k8s.container.restarts
              gt: '0'

      # ═══════════════════════════════════════════════════════════════════════
      # DETAIL TABLES
      # ═══════════════════════════════════════════════════════════════════════
      - title: Workload Details
        size: {w: 48, h: 3}
        markdown:
          content: '## 🔍 Workload Status Details'
          font_size: 14
      - title: Deployment Status
        size: {w: 24, h: 12}
        lens:
          type: datatable
          data_view: metrics-*
          breakdowns:
            - field: k8s.deployment.name
              type: values
              size: 20
              label: Deployment
            - field: k8s.namespace.name
              type: values
              size: 1
              label: Namespace
          metrics:
            - field: k8s.deployment.desired
              aggregation: max
              label: Desired
              format:
                type: number
                decimals: 0
            - field: k8s.deployment.available
              aggregation: max
              label: Available
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.deployment.name
      - title: StatefulSet Status
        size: {w: 24, h: 12}
        lens:
          type: datatable
          data_view: metrics-*
          breakdowns:
            - field: k8s.statefulset.name
              type: values
              size: 20
              label: StatefulSet
            - field: k8s.namespace.name
              type: values
              size: 1
              label: Namespace
          metrics:
            - field: k8s.statefulset.desired_pods
              aggregation: max
              label: Desired
              format:
                type: number
                decimals: 0
            - field: k8s.statefulset.ready_pods
              aggregation: max
              label: Ready
              format:
                type: number
                decimals: 0
            - field: k8s.statefulset.current_pods
              aggregation: max
              label: Current
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.statefulset.name
Resource Allocation (03-resource-allocation.yaml)
---
# Kubernetes Resource Allocation Dashboard
# SRE Question: "Am I running out of resources? Are workloads over/under-provisioned?"
dashboards:
  - id: k8s-cluster-resources
    name: '[Metrics K8s Cluster] Resource Allocation'
    description: CPU, memory, and storage requests vs limits for capacity planning
    controls:
      - type: options
        label: Namespace
        data_view: metrics-*
        field: k8s.namespace.name
      - type: options
        label: Node
        data_view: metrics-*
        field: k8s.node.name
    filters:
      - field: data_stream.dataset
        equals: kubernetesclusterreceiver.otel
    panels:
      # ═══════════════════════════════════════════════════════════════════════
      # NAVIGATION
      # ═══════════════════════════════════════════════════════════════════════
      - title: Navigation
        size: {w: 48, h: 3}
        links:
          layout: horizontal
          items:
            - label: 📊 Overview
              dashboard: k8s-cluster-overview
            - label: ⚙️ Workloads
              dashboard: k8s-cluster-workloads
            - label: 📦 Resources
              dashboard: k8s-cluster-resources
            - label: 🔄 Batch Jobs
              dashboard: k8s-cluster-batch
            - label: 📈 Autoscaling
              dashboard: k8s-cluster-hpa

      # ═══════════════════════════════════════════════════════════════════════
      # CLUSTER CAPACITY OVERVIEW
      # ═══════════════════════════════════════════════════════════════════════
      - title: Cluster Capacity
        size: {w: 48, h: 3}
        markdown:
          content: '## 📊 Cluster Capacity (Requests vs Limits)'
          font_size: 14
      - title: CPU Requests vs Limits
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.container.cpu_request)
              label: CPU Requests
            - formula: sum(k8s.container.cpu_limit)
              label: CPU Limits
          filters:
            - exists: k8s.container.name
      - title: Memory Requests vs Limits
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.container.memory_request)
              label: Memory Requests
              format:
                type: bytes
            - formula: sum(k8s.container.memory_limit)
              label: Memory Limits
              format:
                type: bytes
          filters:
            - exists: k8s.container.name
      - title: Storage Requests vs Limits
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.container.storage_request)
              label: Storage Requests
              format:
                type: bytes
            - formula: sum(k8s.container.storage_limit)
              label: Storage Limits
              format:
                type: bytes
          filters:
            - exists: k8s.container.storage_request
      - title: Resource Quota Usage
        size: {w: 24, h: 12}
        lens:
          type: bar
          mode: stacked
          data_view: metrics-*
          dimension:
            field: resource
            type: values
            size: 10
            label: Resource Type
          metrics:
            - field: k8s.resource_quota.used
              aggregation: max
              label: Used
            - formula: max(k8s.resource_quota.hard_limit) - max(k8s.resource_quota.used)
              label: Available
          filters:
            - exists: k8s.resource_quota.hard_limit
          color:
            palette: eui_amsterdam_color_blind
            assignments:
              - value: Used
                color: '#6092C0'
              - value: Available
                color: '#54B399'

      # ═══════════════════════════════════════════════════════════════════════
      # NAMESPACE ALLOCATION
      # ═══════════════════════════════════════════════════════════════════════
      - title: Namespace Allocation
        size: {w: 48, h: 3}
        markdown:
          content: '## 🏷️ Resource Allocation by Namespace'
          font_size: 14
      - title: CPU by Namespace
        size: {w: 24, h: 14}
        lens:
          type: bar
          mode: stacked
          data_view: metrics-*
          dimension:
            field: k8s.namespace.name
            type: values
            size: 15
            sort:
              by: CPU Limits
              direction: desc
          metrics:
            - formula: sum(k8s.container.cpu_request)
              label: CPU Requests
            - formula: sum(k8s.container.cpu_limit)
              label: CPU Limits
          filters:
            - exists: k8s.container.name
      - title: Memory by Namespace
        size: {w: 24, h: 14}
        lens:
          type: bar
          mode: stacked
          data_view: metrics-*
          dimension:
            field: k8s.namespace.name
            type: values
            size: 15
            sort:
              by: Memory Limits
              direction: desc
          metrics:
            - formula: sum(k8s.container.memory_request)
              label: Memory Requests
              format:
                type: bytes
            - formula: sum(k8s.container.memory_limit)
              label: Memory Limits
              format:
                type: bytes
          filters:
            - exists: k8s.container.name

      # ═══════════════════════════════════════════════════════════════════════
      # POD RESOURCE DETAILS
      # ═══════════════════════════════════════════════════════════════════════
      - title: Pod Details
        size: {w: 48, h: 3}
        markdown:
          content: '## 🔍 Pod Resource Details'
          font_size: 14
      - title: Pod Resource Summary
        size: {w: 48, h: 14}
        lens:
          type: datatable
          data_view: metrics-*
          breakdowns:
            - field: k8s.pod.name
              type: values
              size: 25
              label: Pod
            - field: k8s.namespace.name
              type: values
              size: 1
              label: Namespace
          metrics:
            - formula: sum(k8s.container.cpu_request)
              label: CPU Req
            - formula: sum(k8s.container.cpu_limit)
              label: CPU Lim
            - formula: sum(k8s.container.memory_request)
              label: Mem Req
              format:
                type: bytes
            - formula: sum(k8s.container.memory_limit)
              label: Mem Lim
              format:
                type: bytes
            - formula: sum(k8s.container.restarts)
              label: Restarts
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.container.name
Batch Jobs (04-batch-jobs.yaml)
---
# Kubernetes Batch Jobs Dashboard
# SRE Question: "Are my jobs completing successfully? What's failing?"
dashboards:
  - id: k8s-cluster-batch
    name: '[Metrics K8s Cluster] Batch Jobs'
    description: Job and CronJob execution status and completion tracking
    controls:
      - type: options
        label: Namespace
        data_view: metrics-*
        field: k8s.namespace.name
      - type: options
        label: Job
        data_view: metrics-*
        field: k8s.job.name
    filters:
      - field: data_stream.dataset
        equals: kubernetesclusterreceiver.otel
    panels:
      # ═══════════════════════════════════════════════════════════════════════
      # NAVIGATION
      # ═══════════════════════════════════════════════════════════════════════
      - title: Navigation
        size: {w: 48, h: 3}
        links:
          layout: horizontal
          items:
            - label: 📊 Overview
              dashboard: k8s-cluster-overview
            - label: ⚙️ Workloads
              dashboard: k8s-cluster-workloads
            - label: 📦 Resources
              dashboard: k8s-cluster-resources
            - label: 🔄 Batch Jobs
              dashboard: k8s-cluster-batch
            - label: 📈 Autoscaling
              dashboard: k8s-cluster-hpa

      # ═══════════════════════════════════════════════════════════════════════
      # JOB STATUS SUMMARY (4 metric cards)
      # ═══════════════════════════════════════════════════════════════════════
      - title: Job Status Summary
        size: {w: 48, h: 3}
        markdown:
          content: '## 📋 Job Status Summary'
          font_size: 14
      - title: Successful Jobs
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.job.successful_pods)
            label: Successful
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.job.name
      - title: Failed Jobs
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.job.failed_pods)
            label: Failed
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.job.name
      - title: Active Jobs
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.job.active_pods)
            label: Active
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.job.name
      - title: Active CronJobs
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.cronjob.active_jobs)
            label: Active
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.cronjob.name

      # ═══════════════════════════════════════════════════════════════════════
      # JOB EXECUTION TRENDS
      # ═══════════════════════════════════════════════════════════════════════
      - title: Job Execution Trends
        size: {w: 48, h: 3}
        markdown:
          content: '## 📈 Job Execution Trends'
          font_size: 14
      - title: Job Success vs Failure Trend
        size: {w: 48, h: 14}
        lens:
          type: area
          mode: stacked
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.job.successful_pods)
              label: Successful
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.job.failed_pods)
              label: Failed
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.job.active_pods)
              label: Active
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.job.name
          color:
            palette: eui_amsterdam_color_blind
            assignments:
              - value: Successful
                color: '#54B399'
              - value: Failed
                color: '#D36086'
              - value: Active
                color: '#6092C0'

      # ═══════════════════════════════════════════════════════════════════════
      # JOB DETAILS
      # ═══════════════════════════════════════════════════════════════════════
      - title: Job Details
        size: {w: 48, h: 3}
        markdown:
          content: '## 🔍 Job Details'
          font_size: 14
      - title: Jobs by Status
        size: {w: 48, h: 14}
        lens:
          type: datatable
          data_view: metrics-*
          breakdowns:
            - field: k8s.job.name
              type: values
              size: 25
              label: Job
            - field: k8s.namespace.name
              type: values
              size: 1
              label: Namespace
          metrics:
            - field: k8s.job.active_pods
              aggregation: max
              label: Active
              format:
                type: number
                decimals: 0
            - field: k8s.job.successful_pods
              aggregation: max
              label: Successful
              format:
                type: number
                decimals: 0
            - field: k8s.job.failed_pods
              aggregation: max
              label: Failed
              format:
                type: number
                decimals: 0
            - field: k8s.job.desired_successful_pods
              aggregation: max
              label: Desired
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.job.name
Autoscaling (05-autoscaling.yaml)
---
# Kubernetes Autoscaling Dashboard
# SRE Question: "Is autoscaling working? Am I hitting limits?"
dashboards:
  - id: k8s-cluster-hpa
    name: '[Metrics K8s Cluster] Autoscaling'
    description: Horizontal Pod Autoscaler scaling behavior and capacity tracking
    controls:
      - type: options
        label: Namespace
        data_view: metrics-*
        field: k8s.namespace.name
      - type: options
        label: HPA
        data_view: metrics-*
        field: k8s.hpa.name
    filters:
      - field: data_stream.dataset
        equals: kubernetesclusterreceiver.otel
    panels:
      # ═══════════════════════════════════════════════════════════════════════
      # NAVIGATION
      # ═══════════════════════════════════════════════════════════════════════
      - title: Navigation
        size: {w: 48, h: 3}
        links:
          layout: horizontal
          items:
            - label: 📊 Overview
              dashboard: k8s-cluster-overview
            - label: ⚙️ Workloads
              dashboard: k8s-cluster-workloads
            - label: 📦 Resources
              dashboard: k8s-cluster-resources
            - label: 🔄 Batch Jobs
              dashboard: k8s-cluster-batch
            - label: 📈 Autoscaling
              dashboard: k8s-cluster-hpa

      # ═══════════════════════════════════════════════════════════════════════
      # HPA STATUS SUMMARY (4 metric cards)
      # ═══════════════════════════════════════════════════════════════════════
      - title: HPA Status Summary
        size: {w: 48, h: 3}
        markdown:
          content: '## 📈 HPA Status Summary'
          font_size: 14
      - title: Total HPAs
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: unique_count(k8s.hpa.name)
            label: HPAs
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.hpa.name
      - title: Current Replicas
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.hpa.current_replicas)
            label: Current
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.hpa.name
      - title: Desired Replicas
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.hpa.desired_replicas)
            label: Desired
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.hpa.name
      - title: Max Replicas Limit
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.hpa.max_replicas)
            label: Max Total
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.hpa.name

      # ═══════════════════════════════════════════════════════════════════════
      # SCALING BEHAVIOR
      # ═══════════════════════════════════════════════════════════════════════
      - title: Scaling Behavior
        size: {w: 48, h: 3}
        markdown:
          content: '## 🔄 Scaling Behavior'
          font_size: 14
      - title: Scaling Activity (Current vs Desired)
        size: {w: 24, h: 14}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.hpa.current_replicas)
              label: Current Replicas
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.hpa.desired_replicas)
              label: Desired Replicas
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.hpa.name
      - title: Capacity Headroom (Min / Current / Max)
        size: {w: 24, h: 14}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.hpa.min_replicas)
              label: Min Replicas
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.hpa.current_replicas)
              label: Current Replicas
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.hpa.max_replicas)
              label: Max Replicas
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.hpa.name

      # ═══════════════════════════════════════════════════════════════════════
      # HPA DETAILS
      # ═══════════════════════════════════════════════════════════════════════
      - title: HPA Details
        size: {w: 48, h: 3}
        markdown:
          content: '## 🔍 HPA Configuration & Status'
          font_size: 14
      - title: HPA Status by Name
        size: {w: 48, h: 14}
        lens:
          type: datatable
          data_view: metrics-*
          breakdowns:
            - field: k8s.hpa.name
              type: values
              size: 25
              label: HPA
            - field: k8s.namespace.name
              type: values
              size: 1
              label: Namespace
          metrics:
            - field: k8s.hpa.current_replicas
              aggregation: max
              label: Current
              format:
                type: number
                decimals: 0
            - field: k8s.hpa.desired_replicas
              aggregation: max
              label: Desired
              format:
                type: number
                decimals: 0
            - field: k8s.hpa.min_replicas
              aggregation: max
              label: Min
              format:
                type: number
                decimals: 0
            - field: k8s.hpa.max_replicas
              aggregation: max
              label: Max
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.hpa.name

Prerequisites

  • Kubernetes cluster: v1.24+
  • OpenTelemetry Collector: Contrib distribution with k8sclusterreceiver
  • Kibana: Version 8.x or later
  • Cluster admin permissions: For RBAC configuration

Data Requirements

  • Data stream dataset: kubernetesclusterreceiver.otel
  • Data view: metrics-*

OpenTelemetry Collector Configuration

Receiver Configuration

receivers:
  k8s_cluster:
    auth_type: serviceAccount
    collection_interval: 10s
    node_conditions_to_report: [Ready]
    distribution: kubernetes
    allocatable_types_to_report: [cpu, memory, ephemeral-storage, storage]
    metadata_collection_interval: 5m

exporters:
  elasticsearch:
    endpoints: ["https://elasticsearch:9200"]
    auth:
      authenticator: basicauth
    mapping:
      mode: ecs

service:
  pipelines:
    metrics:
      receivers: [k8s_cluster]
      processors: [batch, resourcedetection, resource]
      exporters: [elasticsearch]

Receiver Configuration Options

YAML Key Type Description Default
auth_type string Kubernetes API authentication method (serviceAccount, kubeConfig) serviceAccount
collection_interval duration Metric collection frequency 10s
node_conditions_to_report list Node conditions to monitor [Ready]
distribution string Cluster type (kubernetes, openshift) kubernetes
allocatable_types_to_report list Node resource types to report [cpu, memory, ephemeral-storage, storage]
metadata_collection_interval duration Entity metadata collection frequency 5m

Metrics Reference

All metrics below are enabled by default.

Container Metrics

Metric Type Unit Description
k8s.container.cpu_limit Gauge {cpu} Maximum CPU resource limit for container
k8s.container.cpu_request Gauge {cpu} CPU resources requested for container
k8s.container.memory_limit Gauge By Maximum memory resource limit
k8s.container.memory_request Gauge By Memory resources requested
k8s.container.storage_limit Gauge By Maximum storage resource limit
k8s.container.storage_request Gauge By Storage resources requested
k8s.container.ephemeralstorage_limit Gauge By Maximum ephemeral storage limit
k8s.container.ephemeralstorage_request Gauge By Ephemeral storage requested
k8s.container.ready Gauge Whether container passed readiness probe (0/1)
k8s.container.restarts Gauge {restart} Container restart count

Pod Metrics

Metric Type Unit Description
k8s.pod.phase Gauge Current pod phase (numeric encoding, see below)

Deployment Metrics

Metric Type Unit Description
k8s.deployment.desired Gauge {pod} Desired pod count in deployment
k8s.deployment.available Gauge {pod} Available pods (ready for minReadySeconds)

StatefulSet Metrics

Metric Type Unit Description
k8s.statefulset.desired_pods Gauge {pod} Desired pods (spec.replicas)
k8s.statefulset.ready_pods Gauge {pod} Pods with Ready condition
k8s.statefulset.current_pods Gauge {pod} Pods created from StatefulSet version
k8s.statefulset.updated_pods Gauge {pod} Pods created from current version

DaemonSet Metrics

Metric Type Unit Description
k8s.daemonset.desired_scheduled_nodes Gauge {node} Nodes that should run daemon pods
k8s.daemonset.current_scheduled_nodes Gauge {node} Nodes running daemon pods as intended
k8s.daemonset.ready_nodes Gauge {node} Nodes with ready daemon pods
k8s.daemonset.misscheduled_nodes Gauge {node} Nodes running daemon pods incorrectly

ReplicaSet Metrics

Metric Type Unit Description
k8s.replicaset.desired Gauge {pod} Desired pod count in replicaset
k8s.replicaset.available Gauge {pod} Available pods targeted by replicaset

Job Metrics

Metric Type Unit Description
k8s.job.active_pods Gauge {pod} Actively running job pods
k8s.job.desired_successful_pods Gauge {pod} Desired successful pod count
k8s.job.successful_pods Gauge {pod} Pods in Succeeded phase
k8s.job.failed_pods Gauge {pod} Pods in Failed phase
k8s.job.max_parallel_pods Gauge {pod} Maximum concurrent pods

CronJob Metrics

Metric Type Unit Description
k8s.cronjob.active_jobs Gauge {job} Count of actively running jobs

HPA Metrics

Metric Type Unit Description
k8s.hpa.current_replicas Gauge {pod} Current pod replicas managed by autoscaler
k8s.hpa.desired_replicas Gauge {pod} Desired pod replicas for autoscaler
k8s.hpa.min_replicas Gauge {pod} Minimum autoscaler replica count
k8s.hpa.max_replicas Gauge {pod} Maximum autoscaler replica count

Resource Quota Metrics

Metric Type Unit Description Attributes
k8s.resource_quota.hard_limit Gauge {resource} Upper resource limit in namespace quota resource
k8s.resource_quota.used Gauge {resource} Resource usage against quota resource

Namespace Metrics

Metric Type Unit Description
k8s.namespace.phase Gauge Current phase (1=active, 0=terminating)

Optional Metrics (disabled by default)

Metric Type Unit Description Attributes
k8s.container.status.reason Sum {container} Container count by status reason k8s.container.status.reason
k8s.container.status.state Sum {container} Container count by state k8s.container.status.state
k8s.node.condition Gauge {condition} Node condition status condition
k8s.pod.status_reason Gauge Pod status reason (numeric encoding)

Phase Value Encoding

The k8s.pod.phase metric uses numeric values:

Value Phase
1 Pending
2 Running
3 Succeeded
4 Failed
5 Unknown

Metric Attributes

Attribute Values Description
resource cpu, memory, pods, requests.cpu, requests.memory, limits.cpu, limits.memory Resource quota type
k8s.container.status.reason ContainerCreating, CrashLoopBackOff, CreateContainerConfigError, ErrImagePull, ImagePullBackOff, OOMKilled, Completed, Error, ContainerCannotRun Container status reason
k8s.container.status.state terminated, running, waiting Container state
condition Ready, MemoryPressure, PIDPressure, DiskPressure Node condition

Metrics Not Used in Dashboards

The following metrics are available from the k8sclusterreceiver but are not currently visualized in the dashboards:

Default Metrics Not Used

Metric Type Unit Description
k8s.container.ephemeralstorage_limit Gauge By Maximum ephemeral storage limit
k8s.container.ephemeralstorage_request Gauge By Ephemeral storage requested
k8s.statefulset.updated_pods Gauge {pod} Pods created from current version
k8s.daemonset.current_scheduled_nodes Gauge {node} Nodes running daemon pods as intended
k8s.daemonset.misscheduled_nodes Gauge {node} Nodes running daemon pods incorrectly
k8s.job.max_parallel_pods Gauge {pod} Maximum concurrent pods
k8s.namespace.phase Gauge Current phase (1=active, 0=terminating)

Optional Metrics Not Used

Metric Type Unit Description Attributes
k8s.container.status.reason Sum {container} Container count by status reason k8s.container.status.reason
k8s.container.status.state Sum {container} Container count by state k8s.container.status.state
k8s.node.condition Gauge {condition} Node condition status condition
k8s.pod.status_reason Gauge Pod status reason (numeric encoding)