Kubernetes Cluster Receiver Dashboards¶

Kubernetes cluster monitoring dashboards using OpenTelemetry k8sclusterreceiver metrics, designed for SRE and DevOps workflows.

Overview¶

The k8sclusterreceiver is an OpenTelemetry Collector receiver that collects cluster-level metrics from the Kubernetes API server. It provides visibility into cluster health, workload status, resource utilization, and autoscaling behavior.

Important: The k8sclusterreceiver must be deployed as a single instance per cluster to avoid duplicate metrics.

Dashboards¶

Dashboard	File	Description
Cluster Overview	`01-cluster-overview.yaml`	Entry point for cluster health triage
Workload Health	`02-workload-health.yaml`	Deployment and container health
Resource Allocation	`03-resource-allocation.yaml`	Capacity planning and quota analysis
Batch Jobs	`04-batch-jobs.yaml`	Job and CronJob monitoring
Autoscaling	`05-autoscaling.yaml`	HPA scaling behavior

All dashboards include navigation links for easy switching between views.

Dashboard Definitions¶

Cluster Overview (01-cluster-overview.yaml)

---
# Kubernetes Cluster Overview Dashboard
# SRE Entry Point: "Is my cluster healthy? Where should I look?"
dashboards:
  - id: k8s-cluster-overview
    name: '[Metrics K8s Cluster] Overview'
    description: High-level Kubernetes cluster health for rapid SRE triage
    controls:
      - type: options
        label: Namespace
        data_view: metrics-*
        field: k8s.namespace.name
    filters:
      - field: data_stream.dataset
        equals: kubernetesclusterreceiver.otel
    panels:
      # ═══════════════════════════════════════════════════════════════════════
      # NAVIGATION
      # ═══════════════════════════════════════════════════════════════════════
      - title: Navigation
        size: {w: 48, h: 3}
        links:
          layout: horizontal
          items:
            - label: 📊 Overview
              dashboard: k8s-cluster-overview
            - label: ⚙️ Workloads
              dashboard: k8s-cluster-workloads
            - label: 📦 Resources
              dashboard: k8s-cluster-resources
            - label: 🔄 Batch Jobs
              dashboard: k8s-cluster-batch
            - label: 📈 Autoscaling
              dashboard: k8s-cluster-hpa

      # ═══════════════════════════════════════════════════════════════════════
      # CLUSTER HEALTH SUMMARY (4 metric cards - at-a-glance health)
      # ═══════════════════════════════════════════════════════════════════════
      - title: Cluster Health
        size: {w: 48, h: 3}
        markdown:
          content: '## 🏥 Cluster Health'
          font_size: 14
      - title: Running Pods
        description: Pods in Running phase (phase=2).
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: unique_count(k8s.pod.name)
            label: Running
            format:
              type: number
              decimals: 0
          filters:
            - field: k8s.pod.phase
              equals: '2'
      - title: Pending Pods
        description: >-
          Pods in Pending phase (phase=1), waiting for scheduling or container
          image pull.
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: unique_count(k8s.pod.name)
            label: Pending
            format:
              type: number
              decimals: 0
          filters:
            - field: k8s.pod.phase
              equals: '1'
      - title: Failed Pods
        description: Pods in Failed phase (phase=4). Check pod logs for root cause.
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: unique_count(k8s.pod.name)
            label: Failed
            format:
              type: number
              decimals: 0
          filters:
            - field: k8s.pod.phase
              equals: '4'
      - title: Container Restarts
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.container.restarts)
            label: Restarts
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.container.restarts

      # ═══════════════════════════════════════════════════════════════════════
      # ANALYSIS: Pod Health Distribution & Trends
      # ═══════════════════════════════════════════════════════════════════════
      - title: Pod Health Distribution
        size: {w: 20, h: 14}
        lens:
          type: pie
          data_view: metrics-*
          breakdowns:
            - field: k8s.pod.phase
              type: values
              label: Status
              size: 5
          metrics:
            - aggregation: unique_count
              field: k8s.pod.name
              label: Pods
              format:
                type: number
                decimals: 0
          color:
            palette: eui_amsterdam_color_blind
            assignments:
              - value: '1'
                color: '#FEC514'
              - value: '2'
                color: '#54B399'
              - value: '3'
                color: '#6092C0'
              - value: '4'
                color: '#D36086'
              - value: '5'
                color: '#9170B8'
      - title: Pod Health Over Time
        size: {w: 28, h: 14}
        lens:
          type: area
          mode: stacked
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          breakdown:
            field: k8s.pod.phase
            type: values
            size: 5
          metrics:
            - aggregation: unique_count
              field: k8s.pod.name
              label: Pods
              format:
                type: number
                decimals: 0
          color:
            palette: eui_amsterdam_color_blind
            assignments:
              - value: '1'
                color: '#FEC514'
              - value: '2'
                color: '#54B399'
              - value: '3'
                color: '#6092C0'
              - value: '4'
                color: '#D36086'
              - value: '5'
                color: '#9170B8'

      # ═══════════════════════════════════════════════════════════════════════
      # WORKLOAD HEALTH PREVIEW
      # ═══════════════════════════════════════════════════════════════════════
      - title: Workload Health
        size: {w: 48, h: 3}
        markdown:
          content: '## 🚀 Workload Health Preview'
          font_size: 14
      - title: Deployments - Desired vs Available
        description: >-
          Gap between lines indicates deployments that can't reach desired
          replica count.
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.deployment.desired)
              label: Desired
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.deployment.available)
              label: Available
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.deployment.name
      - title: Container Restarts by Namespace
        size: {w: 24, h: 12}
        lens:
          type: bar
          data_view: metrics-*
          dimension:
            field: k8s.namespace.name
            type: values
            size: 10
            sort:
              by: Restarts
              direction: desc
          metrics:
            - formula: sum(k8s.container.restarts)
              label: Restarts
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.container.restarts

      # ═══════════════════════════════════════════════════════════════════════
      # DETAIL: Unhealthy Deployments Table
      # ═══════════════════════════════════════════════════════════════════════
      - title: Unhealthy Deployments
        size: {w: 48, h: 3}
        markdown:
          content: '## 🔍 Unhealthy Deployments (Desired ≠ Available)'
          font_size: 14
      - title: Deployments Missing Replicas
        description: >-
          Missing = Desired - Available. Positive values indicate failed
          provisioning or insufficient resources.
        size: {w: 48, h: 12}
        lens:
          type: datatable
          data_view: metrics-*
          breakdowns:
            - field: k8s.deployment.name
              type: values
              size: 25
              label: Deployment
              sort:
                by: Missing
                direction: desc
            - field: k8s.namespace.name
              type: values
              size: 1
              label: Namespace
          metrics:
            - field: k8s.deployment.desired
              aggregation: max
              label: Desired
              format:
                type: number
                decimals: 0
            - field: k8s.deployment.available
              aggregation: max
              label: Available
              format:
                type: number
                decimals: 0
            - formula: max(k8s.deployment.desired) - max(k8s.deployment.available)
              label: Missing
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.deployment.name

Workload Health (02-workload-health.yaml)

---
# Kubernetes Workload Health Dashboard
# SRE Question: "Are my deployments healthy? What's crashing?"
dashboards:
  - id: k8s-cluster-workloads
    name: '[Metrics K8s Cluster] Workload Health'
    description: Deployment, StatefulSet, DaemonSet, and container health monitoring
    controls:
      - type: options
        label: Namespace
        data_view: metrics-*
        field: k8s.namespace.name
      - type: options
        label: Deployment
        data_view: metrics-*
        field: k8s.deployment.name
    filters:
      - field: data_stream.dataset
        equals: kubernetesclusterreceiver.otel
    panels:
      # ═══════════════════════════════════════════════════════════════════════
      # NAVIGATION
      # ═══════════════════════════════════════════════════════════════════════
      - title: Navigation
        size: {w: 48, h: 3}
        links:
          layout: horizontal
          items:
            - label: 📊 Overview
              dashboard: k8s-cluster-overview
            - label: ⚙️ Workloads
              dashboard: k8s-cluster-workloads
            - label: 📦 Resources
              dashboard: k8s-cluster-resources
            - label: 🔄 Batch Jobs
              dashboard: k8s-cluster-batch
            - label: 📈 Autoscaling
              dashboard: k8s-cluster-hpa

      # ═══════════════════════════════════════════════════════════════════════
      # CONTAINER HEALTH SUMMARY (4 metric cards)
      # ═══════════════════════════════════════════════════════════════════════
      - title: Container Health
        size: {w: 48, h: 3}
        markdown:
          content: '## 🐳 Container Health'
          font_size: 14
      - title: Ready Containers
        description: Containers with all startup and liveness probes passing. Can receive traffic.
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: unique_count(k8s.container.name)
            label: Ready
            format:
              type: number
              decimals: 0
          filters:
            - field: k8s.container.ready
              equals: '1'
      - title: Not Ready Containers
        description: >-
          Containers failing probes. Check pod logs and events for startup or
          health issues.
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: unique_count(k8s.container.name)
            label: Not Ready
            format:
              type: number
              decimals: 0
          filters:
            - field: k8s.container.ready
              equals: '0'
      - title: Total Restarts
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.container.restarts)
            label: Restarts
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.container.restarts
      - title: Containers Restarting
        description: >-
          Containers that have restarted at least once since pod creation.
          Frequent restarts indicate instability.
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: unique_count(k8s.container.name)
            label: With Restarts
            format:
              type: number
              decimals: 0
          filters:
            - field: k8s.container.restarts
              gt: '0'

      # ═══════════════════════════════════════════════════════════════════════
      # DEPLOYMENT HEALTH
      # ═══════════════════════════════════════════════════════════════════════
      - title: Deployment Health
        size: {w: 48, h: 3}
        markdown:
          content: '## 🚀 Deployment Health (Desired vs Available)'
          font_size: 14
      - title: Deployments
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.deployment.desired)
              label: Desired
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.deployment.available)
              label: Available
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.deployment.name
      - title: StatefulSets
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.statefulset.desired_pods)
              label: Desired
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.statefulset.ready_pods)
              label: Ready
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.statefulset.name
      - title: DaemonSets
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.daemonset.desired_scheduled_nodes)
              label: Desired Nodes
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.daemonset.ready_nodes)
              label: Ready Nodes
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.daemonset.name
      - title: ReplicaSets
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.replicaset.desired)
              label: Desired
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.replicaset.available)
              label: Available
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.replicaset.name

      # ═══════════════════════════════════════════════════════════════════════
      # CONTAINER ANALYSIS
      # ═══════════════════════════════════════════════════════════════════════
      - title: Container Analysis
        size: {w: 48, h: 3}
        markdown:
          content: '## 📊 Container Analysis'
          font_size: 14
      - title: Container Readiness Over Time
        description: >-
          Green (1) = ready, red (0) = not ready. Correlate dips with
          deployments or incidents.
        size: {w: 24, h: 12}
        lens:
          type: area
          mode: stacked
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          breakdown:
            field: k8s.container.ready
            type: values
            size: 2
          metrics:
            - aggregation: unique_count
              field: k8s.container.name
              label: Containers
              format:
                type: number
                decimals: 0
          color:
            palette: eui_amsterdam_color_blind
            assignments:
              - value: '1'
                color: '#54B399'
              - value: '0'
                color: '#D36086'
      - title: Top Restarting Containers
        size: {w: 24, h: 12}
        lens:
          type: bar
          data_view: metrics-*
          dimension:
            field: k8s.container.name
            type: values
            size: 15
            sort:
              by: Restarts
              direction: desc
          metrics:
            - field: k8s.container.restarts
              aggregation: max
              label: Restarts
              format:
                type: number
                decimals: 0
          filters:
            - field: k8s.container.restarts
              gt: '0'

      # ═══════════════════════════════════════════════════════════════════════
      # DETAIL TABLES
      # ═══════════════════════════════════════════════════════════════════════
      - title: Workload Details
        size: {w: 48, h: 3}
        markdown:
          content: '## 🔍 Workload Status Details'
          font_size: 14
      - title: Deployment Status
        size: {w: 24, h: 12}
        lens:
          type: datatable
          data_view: metrics-*
          breakdowns:
            - field: k8s.deployment.name
              type: values
              size: 20
              label: Deployment
            - field: k8s.namespace.name
              type: values
              size: 1
              label: Namespace
          metrics:
            - field: k8s.deployment.desired
              aggregation: max
              label: Desired
              format:
                type: number
                decimals: 0
            - field: k8s.deployment.available
              aggregation: max
              label: Available
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.deployment.name
      - title: StatefulSet Status
        size: {w: 24, h: 12}
        lens:
          type: datatable
          data_view: metrics-*
          breakdowns:
            - field: k8s.statefulset.name
              type: values
              size: 20
              label: StatefulSet
            - field: k8s.namespace.name
              type: values
              size: 1
              label: Namespace
          metrics:
            - field: k8s.statefulset.desired_pods
              aggregation: max
              label: Desired
              format:
                type: number
                decimals: 0
            - field: k8s.statefulset.ready_pods
              aggregation: max
              label: Ready
              format:
                type: number
                decimals: 0
            - field: k8s.statefulset.current_pods
              aggregation: max
              label: Current
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.statefulset.name

Resource Allocation (03-resource-allocation.yaml)

---
# Kubernetes Resource Allocation Dashboard
# SRE Question: "Am I running out of resources? Are workloads over/under-provisioned?"
dashboards:
  - id: k8s-cluster-resources
    name: '[Metrics K8s Cluster] Resource Allocation'
    description: CPU, memory, and storage requests vs limits for capacity planning
    controls:
      - type: options
        label: Namespace
        data_view: metrics-*
        field: k8s.namespace.name
      - type: options
        label: Node
        data_view: metrics-*
        field: k8s.node.name
    filters:
      - field: data_stream.dataset
        equals: kubernetesclusterreceiver.otel
    panels:
      # ═══════════════════════════════════════════════════════════════════════
      # NAVIGATION
      # ═══════════════════════════════════════════════════════════════════════
      - title: Navigation
        size: {w: 48, h: 3}
        links:
          layout: horizontal
          items:
            - label: 📊 Overview
              dashboard: k8s-cluster-overview
            - label: ⚙️ Workloads
              dashboard: k8s-cluster-workloads
            - label: 📦 Resources
              dashboard: k8s-cluster-resources
            - label: 🔄 Batch Jobs
              dashboard: k8s-cluster-batch
            - label: 📈 Autoscaling
              dashboard: k8s-cluster-hpa

      # ═══════════════════════════════════════════════════════════════════════
      # CLUSTER CAPACITY OVERVIEW
      # ═══════════════════════════════════════════════════════════════════════
      - title: Cluster Capacity
        size: {w: 48, h: 3}
        markdown:
          content: '## 📊 Cluster Capacity (Requests vs Limits)'
          font_size: 14
      - title: CPU Requests vs Limits
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.container.cpu_request)
              label: CPU Requests
            - formula: sum(k8s.container.cpu_limit)
              label: CPU Limits
          filters:
            - exists: k8s.container.name
      - title: Memory Requests vs Limits
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.container.memory_request)
              label: Memory Requests
              format:
                type: bytes
            - formula: sum(k8s.container.memory_limit)
              label: Memory Limits
              format:
                type: bytes
          filters:
            - exists: k8s.container.name
      - title: Storage Requests vs Limits
        size: {w: 24, h: 12}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.container.storage_request)
              label: Storage Requests
              format:
                type: bytes
            - formula: sum(k8s.container.storage_limit)
              label: Storage Limits
              format:
                type: bytes
          filters:
            - exists: k8s.container.storage_request
      - title: Resource Quota Usage
        size: {w: 24, h: 12}
        lens:
          type: bar
          mode: stacked
          data_view: metrics-*
          dimension:
            field: resource
            type: values
            size: 10
            label: Resource Type
          metrics:
            - field: k8s.resource_quota.used
              aggregation: max
              label: Used
            - formula: max(k8s.resource_quota.hard_limit) - max(k8s.resource_quota.used)
              label: Available
          filters:
            - exists: k8s.resource_quota.hard_limit
          color:
            palette: eui_amsterdam_color_blind
            assignments:
              - value: Used
                color: '#6092C0'
              - value: Available
                color: '#54B399'

      # ═══════════════════════════════════════════════════════════════════════
      # NAMESPACE ALLOCATION
      # ═══════════════════════════════════════════════════════════════════════
      - title: Namespace Allocation
        size: {w: 48, h: 3}
        markdown:
          content: '## 🏷️ Resource Allocation by Namespace'
          font_size: 14
      - title: CPU by Namespace
        size: {w: 24, h: 14}
        lens:
          type: bar
          mode: stacked
          data_view: metrics-*
          dimension:
            field: k8s.namespace.name
            type: values
            size: 15
            sort:
              by: CPU Limits
              direction: desc
          metrics:
            - formula: sum(k8s.container.cpu_request)
              label: CPU Requests
            - formula: sum(k8s.container.cpu_limit)
              label: CPU Limits
          filters:
            - exists: k8s.container.name
      - title: Memory by Namespace
        size: {w: 24, h: 14}
        lens:
          type: bar
          mode: stacked
          data_view: metrics-*
          dimension:
            field: k8s.namespace.name
            type: values
            size: 15
            sort:
              by: Memory Limits
              direction: desc
          metrics:
            - formula: sum(k8s.container.memory_request)
              label: Memory Requests
              format:
                type: bytes
            - formula: sum(k8s.container.memory_limit)
              label: Memory Limits
              format:
                type: bytes
          filters:
            - exists: k8s.container.name

      # ═══════════════════════════════════════════════════════════════════════
      # POD RESOURCE DETAILS
      # ═══════════════════════════════════════════════════════════════════════
      - title: Pod Details
        size: {w: 48, h: 3}
        markdown:
          content: '## 🔍 Pod Resource Details'
          font_size: 14
      - title: Pod Resource Summary
        size: {w: 48, h: 14}
        lens:
          type: datatable
          data_view: metrics-*
          breakdowns:
            - field: k8s.pod.name
              type: values
              size: 25
              label: Pod
            - field: k8s.namespace.name
              type: values
              size: 1
              label: Namespace
          metrics:
            - formula: sum(k8s.container.cpu_request)
              label: CPU Req
            - formula: sum(k8s.container.cpu_limit)
              label: CPU Lim
            - formula: sum(k8s.container.memory_request)
              label: Mem Req
              format:
                type: bytes
            - formula: sum(k8s.container.memory_limit)
              label: Mem Lim
              format:
                type: bytes
            - formula: sum(k8s.container.restarts)
              label: Restarts
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.container.name

Batch Jobs (04-batch-jobs.yaml)

---
# Kubernetes Batch Jobs Dashboard
# SRE Question: "Are my jobs completing successfully? What's failing?"
dashboards:
  - id: k8s-cluster-batch
    name: '[Metrics K8s Cluster] Batch Jobs'
    description: Job and CronJob execution status and completion tracking
    controls:
      - type: options
        label: Namespace
        data_view: metrics-*
        field: k8s.namespace.name
      - type: options
        label: Job
        data_view: metrics-*
        field: k8s.job.name
    filters:
      - field: data_stream.dataset
        equals: kubernetesclusterreceiver.otel
    panels:
      # ═══════════════════════════════════════════════════════════════════════
      # NAVIGATION
      # ═══════════════════════════════════════════════════════════════════════
      - title: Navigation
        size: {w: 48, h: 3}
        links:
          layout: horizontal
          items:
            - label: 📊 Overview
              dashboard: k8s-cluster-overview
            - label: ⚙️ Workloads
              dashboard: k8s-cluster-workloads
            - label: 📦 Resources
              dashboard: k8s-cluster-resources
            - label: 🔄 Batch Jobs
              dashboard: k8s-cluster-batch
            - label: 📈 Autoscaling
              dashboard: k8s-cluster-hpa

      # ═══════════════════════════════════════════════════════════════════════
      # JOB STATUS SUMMARY (4 metric cards)
      # ═══════════════════════════════════════════════════════════════════════
      - title: Job Status Summary
        size: {w: 48, h: 3}
        markdown:
          content: '## 📋 Job Status Summary'
          font_size: 14
      - title: Successful Jobs
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.job.successful_pods)
            label: Successful
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.job.name
      - title: Failed Jobs
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.job.failed_pods)
            label: Failed
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.job.name
      - title: Active Jobs
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.job.active_pods)
            label: Active
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.job.name
      - title: Active CronJobs
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.cronjob.active_jobs)
            label: Active
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.cronjob.name

      # ═══════════════════════════════════════════════════════════════════════
      # JOB EXECUTION TRENDS
      # ═══════════════════════════════════════════════════════════════════════
      - title: Job Execution Trends
        size: {w: 48, h: 3}
        markdown:
          content: '## 📈 Job Execution Trends'
          font_size: 14
      - title: Job Success vs Failure Trend
        size: {w: 48, h: 14}
        lens:
          type: area
          mode: stacked
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.job.successful_pods)
              label: Successful
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.job.failed_pods)
              label: Failed
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.job.active_pods)
              label: Active
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.job.name
          color:
            palette: eui_amsterdam_color_blind
            assignments:
              - value: Successful
                color: '#54B399'
              - value: Failed
                color: '#D36086'
              - value: Active
                color: '#6092C0'

      # ═══════════════════════════════════════════════════════════════════════
      # JOB DETAILS
      # ═══════════════════════════════════════════════════════════════════════
      - title: Job Details
        size: {w: 48, h: 3}
        markdown:
          content: '## 🔍 Job Details'
          font_size: 14
      - title: Jobs by Status
        size: {w: 48, h: 14}
        lens:
          type: datatable
          data_view: metrics-*
          breakdowns:
            - field: k8s.job.name
              type: values
              size: 25
              label: Job
            - field: k8s.namespace.name
              type: values
              size: 1
              label: Namespace
          metrics:
            - field: k8s.job.active_pods
              aggregation: max
              label: Active
              format:
                type: number
                decimals: 0
            - field: k8s.job.successful_pods
              aggregation: max
              label: Successful
              format:
                type: number
                decimals: 0
            - field: k8s.job.failed_pods
              aggregation: max
              label: Failed
              format:
                type: number
                decimals: 0
            - field: k8s.job.desired_successful_pods
              aggregation: max
              label: Desired
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.job.name

Autoscaling (05-autoscaling.yaml)

---
# Kubernetes Autoscaling Dashboard
# SRE Question: "Is autoscaling working? Am I hitting limits?"
dashboards:
  - id: k8s-cluster-hpa
    name: '[Metrics K8s Cluster] Autoscaling'
    description: Horizontal Pod Autoscaler scaling behavior and capacity tracking
    controls:
      - type: options
        label: Namespace
        data_view: metrics-*
        field: k8s.namespace.name
      - type: options
        label: HPA
        data_view: metrics-*
        field: k8s.hpa.name
    filters:
      - field: data_stream.dataset
        equals: kubernetesclusterreceiver.otel
    panels:
      # ═══════════════════════════════════════════════════════════════════════
      # NAVIGATION
      # ═══════════════════════════════════════════════════════════════════════
      - title: Navigation
        size: {w: 48, h: 3}
        links:
          layout: horizontal
          items:
            - label: 📊 Overview
              dashboard: k8s-cluster-overview
            - label: ⚙️ Workloads
              dashboard: k8s-cluster-workloads
            - label: 📦 Resources
              dashboard: k8s-cluster-resources
            - label: 🔄 Batch Jobs
              dashboard: k8s-cluster-batch
            - label: 📈 Autoscaling
              dashboard: k8s-cluster-hpa

      # ═══════════════════════════════════════════════════════════════════════
      # HPA STATUS SUMMARY (4 metric cards)
      # ═══════════════════════════════════════════════════════════════════════
      - title: HPA Status Summary
        size: {w: 48, h: 3}
        markdown:
          content: '## 📈 HPA Status Summary'
          font_size: 14
      - title: Total HPAs
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: unique_count(k8s.hpa.name)
            label: HPAs
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.hpa.name
      - title: Current Replicas
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.hpa.current_replicas)
            label: Current
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.hpa.name
      - title: Desired Replicas
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.hpa.desired_replicas)
            label: Desired
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.hpa.name
      - title: Max Replicas Limit
        hide_title: true
        size: {w: 12, h: 4}
        lens:
          type: metric
          data_view: metrics-*
          primary:
            formula: sum(k8s.hpa.max_replicas)
            label: Max Total
            format:
              type: number
              decimals: 0
          filters:
            - exists: k8s.hpa.name

      # ═══════════════════════════════════════════════════════════════════════
      # SCALING BEHAVIOR
      # ═══════════════════════════════════════════════════════════════════════
      - title: Scaling Behavior
        size: {w: 48, h: 3}
        markdown:
          content: '## 🔄 Scaling Behavior'
          font_size: 14
      - title: Scaling Activity (Current vs Desired)
        size: {w: 24, h: 14}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.hpa.current_replicas)
              label: Current Replicas
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.hpa.desired_replicas)
              label: Desired Replicas
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.hpa.name
      - title: Capacity Headroom (Min / Current / Max)
        size: {w: 24, h: 14}
        lens:
          type: line
          data_view: metrics-*
          dimension:
            field: '@timestamp'
            type: date_histogram
          metrics:
            - formula: sum(k8s.hpa.min_replicas)
              label: Min Replicas
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.hpa.current_replicas)
              label: Current Replicas
              format:
                type: number
                decimals: 0
            - formula: sum(k8s.hpa.max_replicas)
              label: Max Replicas
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.hpa.name

      # ═══════════════════════════════════════════════════════════════════════
      # HPA DETAILS
      # ═══════════════════════════════════════════════════════════════════════
      - title: HPA Details
        size: {w: 48, h: 3}
        markdown:
          content: '## 🔍 HPA Configuration & Status'
          font_size: 14
      - title: HPA Status by Name
        size: {w: 48, h: 14}
        lens:
          type: datatable
          data_view: metrics-*
          breakdowns:
            - field: k8s.hpa.name
              type: values
              size: 25
              label: HPA
            - field: k8s.namespace.name
              type: values
              size: 1
              label: Namespace
          metrics:
            - field: k8s.hpa.current_replicas
              aggregation: max
              label: Current
              format:
                type: number
                decimals: 0
            - field: k8s.hpa.desired_replicas
              aggregation: max
              label: Desired
              format:
                type: number
                decimals: 0
            - field: k8s.hpa.min_replicas
              aggregation: max
              label: Min
              format:
                type: number
                decimals: 0
            - field: k8s.hpa.max_replicas
              aggregation: max
              label: Max
              format:
                type: number
                decimals: 0
          filters:
            - exists: k8s.hpa.name

Prerequisites¶

Kubernetes cluster: v1.24+
OpenTelemetry Collector: Contrib distribution with k8sclusterreceiver
Kibana: Version 8.x or later
Cluster admin permissions: For RBAC configuration

Data Requirements¶

Data stream dataset: kubernetesclusterreceiver.otel
Data view: metrics-*

OpenTelemetry Collector Configuration¶

Receiver Configuration¶

receivers:
  k8s_cluster:
    auth_type: serviceAccount
    collection_interval: 10s
    node_conditions_to_report: [Ready]
    distribution: kubernetes
    allocatable_types_to_report: [cpu, memory, ephemeral-storage, storage]
    metadata_collection_interval: 5m

exporters:
  elasticsearch:
    endpoints: ["https://elasticsearch:9200"]
    auth:
      authenticator: basicauth
    mapping:
      mode: ecs

service:
  pipelines:
    metrics:
      receivers: [k8s_cluster]
      processors: [batch, resourcedetection, resource]
      exporters: [elasticsearch]

Receiver Configuration Options¶

YAML Key	Type	Description	Default
`auth_type`	string	Kubernetes API authentication method (`serviceAccount`, `kubeConfig`)	`serviceAccount`
`collection_interval`	duration	Metric collection frequency	`10s`
`node_conditions_to_report`	list	Node conditions to monitor	`[Ready]`
`distribution`	string	Cluster type (`kubernetes`, `openshift`)	`kubernetes`
`allocatable_types_to_report`	list	Node resource types to report	`[cpu, memory, ephemeral-storage, storage]`
`metadata_collection_interval`	duration	Entity metadata collection frequency	`5m`

Metrics Reference¶

All metrics below are enabled by default.

Container Metrics¶

Metric	Type	Unit	Description
`k8s.container.cpu_limit`	Gauge	`{cpu}`	Maximum CPU resource limit for container
`k8s.container.cpu_request`	Gauge	`{cpu}`	CPU resources requested for container
`k8s.container.memory_limit`	Gauge	`By`	Maximum memory resource limit
`k8s.container.memory_request`	Gauge	`By`	Memory resources requested
`k8s.container.storage_limit`	Gauge	`By`	Maximum storage resource limit
`k8s.container.storage_request`	Gauge	`By`	Storage resources requested
`k8s.container.ephemeralstorage_limit`	Gauge	`By`	Maximum ephemeral storage limit
`k8s.container.ephemeralstorage_request`	Gauge	`By`	Ephemeral storage requested
`k8s.container.ready`	Gauge	—	Whether container passed readiness probe (0/1)
`k8s.container.restarts`	Gauge	`{restart}`	Container restart count

Pod Metrics¶

Metric	Type	Unit	Description
`k8s.pod.phase`	Gauge	—	Current pod phase (numeric encoding, see below)

Deployment Metrics¶

Metric	Type	Unit	Description
`k8s.deployment.desired`	Gauge	`{pod}`	Desired pod count in deployment
`k8s.deployment.available`	Gauge	`{pod}`	Available pods (ready for minReadySeconds)

StatefulSet Metrics¶

Metric	Type	Unit	Description
`k8s.statefulset.desired_pods`	Gauge	`{pod}`	Desired pods (spec.replicas)
`k8s.statefulset.ready_pods`	Gauge	`{pod}`	Pods with Ready condition
`k8s.statefulset.current_pods`	Gauge	`{pod}`	Pods created from StatefulSet version
`k8s.statefulset.updated_pods`	Gauge	`{pod}`	Pods created from current version

DaemonSet Metrics¶

Metric	Type	Unit	Description
`k8s.daemonset.desired_scheduled_nodes`	Gauge	`{node}`	Nodes that should run daemon pods
`k8s.daemonset.current_scheduled_nodes`	Gauge	`{node}`	Nodes running daemon pods as intended
`k8s.daemonset.ready_nodes`	Gauge	`{node}`	Nodes with ready daemon pods
`k8s.daemonset.misscheduled_nodes`	Gauge	`{node}`	Nodes running daemon pods incorrectly

ReplicaSet Metrics¶

Metric	Type	Unit	Description
`k8s.replicaset.desired`	Gauge	`{pod}`	Desired pod count in replicaset
`k8s.replicaset.available`	Gauge	`{pod}`	Available pods targeted by replicaset

Job Metrics¶

Metric	Type	Unit	Description
`k8s.job.active_pods`	Gauge	`{pod}`	Actively running job pods
`k8s.job.desired_successful_pods`	Gauge	`{pod}`	Desired successful pod count
`k8s.job.successful_pods`	Gauge	`{pod}`	Pods in Succeeded phase
`k8s.job.failed_pods`	Gauge	`{pod}`	Pods in Failed phase
`k8s.job.max_parallel_pods`	Gauge	`{pod}`	Maximum concurrent pods

CronJob Metrics¶

Metric	Type	Unit	Description
`k8s.cronjob.active_jobs`	Gauge	`{job}`	Count of actively running jobs

HPA Metrics¶

Metric	Type	Unit	Description
`k8s.hpa.current_replicas`	Gauge	`{pod}`	Current pod replicas managed by autoscaler
`k8s.hpa.desired_replicas`	Gauge	`{pod}`	Desired pod replicas for autoscaler
`k8s.hpa.min_replicas`	Gauge	`{pod}`	Minimum autoscaler replica count
`k8s.hpa.max_replicas`	Gauge	`{pod}`	Maximum autoscaler replica count

Resource Quota Metrics¶

Metric	Type	Unit	Description	Attributes
`k8s.resource_quota.hard_limit`	Gauge	`{resource}`	Upper resource limit in namespace quota	`resource`
`k8s.resource_quota.used`	Gauge	`{resource}`	Resource usage against quota	`resource`

Namespace Metrics¶

Metric	Type	Unit	Description
`k8s.namespace.phase`	Gauge	—	Current phase (1=active, 0=terminating)

Optional Metrics (disabled by default)¶

Metric	Type	Unit	Description	Attributes
`k8s.container.status.reason`	Sum	`{container}`	Container count by status reason	`k8s.container.status.reason`
`k8s.container.status.state`	Sum	`{container}`	Container count by state	`k8s.container.status.state`
`k8s.node.condition`	Gauge	`{condition}`	Node condition status	`condition`
`k8s.pod.status_reason`	Gauge	—	Pod status reason (numeric encoding)	—

Phase Value Encoding¶

The k8s.pod.phase metric uses numeric values:

Value	Phase
`1`	Pending
`2`	Running
`3`	Succeeded
`4`	Failed
`5`	Unknown

Metric Attributes¶

Attribute	Values	Description
`resource`	`cpu`, `memory`, `pods`, `requests.cpu`, `requests.memory`, `limits.cpu`, `limits.memory`	Resource quota type
`k8s.container.status.reason`	`ContainerCreating`, `CrashLoopBackOff`, `CreateContainerConfigError`, `ErrImagePull`, `ImagePullBackOff`, `OOMKilled`, `Completed`, `Error`, `ContainerCannotRun`	Container status reason
`k8s.container.status.state`	`terminated`, `running`, `waiting`	Container state
`condition`	`Ready`, `MemoryPressure`, `PIDPressure`, `DiskPressure`	Node condition

Metrics Not Used in Dashboards¶

The following metrics are available from the k8sclusterreceiver but are not currently visualized in the dashboards:

Default Metrics Not Used¶

Metric	Type	Unit	Description
`k8s.container.ephemeralstorage_limit`	Gauge	`By`	Maximum ephemeral storage limit
`k8s.container.ephemeralstorage_request`	Gauge	`By`	Ephemeral storage requested
`k8s.statefulset.updated_pods`	Gauge	`{pod}`	Pods created from current version
`k8s.daemonset.current_scheduled_nodes`	Gauge	`{node}`	Nodes running daemon pods as intended
`k8s.daemonset.misscheduled_nodes`	Gauge	`{node}`	Nodes running daemon pods incorrectly
`k8s.job.max_parallel_pods`	Gauge	`{pod}`	Maximum concurrent pods
`k8s.namespace.phase`	Gauge	—	Current phase (1=active, 0=terminating)

Optional Metrics Not Used¶

Metric	Type	Unit	Description	Attributes
`k8s.container.status.reason`	Sum	`{container}`	Container count by status reason	`k8s.container.status.reason`
`k8s.container.status.state`	Sum	`{container}`	Container count by state	`k8s.container.status.state`
`k8s.node.condition`	Gauge	`{condition}`	Node condition status	`condition`
`k8s.pod.status_reason`	Gauge	—	Pod status reason (numeric encoding)	—