Create a custom check

Diagnose ships with a set of built-in checks that cover common Kubernetes issues. You can browse the source for existing checks and extend Diagnose with custom checks to validate conditions specific to your setup.

This guide walks you through the process of creating, registering, and testing a custom diagnostic check.

How checks work

Every diagnostic check is a bash script that:

Validates that the resources it needs exist (pods, services, ingresses).
Reads data from pre-collected JSON files.
Analyzes the data and prints findings.
Reports a result with a status and evidence.

Checks don't make Kubernetes API calls. During a Diagnose run, the build_context step collects a snapshot of the cluster before any checks execute. Your check reads from that snapshot, which keeps results consistent and fast.

Check statuses

Each check must report exactly one of these statuses:

Status	Meaning
`success`	The check passed. No issues found.
`failed`	The check found a problem that needs attention.
`warning`	Something looks unusual but isn't necessarily broken.
`skipped`	The check couldn't run because a prerequisite was missing (e.g., no pods exist).

Step 1: Create the check script

Create a new file in the appropriate category folder under k8s/diagnose/:

networking/ for ingress, routing, and load balancer checks
scope/ for pod and workload checks
service/ for Kubernetes Service checks

The filename becomes the check identifier. Use snake_case with no file extension.

touch k8s/diagnose/scope/my_custom_check
chmod +x k8s/diagnose/scope/my_custom_check

Step 2: Write the check logic

Here's the structure every check should follow:

#!/bin/bash
# Check: My Custom Check
# Validates that [what this check does]

# 1. Validate prerequisites
require_pods || return 0

# 2. Read data from pre-collected files
PODS=$(jq -r '.items[].metadata.name' "$PODS_FILE" 2>/dev/null | tr '\n' ' ')

# 3. Analyze and report
HAS_ISSUES=0

for POD_NAME in $PODS; do
    POD_INFO=$(jq --arg name "$POD_NAME" \
      '.items[] | select(.metadata.name == $name)' \
      "$PODS_FILE" 2>/dev/null)

    # Your validation logic here
    SOME_VALUE=$(echo "$POD_INFO" | jq -r '.spec.someField // empty')

    if [[ -z "$SOME_VALUE" ]]; then
        HAS_ISSUES=1
        print_error "Pod $POD_NAME: Missing someField"
        print_action "Add someField to your deployment spec"
    else
        print_success "Pod $POD_NAME: someField is configured"
    fi
done

# 4. Report final result
if [[ $HAS_ISSUES -eq 1 ]]; then
    update_check_result --status "failed" --evidence '{"checked": true}'
else
    update_check_result --status "success" --evidence '{"checked": true}'
fi

Always return 0 after require

Use require_pods || return 0 (not return 1). Returning 0 tells the executor the script ran successfully, even though the check was skipped. Returning 1 would cause the executor to treat it as an unexpected error.

Step 3: Register the check in the workflow

Open the workflow.yml file in the same category folder and add your check:

# k8s/diagnose/scope/workflow.yml
steps:
  # ... existing checks ...
  - name: My Custom Check
    description: Validates that [what it checks]
    category: Scope
    type: script
    file: "$SERVICE_PATH/diagnose/scope/my_custom_check"

The category field determines how the check is grouped in the UI. Use one of:

Networking for networking checks
Scope for scope checks
K8s Service for service checks

The name and description are what users see in the Diagnose results.

Available data files

The build_context step exports these environment variables pointing to pre-collected JSON files:

Variable	Contents
`$PODS_FILE`	Pods matching the scope/deployment labels
`$SERVICES_FILE`	Services matching the labels
`$ENDPOINTS_FILE`	All endpoints in the namespace
`$INGRESSES_FILE`	Ingresses matching the scope labels
`$SECRETS_FILE`	Secret metadata (no secret data)
`$INGRESSCLASSES_FILE`	Available IngressClasses in the cluster
`$EVENTS_FILE`	Recent events in the namespace
`$ALB_CONTROLLER_PODS_FILE`	ALB controller pod information
`$ALB_CONTROLLER_LOGS_DIR`	Directory with ALB controller log files

You also have access to these context variables:

Variable	Contents
`$NAMESPACE`	The Kubernetes namespace
`$SCOPE_ID`	The nullplatform scope ID
`$DEPLOYMENT_ID`	The deployment ID (if running a deployment-level diagnosis)
`$LABEL_SELECTOR`	Full label selector including scope and deployment
`$SCOPE_LABEL_SELECTOR`	Label selector for scope only (`scope_id=...`)

Helper functions reference

These functions are available to every check, loaded automatically from diagnose_utils.

Output functions

Use these to print formatted messages. They appear in the check logs that users see in the UI.

print_success "All pods are healthy"        # ✓ green
print_error "Pod web-123 is not ready"      # ✗ red
print_warning "High restart count detected"  # ⚠ yellow
print_info "Checking pod web-123"            # ℹ cyan
print_action "Increase memory limits"        # 🔧 cyan

Resource validation

These functions check whether prerequisite resources exist. If they don't, the check is automatically set to "skipped" and the function returns 1.

require_pods       # Checks $PODS_FILE has items
require_services   # Checks $SERVICES_FILE has items
require_ingresses  # Checks $INGRESSES_FILE has items

Always call these before accessing the corresponding data files:

require_pods || return 0
# Safe to read $PODS_FILE from here

Reporting results

Call update_check_result once at the end of your check to report the final status.

# Named parameters
update_check_result --status "success" --evidence '{"pods_checked": 3}'

# Positional parameters
update_check_result "failed" '{"reason": "no healthy endpoints"}'

The evidence parameter must be valid JSON. It's stored with the check result and can contain any data that helps explain what the check found.

Example: a complete custom check

Here's a real-world example that validates containers have CPU limits configured:

#!/bin/bash
# Check: CPU Limits
# Validates that all containers have CPU limits configured

require_pods || return 0

PODS=$(jq -r '.items[].metadata.name' "$PODS_FILE" 2>/dev/null | tr '\n' ' ')

TOTAL_CONTAINERS=0
MISSING_LIMITS=0

for POD_NAME in $PODS; do
    CONTAINERS=$(jq --arg name "$POD_NAME" '
      .items[] | select(.metadata.name == $name) |
      .spec.containers[] | {
        name: .name,
        cpu_limit: (.resources.limits.cpu // null)
      }
    ' "$PODS_FILE" 2>/dev/null)

    while IFS= read -r CONTAINER; do
        CONTAINER_NAME=$(echo "$CONTAINER" | jq -r '.name')
        CPU_LIMIT=$(echo "$CONTAINER" | jq -r '.cpu_limit')
        TOTAL_CONTAINERS=$((TOTAL_CONTAINERS + 1))

        if [[ "$CPU_LIMIT" == "null" ]]; then
            MISSING_LIMITS=$((MISSING_LIMITS + 1))
            print_error "Pod $POD_NAME, container $CONTAINER_NAME: No CPU limit"
            print_action "Add resources.limits.cpu to the container spec"
        else
            print_success "Pod $POD_NAME, container $CONTAINER_NAME: CPU limit $CPU_LIMIT"
        fi
    done <<< "$(echo "$CONTAINERS" | jq -c '.')"
done

if [[ $MISSING_LIMITS -gt 0 ]]; then
    update_check_result --status "failed" \
      --evidence "{\"total\":$TOTAL_CONTAINERS,\"missing_limits\":$MISSING_LIMITS}"
else
    update_check_result --status "success" \
      --evidence "{\"total\":$TOTAL_CONTAINERS,\"missing_limits\":0}"
fi

  - name: CPU Limits
    description: Validates that all containers have CPU limits configured
    category: Scope
    type: script
    file: "$SERVICE_PATH/diagnose/scope/cpu_limits_check"

Tips

Read from files, not the API. Never use kubectl inside a check. All data you need is in the pre-collected JSON files.
Keep evidence useful. Include counts, names, and specific values in your evidence JSON. This helps users understand the result without reading logs.
Handle empty data gracefully. Use jq with // empty or // null defaults to avoid errors when fields are missing.
One concern per check. Each check should validate a single condition. If you're checking multiple things, consider splitting into separate checks.
Use the right category folder. Place your check where it logically belongs. If it inspects pods, it goes in scope/. If it inspects ingress configuration, it goes in networking/.

How checks work​

Check statuses​

Step 1: Create the check script​

Step 2: Write the check logic​

Step 3: Register the check in the workflow​

Available data files​

Helper functions reference​

Output functions​

Resource validation​

Reporting results​

Example: a complete custom check​

Tips​