Skip to main content

Create a custom check

Diagnose ships with a set of built-in checks that cover common Kubernetes issues. You can browse the source for existing checks and extend Diagnose with custom checks to validate conditions specific to your setup.

This guide walks you through the process of creating, registering, and testing a custom diagnostic check.

How checks work

Every diagnostic check is a bash script that:

  1. Validates that the resources it needs exist (pods, services, ingresses).
  2. Reads data from pre-collected JSON files.
  3. Analyzes the data and prints findings.
  4. Reports a result with a status and evidence.

Checks don't make Kubernetes API calls. During a Diagnose run, the build_context step collects a snapshot of the cluster before any checks execute. Your check reads from that snapshot, which keeps results consistent and fast.

Check statuses

Each check must report exactly one of these statuses:

StatusMeaning
successThe check passed. No issues found.
failedThe check found a problem that needs attention.
warningSomething looks unusual but isn't necessarily broken.
skippedThe check couldn't run because a prerequisite was missing (e.g., no pods exist).

Step 1: Create the check script

Create a new file in the appropriate category folder under k8s/diagnose/:

  • networking/ for ingress, routing, and load balancer checks
  • scope/ for pod and workload checks
  • service/ for Kubernetes Service checks

The filename becomes the check identifier. Use snake_case with no file extension.

touch k8s/diagnose/scope/my_custom_check
chmod +x k8s/diagnose/scope/my_custom_check

Step 2: Write the check logic

Here's the structure every check should follow:

#!/bin/bash
# Check: My Custom Check
# Validates that [what this check does]

# 1. Validate prerequisites
require_pods || return 0

# 2. Read data from pre-collected files
PODS=$(jq -r '.items[].metadata.name' "$PODS_FILE" 2>/dev/null | tr '\n' ' ')

# 3. Analyze and report
HAS_ISSUES=0

for POD_NAME in $PODS; do
POD_INFO=$(jq --arg name "$POD_NAME" \
'.items[] | select(.metadata.name == $name)' \
"$PODS_FILE" 2>/dev/null)

# Your validation logic here
SOME_VALUE=$(echo "$POD_INFO" | jq -r '.spec.someField // empty')

if [[ -z "$SOME_VALUE" ]]; then
HAS_ISSUES=1
print_error "Pod $POD_NAME: Missing someField"
print_action "Add someField to your deployment spec"
else
print_success "Pod $POD_NAME: someField is configured"
fi
done

# 4. Report final result
if [[ $HAS_ISSUES -eq 1 ]]; then
update_check_result --status "failed" --evidence '{"checked": true}'
else
update_check_result --status "success" --evidence '{"checked": true}'
fi
Always return 0 after require

Use require_pods || return 0 (not return 1). Returning 0 tells the executor the script ran successfully, even though the check was skipped. Returning 1 would cause the executor to treat it as an unexpected error.

Step 3: Register the check in the workflow

Open the workflow.yml file in the same category folder and add your check:

# k8s/diagnose/scope/workflow.yml
steps:
# ... existing checks ...
- name: My Custom Check
description: Validates that [what it checks]
category: Scope
type: script
file: "$SERVICE_PATH/diagnose/scope/my_custom_check"

The category field determines how the check is grouped in the UI. Use one of:

  • Networking for networking checks
  • Scope for scope checks
  • K8s Service for service checks

The name and description are what users see in the Diagnose results.

Available data files

The build_context step exports these environment variables pointing to pre-collected JSON files:

VariableContents
$PODS_FILEPods matching the scope/deployment labels
$SERVICES_FILEServices matching the labels
$ENDPOINTS_FILEAll endpoints in the namespace
$INGRESSES_FILEIngresses matching the scope labels
$SECRETS_FILESecret metadata (no secret data)
$INGRESSCLASSES_FILEAvailable IngressClasses in the cluster
$EVENTS_FILERecent events in the namespace
$ALB_CONTROLLER_PODS_FILEALB controller pod information
$ALB_CONTROLLER_LOGS_DIRDirectory with ALB controller log files

You also have access to these context variables:

VariableContents
$NAMESPACEThe Kubernetes namespace
$SCOPE_IDThe nullplatform scope ID
$DEPLOYMENT_IDThe deployment ID (if running a deployment-level diagnosis)
$LABEL_SELECTORFull label selector including scope and deployment
$SCOPE_LABEL_SELECTORLabel selector for scope only (scope_id=...)

Helper functions reference

These functions are available to every check, loaded automatically from diagnose_utils.

Output functions

Use these to print formatted messages. They appear in the check logs that users see in the UI.

print_success "All pods are healthy"        # ✓ green
print_error "Pod web-123 is not ready" # ✗ red
print_warning "High restart count detected" # ⚠ yellow
print_info "Checking pod web-123" # ℹ cyan
print_action "Increase memory limits" # 🔧 cyan

Resource validation

These functions check whether prerequisite resources exist. If they don't, the check is automatically set to "skipped" and the function returns 1.

require_pods       # Checks $PODS_FILE has items
require_services # Checks $SERVICES_FILE has items
require_ingresses # Checks $INGRESSES_FILE has items

Always call these before accessing the corresponding data files:

require_pods || return 0
# Safe to read $PODS_FILE from here

Reporting results

Call update_check_result once at the end of your check to report the final status.

# Named parameters
update_check_result --status "success" --evidence '{"pods_checked": 3}'

# Positional parameters
update_check_result "failed" '{"reason": "no healthy endpoints"}'

The evidence parameter must be valid JSON. It's stored with the check result and can contain any data that helps explain what the check found.

Example: a complete custom check

Here's a real-world example that validates containers have CPU limits configured:

#!/bin/bash
# Check: CPU Limits
# Validates that all containers have CPU limits configured

require_pods || return 0

PODS=$(jq -r '.items[].metadata.name' "$PODS_FILE" 2>/dev/null | tr '\n' ' ')

TOTAL_CONTAINERS=0
MISSING_LIMITS=0

for POD_NAME in $PODS; do
CONTAINERS=$(jq --arg name "$POD_NAME" '
.items[] | select(.metadata.name == $name) |
.spec.containers[] | {
name: .name,
cpu_limit: (.resources.limits.cpu // null)
}
' "$PODS_FILE" 2>/dev/null)

while IFS= read -r CONTAINER; do
CONTAINER_NAME=$(echo "$CONTAINER" | jq -r '.name')
CPU_LIMIT=$(echo "$CONTAINER" | jq -r '.cpu_limit')
TOTAL_CONTAINERS=$((TOTAL_CONTAINERS + 1))

if [[ "$CPU_LIMIT" == "null" ]]; then
MISSING_LIMITS=$((MISSING_LIMITS + 1))
print_error "Pod $POD_NAME, container $CONTAINER_NAME: No CPU limit"
print_action "Add resources.limits.cpu to the container spec"
else
print_success "Pod $POD_NAME, container $CONTAINER_NAME: CPU limit $CPU_LIMIT"
fi
done <<< "$(echo "$CONTAINERS" | jq -c '.')"
done

if [[ $MISSING_LIMITS -gt 0 ]]; then
update_check_result --status "failed" \
--evidence "{\"total\":$TOTAL_CONTAINERS,\"missing_limits\":$MISSING_LIMITS}"
else
update_check_result --status "success" \
--evidence "{\"total\":$TOTAL_CONTAINERS,\"missing_limits\":0}"
fi

Register it in k8s/diagnose/scope/workflow.yml:

  - name: CPU Limits
description: Validates that all containers have CPU limits configured
category: Scope
type: script
file: "$SERVICE_PATH/diagnose/scope/cpu_limits_check"

Tips

  • Read from files, not the API. Never use kubectl inside a check. All data you need is in the pre-collected JSON files.
  • Keep evidence useful. Include counts, names, and specific values in your evidence JSON. This helps users understand the result without reading logs.
  • Handle empty data gracefully. Use jq with // empty or // null defaults to avoid errors when fields are missing.
  • One concern per check. Each check should validate a single condition. If you're checking multiple things, consider splitting into separate checks.
  • Use the right category folder. Place your check where it logically belongs. If it inspects pods, it goes in scope/. If it inspects ingress configuration, it goes in networking/.