Create a custom check
Diagnose ships with a set of built-in checks that cover common Kubernetes issues. You can browse the source for existing checks and extend Diagnose with custom checks to validate conditions specific to your setup.
This guide walks you through the process of creating, registering, and testing a custom diagnostic check.
How checks work
Every diagnostic check is a bash script that:
- Validates that the resources it needs exist (pods, services, ingresses).
- Reads data from pre-collected JSON files.
- Analyzes the data and prints findings.
- Reports a result with a status and evidence.
Checks don't make Kubernetes API calls. During a Diagnose run, the build_context step collects a snapshot of the cluster before any checks execute. Your check reads from that snapshot, which keeps results consistent and fast.
Check statuses
Each check must report exactly one of these statuses:
| Status | Meaning |
|---|---|
success | The check passed. No issues found. |
failed | The check found a problem that needs attention. |
warning | Something looks unusual but isn't necessarily broken. |
skipped | The check couldn't run because a prerequisite was missing (e.g., no pods exist). |
Step 1: Create the check script
Create a new file in the appropriate category folder under k8s/diagnose/:
networking/for ingress, routing, and load balancer checksscope/for pod and workload checksservice/for Kubernetes Service checks
The filename becomes the check identifier. Use snake_case with no file extension.
touch k8s/diagnose/scope/my_custom_check
chmod +x k8s/diagnose/scope/my_custom_check
Step 2: Write the check logic
Here's the structure every check should follow:
#!/bin/bash
# Check: My Custom Check
# Validates that [what this check does]
# 1. Validate prerequisites
require_pods || return 0
# 2. Read data from pre-collected files
PODS=$(jq -r '.items[].metadata.name' "$PODS_FILE" 2>/dev/null | tr '\n' ' ')
# 3. Analyze and report
HAS_ISSUES=0
for POD_NAME in $PODS; do
POD_INFO=$(jq --arg name "$POD_NAME" \
'.items[] | select(.metadata.name == $name)' \
"$PODS_FILE" 2>/dev/null)
# Your validation logic here
SOME_VALUE=$(echo "$POD_INFO" | jq -r '.spec.someField // empty')
if [[ -z "$SOME_VALUE" ]]; then
HAS_ISSUES=1
print_error "Pod $POD_NAME: Missing someField"
print_action "Add someField to your deployment spec"
else
print_success "Pod $POD_NAME: someField is configured"
fi
done
# 4. Report final result
if [[ $HAS_ISSUES -eq 1 ]]; then
update_check_result --status "failed" --evidence '{"checked": true}'
else
update_check_result --status "success" --evidence '{"checked": true}'
fi
Use require_pods || return 0 (not return 1). Returning 0 tells the executor the script ran successfully, even though the check was skipped. Returning 1 would cause the executor to treat it as an unexpected error.
Step 3: Register the check in the workflow
Open the workflow.yml file in the same category folder and add your check:
# k8s/diagnose/scope/workflow.yml
steps:
# ... existing checks ...
- name: My Custom Check
description: Validates that [what it checks]
category: Scope
type: script
file: "$SERVICE_PATH/diagnose/scope/my_custom_check"
The category field determines how the check is grouped in the UI. Use one of:
Networkingfor networking checksScopefor scope checksK8s Servicefor service checks
The name and description are what users see in the Diagnose results.
Available data files
The build_context step exports these environment variables pointing to pre-collected JSON files:
| Variable | Contents |
|---|---|
$PODS_FILE | Pods matching the scope/deployment labels |
$SERVICES_FILE | Services matching the labels |
$ENDPOINTS_FILE | All endpoints in the namespace |
$INGRESSES_FILE | Ingresses matching the scope labels |
$SECRETS_FILE | Secret metadata (no secret data) |
$INGRESSCLASSES_FILE | Available IngressClasses in the cluster |
$EVENTS_FILE | Recent events in the namespace |
$ALB_CONTROLLER_PODS_FILE | ALB controller pod information |
$ALB_CONTROLLER_LOGS_DIR | Directory with ALB controller log files |
You also have access to these context variables:
| Variable | Contents |
|---|---|
$NAMESPACE | The Kubernetes namespace |
$SCOPE_ID | The nullplatform scope ID |
$DEPLOYMENT_ID | The deployment ID (if running a deployment-level diagnosis) |
$LABEL_SELECTOR | Full label selector including scope and deployment |
$SCOPE_LABEL_SELECTOR | Label selector for scope only (scope_id=...) |
Helper functions reference
These functions are available to every check, loaded automatically from diagnose_utils.
Output functions
Use these to print formatted messages. They appear in the check logs that users see in the UI.
print_success "All pods are healthy" # ✓ green
print_error "Pod web-123 is not ready" # ✗ red
print_warning "High restart count detected" # ⚠ yellow
print_info "Checking pod web-123" # ℹ cyan
print_action "Increase memory limits" # 🔧 cyan
Resource validation
These functions check whether prerequisite resources exist. If they don't, the check is automatically set to "skipped" and the function returns 1.
require_pods # Checks $PODS_FILE has items
require_services # Checks $SERVICES_FILE has items
require_ingresses # Checks $INGRESSES_FILE has items
Always call these before accessing the corresponding data files:
require_pods || return 0
# Safe to read $PODS_FILE from here
Reporting results
Call update_check_result once at the end of your check to report the final status.
# Named parameters
update_check_result --status "success" --evidence '{"pods_checked": 3}'
# Positional parameters
update_check_result "failed" '{"reason": "no healthy endpoints"}'
The evidence parameter must be valid JSON. It's stored with the check result and can contain any data that helps explain what the check found.
Example: a complete custom check
Here's a real-world example that validates containers have CPU limits configured:
#!/bin/bash
# Check: CPU Limits
# Validates that all containers have CPU limits configured
require_pods || return 0
PODS=$(jq -r '.items[].metadata.name' "$PODS_FILE" 2>/dev/null | tr '\n' ' ')
TOTAL_CONTAINERS=0
MISSING_LIMITS=0
for POD_NAME in $PODS; do
CONTAINERS=$(jq --arg name "$POD_NAME" '
.items[] | select(.metadata.name == $name) |
.spec.containers[] | {
name: .name,
cpu_limit: (.resources.limits.cpu // null)
}
' "$PODS_FILE" 2>/dev/null)
while IFS= read -r CONTAINER; do
CONTAINER_NAME=$(echo "$CONTAINER" | jq -r '.name')
CPU_LIMIT=$(echo "$CONTAINER" | jq -r '.cpu_limit')
TOTAL_CONTAINERS=$((TOTAL_CONTAINERS + 1))
if [[ "$CPU_LIMIT" == "null" ]]; then
MISSING_LIMITS=$((MISSING_LIMITS + 1))
print_error "Pod $POD_NAME, container $CONTAINER_NAME: No CPU limit"
print_action "Add resources.limits.cpu to the container spec"
else
print_success "Pod $POD_NAME, container $CONTAINER_NAME: CPU limit $CPU_LIMIT"
fi
done <<< "$(echo "$CONTAINERS" | jq -c '.')"
done
if [[ $MISSING_LIMITS -gt 0 ]]; then
update_check_result --status "failed" \
--evidence "{\"total\":$TOTAL_CONTAINERS,\"missing_limits\":$MISSING_LIMITS}"
else
update_check_result --status "success" \
--evidence "{\"total\":$TOTAL_CONTAINERS,\"missing_limits\":0}"
fi
Register it in k8s/diagnose/scope/workflow.yml:
- name: CPU Limits
description: Validates that all containers have CPU limits configured
category: Scope
type: script
file: "$SERVICE_PATH/diagnose/scope/cpu_limits_check"
Tips
- Read from files, not the API. Never use
kubectlinside a check. All data you need is in the pre-collected JSON files. - Keep evidence useful. Include counts, names, and specific values in your evidence JSON. This helps users understand the result without reading logs.
- Handle empty data gracefully. Use
jqwith// emptyor// nulldefaults to avoid errors when fields are missing. - One concern per check. Each check should validate a single condition. If you're checking multiple things, consider splitting into separate checks.
- Use the right category folder. Place your check where it logically belongs. If it inspects pods, it goes in
scope/. If it inspects ingress configuration, it goes innetworking/.