Troubleshooting actions on running scopes

There are situations where we want to perform specific actions on an ongoing scope. For example in case of observing a weird behaviour on production, we might want to kill a specific instance to see if the problem is related to that instance. One of the possible approaches is to make a new deployment, but this can be time-consuming and might not be the best approach for this situation.

These actions can be triggered through the Performance View in the UI. Each change that they perform is not persisted, it means that in case of making a new deployment, any change made by these actions will be lost.

Use Case Scenario

Let's suppose that we have a traffic spike on production. Through the Performance View we can quickly change the scaling behaviour, by adding more instances to the underlying Autoscaling Group. In case of doing a deployment, the changes we previously made will be overridden by the scope configuration.

For example, we can have a scope with scaling enabled, with a min/max range of 2 and 5 instances, respectively. In case of having the traffic spike, we can quickly set the amount of instances to 10, in order to cope with the traffic increase. In case of doing a deployment, the amount of instances will be set back to the min/max range of 2 and 5, because that's what the scope configuration says. These troubleshooting actions are meant to quickly take action on a running scope, without the need of doing a deployment.

Current Supported Actions

We have a bunch of actions that we can perform on a running scope. In case there are any other action that you might find useful to have, please reach out! Currently, these actions are:

Instance Kill

This action will terminate an instance or pod. We can both kill a specific instance or a bunch of them at once. We need to be careful with this action as it may generate service degradation.

Autoscaling Fix

This is a quick way to scale our infrastructure, by adding or removing instances/pods, depending on the desired amount that we set. Along with this amount, we can also specify whether to leave the autoscaling enabled or not.

Autoscaling Stop

In case a scope has autoscaling enabled, we can stop it. Useful in case we observe that the scaling rules that we have previously set, were too aggressive and we want to avoid any further scaling.

Autoscaling Start

In case a scope has autoscaling enabled, and was previously stopped, we can enable it back

Setup

Applies only to instance_kill in server-based scopes, the rest do not require further setup

In order to be able to kill an instance, nullplatform needs permission to terminate an instance. To do so, we need to follow two steps:

Create a new policy
Attach policy to the role we assume

New Policy

From the AWS Management Console, in IAM > Policies, click on "Create policy" and select the JSON tab. Then add the following:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "ec2:TerminateInstances",
            "Resource": "*"
        }
    ]
}

After setting this JSON, we will name the policy np-troubleshooting-manager. Any new action that we want to add in the future, we can add it to this policy.

Attach Policy to Role

From the AWS Management Console, in IAM > Roles, click on the null-scope-and-deploy-manager role. Click on "Attach policies" and search for the policy we just created. Attach it to the role.

That's it! Now we are ready to use the Troubleshooting Actions.

API Examples

In case of needing to create a troubleshooting action without going through the UI, we can use the API.

Please refer to our API Section for more detail about usage

Creating an Action

`instance_kill`

We can both kill a single instance/pod, or multiple at once. The instance_id is for both server-based scopes and k8s-based scopes.

In case of server-based scope, we just need the instance ID that we can find through the Performance View, or through the EC2 Dashboard or AWS CLI using describe-instances command.

In case of k8s-based scopes, we need the pod name that we can find through the Performance View, or through the EKS Dashboard (or equivalent) or kubectl get pods command.

POST /scope/<SCOPE_ID>/action
{
    "name": "instance_kill",
    "parameters": {
        "instance_id": "i-abcdef123456"
    }
}

POST /scope/<SCOPE_ID>/action
{
    "name": "instance_kill",
    "parameters": {
        "instance_id": [
            "i-abcdef123456",
            "i-defghi654321"    
        ]
    }
}

`autoscaling_fix`

With this action we set the desired amount of instances/pods to have. Under the hood, if we set 10 instances, and we currently have 5 instances, nullplatform will add 5 more instances. If we set 5 instances, and we currently have 10, nullplatform will remove 5 instances.

Along with the desired amount, we can also specify whether to leave the autoscaling enabled or not (in case the scope previously had scaling enabled through the scope details).

POST /scope/<SCOPE_ID>/action
{
    "name": "autoscaling_fix",
    "parameters": {
        "desired_instances": 10,
        "autoscaling_enabled": false
    }
}

`autoscaling_stop`

In case the scope has autoscaling enabled, we can stop it. Useful in case we observe that the scaling rules that we have previously set, were too aggressive and we want to avoid any further scaling.

POST /scope/<SCOPE_ID>/action
{
    "name": "autoscaling_stop",
    "parameters": {}
}

`autoscaling_start`

In case the scope has autoscaling enabled, and was previously stopped, we can enable it back

POST /scope/<SCOPE_ID>/action
{
    "name": "autoscaling_start",
    "parameters": {}
}

Reading an Action

GET /scope/<SCOPE_ID>/action/<ACTION_ID>
{
    "id": 5678,
    "status": "failed",
    "entity": "scope",
    "entity_id": 1234,
    "name": "instance_kill",
    "parameters": {
        "instance_id": "i-abcdef123456"
    },
    "messages": [
        {
            "level": "ERROR",
            "message": "Instance id i-abcdef123456 not found"
        }
    ],
    "created_at": "2024-09-26T19:24:02.714Z",
    "updated_at": "2024-09-26T19:24:05.801Z",
    "nrn": "organization=1:account=2:namespace=3:application=4:scope=1234"
}

Use Case Scenario​

Current Supported Actions​

Instance Kill​

Autoscaling Fix​

Autoscaling Stop​

Autoscaling Start​

Setup​

New Policy​

Attach Policy to Role​

API Examples​

Creating an Action​

instance_kill​

autoscaling_fix​

autoscaling_stop​

autoscaling_start​

Reading an Action​