Troubleshooting actions on running scopes
There are situations where we want to perform specific actions on an ongoing scope. For example in case of observing a weird behaviour on production, we might want to kill a specific instance to see if the problem is related to that instance. One of the possible approaches is to make a new deployment, but this can be time-consuming and might not be the best approach for this situation.
These actions can be triggered through the Performance View in the UI. Each change that they perform is not persisted, it means that in case of making a new deployment, any change made by these actions will be lost.
Use Case Scenario
Let's suppose that we have a traffic spike on production. Through the Performance View we can quickly change the scaling behaviour, by adding more instances to the underlying Autoscaling Group. In case of doing a deployment, the changes we previously made will be overridden by the scope configuration.
For example, we can have a scope with scaling enabled, with a min/max range of 2 and 5 instances, respectively. In case of having the traffic spike, we can quickly set the amount of instances to 10, in order to cope with the traffic increase. In case of doing a deployment, the amount of instances will be set back to the min/max range of 2 and 5, because that's what the scope configuration says. These troubleshooting actions are meant to quickly take action on a running scope, without the need of doing a deployment.
Current Supported Actions
We have a bunch of actions that we can perform on a running scope. In case there are any other action that you might find useful to have, please reach out! Currently, these actions are:
Instance Kill
This action will terminate an instance or pod. We can both kill a specific instance or a bunch of them at once. We need to be careful with this action as it may generate service degradation.
Autoscaling Fix
This is a quick way to scale our infrastructure, by adding or removing instances/pods, depending on the desired amount that we set. Along with this amount, we can also specify whether to leave the autoscaling enabled or not.
Autoscaling Stop
In case a scope has autoscaling enabled, we can stop it. Useful in case we observe that the scaling rules that we have previously set, were too aggressive and we want to avoid any further scaling.
Autoscaling Start
In case a scope has autoscaling enabled, and was previously stopped, we can enable it back
Setup
Applies only to instance_kill in server-based scopes, the rest do not require further setup
In order to be able to kill an instance, nullplatform needs permission to terminate an instance. To do so, we need to follow two steps:
- Create a new policy
- Attach policy to the role we assume
New Policy
From the AWS Management Console, in IAM > Policies, click on "Create policy" and select the JSON tab. Then add the following:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "ec2:TerminateInstances",
"Resource": "*"
}
]
}
After setting this JSON, we will name the policy np-troubleshooting-manager
. Any new action that we want to add in
the future, we can add it to this policy.
Attach Policy to Role
From the AWS Management Console, in IAM > Roles, click on the null-scope-and-deploy-manager
role. Click on
"Attach policies" and search for the policy we just created. Attach it to the role.
That's it! Now we are ready to use the Troubleshooting Actions.
API Examples
In case of needing to create a troubleshooting action without going through the UI, we can use the API.
Please refer to our API Section for more detail about usage
Creating an Action
instance_kill
We can both kill a single instance/pod, or multiple at once. The instance_id
is for both server-based scopes and
k8s-based scopes.
In case of server-based scope, we just need the instance ID that we can find through the Performance View, or through
the EC2 Dashboard or AWS CLI using describe-instances
command.
In case of k8s-based scopes, we need the pod name that we can find through the Performance View, or through the
EKS Dashboard (or equivalent) or kubectl get pods
command.
POST /scope/<SCOPE_ID>/action
{
"name": "instance_kill",
"parameters": {
"instance_id": "i-abcdef123456"
}
}
POST /scope/<SCOPE_ID>/action
{
"name": "instance_kill",
"parameters": {
"instance_id": [
"i-abcdef123456",
"i-defghi654321"
]
}
}
autoscaling_fix
With this action we set the desired amount of instances/pods to have. Under the hood, if we set 10 instances, and we currently have 5 instances, nullplatform will add 5 more instances. If we set 5 instances, and we currently have 10, nullplatform will remove 5 instances.
Along with the desired amount, we can also specify whether to leave the autoscaling enabled or not (in case the scope previously had scaling enabled through the scope details).
POST /scope/<SCOPE_ID>/action
{
"name": "autoscaling_fix",
"parameters": {
"desired_instances": 10,
"autoscaling_enabled": false
}
}
autoscaling_stop
In case the scope has autoscaling enabled, we can stop it. Useful in case we observe that the scaling rules that we have previously set, were too aggressive and we want to avoid any further scaling.
POST /scope/<SCOPE_ID>/action
{
"name": "autoscaling_stop",
"parameters": {}
}
autoscaling_start
In case the scope has autoscaling enabled, and was previously stopped, we can enable it back
POST /scope/<SCOPE_ID>/action
{
"name": "autoscaling_start",
"parameters": {}
}
Reading an Action
GET /scope/<SCOPE_ID>/action/<ACTION_ID>
{
"id": 5678,
"status": "failed",
"entity": "scope",
"entity_id": 1234,
"name": "instance_kill",
"parameters": {
"instance_id": "i-abcdef123456"
},
"messages": [
{
"level": "ERROR",
"message": "Instance id i-abcdef123456 not found"
}
],
"created_at": "2024-09-26T19:24:02.714Z",
"updated_at": "2024-09-26T19:24:05.801Z",
"nrn": "organization=1:account=2:namespace=3:application=4:scope=1234"
}