Use the Nomad Autoscaler to scale a service

10min
|
Nomad
Consul

The Nomad Autoscaler is a tool that can scale workloads and client nodes in a Nomad cluster automatically. It supports two kinds of scaling scenarios:

Horizontal application autoscaling is when the autoscaler controls the number of allocations (service instances) Nomad schedules.
Horizontal cluster autoscaling is when the autoscaler controls the number of Nomad client nodes in the cluster.

Both types of autoscaling are configured with scaling policies that scale according to changes in resource usage, including CPU consumption, memory, or metrics from other Application Performance Monitoring (APM) tools.

When you deploy an application as microservices, the autoscaler helps you scale each service independently. If only one service experiences additional load, then Nomad can add additional allocations for that service only. This approach uses available resources more efficiently than scaling the entire application.

In this tutorial, you deploy a version of HashiCups with a modified job definition for the frontend service. The modified job instructs the autoscaler to create additional instances during high CPU load.

Infrastructure overview

At the beginning of the tutorial, you have Consul API gateway deployed on the public client node of your cluster.

Architectural diagram. Consul API gateway running in the Consul and Nomad cluster.

Prerequisites

This tutorial uses the infrastructure set up in the previous tutorial of this collection, Integrate service mesh and gateway. Complete that tutorial to set up the infrastructure if you have not done so.

Review configuration files

The frontend service renders the HashiCups UI and contains a value in the page footer that shows which instance send the response to the request. This tutorial uses the footer to show how scaling functions when the autoscaler responds to the increased load.

This version of HashiCups adds a scaling block to the frontend service, and includes running the Nomad Autoscaler as a job in Nomad.

Additional configurations for the autoscaler exist in the shared/jobs directory and include 05.autoscaler.config.sh and 05.autoscaler.nomad.hcl.

Review the autoscaler configuration

The Nomad Autoscaler is a separate piece of software that runs as a system process like the Consul and Nomad agents, or as a job in Nomad. It scales workloads running in Nomad based on the scaling block in the jobspec.

The repository provides a script, 05.autoscaler.config.sh, that automates the initial configuration required for Nomad to integrate with the Aurtoscaler.

Click here to examine the details of the configuration script.

The set up script first cleans up previous ACL configurations and then applies the ACL policy for the autoscaler.

/shared/jobs/05.autoscaler.config.sh

## ...

## Delete Nomad ACL policy
nomad acl policy delete autoscaler

## ...

## Create Nomad ACL policy 'autoscaling-policy'
tee ${_scale_policy_FILE} > /dev/null << EOF
namespace "default" {
  policy = "scale"
}

namespace "default" {
  capabilities = ["read-job"]
}

operator {
  policy = "read"
}

namespace "default" {
  variables {
    path "nomad-autoscaler/lock" {
      capabilities = ["write"]
    }
  }
}
EOF

nomad acl policy apply \
        -namespace default \
        -job autoscaler \
        autoscaler ${_scale_policy_FILE}

Review the autoscaler jobspec

The autoscaler runs as a Docker container. Its configuration defines the Nomad cluster address as well as the Application Performance Monitoring (APM) tool used to monitor data.

This jobspec uses the Nomad APM plugin. It is suitable for scaling based on CPU and memory usage. It is not as flexible as other APM plugins, but does not require additional installation or configuration. If you want to scale based on other metrics, consider using the Prometheus plugin or the Datadog plugin.

/shared/jobs/05.autoscaler.nomad.hcl

job "autoscaler" {

  group "autoscaler" {
    # ...
    task "autoscaler" {
      # ...

      template {
        data = <<EOF
          log_level = "debug"
          plugin_dir = "/plugins"

          nomad {
            address = "https://nomad.service.dc1.global:4646"
            skip_verify = "true"
          }

          apm "nomad" {
            driver = "nomad-apm"
          }
        EOF

        destination = "${NOMAD_TASK_DIR}/config.hcl"
      }
    }
  }
}

Review the HashiCups jobspec

Open the 05.hashicups.nomad.hcl jobspec file and view the contents.

Nomad scales the frontend service when CPU usage of all tasks in the frontend group reaches 70% of the maximum allocated CPU for the group. The target value strategy plugin is responsible for the CPU usage calculation. Scaling up happens in increments of one instance maximum, while scaling down happens up to two instances maximum. These values are part of the strategy block configuration.

/shared/jobs/05.hashicups.nomad.hcl

# ...
variable "frontend_max_instances" {
  description = "The maximum number of instances to scale up to."
  default     = 5
}

variable "frontend_max_scale_up" {
  description = "The maximum number of instances to scale up by."
  default     = 1
}

variable "frontend_max_scale_down" {
  description = "The maximum number of instances to scale down by."
  default     = 2
}

job "hashicups" {
  # ...
  group "frontend" {
    # ...
    scaling {
      enabled = true
      min     = 1
      max     = var.frontend_max_instances

      policy {
        evaluation_interval = "5s"
        cooldown            = "10s"

        check "high-cpu-usage" {
          source = "nomad-apm"
          query = "max_cpu-allocated"

          strategy "target-value" {
            driver = "target-value"
            target = 70
            threshold = 0.05
            max_scale_up = var.frontend_max_scale_up
            max_scale_down = var.frontend_max_scale_down
          }
        }
      }
    }
    # ...
  }
  # ...
}

Review the load test script

The load testing script makes requests to the HashiCups URL with the hey tool to trigger scaling. It does so in several waves and adds more requests with each additional wave.

Click here to examine the details of the load testing script.

/shared/jobs/05.load-test.sh

#!/bin/bash

# This script requires the hey tool
# https://github.com/rakyll/hey

[ -z "$1" ] && echo "No URL passed as first argument...exiting" && exit 1

_URL=$1
echo "Application address: $_URL"

_waves=5

_wave_duration=15
_workers_multiplier=7
_rate_multiplier=6
_sleep_time=7


for i in $(seq 1 $_waves);
do
    _wave_duration=15
    _concurrent_workers=$(($_workers_multiplier * $i))
    _rate_limit_per_sec_per_worker=$(($_rate_multiplier * $i))

    echo "Sending $(($_wave_duration * $_concurrent_workers * $_rate_limit_per_sec_per_worker)) requests over $_wave_duration seconds"

    hey -z "$_wave_duration"s -c $_concurrent_workers -q $_rate_limit_per_sec_per_worker -m GET ${_URL} > /dev/null

    echo "Waiting $_sleep_time seconds..."
    sleep $_sleep_time
done

Deploy Nomad autoscaler

Deploy the Nomad autoscaler before you deploy the HashiCups application.

Run the autoscaler setup script and jobspec

Run the autoscaler configuration script.

$ ./05.autoscaler.config.sh
Configure environment.
Clean previous configurations.
Successfully deleted autoscaler policy!
Create Nomad ACL policy 'autoscaler'
Successfully wrote "autoscaler" ACL policy!

Submit the autoscaler job to Nomad.

$ nomad job run 05.autoscaler.nomad.hcl
==> 2024-11-14T11:10:53+01:00: Monitoring evaluation "470de665"
    2024-11-14T11:10:53+01:00: Evaluation triggered by job "autoscaler"
    2024-11-14T11:10:53+01:00: Allocation "aab6f744" created: node "44bf6336", group "autoscaler"
    2024-11-14T11:10:55+01:00: Evaluation within deployment: "30dcd6fb"
    2024-11-14T11:10:55+01:00: Evaluation status changed: "pending" -> "complete"
==> 2024-11-14T11:10:55+01:00: Evaluation "470de665" finished with status "complete"
==> 2024-11-14T11:10:55+01:00: Monitoring deployment "30dcd6fb"
  ✓ Deployment "30dcd6fb" successful

    2024-11-14T11:11:08+01:00
    ID          = 30dcd6fb
    Job ID      = autoscaler
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    autoscaler  1        1       1        0          2024-11-14T10:21:06Z

Deploy HashiCups

Submit the HashiCups job to Nomad.

$ nomad job run 05.hashicups.nomad.hcl
==> 2024-11-14T11:12:46+01:00: Monitoring evaluation "ddff28e0"
    2024-11-14T11:12:46+01:00: Evaluation triggered by job "hashicups"
    2024-11-14T11:12:46+01:00: Evaluation within deployment: "ace8ef73"
    2024-11-14T11:12:46+01:00: Allocation "8af7e475" created: node "dda24c18", group "frontend"
    2024-11-14T11:12:46+01:00: Allocation "976c2df2" created: node "3fadad86", group "db"
    2024-11-14T11:12:46+01:00: Allocation "a69ca45d" created: node "3fadad86", group "nginx"
    2024-11-14T11:12:46+01:00: Allocation "b2e63ebb" created: node "3fadad86", group "public-api"
    2024-11-14T11:12:46+01:00: Allocation "bfc46236" created: node "3fadad86", group "payments"
    2024-11-14T11:12:46+01:00: Allocation "f755b328" created: node "dda24c18", group "product-api"
    2024-11-14T11:12:46+01:00: Evaluation status changed: "pending" -> "complete"
==> 2024-11-14T11:12:46+01:00: Evaluation "ddff28e0" finished with status "complete"
==> 2024-11-14T11:12:46+01:00: Monitoring deployment "ace8ef73"
  ✓ Deployment "ace8ef73" successful

    2024-11-14T11:13:28+01:00
    ID          = ace8ef73
    Job ID      = hashicups
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group   Desired  Placed  Healthy  Unhealthy  Progress Deadline
    db           1        1       1        0          2024-11-14T10:23:19Z
    frontend     1        1       1        0          2024-11-14T10:23:07Z
    nginx        1        1       1        0          2024-11-14T10:23:03Z
    payments     1        1       1        0          2024-11-14T10:23:19Z
    product-api  1        1       1        0          2024-11-14T10:23:22Z
    public-api   1        1       1        0          2024-11-14T10:23:26Z

Scale the `frontend` service

Get the public address of the API gateway and export it as the API_GW environment variable.

$ export API_GW=`nomad node status -verbose \
    $(nomad job allocs --namespace=ingress api-gateway | grep -i running | awk '{print $2}') | \
    grep -i public-ipv4 | awk -F "=" '{print $2}' | xargs | \
    awk '{print "https://"$1":8443"}'`

Open the Nomad UI and log in with the ui -authenticate command. This command opens a web browser window on your machine. Alternatively, you can open the Nomad UI with the IP in Nomad_UI and log in with Nomad_UI_token.

$ nomad ui -authenticate
Opening URL "https://18.116.52.247:4646" with one-time token

The hashicups job, which consists of multiple services, appears in the list of jobs.

Nomad UI showing the jobs page. The page shows the autoscaler, the API gateway, and the HashiCups jobs running.

Click the hashicups job, and then select the frontend task from the list of task groups.

This page displays a graph that shows scaling events at the bottom of the page. Keep this page open so that you can reference it when scaling starts.

Nomad UI showing the details for the task group frontend in the job hashicups.

Run the load test script and observe the graph on the frontend task page in the Nomad UI. Observe Nomad create additional allocations when the autoscaler scales the frontend service up, and then remove the allocations as the autoscaler scales the service back down.

$ ./05.load-test.sh $API_GW
Application address: https://3.15.17.40:8443
Sending 630 requests over 15 seconds
Waiting 7 seconds...
Sending 2520 requests over 15 seconds
Waiting 7 seconds...
Sending 5670 requests over 15 seconds
Waiting 7 seconds...
Sending 10080 requests over 15 seconds
Waiting 7 seconds...
Sending 15750 requests over 15 seconds
Waiting 7 seconds...

Nomad UI showing the details for the task group frontend in the job hashicups.The scaling pattern can be observed.

In the Consul UI, the number of instances of the frontend service registered in the catalog changes as the autoscaler scales up and down.

Consul UI showing the services page. The page shows HashiCups with three instances of the frontend service.

In the Consul UI, click the frontend service and then click on the Instances tab name to view details about each instance.

Consul UI showing the instances of the frontend service.

Before you clean up your environment, you can re-run the load script and observe changes in the Nomad UI and Consul UI as they occur.

Clean up

After you complete this tutorial, you should clean up the deployment. If you want to keep experimenting with the cluster you can clean the cluster state without destroying the underlying infrastructure.

When you are finished, we recommend you destroy the infrastructure to avoid unnecessary costs.

Open up the terminal session from where you submitted the jobs and stop the deployment when you are ready to move on. The nomad job stop command can accept more than one job.

$ nomad job stop -purge hashicups autoscaler
==> 2024-11-14T18:31:37+01:00: Monitoring evaluation "a26c41b6"
==> 2024-11-14T18:31:37+01:00: Monitoring evaluation "9592a117"
    2024-11-14T18:31:37+01:00: Evaluation triggered by job "hashicups"
    2024-11-14T18:31:37+01:00: Evaluation status changed: "pending" -> "complete"
    2024-11-14T18:31:37+01:00: Evaluation triggered by job "autoscaler"
==> 2024-11-14T18:31:37+01:00: Evaluation "9592a117" finished with status "complete"
    2024-11-14T18:31:37+01:00: Evaluation status changed: "pending" -> "complete"
==> 2024-11-14T18:31:37+01:00: Evaluation "a26c41b6" finished with status "complete"

Clean the autoscaler configuration.

$ ./05.autoscaler.config.sh -clean
Configure environment.
Clean previous configurations.
Successfully deleted autoscaler policy!
Only cleaning selected...Exiting.

Stop the API gateway deployment.

$ nomad job stop --namespace ingress -purge api-gateway
==> 2024-11-14T18:32:45+01:00: Monitoring evaluation "d412f90c"
    2024-11-14T18:32:45+01:00: Evaluation triggered by job "api-gateway"
    2024-11-14T18:32:45+01:00: Evaluation status changed: "pending" -> "complete"
==> 2024-11-14T18:32:45+01:00: Evaluation "d412f90c" finished with status "complete"

Remove Consul intentions.

$ ./04.intentions.consul.sh -clean
Configure environment.
Clean previous configurations.
Config entry deleted: service-intentions/database
Config entry deleted: service-intentions/product-api
Config entry deleted: service-intentions/payments-api
Config entry deleted: service-intentions/public-api
Config entry deleted: service-intentions/frontend
Config entry deleted: service-intentions/nginx
Only cleaning selected...Exiting.

Remove Consul and Nomad configuration.

$ ./04.api-gateway.config.sh -clean
Configure environment.
Clean previous configurations.
Config entry deleted: http-route/hashicups-http-route
Config entry deleted: inline-certificate/api-gw-certicate
Config entry deleted: api-gateway/api-gateway
Binding rule "29fc7353-623c-ed89-5caa-b252ebc59aad" deleted successfully
Successfully deleted namespace "ingress"!
Auth method "nomad-workloads" deleted successfully
Only cleaning selected...Exiting.

Destroy infrastructure

Change to the aws directory.

$ cd ../../aws

Run the script to unset local environment variables. This script returns no output when it finishes successfully

$ source ../shared/scripts/unset_env_variables.sh

Use terraform destroy to remove the provisioned infrastructure. Respond yes to the prompt to confirm removal.

$ terraform destroy -var-file=variables.hcl

# ...

Plan: 0 to add, 0 to change, 88 to destroy.

Changes to Outputs:
  - Configure-local-environment = "source ./datacenter.env" -> null
  - Consul_UI                   = "https://3.147.65.45:8443" -> null
  - Consul_UI_token             = (sensitive value) -> null
  - Nomad_UI                    = "https://3.147.65.45:4646" -> null
  - Nomad_UI_token              = (sensitive value) -> null

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

# ...

Destroy complete! Resources: 88 destroyed.

Delete AMI and S3-store snapshots (optional)

Make sure that you clean up any remaining AMIs and snapshots stored in S3 either through the AWS UI or the CLI.

Delete the stored AMI built using packer using the deregister-image command. Change the region based on the region you configurd in variables.hcl.

$ aws ec2 deregister-image --image-id ami-0445eeea5e1406960 --region us-east-1

To delete stored snapshots, first query for the snapshot using the describe-snapshots command.

$ aws ec2 describe-snapshots \
    --region us-east-1 \
    --owner-ids self \
    --query "Snapshots[*].{ID:SnapshotId,Time:StartTime}"

This will return a json array of existing snapshots.

[
    {
        "ID": "snap-06ca3b71432a604ad",
        "Time": "2024-11-12T15:51:41.598000+00:00"
    }
]

Next, delete the stored snapshot using the delete-snapshot command by specifying the snapshot-id value.

$ aws ec2 delete-snapshot --snapshot-id snap-06ca3b71432a604ad --region us-east-1

Next steps

In this tutorial, you deployed a version of HashiCups with a modified job defintion for the frontend service that instructed the Nomad Autoscaler to scale up and down based on CPU load.

In this collection, you learned how to migrate a monolithic application to microservices and run them in Nomad with Consul. You deployed a cluster running Consul and Nomad, configured access to the CLI and UI components, deployed several versions of the HashiCups application to show different stages of integration with Consul and Nomad, and automatically and independently scaled one of the HashiCups services with the Nomad Autoscaler.

Check out the resources below to learn more and continue your learning and development.

Learn more about integrating Consul with Nomad
Learn more about networking in Nomad and using Consul service discovery and service mesh
Learn more about the concepts of the Nomad Autoscaler and the available APM plugins, strategy plugins, and target plugins

Integrate service mesh and gateway

Next Collection

Advanced Scheduling