Nomad Enterprise observability

Observability is a crucial aspect of operating any system. In the context of Nomad Enterprise, it allows operators to gain insights into the health, performance, and behavior of both the Nomad Enterprise cluster itself and the workloads it manages.

Tip

The three pillars of observability are:

Metrics: Quantitative measurements of system behavior over time.
Logs: Detailed records of events and actions within the system.
Tracing/sometimes overlapping with Application Performance Monitoring (APM): The ability to follow a request or transaction through a distributed system.

In a Nomad Enterprise environment, these pillars apply as follows:

Metrics

Nomad-provided metrics help monitor cluster health, resource utilization, and job performance.
Application metrics are typically handled by applications and their developers, and help monitor application-specific behavior and metrics, which can be technical (number of requests per second, latency) or business (number of orders processed, revenue, and so on.)
System metrics provided by the underlying infrastructure (be it bare metal or virtualized) help monitor the health and performance of the infrastructure on which Nomad Enterprise runs.

Having all three in a single system and being able to dashboard and alert on them together is key to having a complete understanding. As an example, the underlying infrastructure's I/O performance impacts the application request latency. Seeing a spike in the latter can be hard to explain without knowledge of the former. Also use metrics to ensure that the cluster can handle the load its under and is thus useful for Autoscaling and to ensure there is minimal waste.

Logs

Nomad-provided logs come in two types - operational logs, which provide detailed information about Nomad operations (evaluations, scheduling allocations, provisioning storage/networking, operating on templates, and so on) and audit logs which are useful to keep track of human/machine actions in Nomad.
The applications themselves generate application logs and help understand application-specific behavior and events.
The underlying operating system provides logs and related infrastructure if it exists (for example, a hypervisor if running in VMs), and these help us understand the health of the infrastructure on which Nomad runs.
Having all three in a single system and being able to search and correlate them together is key to having a complete understanding. As an example, a network card driver issue (logged about in the OS/hypervisor logs) can cause network-related errors in both Nomad and application logs, but they wouldn't be aware of it, and neither would an operator with access only to them.

Tracing

Tracing allows you to understand the flow of requests through applications deployed on Nomad. This is useful in a microservices architecture where a single request can span multiple services. By tracing the request, you can identify bottlenecks, latency issues, and other performance problems, and correlate them to metrics/logs explaining them.

Implementing a comprehensive observability strategy in Nomad Enterprise is crucial for both operators and application developers - it enables proactive issue detection, faster troubleshooting, and informed capacity planning, but also helps to better understand applications' behavior, identify bottlenecks and catch regressions.

The exact composition and configuration of observability tools and practices depend on the specific requirements of each Nomad Enterprise environment, the organization's needs and the existing tooling already in use. Most observability agents support multiple inputs and outputs. The following sections provide generally applicable good practices to follow. The key is to have a holistic view of the system, and to be able to correlate information and events across the three pillars to understand everything.

Observability of the Nomad cluster

Metrics to monitor

When monitoring the Nomad cluster itself, focus on the most crucial metrics as described in the Monitoring Nomad documentation. They include vital information about the health of the cluster itself, as well as scheduling performance and bottlenecks. It is important to, alongside metric collection, set up alerting on the most important and relevant metrics. There is also a complete Metrics Reference available - some metrics might not be crucial, but can still be useful for debugging or correlations.

To enable the collection of these metrics, Nomad Enterprise provides the telemetry block in agent configurations. Metrics are available in a variety of formats, and either pushed to a compatible system/agent, or made available for pulling. Statsd and DataDog are the supported push formats. Prometheus format available for pulling metrics. Most modern metrics/observability agents support multiple metric formats, if nothing else, Prometheus (this applies for the OpenTelemetry Collector, Datadog Agent, Elastic Agent, Grafana Alloy, InfluxData Telegraf, New Relic, AppDynamics, Dynatrace and so on).

For example, to enable Prometheus-compatible metrics collection, you can add the following to your Nomad configuration

telemetry {
    publish_allocation_metrics = true # if metrics from the allocations running in Nomad should be published
    publish_node_metrics = true       # if metrics from Nomad nodes should be published
    prometheus_metrics = true         # if metrics should be made available in the Prometheus format on the /metrics path
}

Operational logs and audit logs

Operational logs provide detailed information about Nomad Enterprise's internal operations, while audit logs capture security-relevant events for compliance and forensics. Ship both to a centralised location to index them for search, ensure they survive node failures, and for correlation with other logs and metrics. We recommend your organisation's strategic log aggregation system.

Operational logs

Server logs: Contain information about leadership changes, job submissions, and evaluations.
Client logs: Include details about task starts, stops, related activities such as downloading container images, provisioning networking, attaching volumes, and so on.

To configure logging, adjust the log_level and log_file settings in your Nomad Enterprise agent configuration.

log_level = "INFO"
log_file = "/var/log/nomad/nomad.log"

Examine the settings used for log rotation and adjust accordingly, depending on the specifics of the environment such as available disk space.

Audit logs

Capture all API requests and responses.
Record authentication and authorization decisions.

Enable audit logging by adding an audit block to your Nomad Enterprise server configuration.

audit {
    enabled = true
    sink "file" {
        type = "file"
        format = "json"
        path = "/var/log/nomad/audit.log"
    }
}

Observability of workloads running scheduled by Nomad Enterprise

Pertinent metrics

Nomad Enterprise collects metrics about its workloads such as how much resource it is using. This is useful for a broad understanding of how the workloads are behaving, but it is not enough to understand the workload itself. For that, collect emitted application-specific metrics, but that is the responsibility of the application developers to expose.

For workloads running on Nomad Enterprise, focus on these generic metrics and anything else that is relevant to your applications.

Application performance (entirely up to application developers to expose):

Request rates and latencies
Error rates
Custom application-specific metrics such as number of transactions, unique users, third-party API failure rates, and so on.

Resource consumption

CPU and memory usage per task
Number of allocations running/failed
Number of allocations that were OOM (out of memory) killed
Amount of time the allocation was CPU throttled

Nomad Enterprise allows you to expose the second set of metrics through its telemetry system, and the first set by integrating metric collection tools with Nomad Enterprise.

By example, a Prometheus Agent running as a system job on all Nomad Enterprise clients collects both the Nomad Enterprise-provided metrics from the local Nomad agent, and the application-provided ones. It either scrapes the application's metrics endpoint (discoverable through service discovery or agent-native application discovery) or by having the application push them to the agent over a local port/socket. This allows for useful scaling and consumption, with the agent being centrally managed and thus to update/reconfigure, and the applications only needing basic configuration.

Logs

By default, Nomad collects stdout and stderr from running allocations (storing them in the alloc/logs folder) and makes them available using the API, UI and command-line tool. This is useful for live debugging and troubleshooting, but it is not enough for a production system because the logs are not persisted (however, they are garbage collected). Such logs and are insufficiently searchable, and do correlate well with other allocations.

To have complete log management, ship these logs to a centralized log aggregation system like ElasticSearch/OpenSearch, Splunk or equivalent. As with everything else, ideally, this would be where Nomad Enterprise server and client logs, and OS logs are also shipped to, so you can correlate them with together.

There are a number of different ways to collect allocation logs from Nomad Enterprise, and the most appropriate one depends on the log aggregation system in use, the agent intermediaries, and most importantly, the allocation type (for instance, Docker, Java, exec). Some methods work in all scenarios, others would be task driver specific.

Sidecar

Since each allocation has access to all of its logs in the alloc/logs folder, a sidecar task can be ran in the same task group as the main workload, that collects the logs and ships them to the log aggregation system. Do this with a sidecar container that runs a log shipper like Filebeat, Fluentd, Logstash, Vector, Promtail, or equivalent. It allows custom logic such as sampling (only send 50% of the logs), filtering (such as only error-level logs) and transformations (such as to remove specific labels or tags, add custom metadata such as node name or version or similar.) Also, since the sidecar shares the same environment, log enrichment is optional as it has access to all the same metadata (task/task group name, environment variables, cloud metadata, and so on). The main downside of this approach is that it is wasteful on resources (each allocation has its own logging agent), and that the configuration is per-task group, making it hard to scale without an advanced deployment flow implementing Nomad Pack and Pack dependencies or equivalent.

Task driver logging

Some task drivers, most notably Docker (see a list of available log destinations out of the box in the Docker documentation), support various logging options and logging plugins. Configure these to directly send logs to a supported backend such as AWS CloudWatch Logs, Splunk, syslog, or equivalent.

This is one of the most efficient ways to collect logs, as it does not require an intermediary agent. However it has a number of downsides. It relies on the task driver implementing the feature and thus does not work for heterogeneous workloads. The task driver ships directly as they are, therefore client-side sampling or transformations are out of scope. The configuration is also per-task, making it hard to scale without an advanced deployment flow implementing Nomad Pack and Pack dependencies or equivalent.

Agent integrating with task driver

Most logging agents come with direct support for some of the most popular task drivers (such as Docker) directly. This allows the agent to collect logs from the Docker Daemon directly (as orchestrated by Nomad Enterprise) and ship them to the log aggregation system. This is a good middle ground between the two previous approaches - it is per Nomad client (the agent can run as a system job) and thus more efficient, does not require a sidecar, allows for custom transformations/filters/sampling, and can scale.

The downside is that it requires the agent to be able to integrate with the task driver, and thus it might not work for all task drivers. For Docker in particular, agents rely on Docker labels to enrich logs with metadata (to know which container the logs are from), and by default those only include the allocation id extra_labels. Configure them specifically to expose more metadata.

Agent integrating with Nomad Enterprise

The most advanced approach is to have the agent (running as a system job) integrate with Nomad Enterprise directly, and collect logs from the Nomad Enterprise API. This allows for the most flexibility, as the agent can collect logs from any allocation, regardless of the task driver, and do any transformations/filters required.

Configure it to collect logs from all allocations, or only those that match certain criteria. The downside here is that it requires the agent to be able to integrate with the Nomad Enterprise API, which limits the usable agents. At this time of writing, the only agent with direct support of Nomad Enterprise is Filebeat; for other agents, there are community utilities to bridge the gap (such as vector-logger for Vector, and nomad_follower for others).

Agent collecting logs from the filesystem

Nomad Enterprise clients store allocation folders locally in the data_dir. Any logging agent can also collect the logs directly from host the filesystem. This approach is not the best because the only metadata available is the allocation id, which is not enough information, and enriching with more relevant information such as the task/task group name is not straight forward to achieve.

Tracing

Implementing distributed tracing for microservices running on Nomad Enterprise provides end-to-end visibility into request flows, service dependencies, and performance bottlenecks. Tracing is crucial for distributed environments to have a clear picture of how everything works. Nomad Enterprise itself does not provide tracing, but it allows you to run popular tracing systems alongside your workloads.

Similar to metrics and log collection, an agent can be ran as a system job, and applications can send to it. A common pattern is to have the agent listen on a local (to the host) port/socket, and have all applications (on that host) sending traces to it. The agent collects them, performs any necessary sampling, transformations, and filters, and ships them to the tracing backend. This allows for scaling and consumption, with the agent being centrally managed and thus straight forward to update/reconfigure, and the applications only needing basic configuration.

Observability of the underlying infrastructure

Monitoring the underlying infrastructure is crucial for maintaining a healthy Nomad cluster:

Host-level metrics
- CPU, memory and disk usage
- File descriptors and open connections
Network performance
- Throughput and packet loss
Storage metrics:
- IOPS and latency
- Read/write throughput
Virtualisation layer metrics (if applicable)
- Hypervisor metrics like CPU ready/CPU contention metrics, memory ballooning, and so on

When running Nomad Enterprise on bare metal, host the metric collection using Nomad Enterprise itself (using an agent running as a system job) or using the underlying host OS's service management (such as systemd). When running Nomad Enterprise on virtualized infrastructure, collect the virtualization layer metrics using the virtualization provider's APIs.

Who watches the watchman

If you are self-hosting the observability stack on top of Nomad Enterprise, it is crucial to monitor the monitoring system itself to avoid a Nomad Enterprise cluster outage taking out your ability to notice it and debug it. Do this by having a separate Nomad Enterprise cluster running the monitoring stack, or by using a SaaS monitoring system for high-level health checks. This is important for the alerting system - if it goes down, one cannot know if anything else also goes down. "Dead man switches" (raising an alert if it does not receive a heartbeat signal from the monitoring system) are a good way to ensure that you are aware of the monitoring system being down.

Benchmarking Nomad Enterprise

Benchmarking Nomad Enterprise is a crucial part of setting up observability, and evaluating your setup. It allows you to understand the performance limits of your Nomad Enterprise cluster and identify bottlenecks before they impact production workloads.

HashiCorp provides a Nomad Benchmarking project to help you get started with Nomad cluster performance testing. Refer to this blog post from the Nomad team on the project and their results.

Reference/useful links

Documentation: Nomad metrics reference Documentation: Monitoring Nomad Blog: Logging on Nomad with Vector Blog: Logging on Nomad and log aggregation with Loki OpenTelemetry collector running in Nomad example Running the OpenTelemetry demo app on HashiCorp Nomad Datadog integration for Nomad Nomad integration for Grafana Cloud Filebeat autodiscover Prometheus configuration reference

Backups

Networking