Monitoring and logging

This best practice looks at how to best implement monitoring in a backend service.

Description

Monitoring is an important part of developing and managing a service. It is important that we are aware of any issues the service is experiencing and are able to quickly debug and get to the root cause when issues do occur. There are 3 types of monitoring we will look into in this document: logs, metrics, and tracing.

Logs

When writing logs in a backend service the following should be adhered to:

Never log Personally identifiable information (PII). If you're unsure what constitutes PII the ICO has guidance here
Write JSON logs.
Make the error message unique and descriptive.
Include transaction ids on every line (AWS request id, registration id).
Provide contextual information to aid debugging.
Avoid overly writing logs; this can increase the cost of CloudWatch logs by having unnecessary log lines. Generally 1 or 2 lines per successful request is sufficient.
Log errors close to where they happen.
Include stack traces.
Ensure CloudWatch Log group is human-readable [i.e. set names for Lambdas]
Set appropriate retention. (1 month is the norm).
Never log passwords or secret information.

Log packages

We have two recommended logging packages. You are free to choose your own if neither of these suit your requirements.

The two recommended packages are:

Powertools for AWS Lambda (Recommended as of 2024)
Winston

We recommend using Powertools for AWS Lambda as it is a package that is built for AWS Lambda and provides a lot of useful features out of the box. It is also easy to integrate with other AWS services.

Powertools for AWS Lambda

You can read in depth about the Powertools package here: https://docs.powertools.aws.dev/lambda/typescript/latest/core/logger/#getting-started

import { Logger } from "@aws-lambda-powertools/logger";
import type { Context } from "aws-lambda";

// Initialize the logger
const logger = new Logger({
  serviceName: "MyService",
});

// Example usage
logger.error("Response generated", { response });
logger.error("Error processing request", { error });

// Set context information from a Lambder handler
export const handler = async (
  _event: unknown,
  context: Context,
): Promise<void> => {
  logger.addContext(context);

  logger.info("This is an INFO log with some context");
};

Winston

You can read in depth about the winston package here: https://github.com/winstonjs/winston

import { createLogger } from "winston";
export const logger = createLogger({
  level: process.env.LOG_LEVEL || "info",
  format: format.combine(
    format.errors({ stack: true }),
    format.timestamp(),
    format.json(),
  ),
  transports: [new transports.Console()],
});

// Set context information on every request
logger.defaultMeta = {
  awsRequestId: context.awsRequestId,
  registrationId,
};

// Example usage
logger.error("Database returned disallowed query exception", { query });
logger.error(error);

logger.info("Successfully completed registration", {
  context: "contextual information",
  registrationId: "123",
});

Metrics, Alarms and Dashboards

Logging is very good at providing detailed information about what happened but not effective at providing a clear overview on the health of a service at a glance. Dashboards and metrics provide a way for engineers to quickly assess the status of a service before diving into logs for more detail. Metrics are also effective for use to trigger alarms, allowing engineers to be notified of an issue without constant checking.

In general, metrics, alarms and dashboards should be defined and maintained in CDK. The reason for this is that the monitoring can then sit close to the code and easily evolve with the code. Another reason is that configuration can be easily replicated to monitor multiple resources or environments without the need for duplication. And, as always with infrastructure as code, it enables version control.

Cloudwatch and Datadog

The two tools available for tracking metrics and using them to create dashboards and alarms are AWS Cloudwatch and Datadog. The most useful Cloudwatch metrics are captured for AWS resources by default, but additional configuration is required to capture these metrics in Datadog from an AWS account. Datadog is a more specialist and sophisticated monitoring tool. One of the benefits of using monitoring tools in Datadog is the ability to navigate traces across different resources and services to build a bigger picture of a full journey. Note that metrics can take up to 10 minutes to appear in Datadog which may affect your decision over which tool to use to create alarms or dashboards, depending on how business critical the product is and how quickly you need to respond to issues.

The Datadog account is used by all of engineering in CRUK, so bear that in mind when naming resources. All resources that are sending logs or metrics to Datadog must be tagged with the Datadog unified service tags: env, service and version.

In order to send logs or metrics from an AWS account to Datadog, see the docs and seek support from the Infrastructure team to ensure the AWS account is integrated with Datadog correctly.

Metrics

Most AWS services will emit metrics by default (or can be enabled within the CDK). The following list provides some insight on some of the most useful metrics to utilise on AWS resources.

API Gateway / AppSync - Invocations, Latency, Errors (5xx, 4xx)
Lambda - Invocations, Errors, Throttles, Latency
Step Functions - Invocations, Errors, Throttles, Latency
SQS - Visible Messages, Not Visible Messages, Age of Oldest Item, Messages Received

Cloudwatch

In addition to the above default metrics, custom Metric Filters can be created in Cloudwatch for metrics generated from specific log messages.

Datadog

Refer to the Datadog setup above in order to send AWS metrics to Datadog.

Custom Metrics can also be created in Datadog to create algorithms of default metrics in order to track more complex metrics.

Alarms

Alarms should focus on issues that could occur in a system, such as a DLQ being non-empty, increased latency, increased number of errors. If an alarm is not actionable then there is little point alarming on it.

Alarms alone are not sufficient to ensure engineers know when an issue is occurring. In addition, engineers must be notified when alarms are triggered. The preference is to send alarm notifications to dedicated Slack channels which engineers monitor. If there are any noisy alarms, try to remediate the cause before adjusting trigger thresholds.

Cloudwatch

Alarms can be created to trigger on certain metrics hitting thresholds or behaving unusually. Alarms should ideally be written and managed in CDK.

The recommended approach is to use SNS with AWS Chatbot to write messages to Slack on alarms triggering.

Examples

Example from activity management

In this example the monitoring construct is used on the activity management stack to only create the Chatbot resources when the slack props are passed (which is usually only done for the production environment, but can be done for certain PRs if testing new monitoring resources is needed).

When Slack props are provided, the construct adds the Chatbot configuration to send alarm notifications to specified Slack channels.

The monitoring construct also sets up CloudWatch metrics and alarms to monitor the service's performance and detect issues. Alarms are configured to trigger notifications for critical events, such as high error rates or increased latency. These alerts have configured the same SNS topic as the Chatbot configuration for their alarm action, which ensures that the slack messages are sent when the alarm is triggered. This allows the team to receive real-time alerts and respond quickly to issues.

This example also includes the creation of CloudWatch dashboards that provide a visual overview of the service's health. These dashboards display key metrics such as API Gateway invocations, Lambda errors, and SQS message counts. However the dashboard might not be needed as Datadog will provide such metrics.

Datadog

Once metrics are in Datadog for AWS resources, monitors (Datadog term for alarms) can be created to trigger against certain metrics.

Monitors should ideally be written and managed in CDK. CDK constructs can be imported from the npm module @cdk-cloudformation/datadog-monitors-monitor. Follow the instructions in the module docs to install the third party extension to your AWS account.

To send monitor notifications to Slack, the channel must be included in the monitor configuration and also added to the Datadog/Slack integration by a member of the Infrastructure team.

Dashboards

Products can have have a Dashboard that includes all the metrics that are relevant to the infrastructure and running of the service. Multiple dashboards can be created for different sections of a product (i.e. one dashboard for infrastructure and one for product-centric metrics).

Cloudwatch

Dashboards should ideally be written and managed in CDK.

Datadog

Dashboards can be created in CDK. CDK constructs can be imported from the npm module @cdk-cloudformation/datadog-dashboards-dashboard, however the TypeScript type definitions are fairly loose in this module. Follow the instructions in the module docs to install the third party extension to your AWS account.

Rationale

Logs

By following the recommendations above we have the following benefits:

Easily queryable logs using CloudWatch Insights. This gives great filters and searching of our logs as we are using a JSON format.
Find all logs against a specific registration/request easily.
Provides useful information to debuggers.
Avoids unnecessary CloudWatch Logs cost by setting retention appropriately and reducing unnecessary logging. Whilst also having necessary logs on errors.
By using a library it is easily integrated with Lambdas and ECS and logging patterns are shared across products.

Metrics

Having dashboards, metrics and appropriate alarms is important for making engineers aware of issues on the products. Without it we will be unaware of problems which could impact our supporters experience of the product.

Having well defined dashboards, metrics and alarms in the CDK means they can be tested, reviewed and documented easily for future maintainers. Having a standard on what to alarm and what to display also reduces disparity across products.

Tracing

Due to it's ease of integration (very minimal changes required if at all) and superb tracing capability it is recommended over other services such as AWS X-Ray. (Both services were tested side to compare).

CloudWatch dashboards are preferred as they can be managed internally and by the CDK. Full control is also provided by CloudWatch and there is a greater degree of customization.

Examples

The activity management product follows the guidance above. The CDK stack can be found here with monitoring examples.

https://github.com/CRUKorg/activity-management/blob/master/packages/server-cdk-stack/lib/constructs/monitoring.ts

References & Further Reading

Logging best practices - https://www.sentinelone.com/blog/the-10-commandments-of-logging/
Logging with Winston - https://www.section.io/engineering-education/logging-with-winston/
CloudWatch CDK docs - https://docs.aws.amazon.com/cdk/api/latest/docs/aws-cloudwatch-readme.html
Monitoring Best practices (from Microsoft) - https://docs.microsoft.com/en-us/azure/architecture/best-practices/monitoring

Description​

Logs​

Log packages​

Powertools for AWS Lambda​

Winston​

Metrics, Alarms and Dashboards​

Cloudwatch and Datadog​

Metrics​

Alarms​

Examples​

Dashboards​

Rationale​

Logs​

Metrics​

Tracing​

Examples​

References & Further Reading​

Description

Logs

Log packages

Powertools for AWS Lambda

Winston

Metrics, Alarms and Dashboards

Cloudwatch and Datadog

Metrics

Alarms

Examples

Dashboards

Rationale

Logs

Metrics

Tracing

Examples

References & Further Reading