|||

Video Transcript

X

How to Stop a Node Server: Handling Node.js Errors on Shutdown

This blog post is adapted from a talk given by Julián Duque at NodeConf EU 2019 titled "Let it crash!."

Before coming to Heroku, I did some consulting work as a Node.js solutions architect. My job was to visit various companies and make sure that they were successful in designing production-ready Node applications. Unfortunately, I witnessed many different problems when it came to error handling, especially on process shutdown. When an error occurred, there was often not enough visibility on why it happened, a lack of logging details, and bouts of downtime as applications attempted to recover from crashes.

We started to assemble a collection of best practices and recommendations on error handling, the best way to stop a node server process, and building for fast Node.js process restarts. In this post, I'll walk through some of the background on the Node.js process lifecycle, some strategies to properly handle graceful shutdown, and how to quickly restart node after a catastrophic error terminates your program.

The Node.js process lifecycle

Let's first explore briefly how Node.js operates. A Node.js process is very lightweight and has a small memory footprint. Because crashes are an inevitable part of programming, your primary goal when architecting an application is to keep the startup process very lean, so that the Node.js process restarts as fast as possible. If your startup operations include CPU intensive work or synchronous operations, it might affect the ability of your Node.js processes to quickly restart.

A strategy you can use here is to prebuild as much as possible. That might mean preparing data or compiling assets during the building process. It may increase your deployment times, but it's better to spend more time outside of the startup process. Ultimately, this ensures that when a crash does happen, you can exit a process and start a new one without much downtime.

Node.js exit methods

Let's take a look at several ways you can terminate a Node.js process and the differences between them.

The most commonly used Node.js exist function is process.exit(), which takes a single argument, an integer. If the argument is 0, it represents a successful exit state. If it's greater than that, it indicates that an error occurred; 1 is a common exit code for failures here.

Another option is process.abort(). When this method is called, the Node.js process terminates immediately. More importantly, if your operating system allows it, the Node.js exit will also generate a core dump file, which contains a ton of useful information about the process. You can use this core dump to do some postmortem debugging using tools like llnode.

Node.js exit events

As Node.js is built on top of JavaScript, it has an event loop, which allows you to listen for events that occur and act on them. When Node.js exits, it also emits several types of events.

One of these is beforeExit, and as its name implies, it is emitted right before a Node process exits. You can provide an event handler which can make asynchronous calls, and the event loop will continue to perform the work until it's all finished. It's important to note that this event is not emitted on process.exit() calls or uncaughtExceptions; we'll get into when you might use this event a little later.

Another event is exit, which is emitted only when process.exit() is explicitly called. As it fires after the event loop has been terminated, you can't do any asynchronous work in this handler.

The code sample below illustrates the differences between the two events:

process.on('beforeExit', code => {
  // Can make asynchronous calls
  setTimeout(() => {
    console.log(`Process will exit with code: ${code}`)
    process.exit(code)
  }, 100)
})

process.on('exit', code => {
  // Only synchronous calls
  console.log(`Process exited with code: ${code}`)
})

OS signal events

Your operating system emits events to your Node.js process, too, depending on the circumstances occurring outside of your program. These are referred to as signals. Two of the more common signals are SIGTERM and SIGINT.

SIGTERM is normally sent by a process monitor to tell Node.js to expect a successful termination. If you're running systemd or upstart to manage your Node application, and you stop the service, it sends a SIGTERM event so that you can handle the process shutdown.

SIGINT is emitted when a Node.js process is interrupted, usually as the result of a control-C (^-C) keyboard event. You can also capture that event and do some work around it.

Here is an example showing how you may act on these signal events:

process.on('SIGTERM', signal => {
  console.log(`Process ${process.pid} received a SIGTERM signal`)
  process.exit(0)
})

process.on('SIGINT', signal => {
  console.log(`Process ${process.pid} has been interrupted`)
  process.exit(0)
})

Since these two events are considered a successful termination, we call process.exit and pass an argument of 0 because it is something that is expected.

JavaScript error events

At last, we arrive at higher-level error types: the error events thrown by JavaScript itself.

When a JavaScript error is not properly handled, an uncaughtException is emitted. These suggest the programmer has made an error, and they should be treated with the utmost priority. Usually, it means a bug occurred on a piece of logic that needed more testing, such as calling a method on a null type.

An unhandledRejection error is a newer concept. It is emitted when a promise is not satisfied; in other words, a promise was rejected (it failed), and there was no handler attached to respond. These errors can indicate an operational error or a programmer error, and they should also be treated as high priority.

In both of these cases, you should do something counterintuitive and let your program crash! Let the Node.js process restart play out. Please don't try to be clever and introduce some complex logic trying to avoid restarting a Node process on an uncaughtException. Doing so will almost always leave your application in a bad state, whether that's having a memory leak or leaving sockets hanging. It's simpler to let it crash, start a new process from scratch, and continue receiving more requests.

Here's some code indicating how you might best handle these events:

process.on('uncaughtException', err => {
  console.log(`Uncaught Exception: ${err.message}`)
  process.exit(1)
})

We’re explicitly “crashing” the Node.js process here! Don’t be afraid of this! It is more likely than not unsafe to continue. The Node.js documentation says,

Unhandled exceptions inherently mean that an application is in an undefined state...The correct use of 'uncaughtException' is to perform synchronous cleanup of allocated resources (e.g. file descriptors, handles, etc) before shutting down the process. It is not safe to resume normal operation after 'uncaughtException'.

process.on('unhandledRejection', (reason, promise) => {
  console.log('Unhandled rejection at ', promise, `reason: ${err.message}`)
  process.exit(1)
})

unhandledRejection is such a common error, that the Node.js maintainers have decided it should really crash the process, and they warn us that in a future version of Node.js unhandledRejections will crash the process.

[DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Run more than one process

Even if your process startup time is extremely quick, running just a single process is a risk to safe and uninterrupted application operation. We recommend running more than one process and to use a load balancer to handle the scheduling. That way, if one of the processes crashes, there is another process that is alive and able to receive new requests. This is going to give you a little bit more leverage and prevent downtime.

Use whatever you have on-hand for the load balancing. You can configure a reverse proxy like nginx or HAProxy to do this. If you're on Heroku, you can scale your application to increase the number of dynos. If you're on Kubernetes, you can use Ingress or other load balancer strategies for your application.

Monitor your processes

You should have process monitoring in-place, something running in your operating system or an application environment that's constantly checking if your Node.js process is alive or not. If the process crashes due to a failure, the process monitor is in charge of restarting the process.

Our recommendation is to always use the native process monitoring that's available on your operating system. For example, if you're running on Unix or Linux, you can use the systemd or upstart commands. If you're using containers, Docker has a --restart flag, and Kubernetes has restartPolicy, both of which are useful.

If you can't use any existing tools, use a Node.js process monitor like PM2 or forever as a last resort. These tools are okay for development environments, but I can't really recommend them for production use.

If your application is running on Heroku, don’t worry—we take care of the restart for you!

Graceful shutdowns

Let's say we have a server running. It's receiving requests and establishing connections with clients. But what happens if the process crashes? If we're not performing a graceful shutdown, some of those sockets are going to hang around and keep waiting for a response until a timeout has been reached. That unnecessary time spent consumes resources, eventually leading to downtime and a degraded experience for your users.

It's best to explicitly stop receiving connections, so that the server can disconnect connections while it's recovering. Any new connections will go to the other Node.js processes running through the load balancer

To do this, you can call server.close(), which tells the server to stop accepting new connections. Most Node servers implement this class, and it accepts a callback function as an argument.

Now, imagine that your server has many clients connected, and the majority of them have not experienced an error or crashed. How can you close the server while not abruptly disconnecting valid clients? We'll need to use a timeout to build a system to indicate that if all the connections don't close within a certain limit, we will completely shutdown the server. We do this because we want to give existing, healthy clients time to finish up but don't want the server to wait for an excessively long time to shutdown.

Here's some sample code of what that might look like:

process.on('<signal or error event>', _ => {
  server.close(() => {
    process.exit(0)
  })
  // If server hasn't finished in 1000ms, shut down process
  setTimeout(() => {
    process.exit(0)
  }, 1000).unref() // Prevents the timeout from registering on event loop
})

Logging

Chances are you have already implemented a robust logging strategy for your running application, so I won't get into it too much about that here. Just remember to log with the same rigorous quality and amount of information for when the application shuts down!

If a crash occurs, log as much relevant information as possible, including the errors and stack trace. Rely on libraries like pino or winston in your application, and store these logs using one of their transports for better visibility. You can also take a look at our various logging add-ons to find a provider which matches your application’s needs.

Make sure everything is still good

Last, and certainly not least, we recommend that you add a health check route. This is a simple endpoint that returns a 200 status code if your application is running:

// Add a health check route in express
app.get('/_health', (req, res) => {
  res.status(200).send('ok')
})

You can have a separate service continuously monitor that route. You can configure this in a number of ways, whether by using a reverse proxy, such as nginx or HAProxy, or a load balancer, like ELB or ALB.

Any application that acts as the top layer of your Node.js process can be used to constantly monitor that the health check is returning. These will also give you way more visibility around the health of your Node.js processes, and you can rest easy knowing that your Node processes are running properly. There are some great great monitoring services to help you with this in the Add-ons section of our Elements Marketplace.

Putting it all together: A function to gracefully restart Node apps

Whenever I work on a new Node.js project, I use the same function to ensure that my crashes are logged and my recoveries are guaranteed. It looks something like this:

function terminate (server, options = { coredump: false, timeout: 500 }) {
  // Exit function
  const exit = code => {
    options.coredump ? process.abort() : process.exit(code)
  }

  return (code, reason) => (err, promise) => {
    if (err && err instanceof Error) {
    // Log error information, use a proper logging library here :)
    console.log(err.message, err.stack)
    }

    // Attempt a graceful shutdown
    server.close(exit)
    setTimeout(exit, options.timeout).unref()
  }
}

module.exports = terminate

Here, I've created a module called terminate. I pass the instance of that server that I'm going to be closing, and some configuration options, such as whether I want to enable core dumps, as well as the timeout. I usually use an environment variable to control when I want to enable a core dump. I enable them only when I am going to do some performance testing on my application or whenever I want to replicate the error.

This exported function can then be set to listen to our error events:

const http = require('http')
const terminate = require('./terminate')
const server = http.createServer(...)

const exitHandler = terminate(server, {
  coredump: false,
  timeout: 500
})

process.on('uncaughtException', exitHandler(1, 'Unexpected Error'))
process.on('unhandledRejection', exitHandler(1, 'Unhandled Promise'))
process.on('SIGTERM', exitHandler(0, 'SIGTERM'))
process.on('SIGINT', exitHandler(0, 'SIGINT'))

Additional resources

There are a number of existing npm modules that pretty much solve the aforementioned issues in a similar ways. You can check these out as well:

Hopefully, this information will simplify your life and enable your Node app to run better and safer in production!

Originally published: December 17, 2019

Browse the archives for engineering or all blogs Subscribe to the RSS feed for engineering or all blogs.