As outlined in a previous blog post, Heroku Data services undergo routine maintenances for security and patching. In this post, we describe the process used to minimize downtime for Heroku Postgres and Heroku Key-Value Store premium ‘High Availability’ plans and how we optimized the process to perform up to 75% faster.
Data Services Architecture
High availability plans for Postgres and Redis are designed to have two database instances running at the same time. One is a writeable primary database server and the other is a read-only hidden standby. Since the standby is hidden, customers cannot access it during normal operations.
Before starting a planned maintenance, we do our best to ensure that your standby is fully caught up with your primary.
During the maintenance, we replace the primary with its standby (now new primary), and tell your app to start to connecting to the new primary instead of the old one. We then build a new standby in the background. If your app attempts to write to the old primary after the old standby unfollows it, you may experience lost writes. The most critical part of the maintenance is ensuring your app connects and writes to the proper database instance.
The Old Way of Controlling Writes During a Maintenance
In the past, we changed the config var in your app so that your DATABASE_URL
would point to the standby and trigger an app restart to ensure that your app reconnected to the read-only standby. While this is happening, your primary could still accept writes and replicate them to the standby.
During the rolling restart of your app’s dynos, you could have apps connecting to both database instances. To ensure that lost writes are minimized, we would terminate the primary instance. Any long-running transactions are given time to complete as the database instance shuts down safely. The maintenance then needs to pause long enough to ensure that any writes are replicated from the shutting down primary to the standby.
Once the primary is shutdown, the standby can stop following the primary, exit read-only mode, and become the new primary. At this point, your app should no longer see errors connecting to the new primary. The maintenance then finishes cleanup and builds a new standby to follow the new primary.
The New Way of Controlling Writes During a Maintenance
As mentioned above, the old way waits for the primary to safely shutdown, which means downtime can be extended by long-running transactions.
The new method immediately redirects any new connections to the new primary/old standby, which ensures that no connections are made to the old primary. We then kill any existing connections to the old primary. Your app may see connection errors at that point, but any reconnections safely move to the new primary. The ensuing rolling restart of your app correctly points all dynos to the standby.
Now that any connections to the old primary are redirected, we know that no further writes can be written to the old primary. It is safe for the old standby to unfollow the old primary, exit read-only mode, and become the new primary. We can now safely perform the unfollow step much quicker than in the past. We also gain additional control over the steps of the maintenance and ensure that tail latencies are greatly reduced.
The New Way Is Up to 75% Faster
This new approach brings significant improvements to the most critical and potentially problematic parts of maintenances. As a result, we now see maintenances for HA data services complete in 15-40 seconds, a big improvement on the 60 seconds or more required before. Heroku’s own internal databases now use this behavior for performing maintenances, and we’ve reduced customer impact to the point that in many cases we can act without advance notice.
Best of all, this is now the default behavior for maintenances on Heroku Postgres and Heroku Key-Value Store databases with Premium, Private, or Shield plans. As always, we are constantly working to minimize the downtime caused by maintenances.