May 22, 2012 by Mark Pundsack
Developers like you deploy code to hundreds of thousands of apps every month on the Heroku platform. Some of these are production apps which serve hundreds of millions or even billions of requests per month. Uptime of the platform is critical for such apps.
We want to achieve the sustained reliability that these apps require. But when there are incidents that impact uptime, we want to maximize our transparency and accountability to you and all developers on the platform.
Today, we’re launching a completely redesigned status.heroku.com, which provides real-time status of the platform, the ability to sign up for email or SMS notification of incidents, and recent uptime history in both visual and numeric formats.
Let’s zoom in on each of these points.
Incident monitoring: The circles at the top show green (all systems go), yellow (intermittent errors or partial impact), or red (major outage of specified component). The boxes below describe the incident in detail and show how long it’s been happening. This page uses Pusher to refresh the page automatically as we post updates, without requiring you to continually reload to track progress of an in-progress incident.
Proactive Alerts: Click “Subscribe to Notifications” in the upper right corner to receive notifications on platform incidents in a variety of formats: email, SMS, Twitter, or RSS.
Uptime numbers for the last month: At the top, you’ll see uptime numbers for the previous month. This provides an at-a-glance answer to the question of “How stable has Heroku been lately?” Our numbers here haven’t been as good as we’d like in the last few months, but by posting them publicly here we intend to create the transparency and accountability that will help drive us to improve. After all, you make what you measure.
Timeline: Most status sites, including the previous implementation of the Heroku status site, list incidents in a blog-like format. For the new Heroku status site, we took the opportunity to try something more innovative, and the result was the timeline view. Incidents are displayed on a vertical timeline. When the platform is performing normally, the timeline is green. When an incident occurs, a red or yellow bar sized to the duration of the incident is plotted against the timeline. By scrolling down the timeline, you can get a feel for the duration and frequency of recent incidents, without needing to scrutinize each one individually.
Production vs Development: There are two circles for status, two uptime numbers, and two timelines. Why the separation? Heroku’s operational efforts always prioritize continuity of service for existing production applications over service for development/prototype/hobby apps, or the ability to take development actions (such as deploying new code) against production apps. If you can’t push code to your app for ten minutes, it’s an annoyance. If your production app stops serving traffic to your users for ten minutes, that’s a much bigger problem. Today, production apps are defined as any app running two or more dynos with a production-grade database.
API: Like other parts of the Heroku platform, the new status site has its own API. Use this to create a custom monitoring tool, your own front-end, or whatever you like.
Documentation can be found in the Dev Center.
This new status site is more than just eye candy: we’ve provided transparency and demonstrated our accountability for the uptime of the Heroku platform. It’s our goal to earn a long-term track record of reliability, one that is deserving of the critical production apps which many of you have entrusted us with.
We’d like to extend a big thank-you to the approximately 350 people who helped us beta test the new status site. If you’d like to help us test future beta releases of Heroku products, sign up for our private beta list.