Heroku Postgres Databases Patched

Heroku Postgres Databases have been patched

Data is one of the most valuable assets of any company. As a database-as-a-service provider, one of our biggest responsibilities is ensuring your data is kept safe. A few weeks ago, one of the worst security vulnerabilities to date in PostgreSQL was discovered. To address this issue, Heroku deployed a point release upgrade across the entire Heroku Postgres service earlier this week. This resulted in a period of database unavailability, typically with a duration of less than one minute. Every database running on Heroku Postgres is now appropriately patched and is unaffected by the vulnerability.

PostgreSQL Vulnerability Details

The PostgreSQL project has provided official detail on CVE-2013-1899.

Several weeks ago there was a responsible disclosure of a serious security vulnerability within PostgreSQL by Mitsumasa Kondo and Kyotaro Horiguchi. The vulnerability allows unauthenticated remote users to use the ‘postmaster‘ process to write data to any accessible file, including critical internal database files.

The vulnerability was fixed and then committed to the PostgreSQL’s private git repository, but only after updates to anonymously accessible copies were disabled. Updated versions of PostgreSQL were released today to most large packaging repositories, as well as source code and installers.

Heroku Postgres Patching

The Heroku Postgres team worked with the PostgreSQL community to ensure we would be able to rapidly apply this patch. However, due to the nature of the issue, and aiming to mitigate risk for others, we were not able to discuss specifics until now. Our goal — in addition to ensuring your data was safe — was to continue monitoring this upgrade as it was deployed, providing early feedback to the community should bugs be found, and not jeopardizing in any way the coordinated public disclosure process stewarded by the PostgreSQL community. Most importantly, the PostgreSQL source code that included the patch was held in the utmost secrecy. In addition, the deployment plan was reviewed by PostgreSQL community members in advance.

Once the source code was released to the PostgreSQL packagers—of which a member of the Heroku Postgres staff is a part of—we began applying this patch to all Heroku Postgres databases, with the first updates starting on Monday. As of Wednesday at 6:30 PM PDT, all Heroku Postgres databases had been upgraded to their appropriate point release and were no longer vulnerable to CVE-2013-1899.

Conclusion

We realize that having no control over a maintenance window, however brief, is among the worst possible experiences. We are very sorry. Two reasons prevented us from working with you to schedule the security update. First, we prioritize ensuring your data is safe above all else, as a result making sure that every database was patched before this exploit was weaponized was paramount. Secondly, this was the first time we've had to deal with a security update of this scale, and have no machinery in place to schedule upgrades of this sort. Spending time to build such machinery would have prevented us from having every database patched in time. We will continue to work on improving our process around such maintenance to provide a better experience in the future.

As of late Wednesday all Heroku Postgres databases were upgraded and no longer at risk of CVE-2013-1899. No further action is required on your part to ensure your data remains safe.

Routing and Web Performance on Heroku: a FAQ

Hi. I'm Adam Wiggins, cofounder and CTO of Heroku.

Heroku has been my life’s work. Millions of apps depend on us, and I take that responsibility very personally.

Recently, Heroku has faced criticism from the hacker community about how our HTTP router works, and about web performance on the platform in general. I’ve read all the public discussions, and have spent a lot of time over the past month talking with our customers about this subject.

The concerns I've heard from you span past, present, and future.

The past: some customers have hit serious problems with poor web performance and insufficient visibility on their apps, and have been left very frustrated as a result. What happened here? The present: how do you know if your app is affected, and if so what should you do? And the future: what is Heroku doing about this? Is Heroku a good place to run and scale an app over the long term?

To answer these questions, we’ve written a FAQ, found below. It covers what happened, why the router works the way that it does, whether your app is affected by excessive queue time, and what the solution is.

As to the future, here’s what we’re doing. We’re ramping up hands-on migration assistance for all users running on our older stack, Bamboo, or running a non-concurrent backend on our new stack, Cedar. (See the FAQ for why this is the fix.) We’re adding new features such as 2X dynos to make it easier to run concurrent backends for large Rails apps. And we're making performance and visibility a bigger area of product attention, starting with some tools we've already released in the last month.

If you have a question not answered by this FAQ, post it as a comment here, on Hacker News, or on Twitter. I’ll attempt to answer all such questions posted in the next 24 hours.

To all our customers who experienced real pain from this: we're truly sorry. After reading this FAQ, I hope you feel we're taking every reasonable step to set things right, but if not, please let us know.

Adam


Overview

Q. Is Heroku’s router broken?

A. No. While hundreds of pages could be written on this topic, we’ll address some of this in Routing technology. Summary: the current version of the router was designed to provide the optimum combination of uptime, throughput, and support for modern concurrent backends. It works as designed.

Q. So what’s this whole thing about then?

A. Since early 2011, high-volume Rails apps that run on Heroku and use single-threaded web servers sometimes experienced severe tail latencies and poor utilization of web backends (dynos). Lack of visibility into app performance, including incorrect queue time reporting prior to the New Relic update in February 2013, made diagnosing these latencies (by customers, and even by Heroku’s own support team) very difficult.

Q. What types of apps are affected?

A. Rails apps running on Thin, with six or more dynos, and serving 1k reqs/min or more are the most likely to be affected. The impact becomes more pronounced as such apps use more dynos, serve more traffic, or have large request time variances.

Q. How can I tell if my app is affected?

A. Add the free version of New Relic (heroku addons:add newrelic) and install the latest version of the newrelic_rpm gem, then watch your queue time. Average queue times above 40ms are usually indicative of a problem.

Some apps with lower request volume may be affected if they have extremely high request time variances (e.g., HTTP requests lasting 10+ seconds) or make callbacks like this OAuth example.

Q. What’s the fix?

A. Switch to a concurrent web backend like Unicorn or Puma on JRuby, which allows the dyno to manage its own request queue and avoid blocking on long requests.

This requires that your app be on our most current stack, Cedar.

Q. Can you give me some help with this?

A. Certainly. We’ve already emailed all customers with apps running on Thin with more than six dynos with self-migration instructions, and a way to reach us for direct assistance.

If you haven’t received the email and want help making the switch, contact us for migrating to Cedar or migrating to Unicorn.

Routing technology

Q. Why does the router work the way that it does?

A. The Cedar router was built with two goals in mind: (1) to support the new world of concurrent web backends which have become the standard in Ruby and all other language communities; and (2) to handle the throughput and availability needs of high-traffic apps.

Read detailed documentation of Heroku’s HTTP routing.

Q. Even with concurrent web backends, wouldn’t a single global request queue still use web dynos more efficiently?

A. Probably, but it comes with trade-offs for availability and performance. The Heroku router favors availability, stateless horizontal scaling, and low latency through individual routing nodes. Per-app global request queues require a sacrifice on one or more of these fronts. See Kyle Kingsbury’s post on the CAP theorem implications for global request queueing.

After extensive research and experimentation, we have yet to find either a theoretical model or a practical implementation that beats the simplicity and robustness of random routing to web backends that can support multiple concurrent connections.

Q. So does that mean you aren’t working on improving HTTP performance?

A. Not at all. We're always looking for new ways to make HTTP requests on Heroku faster, more reliable, and more efficient. For example, we’ve been experimenting with backpressure routing for web dynos to signal to the router that they are overloaded.

You, our customers, have told us that it’s not routing algorithms you ultimately care about, but rather overall web performance. You want to serve HTTP requests as quickly as possible, for fast page loads or API calls for your users. And you want to be able to quickly and easily diagnose performance problems.

Performance and visibility are what matters, and that’s what we’ll work on. This will include ongoing improvements to dynos, the router, visibility tools, and our docs.

Retrospective

Q. Did the Bamboo router degrade?

A. Yes. Our older router was built and designed during the early years of Heroku to support the Aspen and later the Bamboo stack. These stacks did not support concurrent backends, and thus the router was designed with a per-app global request queue. This worked as designed originally, but then degraded slowly over the course of the next two years.

Q. Were the docs wrong?

A. Yes, for Bamboo. They were correct when written, but fell out of date starting in early 2011. Until February 2013, the documentation described the Bamboo router only sending one connection at a time to any given web dyno.

Q. Why didn’t you update Bamboo docs in 2011?

A. At the time, our entire product and engineering team was focused on our new product, Cedar. Being so focused on the future meant that we slipped on stewardship of our existing product.

Q. Was the "How It Works" section of the Heroku website wrong?

A. Yes. Similar to the docs, How It Works section of our website described the router as tracking which dynos were tied up by long HTTP requests. This was accurate when written, but gradually fell out of date in early 2011. Unlike the docs, we completely rewrote the homepage in June of 2011 and it no longer referenced tracking of long requests.

Q. Was the queue time metric in New Relic wrong?

A. Yes, for the same 2011—2013 period from previous questions. The metric was transmitted to the New Relic instrumentation in the app via a set of HTTP headers set by the Heroku router. The root cause was the same as the Bamboo router degradation: the code didn't change, but scaling out the router nodes caused the data to become increasingly inaccurate and eventually useless. With New Relic's help, we fixed this in February 2013 by calculating queue time using a different method.

Q. Why didn’t Heroku take action on this until Rap Genius went public?

A. We’re sorry that we didn’t take action on this based on the customer complaints via support tickets and other channels sooner. We didn’t understand the magnitude of the confusion and frustration caused by the out-of-date Bamboo docs, incorrect queue time information in New Relic, and the general lack of visibility into web performance on the platform. The huge response to the Rap Genius post showed us that this touched a nerve in our community.

The Future

Q. What are we doing to make things right from here forward?

A. We’ve been working with many of our customers to get their queue times down, get them accurate visibility into their app’s performance, and make sure their app is fast and running on the right number of dynos. So far, the results are good.

Q. What about everyone else?

A. If we haven’t been in touch yet, here’s what we’re doing for you:

  • Migration assistance: We’ll give you hands-on help migrating to a concurrent backend, either individually or in online workshops. This includes the move to Cedar if you’re still on Bamboo. If you’re running a multi-dyno app on a non-concurrent backend and haven’t received an email, drop us a line about Thin to Unicorn or Bamboo to Cedar.
  • 2X dynos: We’re fast-tracking the launch of 2X dynos, to provide double the memory and allow for double (or more) Unicorn concurrency for large Rails apps. This is already available in private beta in use by several hundred customers, and will be available in public beta shortly.
  • New visibility tools: We’re putting more focus on bringing you new performance visibility features, such as the log2viz dashboard, CPU and memory use logging, and HTTP request IDs. We’ll be working to do much more on this front to make sure that you can diagnose performance problems when they happen and know what to do about it.

Want something else not mentioned here? Let us know.

Helios - open source framework for mobile

Heroku has a strong tradition with open source projects. Engineers have dedicated countless hours to the projects that developers count on every day. Open Source Software is in our DNA.

Speaking personally, I’m passionate about building tools like AFNetworking and cupertino, in order to help developers build insanely great experiences for mobile devices. It’s with great pleasure that I introduce something new I’ve been working on:

Helios is an open-source framework that provides essential backend services for iOS apps. This includes data synchronization, push notifications, in-app purchases, and passbook integration. It allows developers to get a client-server app up-and-running while seamlessly incorporating functionality as necessary.

Helios is designed for "mobile first" development. Build out great features on the device, and implement the server-side components as necessary. Pour all of your energy into crafting a great user experience, rather than getting mired down with the backend.

Built on the Rack webserver interface, Helios can be easily added into any existing Rails or Sinatra application. Or, if you're starting with a Helios application, you can build a new Rails or Sinatra application on top of it. This means that you can develop your application using the tools and frameworks you love, and maintain flexibility with your architecture as your needs evolve.

Give it a try and let me know what you think!

Waza 2013: How Ecosystems Build Mastery

When we think of the concept of Waza (技) or "art and technique," it's easy to get caught up in the idea of individual mastery. It's true that works of art are often created by those with great skill, but acquiring that skill is neither solitary nor static. Generations of masters contribute to a canon and it is in that spirit that we built the Heroku platform and the Waza event. This year's Waza was no exception.

On February 28th, more than 900 attendees participated in Waza including Ruby founder Yukihiro "Matz" Matsumoto, Django co-creator Jacob Kaplan-Moss and Codeacademy’s Linda Liukas. True to form, we offered you a platform for experimentation and you surprised us with your contributions.

From your origami creations, to your Arduino hacks, to the technical conversations over craft beer -- you taught us that the definition of software development is ever-evolving. Thank you for allowing us to help you change lives and push boundaries. We will continue our commitment to growing the platform for you and look forward to collaborating with you in the future.

For more event highlights visit the Waza videos and photos. To learn more about Heroku, add yourself to our mailing list.

log2viz: Logs as Data for Performance Visibility

If you’re building a customer-facing web app or mobile backend, performance is a critical part of user experience. Fast is a feature, and affects everything from conversion rates to your site’s search ranking.

The first step in performance tuning is getting visibility into the app’s web performance in production. For this, we turn to the app’s logs.

Logs as data

There are many ways to collect metrics, the most common being direct instrumentation into the app. New Relic, Librato, and Hosted Graphite are cloud services that use this approach, and there are numerous roll-your-own options like StatsD and Metrics.

Another approach is to send metrics to the logs. Beginning with the idea that logs are event streams, we can use logs for a holistic view of the app: your code, and the infrastructure that surrounds it (such as the Heroku router). Mark McGranaghan’s Logs as Data and Ryan Daigle’s 5 Steps to Better Application Logging offer an overview of the logs-as-data approach.

Put simply, logs as data means writing semi-structured data to your app's logs via STDOUT. Then the logs can be consumed by one or more services to do dashboards, long-term trending, and threshold alerting.

The benefits of logs-as-data over direct instrumentation include:

  • No additional library dependencies for your app
  • No CPU cost to your dyno by in-app instrumentation
  • Introspection capability by reading the logs directly
  • Metrics backends can be swapped out without changes to app code
  • Possible to split the log stream and send it to multiple backends, for different views and alerting on the same data

Introducing log2viz, a public experiment

log2viz is an open-source demonstration of the logs-as-data concept for Heroku apps. Log in and select one of your apps to see a live-updating dashboard of its web activity.

For example, here’s a screenshot of log2viz running against the Rubygems Bundler API (written and maintained by Terence Lee, André Arko, and Larry Marburger, and running on Heroku):

log2viz gets all of its data from the Heroku log stream — the same data you see when running heroku logs --tail at the command line. It requires no changes to your app code and works for apps written in any language and web framework, demonstrating some of the benefits of logs as data.

Also introducing: log-runtime-metrics

In order to get memory use stats for your dynos, we’ve added a new experimental feature to Heroku Labs to log CPU and memory use by the dyno: log-runtime-metrics.

To enable this for your app (and see memory stats in log2viz), type the following:

$ heroku labs:enable log-runtime-metrics -a myapp
$ heroku restart

This inserts data into your logs like this:

heroku[web.1]: measure=load_avg_5m val=0.0
heroku[web.1]: measure=memory_total val=209.64 units=MB

log2viz reads these stats and displays average and max memory use across your dynos. (Like all Labs features, this is experimental and the format may change in the future.)

Looking under the hood

log2viz is open source. Let’s look at the code:

You can deploy your own copy of log2viz on Heroku, so fork away! For example, Heroku customer Timehop has experimented with trending graphs via Rickshaw.

Logs-as-data add-ons

log2viz isn't the only way to take advantage of your log stream for visibility on Heroku today. Here are a few add-ons which consume your app's logs.

Loggly offers a web console that lets you search your log history, and graph event types over time. For example, let’s search for status=404, logged by the Heroku router whenever your app serves a page not found:

Papertrail offers search archival and history, and can also alert on events when they pass a certain threshold. Here’s how you can set up an email alert every time your app experiences more than 10 H12 errors in a 60 second period. Search for the router log line:

Click “Save Search,” then:

Other add-ons that consume logs include Treasure Data and Logentries.

You can also use non-add-on cloud services, as shown in thoughtbot's writeup on using Splunk Storm with Heroku.

Conclusion

Visibility is a vast and challenging problem space. The logs-as-data approach is still young, and log2viz is just an experiment to get us started. We look forward to your feedback on log2viz, log visibility via add-ons, and your own experiments on performance visibility.

Browse the blog archives or subscribe to the full-text feed.