At Branch, we’ve been through several feature launches on Branch.com and, more recently, several more on our new site, Potluck. Although it becomes easier, building high-quality, high-traffic web applications still isn’t easy. Here are a few things we’ve learned about hosting our apps on Heroku that have helped keep our latency down and our confidence up.
One thing that has been consistently helpful is not hosting services ourselves. Heroku provides a pretty extensive Add-on Marketplace you can use to get most services you’d need up and running in a matter of minutes.
At Branch, we use Redis pretty extensively both for caching and for all the feeds across our sites. We started off by running our own Redis server on EC2, but after a few months of constantly worrying about what we’d do for failover and how best to persist our Redis data, we decided to switch to using OpenRedis. Since then, we focused on building great social apps instead of getting into the business of hosting databases. It gives us one less thing to worry about as we’re trying to iterate on our product. (There is a caveat to this — it’s sometimes difficult to find well-priced, easy to use, and reliable third-party providers for certain technologies. When looking for a new service, we usually scour the Heroku Add-ons page and try out each vendor on a free plan before deciding which to go with.)
For us, the biggest advantage of using SaaSs for various services has been having access to customer support. When we hosted our own Redis or Postgres boxs, we had nobody we could call up and nag about how best to use the software or how to fix obscure bugs. When you’re paying for service, you also get consultants (within reason) for free. This has been enormously helpful when facing weird bugs or just wanting advice on how best to optimize our setup.
Along with Redis and Elasticsearch, our main persistence mechanism at Branch is Postgres. We love Postgres, and Heroku’s Postgres service has been phenomenal at both keeping our boxes up and running well, and at providing features and advice to allow us to do our jobs as well as possible.
When building a webapp with any complexity, you quickly realize that architecting the DB correctly becomes important in the fight against slow response times. A few tools have been really helpful to us when lowering our Postgres usage and query times.
Using the slow query log — You can access your PG logs from the command line with
heroku logs —tail —ps postgres —app APP_NAME
I still haven’t gotten it to give me historical data, but it shows slow queries that are happening right now. Leave the window open for a few minutes on any Postgres instance with moderate to high traffic and you’ll start to see slow queries show up. This is great for figuring out what’s unindexed or just plain gnarly.
Dataclips — Dataclips are SQL queries you can save and pull up anytime via the Heroku Postgres web GUI. These are great for diagnostic queries like queries to Postgres’s inbuilt
pg_stat_user_tables table. One we use pretty often is
select relname as "table",
seq_scan as "non-index lookups",
seq_tup_read as "tuples scanned",
idx_scan as "index lookups",
idx_tup_fetch as "tuples scanned via index"
This shows how many index scans vs non-index scans Postgres is doing on each table and how bad those table scans are (by showing how much data Postgres has to scan through to fulfill those non-indexed queries). This is great for diagnosing which tables are getting hit most often with complex queries or don’t have indexes.
pg-extras — Heroku also has a CLI interface for getting diagnostic data out of your Postgres instance here. This is great for figuring out your cache and index hit rate (both of which ideally should be above 0.99), index sizes, and other info for tuning Postgres instance.
After so many launches, we’ve also learned a lot about how to stay calm and get through a launch successfully. You can find some of our more general learnings here, but a lot of what we’ve learned about using Heroku has been around simplicity and not over-optimizing.
During our first Branch.com launch, we obsessed over having the right amount of dynos up for any situation. We implemented an auto-scaling dyno algorithm to make sure we’d never be caught with our pants down. This turned out to be a horrible idea. When some of our workers went haywire and started sending thousands of emails to a few users because of an obscure Rails bug, we couldn’t scale them down to 0 because the algorithm would just scale them back up. We had to have an engineer sit and type in
heroku ps:scale worker=0 every 3 seconds (sorry, Heroku!) until someone else debugged and fixed the problem.
Now we just scale the dynos up above what we think we’ll need and leave them until we see some latency. It’s manual and not very scientific, and probably costs us a bit extra, but it’s way simpler than our other solutions, and trying to optimize this just wasn’t worth complexity.
Likewise, we used to fixate on analytics. We use New Relic for measuring app latency, and it’s been fantastic for when we need to troubleshoot issues or figure out where a user’s time is going. After each of our launches, we generally saw latency spikes as portions of the app are tested with actual load.
One of the worst things we did on our first launch, though, is to keep the latency graph up for everyone to see all day. It caused a lot of stress and, as random latency spikes showed up now and then, a lot of our time and energy went into watching and worrying about the graphs instead of fixing bugs and pushing out new features. Intermittent latency spikes are inevitable in new applications, and often, the time spent fixing them can be spent fixing other, more pressing issues.
Heroku has been fantastic for building and launching apps quickly and relatively painlessly. Learning about and using all the features Heroku and Heroku Postgres provide has been great for our productivity and has been invaluable in keeping our turnaround time low and keeping us shipping fast.