This blog post is adapted from a talk given by Amy Unger at RailsConf 2018 titled "Knobs, buttons & switches: Operating your application at scale."
We've all seen applications that keel over when a single, upstream service goes down. Despite our best intentions, sometimes an unexpected outage has us scrambling to make repairs. In this blog post, we'll take a look at some tools you can integrate into your application before disaster strikes. We'll talk about seven strategies that can help you shed load, fail gracefully, and protect struggling services. We'll also talk about the technical implementations of these techniques—particularly in Ruby, though the lessons are applicable to any language!
All right. Well, thank you everyone. I'm glad you're here. I hope you enjoyed the cheesecake, and I hope it doesn't put you to sleep. I'm Amy. I'm a backend engineer at Heroku, and I'm talking about how you can add knobs, buttons, and switches to your application to make it alter its behavior when things go wrong.
We've all seen applications that can keel right over when a single unimportant service is down. So, let's not have that be you.
All right. Pilots operate their airplanes from the flight deck and I have fond memories of Captain Kirk yelling every week to divert power to the shields. This talk is about what kinds of levers you should have for operating your application when the going gets tough. I want you to feel like when you're on call you have that level of control over your application.
So, this means that this talk is about application resilience. But it's only one part of the topic. This is what I call the just right talk for the, well, not just right fires. But it is not about the baby fires. These are your casual, everyday failures. No one action that you take on behalf of a customer has a 100% chance of succeeding. Maybe they provided bad data. Maybe there's some conflicting state, whether that's between you and another service or between two other dependent services. Maybe that customer has a found a particular race condition or you've hit a network blip. Whatever the reason, that request and many others like it may not succeed, but those are not what I'm talking about today. So, this talk assumes that you have functionality for retrying requests, unwinding multi-step actions when you hit a snag six steps in. I've talked about those strategies at a previous Rails Conf, and I wanted to highlight them because they will probably give you more bang for your buck depending on where you're at. I'm also not talking about disaster recovery scenarios, those business ending, terrible, horrific, catastrophic events. Something like, I'm sorry, your database is gone. All back ups have been lost, and aliens have abducted all of US east. Good luck.
So, while this is the just right talk it may be more useful for you at this moment to work on failures that are happening quietly right now or to plan for the ones that you hope will never happen, but might end your business. And while this is the just right talk, the entire talk may not be just right for you. I've been lucky enough to work at companies that cared deeply about providing a great, reliable, and resilient customer experience. But how we provided those services to customers reflects what we value. When you have to make a difficult choice about what you choose to do under bad situation, or under bad circumstances, that choice is very particular to the size of the application, your customer base, and your product. So you may end up asking your product people, even your business owners, what do you want me to do in this situation? So, what am I talking about? I'm talking about strategies that can help you shed load, fail gracefully, protect struggling services, and we'll talk about these seven tools that will help you do that. I'll go into some implementation details for each and then I'll give you some buyer beware warnings at the end. So, let's jump right in.
The first tool I wanna talk about is maintenance mode. Going into maintenance is your hard nope, your fail whale. It should have a clear, consistent message with a link to your status page, and most importantly it should be really easy to switch on. At Heroku, we have this implemented as an environment variable. The key thing here is that it's one button you can press and not a series of levers and dials. You should not have to follow a very long play book in order to get this working for you.
The next one I wanna talk about is read-only mode. So, most pieces of software effect, they exist to effect some sort of change in another system. I'm guessing for most of us since it is RailsConf that work that our application performs is to alter a relational database. But it could be any series of things. Think about what your application does for users and whether that's store data in the database, right, transform files, upload them into a file store, or for us they're for GUI launch containers on EC2 instances. Once you have an idea of what your application is modifying think about what you can do if you can't modify that. What questions can you answer? Some of you may be operating a very narrowly scoped service, and the answer may be nothing, and that's fine. This is not the tool for you. But if you're at the classic Rails blog size, maybe larger, this could be very useful. Most people probably just want to read your blog. They don't want to alter it. They don't--they're not publishing. And then for my job currently the primary application I work on has a variety of disparate services. So we need finer grain tools. So this, this is not quite the tool for us, but it's a good first step. Again, the way we would probably implement this is through an environment variable and that's mostly just because it would have similarity to the maintenance mode. But, you know, consider what tool you want to use and use it consistently.
Next, feature flags. So, feature flags can be used for more than just new features. They can allow you to provide a controlled experience when part of your app isn't working. So, imagine what if billing or selling new things was a new feature flag for you? There are different levels of feature flags that we find useful. First is the user, the individual feature flag. This probably isn't very helpful for you during an incident. Hopefully your incidents aren't called just for one user. There's the global level, application wide. So, as I mentioned, what if billing globally was a feature? For us that might be also like freezing modifications to all the containers we're running for customers. But what we really find useful is the group level. So, at Heroku we run users' applications on our platform. So for us, the most relevant groups are usually groups of applications running in a particular region. They might also be groups of applications written in Go or Ruby. You'll want to think about what groupings are meaningful for your business because it really ends up being a combination of what you want to control and who your users really are. So the way we implement this is we have a class that can answer these questions about current application state. This could be a normal active record model talking to a relational database. For us, that is currently what this particular class does talk to, it does talk to our database. But that's not necessarily the right choice for you. This model could be backed by talking to Redis. It could be talking to an in-memory cache. Of course an in-memory cache would mean that for each different web process it would have different application state, which might be more complicated than you want. One of the most interesting options I could think of was curling a file named billing enabled in a particular S3 bucket. If that is what you need to do in order to make sure that this check doesn't fail when the thing you're trying to handle the failure of also fails. Sorry. Too many failures in that sentence. But, you would want to choose something that is going to be able to answer the question of am I down if the thing you're trying to control is also down. For groups, so the previous one we looked at was billing enabled, and this right now would be looking at the setting for billing for our EU customers. So, for groups I really recommend having one switch for an entire group. It may seem silly to have these strings, just tons of them, and this may not be your experience, but at two AM we find that strings really are easier to copy and paste rather than trying to instantiate an application setting model for billing, and then say that it's for the US group and then, you know, toggle the enabled flag. Just a string works a little bit better for us, and really gives us more confidence that when we ask what the current state of the application is we know exactly what we're getting.
Next I wanna talk about rate limits. So, rate limits protect you from disrespectful and malicious traffic. But they can also help you shed load. So if you need to drop half of your traffic to stay up you should drop half your traffic. Your customers, that respectful traffic, they may have to try two or three times to get a particular request through. But if they keep trying they'll be able to do what they need to do. We see this strategy from AWS all the time. When we get to the, when we understand that we are in that state, that they are rejecting a fair number of requests because they are under some sort of load, we start behaving in a way that is helpful to them and to us. We stop sending excess traffic and we start repeating our single most important request to them and then our second-most important request. Eventually it gets through. Eventually that most important request to them will be accepted by them and during that period we won't have been sending them tons and tons of traffic. Rate limits can now also help you protect access to your application from other parts of the business that rely on you. Oftentimes the single application that a user sees is actually a mesh of assorted different services all acting together to create a single user experience and while perhaps you can, I mean, you absolutely can make that internal system function when other parts of it are down, it can be easier to just really try to protect that preferred traffic in addition to the strategies here which will help you stay up even if parts do go down. So, if you can prefer your internal traffic it can help continue to present that unified front to customers and keep you looking up for longer. So, we implement rate limits as a combination of two different kinds of levers, single default, and many modifiers for user accounts, and we find that this gives us the flexibility to provide certain users the rate limits that they need while at the same time retaining a single control for how much traffic we are able to handle at any one point. So, where we start is again an application setting. This is a global default of a rate limit. Here we're saying it's 100 requests per minute. Hopefully we can handle more than that, but let's just say that for easy math. And we have our customer here. Our customer starts with a modifier of one, and what this means is that to determine the customer's rate limit we will multiply the default, 100, by their modifier of one, which results in a rate limit as you might expect, oh, sorry, there we go, of 100 requests per minute. Now, let's say this customer writes in and says, hey, you know, I really have these legitimate reasons that I need twice as much traffic. We say, great. We'll bump you up to two, to a modifier of two, which means at the end of the day they get a rate limit of 200 requests per minute. Some time later we end up under a lot of load and we're not able to keep up and we make the tough choice to say, hey, we need to cut our traffic. And so we cut the default rate by half. So this used to be 100. We're now at 50. But what this means is that all of our customers, all of our accounts actually, including the preferred internal accounts, can in one setting be cut in half. So, our customer here is back at 50% of his or her rate limit, 100 requests per minute. But that's still a little bit above or significantly above the default. So it allows us to rapidly cut traffic coming in, without having to run a script over every single user to adjust their rate limit. I should mention that depending on your application you may want to consider doing cost-based rate limiting. That may be a far better choice instead of doing request based rate limiting. So, in cost-based rate limiting you're going to charge a user a number of tokens depending on the length of their request, so that they can't call your really slow end points as frequently as your blazing fast end points. This is helpful if you're doing request based rate limiting and then you drop your users to maybe to again 50% of normal traffic, but they're still hitting that one horribly un-performing reporting end point because it's the end of the month and everybody needs their stuff. You could still be under excess load and you might want to consider cost-based limits if you have a lot of reporting end points that really tax your application at particular times. Finally, it may seem counter-intuitive but the more complex the algorithm for rate limiting the worse off you will be for denial of service attacks. The more computation time it takes for you to say that you can't process a request, the worse off you are when you're dealing with a flood of requests. This is no reason to not implement some complex rate limiting if you need it, but it is a reason to make sure you have other layers in place to handle distributed denial of service attacks and honestly even the denial of service attacks that happen by mistake when someone just makes up, deploys a bug, and you're getting hit over and over again.
All right, next, stopping non critical work. So, let's say you're hitting limits on your database, maxing out your compute, hitting the limits of some other dependent service. You should be able to stop any reports, any cleaners, any jobs that are making this worse that don't have to happen in the next hour, or maybe don't have to happen in the next four hours. You should be able to just turn them off. So, how do we do this? So, like application setting we have report setting as a model here. Similarly, it takes a string and what we do is we make sure that every report and every job checks to make sure that it is enabled before it runs. So, let's look at a quick code example. So, let's say we have a monthly user report, and that responds to a run method, and it's gonna do something. Who knows what it does, but it has a decent chance of being very intensive. All right, so, before we do any work we're gonna check to make sure that we're enabled. For our monthly user report, we're going to implement a method called enabled. We're going to check report setting and see if this particular report is enabled at this time. But let's make this a little more general. So, let's make your monthly user report inherit from report and then let's say monthly user report is really just responsible for building itself. It doesn't know much else. It's not really gonna be responsible for knowing where it's gonna run, whether it should run. It just can build its report, which means the parent class report will then get a bunch of additional features, so it knows how to respond to run and it can figure out if its child class is enabled. So, this is really useful for reports and jobs. Having this just standard means that any time a user creates a new job it is by default able to be enabled or disabled with one change to, in our case, the database, that Redis, that S3 bucket, whatever you wanna do.
So next, known unknowns. So, I am confident that all of you have never shipped non performing code, ever. But I definitely have. The SQL that you're shipping that you don't know how it will perform for your biggest customers, that you might wanna have it under control if it does go haywire. We have plenty of new features that go out that we think are fine. We've done as much testing as we think is reasonable but there's still, you know, the hair on the back of your neck. So, if you're scared of it, put a flag around it. If it's a new feature we'll put it in a feature flag. It's pretty straightforward and we covered that a little bit already, but if it's a refactor we usually have anything scary go out within GitHub Scientist. So using Scientist allows us to gradually roll out changes--refactors--but it also allows us to enable or disable the experimental code immediately if we see any problems. And the great thing is because it's so fast to disable we can do it even before we're 100% confident that this is what's causing issues, and the beautiful thing about having so many things configurable is if you have a little bit of doubt you can just turn it off and we find that eliminating those rabbit holes, things that might take one person an hour during an incident to look into, to prove that that's not it, is really helpful. We all have biases that, you know, I know that person's code. That's not gonna be, and that has to be it, or maybe it's just that a change went out right at the time that you saw the issue. Being able to turn suspect things off is a really great tool to moving you closer to the real problem faster.
All right, finally, I wanna talk about circuit breakers. So, circuit breakers allow you to be nice to the services that you depend on. They allow you to be a good neighbor. They allow you to not break them and they allow you to not swamp those services as they're just recovering. So, circuit breakers typically are responsive s. So, responsive shutoffs look at all of the calls you're making to a particular service and they can be configured to look at particular metrics. So whether that is the number of timeouts over the course of five minutes or maybe it's a 50% error rate over 10 minutes, whatever you've configured them to look for, responsive shutoffs can automatically kick in and back off any calls to those services. That gives those dependent services, or services you're depending on, time to recover but it also frees up your web processes to not spend the time calling down to a service that is most likely failing. Responsive shutoffs work far faster than any monitoring service. They can go through paging your on-call person, getting them awake, getting them on their computer, and then having them look up the right playbook and then take action. So, the hope is that by the time you page in the on-call person the responsive shutoff has already kicked in and you're in a better failure mode. But you can also use circuit breakers in a hard or manual shutoff. So, this would help you specifically keep traffic away from a struggling service. In some cases, you might want to allow high latency. Let's say, you know, maybe you have a 29 second request to a service every once in a while. You don't want that kind of request to trip the circuit breaker. But that does mean that if that service is in a high latency state where it's taking 29 seconds to respond to every single one of your requests, that means you're probably grinding to a halt, since your web workers are going to be tied up trying to resolve those downstream calls on behalf of your customers and not servicing the massive backlog of requests that you have coming in. So, in that case, while you wouldn't want a circuit breaker to automatically trip, you may want to manually turn it off. The other nice thing, or the other use for these manual shutoffs is a misbehaving service. This is usually for internal services where engineers can be a little more honest with each other. If you have an internal service that's responding 204 and you know it's just dropping requests on the floor a 503 error can actually be better for your customers than allowing those two services to drift out of state or telling your customers that something's gonna happen and it never does. So, these would work in a similar way that our monthly billing report worked. In the same way that monthly billing report inherited from a report class, our billing service client would inherit from a client class that would set up by default circuit breakers for any of its children and would keep track of those individual circuits, again, backed by anything you want, whether that is, you know, in-memory state, shared cache, data store. There are a number of good circuit breaker gems out there that you can just include and will have support for this, and, so, I won't get into implementation too much just because please go read their READMEs. They're lovely. With all these approaches, I would highly recommend writing tools to manage these circuit breakers that do not assume a developer typing into a production console as I have here. A case in point. How many of your on-call engineers know and remember enough about electrical engineering at two AM to confidently remember whether open means sending requests or not? If you watch me giving this talk at RubyConf Australia these will be flipped because I did not remember. So in this case a circuit being open would mean that communication to the service is closed. So, I would highly recommend writing a tool that allows your on-call engineers to see whether a service is off and turn it off or on or some other vocabulary that is universal and hard to misconstrue, hopefully not at two AM, but if needed, at two AM.
All right, so I wanna talk a little bit about implementation. With all these buttons and switches you really want to consider carefully how you form them and where you store their state. You have a number of options, some of which I have listed here. You can store them in a relational database, a data caching layer, in environment variables. You can even have them in your code as a last resort if you think that, you know, a way to control for failure is a deploy, then absolutely. Have a place in your code that it has a comment that says, hey, change this line, and then push it out to production as quickly as you can. For us, we're a deployment platform running on a deployment platform. So usually that option is not available to us. But it gets at this point of consider whether flipping a switch would require access to a component that could be down in a way that you would want to use the switch to control. So it doesn't require access to a running production server and what happens if you can't communicate to the running production server? How might you change the behavior of your application if you can't deploy changes? If you have a mutable infrastructure that might mean environment variables are totally out of the question for handling certain failure cases. One of the reasons why we rely so heavily on databases for storing our application state is because we have high confidence that our wonderful Postgres team, thank you, Gabe, will be able to get us access to the database in order to manually run SQL statements to flip certain bits and in many cases that would be how, instead of the lovely Ruby code, in many cases our failure states would end up being us running SQL in order to flip a switch to allow our running, still running, but behaving poorly, application to discover those changes. So, a final note is to really consider how, how much work it would take to figure out if a switch is flipped or not, because in general the fancier and more sophisticated your switch is the more likely it is to become part of the problem, or to confuse your engineers such that it is eventually the entire problem.
And with that, I promised you some buyer bewares, some caveats. So, here they are. First is about visibility. So, you remember this picture? Yeah, so we've built a lot of knobs and switches in this talk but you haven't actually seen the dashboard. That's because you'll need to build one. Whether it has a lot of pretty graphics or just command line output, having something that can pull the different places where you're storing the state for your buttons and switches and combining it into one comprehensible place is really important for incident operations. Clearly understandable is not a bar that we meet at the moment, but we can discover the state of every single switch, even if it's just, it's way too much output. But we're working on that. Next, does it actually work? How many of you have tested your smoke detectors in the last month? Excellent, all right, congrats, Gabe. You may end up being surprised at your dependencies and more interestingly you may actually be surprised at the dependencies that your dependencies have, especially if those, if you're working with vendors. They might be running on the same infrastructure that you are. So if it's a critical switch perform game days. You really don't have the confidence to know that it will work until you have, you've really tried it. Of course with that mention of vendors, right, you can try to turn off certain things but if you don't have complete confidence about what other people's work is built on it can be really hard to kind of tease out what exactly you need to turn off in order to simulate a complete outage of a particular component. So, this leads me to my next and final point. You're really trading knowledge for control here. The more configurable you make your application at runtime the less confident you can be that it will work in predictable ways. Have you tested for this user when flagged into three things, flagged out of two, and with a service shutoff? I'm guessing not, and if you did test that I'm questioning the size of your test suite. And more than just unit tests, keeping production, staging, and development environments in the same state has been a problem for many of the teams that I've worked on and I don't know of a good solution. And yet, while you are trading knowledge for control I'd still take this deal any day. I'd rather have control over my app to mitigate issues than to know confidently the exact and particular way that app is down and have no way to do anything about it.
So that, thank you. I hope that this has given you some ideas about ways you can make your application a little bit more resilient to the fires you inevitably will see. I do work at Heroku. We have two lovely other speakers. Stella starts tomorrow morning right after the keynote, followed up by Gabe. So if you wanna learn about using Kafka in Rails or how Reddit, sorry, Postgres 10 is gonna make your life awesome, please check us out there, and obviously come by our booth which opens tomorrow as well.
So, thank you, and I am happy to take questions for seven-ish minutes, or have you all disperse.
Thank you. Yes, yup, yup, yup. So, the question is when do we start thinking about adding in a new knob or switch? Usually after an incident. Some of them are longer term, more thoughtful, more thoughtful things. But, yeah. At this point, most of the new ones are something went wrong and we didn't have the ability to control it, so we don't wanna have that happen again.
Yup, how do we train new developers? I think that ties into how do we on-board people into on-call. So, we rely on a couple things. First of all, shadowing. We really do try to get people comfortable with the idea of being on-call by having them shadow on-call engineers during the day. So, they're not getting any pages, but they're, they're in there. They're, you know, seeing the person's screen. We do have documentation, but honestly that's one of, if you're in a really tired state the likelihood that you are going to think oh, let me read through these 50 pages of documentation is next to nothing. So we really want to get people to the point that they know what they're searching for through our docs and through, I mean, hopefully with a playbook it's a little bit more directed. You know which ones you're looking for. But again, if you're in the kind of an information discovery phase during the incident process, we're probably four or five hours in. So yeah, lots of, lots of shadowing, encouragement to read the docs, and then a strong reliance on telling people they really should just page someone. As the secondary for a new person going on primary I am more than happy to be woken up. It just, it needs to happens every once in a while and I want them to feel supported. So, making sure that they are totally okay regardless of the hour and that I am relatively chipper when I am in fact paged in is really important to us. So, yeah. No magical system to it, but just making sure people feel confident and aware of things.
Where do we store state? Okay, so, as I mentioned we primarily store state in Postgres for us and also in Redis, because again, we have confidence that our data team is, just because of our infrastructure is relatively separate enough that if something catastrophic has happened to us most likely they're gonna be able to get us in. All right, well, I see people queuing at the back for the next talk, so thank you very much everyone. Yeah.
1. Maintenance mode
Transforming your application from a live, active site into a single, downtime page is too large a hammer for many applications. However, it can also be the only choice to make when you're not certain what the actual problem is and it can also be the perfect tool for a smaller application or microservice. Therefore, it should be one of the first safeguards to build.
When you've gone into maintenance mode, your site should have a clear message with a link to a status page, so that your users know what to expect. More importantly, maintenance mode should be really easy to toggle. At Heroku, we have this implemented as an environment variable:
MAINTENANCE_MODE=on
While there are many ways to implement this mode, your implementation should be easy for your application operators to toggle whether it’s 2pm or 2am!
2. Read-only mode
For applications which modify any data, read-only mode helps preserve a minimum amount of functionality for your users and give them confidence that you haven’t lost all their information. What this looks like depends on what functions your application provides to users. Perhaps it stores data in the database, uploads files to a document store, or changes user preferences held in a cache. Once you know what your application is modifying, you can plan for what would happen if a user can't modify it.
Suppose there's a sharp increase in replication lag from a bad actor, and all of your users can no longer make any changes on your site. Rather than take the whole application down, it may be better to enter a read-only mode. This can also be set as a simple environment variable:
READONLY_MODE=on
Customers often appreciate a read-only mode over a full-blown maintenance mode so that they can retrieve data and know which of their most recent changes went through. If you know that your site is still able to serve its visitors, your read-only mode should indicate via a banner (or some other UI element) that certain features are temporarily disabled while an ongoing problem is being resolved.
3. Feature flags
Often, feature flags are introduced as a means of A/B testing new site functionality, but they can also be used for handling incidents as they occur.
There are three different types of feature flags to consider:
- User-level: these flags are toggled on a per-user basis. During an outage, they're probably not very useful, due to their narrow effect.
- Application-level: these flags affect all users of your site. These might behave more like the maintenance mode and read-only mode toggles listed above.
- Group-level: these flags only affect a subset of users that you have previously identified.
When it comes to incident handling, group-level feature flags are the most useful of the three. You'll want to think about what groupings are meaningful for your application; these end up being a combination of what you want to control and who your application’s users are.
Suppose your application has started selling products to a limited number of users. One evening, there's a critical issue, and the feature needs to be disabled. We implement this at Heroku within the code itself. A single class can answer questions about the current application state and toggled features:
ApplicationSetting.get('billing-enabled')
=> true
This ApplicationSetting
model could be backed by a database, by Redis -- whatever provides the most resiliency to make sure that this check doesn't fail.
Depending on your company’s need for stability, it may make sense to further subdivide into smaller segments. For example, perhaps your EU users have an entirely different feature flag for billing:
ApplicationSetting.get('billing-enabled-eu')
=> false
For earlier-stage companies, it may be silly to have so many levels of refinement, but if your directive is to shave customer impact down by tenths of percentages, you'll be grateful for the confidence about which segment of the application is being affected!
4. Rate limits
Rate limits are intended to protect you from disrespectful and malicious traffic, but they can also help you shed load. When you are receiving a mixture of normal and malicious traffic, you may need to artificially slow down everything while getting to the problem's source.
If you need to drop half your traffic, drop half your traffic. Your legitimate users may need to try two or three times to get a particular request handled, but if you make it clear to them that your service is unexpectedly (but intentionally!) rejecting a fair number of requests because it's under some sort of load, they will understand and adjust their expectations.
Rate limits can also protect access to your application from other parts of your business that rely on your service. Often, the single application that a user sees is actually a mesh of different services all acting together to create a single user experience. While you absolutely can make that internal system function when other services are down, it can be easier to just prioritize internal requests over external ones.
At Heroku, we implement rate limits as a combination of two different kinds of levers: a single default for every account, plus additional modifiers for different users. We find that this gives us the flexibility to provide certain users the rate limits that they need, while at the same time retaining a single control for how much traffic we are able to handle at any one point.
We set this value as an application setting with a global rate limit default:
ApplicationSetting.set('default-rate') = 100
Here, we're assuming it's 100 requests per minute—hopefully your site can handle much more than that! Next, we assign all the users a default modifier:
user.rate_limit_modifier
=> 1.0
Every user starts with a modifier of one. To determine the customer's rate limit, we will multiply the application default by their modifier in order to determine what their rate limit ought to be:
user.rate_limit
=> 100.0 # requests per minute
Suppose a power user writes in to support and provides legitimate reasons for needing twice the rate limit. In that case, we can set their modifier to two:
power_user.rate_limit_modifier
=> 2.0
This will grant them a rate limit of 200 requests per minute.
At some point, we might need to cut down our traffic. In that case, we can cut the rate limit in half:
ApplicationSetting.set('default-rate') = 50
Every user now has their default rate limit halved, including the power user above. But their value of 100 is still a little bit above than everyone else's default of 50, such that they can continue on with their important work.
Setting limits like this allows us to rapidly adjust traffic coming in without having to run a script over every single user to adjust their rate limit. It's important to note that, depending on your application, you may want to consider doing cost-based rate limiting. With a cost-based rate limiting system, you "charge" a user a number of tokens depending on the length of their request, such that they can't call your really slow endpoints as frequently as your blazing fast endpoints.
Finally, it may seem counter-intuitive, but the more complex the algorithm for rate limiting, the worse it will be during denial of service attacks. The more computation time it takes for you to say that you can't process a request, the worse off you are when you're dealing with a flood of them. This is no reason to not implement sophisticated rate limiting if you need it, but it is a reason to make sure that you have other layers in place to handle distributed denial of service attacks.
5. Stop non-critical work
If your application is consistently pushing up against the limits of its infrastructure, you should be able to pull the plug on anything that isn't urgent. For example, if there are any jobs or processes that don't need to be fulfilled immediately, you should just be able to turn them off.
Let's take a look at how that can be accomplished in the context of a function which generates a monthly user report:
class MonthlyUserReport
def run
do_something
end
end
do_something
has a decent chance of being very computationally expensive. We can instead rework this class to first assert that reports can be generated:
class MonthlyUserReport
def run
return unless enabled?
do_something
end
def enabled?
ReportSetting.get("monthly_user_report")
end
end
Now, before we do any work, we can check to make sure the generation is enabled. Just like the application settings above, we have ReportSetting
defined as a model here:
ReportSetting.get("monthly_user_report")
=> false
We can also generalize this implementation. Let's make the monthly user report inherit from a parent Report
system:
class MonthlyUserReport < Report
def build
do_something
end
end
Now, the monthly user report is only responsible for performing the build, and the parent class is responsible for figuring out whether or not the job ought to run:
class Report
def run
return unless enabled?
build
end
def enabled?
ReportSetting.get(self.class.underscore)
end
end
6. Known unknowns
Sometimes, observing the effects of a new change will be beyond the scope of a feature flag. Even if you believe that all your tests are flawless, you still carry doubt knowing that a disastrous outcome is looming in the shadows. In these cases, you can use a control/candidate system such as Scientist to monitor the behavior of both the new and the old code paths.
Scientist allows you to gradually roll out changes and refactors. It also allows you to enable or disable new or experimental code immediately if there are any problems. Being able to turn suspicious code paths off one-by-one is a really great tool to moving you closer to the real problem faster.
7. Circuit breakers
Circuit breakers allow you to play nice with the services that you depend on. These are typically responsive shut offs that safeguard interactions between services under dire situations. For example, if the number of 500 errors you see from a service in the last 60 seconds passes a threshold, a responsive shut off can automatically step in and halt any calls to those struggling services. This gives those dependent services time to recover, but it also frees up your web processes from spending the time calling a service that is most likely failing.
A responsive shut-off works far faster than any monitoring service. A monitoring service may page your on-call engineer, which prompts them to go to their computer, then search for the right playbook, and then finally take action. By the time the original page was sent to a human, your responsive shut off has already kicked in and you're in a better failure mode.
A circuit breaker could work in a similar way that the monthly billing report worked. Just as the monthly billing report inherited from a parent Report
class, a billing service client could inherit from a Client
class that would set up by default circuit breakers for any of its children.
Further considerations
There are a number of additional caveats you may want to investigate.
The first one is around visibility. Whether it has a lot of pretty graphics or just some command line output, having a way to display the different places where you're storing the state for your buttons and switches and combining it into one comprehensible place is really important for incident operations. Really consider how much work it will take to figure out if a switch is flipped or not, because in general, the fancier and more sophisticated your switch is, the more likely it is to become part of the problem!
You should also be routinely testing whether these switches are actually working. Does it actually work? You can’t have the confidence to know that it will work until you've tried it.
With the variety of techniques listed above, you will want to carefully consider how you form these safeguards and where you store their state. There are a number of options available: in a relational database, a data caching layer, as environment variables, etc. You can even have configurations in your code as a last resort if you believe that a way to control for failure is a fresh deploy.
Consider whether flipping a switch would require access to a component that could be down. If that switch requires access to a running production server and you can't communicate to that server, what happens? How might you change the behavior of your application if you can't deploy changes? If you have an immutable infrastructure, that might mean environment variables are totally out of the question for handling certain failure cases. One of the reasons why we rely so heavily on databases for storing our application state is because we have high confidence that we can retain access to the database in order to manually run SQL statements to toggle those safeguards.
What this boils down to is this: the more configurable you make your application at runtime, the less confident you can be that it will work in predictable ways. Have you tested how a certain user, when flagged into three features, interacts with all of your services? As you implement these knobs and buttons, keep in mind that you are trading knowledge for control. However, it's still a better deal at the end of the day. More control over mitigating issues in the app is better than confidently knowing the exact and particular way an app is down, but having no way to do anything about it.