Optimizing for the cloud often involves infrastructure changes that can be difficult and filled with unknowns, making migrations particularly painful. Maintenance windows aren’t an option because testing happens in production. But there is a way to migrate safely with less headaches.
Back in November, LaunchDarkly’s own Mike Zorn and Justin Caballero attended AWS re:Invent 2021 in Las Vegas to provide a live session on techniques for painless reinvention. At re:Invent, Mike and Justin demonstrated how to convert a streaming event architecture and migrate production databases with zero downtime, using a gradual, reversible, and verifiable processes controlled with feature flags.
Below you can watch the the full video of both presentation. In this post, we've written up some highlights of Justin's presentation (which begins at the 15:00 point), and focuses on the story of LaunchDarkly’s migration to a new database. If you're interested in checking out a recap of Mike’s presentation—which kicked off the event—about how we migrated our streaming event ingestion pipeline, you can check that out here.
Has any team really ever felt safe during a database migration?
Justin began by asking those in attendance if they had ever really felt safe when working on a database migration in their application. To no one’s surprise, nearly every respondent signified that they did not feel safe during this process.
So, why would teams need to perform a database migration in their application? A few reasons could include:
- User growth
- Lack of satisfaction with a query model or database
- The desire to invest in more resilience
- Working on a cloud migration to enable using AWS provided database instead of an on-premise database
As Justin explains, in the case of LaunchDarkly, we wanted to improve our disaster recovery abilities. We were using Mongo and PostgreSQL as our main data stores at the time, but they were not a good fit for how our data was structured. So, we moved to CockroachDB.
The main objective during the migration
Now that the data storage part is settled, Justin pointed out the ultimate goal with this migration:
“The key constraint is that our users can't notice we're doing this. We don't want any downtime. We don't want a maintenance window on the weekend. We want our users to always get the correct data. We want the migration process itself not to cause performance regressions—it's not good if we add a second of latency just because we're fiddling around with a new database.”
After harkening back to Mike Zorn’s San Francisco-Oakland Bay Bridge analogy from earlier, Justin explained how the goal with this migration was to stand up the new database, connect the application to the new database, and then discard the old one. But this is something that obviously can’t be done instantly—it must be tackled gradually. However, we also needed the new changes to be immediately reversible to avoid the need for a deploy if anything went wrong. And one more thing: we also needed a way to verify that the new system was working right before switching over. That’s a lot to manage.
The blueprint for a seamless database migration
As Justin goes on to explain, we achieved this by running our application in a sort of dual-mode that allows it to talk to the old and new databases at the same time. This process started off with the old database remaining active while the new database ran in what we call “shadow mode.”
With this method, we still needed to ensure that the user experience was seamless and would be unaffected if something went wrong along the shadow pathway. We also needed a way to guard against user queries performing slowly in the new database. And finally, when the migration was over and we’d arrived at the end state, we needed a way to essentially flip a switch and drop the connection to the old database. And again, this all needed to be done without any disruptions to user experience. Think of it like replacing the entire plumbing of someone’s home without them even knowing it was happening. (Or you can just go back to Mike’s San Francisco-Oakland Bay Bridge. You do you.)
Fortunately, we had a way to make all of this happen: feature flags.
Mission accomplished
Justin goes on to demonstrate how feature flags allowed us to be gradual with changes while making it easier to roll all of this out to multiple call sites on our application without having to introduce error-prone code all over the place. Plus, the cleanup part was incredibly easy.
Ultimately, feature flags gave our team immediate control inside the application. They provided the ability to turn it off within a second if needed and had a clear view of metrics that could verify that the results were the same between the two databases.
Bottom line: We were able to test out the new database in production, find problems, and optimize the many performance issues during the process—and our customers had no idea this migration even happened because user experience was never affected.
We’re obviously leaving out a lot of details of the actual process, but don’t worry; you can watch the video of Justin’s presentation to see exactly how we pulled this off. He’ll walk you through the whole thing. Click here to watch—Justin’s portion starts right at the 15:00 mark.
And if you haven’t watched Mike’s equally-informative talk on how we safely migrated our streaming event ingestion pipeline using feature flags, you can watch it by using the same link.
Full transcript:
Hi, everybody. I'm Justin, I'm an engineer at LaunchDarkly. Mike just explained how we used feature flags to safely migrate our streaming event ingestion pipeline. I'm going to talk about another kind of migration that we've been working on. I'm curious if folks in the audience maybe have experience with this one. Has anybody ever worked on a database migration in their application? Okay, quite a few hands. I don't know if I'll be able to tell, but how safe did that feel? Thumbs up for safe, thumbs down for not safe. Okay, well quite a few thumbs down. I'm not sure if I saw any thumbs up, actually. Anybody with the thumbs up? Oh, maybe one? Okay. Let's see what we got here. There's lots of reasons we might want to do this. There's more than I've listed here.
- Our application may have grown. We have a lot more users, maybe we're going from a local deployment to a global one.
- Maybe we don't like the query model we chose originally, maybe something about our database, it doesn't have strong enough isolation.
- Maybe we want to invest more in resilience, time to recovery, making sure we don't lose data if there's some sort of failure. Maybe we're working on a cloud migration and we want to start using AWS provided database instead of an on-premise database. Lots of reasons.
I'll talk a little bit about our particular case. So we originally we were using, for a number of years, a combination of Mongo and Postgres as our main data stores. And we wanted to get to a situation where we had much better disaster recovery story. So if we lost a region, we would still be able to continue operating without any downtime or data loss, or to minimize that as much as possible. Also, our data is pretty structured. So the document model we were getting out of Mongo didn't feel like it fit as well. We starting to see some strains from that. We wanted something essentially like a cloud native database with strong isolation properties. So we chose CockroachDB, but the specifics here aren't really the point. Everyone has their own reasons for needing to do this.
I really want to talk more about the general process of how we tackle this and not particular technologies. The key constraint, really, is that our users can't notice that we're doing this. We don't want any downtime. We don't want a maintenance window on the weekend or something. We want our users to always get correct data. We want the migration process itself not to cause performance regressions. It's not good if we add a second of latency just because we're fiddling around with a new database.
Here's a picture of what we want, and this ties back to the bridge scenario a little bit. In the before state, we've got our application connected to an old database. It's filled up with blue data, or maybe these are the cars on the bridge. So we're going to stand up a new empty database. And in the after state, we want our app to be connected to the new database, with all the data in it. And then we're going to throw away the old one. If only we could snap our fingers and have that bridge up and all the cars on it... Of course we can't. So we need to think of a process that's going to work here. Of key importance is that we make the changes gradually. We can't do a big bang approach, obviously, because that's super risky.
We also need the changes to be reversible and to be reversible immediately. We don't want to have to do a deploy if something goes wrong. And even if that takes five minutes, that would be very fast, but more realistically, it’s like 30 minutes. That's not good. And we also need to verify that the new system is working. Does it give us the same answers as the old database did? So the way we do this is by running our application in this sort of dual mode where it's talking to both databases at the same time. We call the one that's connected to the user experience the active database. So that's where, essentially, when [users] do a query, that's where the answers that they're seeing are coming from, would be the active database.
We start off with the old database being active and then the new database is running in what we call shadow mode. So it'll be doing the same queries as the new database, off to the side. And this can be a little more tricky than it seems. We do have to take some care to ensure that if something happens along the shadow pathway, that doesn't affect the user experience. And particularly if the queries are performing slow in the new database, we need to somehow guard against that.
The other thing is that the new database starts out empty and we need some way to fill that up. So this depends on your own scenario, what the database structure is. In some cases, just by the fact that you're doing rights to both databases means that the new one will start to catch up immediately. We do have to do something about preexisting data, so that's another area we need to think about. This is a picture of this iterative process we go through, where once we're connected to both, we're going to essentially just sit and watch what happens. And since we're going to be doing some verification, we'll look to make sure that things are adding up right. If we see something's wrong, we'll maybe stop and go make some bug fixes. We're also going to be observing the performance and doing optimizations if we need to for new queries.
As I mentioned, we also will have some sort of process in the background to be able to sync all data. In our case, we were able to get by with a very simple, sort of ad hoc, essentially, scripts to do queries and shuffle data over. If you have really large data sets or there's some complexity, you might need some sort of offline batch processing or something fancier to deal with that problem. But that's also something you probably need to do iteratively. And then when we get to the end state, all we need to do is flip the switch and drop our connection to the old one, and we're done. And this is something, this switching between the two, is something we do with feature flags and we'll get into that.
Okay. Let's take a look at how to actually make this work inside your application. This is one of the things where you might have a question like, "Could we do this sort of migration with Canary deploys?" And this is the thing where you really do need feature flags, because you need that control internal to your application to control the pathway of execution. We wouldn't be able to do this with Canary deploys, although they can be useful as we're introducing the tooling for the first time to give us an extra degree of safety. It's a very good technique, of course.
Let's look at a little bit of code. We're a Go shop, so these examples are going to be in Go, but of course this would work with any language. So let's say we have an account entity and, if you're familiar with domain-driven design, this pattern of the repository may seem familiar. So we've got this interface with a find method. Let's just look up an account. It's got some query in it and we return an account. That's pretty simple. In our application code, like an API handle or something, we simply call the find method. Of course, in a real app, we would actually do something with that account. But here we're just maybe sending it back to the user. So how do we introduce the second database into this code? That's what we want to do. So we introduce the idea of a mode, and this is the thing that we'll bind to our feature flag. A mode tells us whether we want to talk to database A, database B, or both. In this case, we have the three choices as an enumeration and then in our code, we write some if/else kind of logic here. Once we've made the database calls to either or both databases, we can do a comparison if we talk to both. And we do something with those results of the comparison, which I haven't shown here. But that would be the thing that tells us, are both systems working the same way?
And then we have to choose. In this example, we always return the result from A, if we made a call to A, so that would mean the old database. When we did this, we actually also designed the flag in a way that we could make B the active database. So we could have a both mode that would return to either A or B.
Okay, so this isn't too bad. It's just a little if/else, but there's some problems. We would really like to have some tooling for metrics. We want to be able to compare apples to apples, how these find methods are performing. And we'd like that metric to be done at this level of interface. We might have automatic driver-level metrics, but the queries might not map one-to-one, so we'd like to have the higher-level metrics as well. And the fact that we're doing sequential calls, calling A and then B, is obviously bad because now we're taking the sum of those latencies and that's what our users are going to be hit with. It's not good.
So, let's add some more code. We can add some Goroutines here to increase concurrency and improve latency. We can add some tracing spans around each call. The latency still isn't great. It's the max of A and B, so if B is taking five seconds, that's going to be the latency the user experiences. And more importantly, this code is getting kind of hard to follow. I feel guilty even putting this on the slide. Imagine if you had lots of places in your application where you're doing queries, you wouldn't want to have to do this. It's going to be error-prone; you'll probably make mistakes. There's probably mistakes in this example, even.
So how can we get back to this where it's just a simple one-line call? That would be great. So, we want to factor out all these cross-cutting concerns like instrumentation; the A/B mode selection, which is driven by our feature flag; the execution strategy of whether they execute concurrently, the two calls, or maybe there's some sort of background process we can use for the shadow database. And then the verification strategy, we want to be able to plug that in somehow.
So since we're using Go, we opted to use code generation to build out this tooling. Essentially, we generated a set of wrappers for each repository interface, and we added a wrapper for doing metrics and a wrapper for the other stuff related to the migration process itself. In your own language, like if you're doing Java or something, you might use a dynamic proxy or something that's more runtimey, but code generation worked great for us since we wanted a low-magic approach.
So we add this directive to our interface that says, "Please generate some wrappers for us using this go generate command." And this is what it looks like in our initialization, our bootstrap code for the application. So this is basically like the decorator pattern applied a couple of times. We wrap each of our repositories, the A and the B repo, in an instrumentation wrapper. And then we wrap both of those inside a migration wrapper. And that migration wrapper controls the switching between A and B as well, as things like the verification and the metrics. So once we've created this, we can use it just like a plain old repo. We can call the find method on that in our main code.
The other nice thing about this is when we're done with the migration, we haven't interfered with the code in our API handling endpoint hand code, like the main request processing code. So we can just delete this initialization stuff from the app and we're not worried about breaking the request flow.
Since we're doing code gen, we use this as an opportunity to generate a little record for every method on the data stores so we could capture the package, the type, and the method name. And this works really well for reporting on metrics. We can get metrics down to the method level, as well as giving us something to feature flag on. So we can use these three properties of each method on that interface as attributes in our feature flag.
Speaking of flags, this is what it would look like as an example in LaunchDarkly to control how this repository behaves as we're migrating. Imagine you have a site of users that are your guinea pig users, or maybe they signed up for the bleeding edge; they're willing to tolerate some risk. We could say, just for those users, turn on both databases. Or you could say, you could do a percentage rollout or something. I just want to do this for 2% of my traffic. And this gives us a way to slowly introduce even the tooling, because there's a little bit of concern when you first introduce the tooling itself, that that works. So you want to take it slow at every step.
There are other ways you can imagine you might slice and dice this. There was an interesting blog post that Stripe had about how they did a similar process to this, but they first switched all their read inquiries to the new database and kept the rights and migrated them separately later. So, you could do something like that as well with this mechanism.
The highlights here are, with little investment in tooling, we're able to make it easier to roll this out to multiple call sites on our application without having to introduce error-prone code spread out all over the place. It's designed for easy cleanup. And the big thing really, is using the flags and being very gradual about our changes to increase our peace of mind. That immediate control that's inside your application, the ability to turn it off within a second, combined with the metrics and the verification that the results are the same between two databases, really makes it possible to do this kind of work without very much fear.
And of course, the bottom line, our customers had no idea we're doing this migration. That's really what our goal was. We were able to test out the new database in production, find problems with it. There were lots of performance issues we needed to optimize and our customers were never the wiser that we were doing this. So, that's our story. Thank you.