The adventure of changing the storage backend in HashiCorp Vault

Aug 21, 2023 | 8 minutes | Dennis Christensen HashiCorp Vault Kubernetes Secrets

In BESTSELLER Tech, we rely on HashiCorp Vault to store and protect our secrets. It is obvious to anyone working within tech that this system is critical to the daily operations and security of the organization. Therefore, when the Engineering Services team decided to poke around in our Vault instance, it became quite the adventure.

Vault is a product in the HashiCorp suite. It is a secret engine where applications can fetch and store the keys to the castle. In BESTSELLER Tech, we deploy our Vault instance in Kubernetes. HashiCorp provides a cloud offering of Vault, which includes some nifty features for backup and restore, but with a price tag to match.

We had been using Vault for some time, and it was due for a service check. So we kicked the tires and scraped off some rust to see if the critical parts still held up in a pressure test.

We asked ourselves - Do we trust the backup?

The disturbing collective answer was no! So we sat down and had a serious talk about what we needed to do to restore our confidence in the recovery process. At that time, the backup relied on a dump of all the newest versions of secrets in Vault into one password-protected file. This solution was problematic in itself, but the real issue was that Vault is not just secrets. It also holds a complex policy structure for what the individual teams in the organization can access, and this structure was not part of the backup.

We considered our options: Pay the price and go to HashiCorp Cloud, use Banzai Cloud to backup Vault, or try to improve the current implementation. After some sniffing around, we stumbled upon Raft.

Raft is a storage backend for your Vault instance that is officially supported by HashiCorp. Its selling points are high availability and scalability because your secrets are replicated to every Vault node, so if one node goes down, another will take the lead. This was in itself a major improvement from our round-robin load balancing that would hit the dead node every other time. But the real benefit from Raft was that it enabled us to use Snapshots. These snapshots are a complete picture of our Vault instance, including access policies, secrets and previous versions of that secret, and the ability to nearly one-click-restore with minimal downtime. If you are sitting there thinking, "I must know how this Raft thing works," have a look here.

So Raft it is, what now then?

Thankfully HashiCorp provides a great migration guide on how to change your storage backend to what HashiCorp calls Integrated Storage. In our case, the Vault instance relied on a single Google Cloud Storage bucket to store Vault data. By using Vault integrated storage, the data is replicated to every node with its own persistent storage, which allows us to scale horizontally with more nodes for future growth and not be reliant on a single storage bucket, while making the Vault cluster more resilient to node failures.

We are deploying Vault in Kubernetes using the Vault Helm chart through Terraform, and we quickly realized that the world hadn't stopped turning since we first deployed our Vault instance:

The Ingress for Vault was now part of the Helm chart and leveraged Service Discovery to always hit the active Vault node in the cluster
Security Context was added, so we actually could follow the container security practices that we ourselves preached
Raft support was added for the High Availability mode

Implementing these changes became a prerequisite for the migration to elevate the security and user experience of Vault in BESTSELLER. As the current Ingress was deployed separately from the Vault deployment and without a good solution to only serve healthy Vault nodes, it was really a must-have to use the Ingress configuration provided in the Vault Helm chart.

Cool - We have a test environment, does it reflect the production instance? Well...

No, not really, but thanks to the modern wonders of Infrastructure as Code, it was a breeze to deploy the latest release of our Vault environment with Terraform. Now we could get started with testing the steps towards an integrated storage backend for Vault.

First off, the migration guide expects that the folders for the new storage destination already exist, which they didn't... We took a good look at the Vault Helm chart to figure out how to trigger the creation of the Persistent Volume mounts in our Kubernetes cluster while still keeping the old configuration for the migration ahead.

We found out that if we twisted the Helm chart slightly, we could trick the HA configuration into creating the data mounts. If we added server.ha.raft.enabled = true and server.ha.raft.nodeId = true, along with an extra config block called raft, like shown in the snippet below, it would hit this if statement and create the data mount that we needed for the Raft migration.

 1ha:
 2  config: |
 3    ui = true
 4
 5    storage "gcs" {
 6      bucket        = "bucket"
 7      ha_enabled    = "true"
 8    }
 9
10  raft:
11    config: |
12      ui = true
13
14      storage "gcs" {
15        bucket        = "bucket"
16        ha_enabled    = "true"
17      }

Let's see what breaks!

With all the migration prerequisites done, we were ready for a test run on our test environment.

Vault is deployed as a Statefulset in Kubernetes, which I personally didn't have much experience with. So when I discovered that enabling Raft in the Vault Helm chart was rejected by the running Statefulset, I was knee-deep in Stack Overflow posts regarding changes to running Statefulsets and how to do it, and the answer was - You don't. As with everything in tech, there is, of course, a workaround. We leveraged the --cascade=orphan flag for the kubectl delete statefulsets. This deletes the Statefulset but not the running pods. Then you can execute your own manual rolling update by deploying the new Statefulset and then killing the old pods one by one until all the pods have been replaced with the new Raft settings and configuration.

Our test environment was now ready for the actual migration from the Google Cloud storage bucket into persistent storage in Kubernetes. We started out by creating the migration HCL(HashiCorp Configuration Language) script that contained the current storage_source and the storage_destination, along with the address of the Vault cluster cluster_addr.

 1storage_source "gcs" {
 2  bucket = "bucket"
 3}
 4
 5storage_destination "raft" {
 6  path = "/vault/data/"
 7  node_id = "vault"
 8}
 9
10cluster_addr = "https://vault:port"

Now with root-level access in Vault, we could execute the migration by running the following command vault operator migrate -config=raftmigration.hcl. This operation copies all of your Vault data to the new storage backend. We removed the double ha configuration to only contain the raft config. This leaves Vault with only one active Raft member. You can confirm this using vault operator raft list-peers, which will show the members of the Raft cluster. It is now our job to add the remaining members by hand. We do this by running the command vault operator raft join {leader api_adr} in the pods we want to join the Raft cluster. This is only needed if you forget to add retry_join to the ha:raft config before your migration...

Our development Vault instance was now completely migrated to run as a Raft cluster. We tested out the snapshot functionality which was a 2-command operation to backup and restore the complete Vault with everything(access, policies and secrets). This was the main reason we chose Raft, so it was very satisfying to see it working so smoothly.

Migration day!

The day had come - Everything was tested, stakeholders were notified regarding the downtime, and we were all set to go!

The team sat down after work hours, pizza on the way, and a cold beer waiting as the reward for a successful migration at the end. It started out pretty smoothly, making the initial changes. But when it came to executing the migration script, we had tested with a small amount of data, and it quickly became clear to us that our decision to do the migration after work hours was a good call. When executing, you can neatly follow along in your terminal while eating pizza, and we saw a lot of stuff in our production Vault that needed cleaning afterwards.

After what felt like an eternity, the migration was done, and we got the Raft cluster set up. We tested the snapshot backup in our production environment and confirmed that everything was working as intended. We sat back and enjoyed our well-earned cold reward from the fridge.

Key takeaways

The importance of having a production-like test environment to make mistakes, which we did, really was the key to being able to perform such a major change in a business-critical system
Don't let your test environment be a hollow shell. Populate it to simulate some of the load on production
Infrastructure as Code is your friend. Break it and recreate it!
Put in the time to upgrade. Our road to the latest version was long
Test your backup. Many people rely on it!

About the author

Dennis Christensen

My name is Dennis Christensen, I work as a Systems Engineer in BESTSELLER Tech. I mostly focus on Infrastructure as Code, Cloud, automation and generally just trying to improve the tools of the developers in BESTSELLER.