Migrate a Postgresql Database in Kubernetes with Helm Hooks
Introduction
Kubernetes has become an incredible tool for cloud application development and delivery. It has never been more easy to create scalable, resilient and accessible applications. However, there are still some components of an enterprise level application that can be particularly challenging in Kubernetes. One such component is a database. Evolution of Kubernetes has led to database vendors to transform their databases to run in a K8s cluster, but its not always a smooth process.
In our application, we use Keycloak which is an identity and access management service. Keycloak uses, although not limited to, Postgresql to save essential data. It is worth mentioning that our application along with any dependencies are packaged in a Helm chart, as we use Helm to deploy our application in a K8s cluster. Coming back, initially we used a standalone version of Postgresql i.e. only one instance of Postgresql pod was up. Naturally this caused problems down the lane as our application receieved more traffic. So we decided to change our Postgresql dependency to Postgresql HA. This meant we had could not just run a Helm upgrade and expect everything to be in a ready state. We had to backup our data, delete any old Postgresql instance and artifacts runnning in cluster. Then run Helm upgrade and restore the data.
Well that means a lot of manual intervention, and with it the probablity of something going wrong. Also this Helm chart is used by our user to deploy application in their own cluster and we wanted the upgrade to run without them executing any of the above steps.
Design Overview
After considering different options, we decided the best way for us is to use Helm hooks and Kubernetes Jobs to perform this migration.
Helm hook is a mechanism to intervene at certain point in a releases life cycle. For eg. if we want to create some Kubernetes objects before our applciation is deployed, we can use an Helm "pre-install" hooks to create such object. There are number of hooks available and you can learn more about them here. In the context of our problem we will be focusing on "pre-upgrade" and "post-upgrade" Helm hooks. We will be combining these hooks with K8s jobs. A k8s job is a mechanism that can be used to run a certain task to completion.
When you combine pre-upgrade Helm hook with a K8s job, Helm will run a job before upgrading any K8s component related to our application. Similarly the post-upgrade job, will run after Helm upgrade is completed. It is worth nothing that due to any reason if either of these jobs fails, it will result in Helm upgrade to fail and depending on the perference a rollback can be initiated.
Diving deeper
Pre-uprade
This stage take care of taking backup and cleaning up old postgresql K8s objects.
We first create a PVC that will hold the backup of old Postgresql instance.
After this, we will run the pre-upgrade job. This job deploys a pod that runs a Python script. This script will do following things:
- It will check if an upgrade is necessary, in case upgrade was already completed or POstgresql-HA is already deployed.
- Once it is determined that an upgrade is necessary, it will then create a configmap that keeps track of upgrade process. We will cover more about this later on.
- It will then perform a backup of postgresql DB and save that in the PV we had created earlier.
- After successful backup it will delete Kubernetes objects related to old Postgresql DB namely statfulset and persistent volumes. It will also scale down Keycloak, so that no calls are made to databse during migration.
- This concludes the pre-upgrade process and Helm can now run upgrade. In case there is a failure at any point, the upgrade process is terminated and we can view the logs of the pod associated with the pre-upgrade job to determine the failure.
Post-upgrade
On completing the upgrade process, Helm will invoke post-upgrade job. There is an init-container running in the pod associated with this job. It's job is to check if new Postgresql pods are up and running, after which we will a Python script is executed. It will:
- Check if pre-upgrade job was run and whether restore is necessary.
- If a restore is required, it'll restore the old database and scale up Keycloak.
- It'll also update the configmap specifying that the upgrade is completed.
- Similar to pre-upgrade job, if there is a failure, Helm upgrade will fail. It is then up to user to decide to either re-run Helm upgrade or rollback.
Ensuring state consistency
Our challenge was to make this upgrade as safe as possible. This means no data loss or the upgrade process leaving our K8s cluster in a bad state. We achieved this by keeping a track of our progress during upgrade via a ConfigMap. If for some reason the pre-upgrade job succeeds but post-upgrade job fails. We can run Helm upgrade again and post-upgrade job will know from the configmpa that a backup exists and proceed restoring the data. The configmap is also usual in not running these upgrade jobs again in case an upgrade was already performed earlier.
Advantages
- The key advanatge of this whole exercise was that we did not need an operator or a manual intervention to perform a safe migration.
- Tracking of migration process.
- We can reuse this framework to easily run more complex database upgrade scenarios like changing database vendors.
Always be cautious
- A Kubernetes resource that uses Helm hook, in our case jobs, configmaps, etc., are not managed by Helm release lifecycle. So we have to be careful on how these are created and destroyed.
- This is a solution for a niche problem that we had. This pattern can be used to solve various other problems. However, there are many emerging technologies to manage highly available database instance that takes care of backup, scaling and disruptions.
Conclusion
We came up with a pattern that can solve data migration and problems of similar nature when an application is deployed in Kubernetes using Helm. It is simple, extensible and only requires tools we are already using i.e. Helm and K8s objects.