Netapp migration manual
The following steps are documentation on required steps on the customer side to set config to the new storageclass that has been created. These steps may noy apply if the default storageclass is always used and not explicitly defined in the objects.
ArgoCD
Since the storageclass name has changed you will have to change the name of the storageclass in the git repositories used by ArgoCD. The examples below wil show what you will need to change on the PVC. Customers that are using multiple storageclasses that for example contain settings on snapshotting need to check the cluster or ArgoCD output for the new storagclass names.
From:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: examplepvc
spec:
...
storageClassName: 'netapp-premium'
resources:
requests:
storage: 1Gi
To:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: examplepvc
spec:
...
storageClassName: 'true-nfs'
resources:
requests:
storage: 1Gi
Statefulsets
Beware of operators and ArgoCD auto-heal functionality
The instructions presume that the statefulsets you are trying to recreate with the right storage class are not controlled by an operator, ArgoCD auto-heal function, or anything that could recreate the statefulset before the commands you are running can. Please check for these systems that could recreate the statefulset and disable them so they cannot recreate it before you proceed.
You will also need to check if you have anything like this controlling the statefulset and ensure that the storage class is correctly set there, either after or before running these steps.
For an existing statefulSet (STS), there is no direct way to change the defined storageClassName. To achieve this, you must recreate the STS, while retaining any existing Pods.
Since the name for storagclassnames also change for statefulsets and if that storageclassname has been defined explicitly in the yaml of the statefullset like so:
...
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/name: rabbitmq
name: data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 8Gi
storageClassName: netapp-premium
volumeMode: Filesystem
...
To change this we delete the statefullset object without actually deleting all the pods behind it.
- Get a copy the yaml of the statefulset beforehand, which is very important:
- Edit the created yaml, so it reflects the new
storageclassname, like so: - Delete the statefullset object and recreate it with the correct settings (this will not delete everything of course). Please emphasize on the importance of the
--cascade=orphanflag. This flag instructs Kubernetes to delete the STS resource without terminating the existing Pods: Now the statefulset should be successfully recreated with the correct storageclassname in the template. The existing Pods will still be managed by the STS, and also new PVCs of the statefulset will use the correct storageclass.
FAQ
Why are we moving to another storage device?
The current Netapp storage device is being decommisioned so we are moving to a new Netapp storage device.
Will the pricing change on the new storage?
No, the pricing will stay the same when the data has been moved.
Will the performance be the same?
Yes, from all of our tests the performance is the same on the old Netapp as it is on the new Netapp.
Will snapshots be the same?
How snapshots work on your side will be the same, you can still recover certain files from the snapshots and we can still recover PVC's if the volume has been removed in no less than a week.
What are statefulsets?
Statefulsets are a Kubernetes API object designed (like in the name) for stateful workloads. These make them different than for example a regular deployment object. A statefulset ussually has a PVC for each pod so it simulates a cluster with each pod having their own 'disk' to write to. A statefulset is also then mostly used with multiple replica's because it is for simulating a cluster. With a 1 replica cluster you might run in to issues like it preferring a node it previously ran on when restarting.
For a more detailed explanation you can check https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/ from the official kubernetes docs.
What is the 'statefulset termination grace period'
In the email, we have asked you to review the termination grace period (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination), specifically for your stateful sets. The reason for this request is that when we scale down your workloads to move the data, we do not consider the terminationGracePeriodSeconds, which determines when a workload will be forcefully killed instead of being asked to gracefully shut down. By default, this is set to 30 seconds, after which a SIGKILL signal will be sent to the process. For most web workloads handling basic files, this is usually acceptable, as that time is not needed to properly shut down. However, stateful workloads may not handle such a short time well, potentially causing them to enter a recovery procedure after those 30 seconds. Therefore, we strongly urge you to change this setting to a minimum of 10 minutes (600 seconds) or preferably 30 minutes (1800 seconds) before we initiate the migration, ensuring that downtime does not extend due to stateful workloads undergoing a recovery procedure after data migration.
Below is an example YAML file showing the location of the setting in a stateful set:
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
elasticsearch.k8s.elastic.co/cluster-name: quickstart
elasticsearch.k8s.elastic.co/statefulset-name: quickstart-es-default
name: quickstart-es-default
namespace: elastic-system
spec:
...
spec:
...
terminationGracePeriodSeconds: 180
...
Why do I need to make backups for my statefulsets?
We ask you to make backups of StatefulSets because usually in StatefulSets, things like databases and other stateful workloads are ran. There are many databases or other stateful workloads being used, such as MySQL, PostgreSQL, MongoDB, Elasticsearch, each requiring their own custom recovery plan. For NetApp volumes that we have snapshots available (enabled by default), but they are not the same as backups from a database for example. The difference between a snapshot and a backup is that a snapshot is something that an exact copy of the files on 'disk' at that moment of time. A backup is something like a MySQL dump that for that application that makes something that is easily restorable from that file.
The way we go about moving the data of these types of workloads is by removing all of the pods at the same time. This gives us the least chance of data structures getting corrupted. However, certainly in a multi-replica StatefulSet, we cannot completely rule out the chance of it getting corrupted. So, we ask to create backups beforehand so we always have stable data from which we can recover if something happens from which we are unable to recover ourselves. We think the chance of this happening is very low, but we still want to be sure of the recovery ability from the customer side.