|
| 1 | +--- |
| 2 | +title: "Changing Service Mesh" |
| 3 | +description: "How we swapped Istio with Linkerd with hardly any downtime" |
| 4 | +date: 2021-05-04T20:37:13+02:00 |
| 5 | +draft: false |
| 6 | +author: Frode Sundby |
| 7 | +tags: [istio, linkerd, LoadBalancing] |
| 8 | +--- |
| 9 | + |
| 10 | + |
| 11 | +## Why change? |
| 12 | +With an ambition of making our environments as secure as possible, we jumped on the service-mesh bandwagon in 2018 with Istio 0.7 and have stuck with it since. |
| 13 | + |
| 14 | +Istio is a large and feature rich system that brings capabilities aplenty. |
| 15 | +Although there are a plethora of nifty and useful things we could do with Istio, we've primarily used it for mTLS and authorization policies. |
| 16 | + |
| 17 | +One might think that having lots of features available but not using them couldn't possibly be a problem. |
| 18 | +However, all these extra capabilities comes with a cost - namely complexity; and we've felt encumbered by this complexity every time when configuring, maintaining or troubleshooting in our clusters. |
| 19 | +Our suspicions were that since we hardly used any of the capabilities, we could probably make do with a much simpler alternative. |
| 20 | +So, and after yet another _"Oh... This problem was caused by Istio!"_-moment, we decided the time was ripe to consider the alternatives out there. |
| 21 | + |
| 22 | +We looked to the grand ol' Internet for alternatives and fixed our gaze on the rising star Linkerd 2. |
| 23 | +Having honed in on our preferred candidate, we decided to take it for a quick spin in a cluster and found our suspicions to be accurate. |
| 24 | + |
| 25 | +Rarely has a meme depicted a feeling more strongly |
| 26 | +<!----> |
| 27 | + |
| 28 | +Even though we'd invested a lot of time and built in quite a bit of Istio into our platform, we knew we had to make the change. |
| 29 | + |
| 30 | +## How did we do it? |
| 31 | +### Original architecture: |
| 32 | +Let's first have a quick look at what we were dealing with: |
| 33 | + |
| 34 | +The first thing an end user encountered was our Google LoadBalancer configured by an IstioOperator. |
| 35 | +The traffic was then forwarded to the Istio Ingressgateway, who in turn sent it along via an mTLS connection to the application. |
| 36 | +Before the Ingressgateway could reach the application, both NetworkPolicies and AuthorizationPolicies were required to allow the traffic. |
| 37 | +We used an [operator](https://github.com/nais/naiserator) to configure these policies when an application was deployed. |
| 38 | + |
| 39 | +<!--  --> |
| 40 | + |
| 41 | +### New LoadBalancers and ingress controllers |
| 42 | +Since our LoadBalancers were configured by (and sent traffic to) Istio, we had to change the way we configured them. |
| 43 | +Separating LoadBalancing from mesh is a healthy separation of concern that will give us greater flexibility in the future as well. |
| 44 | +We also had to swap out Istio Ingressgateway with an Ingress Controller - we opted for NGINX. |
| 45 | + |
| 46 | +We started by creating IP-addresses and Cloud Armor security policies for our new LoadBalancers with [Terraform](https://www.terraform.io/). |
| 47 | + |
| 48 | +The loadbalancers themselves were created by an Ingress object: |
| 49 | +```yaml |
| 50 | +apiVersion: networking.k8s.io/v1beta1 |
| 51 | +kind: Ingress |
| 52 | +metadata: |
| 53 | + annotations: |
| 54 | + networking.gke.io/v1beta1.FrontendConfig: <tls-config> |
| 55 | + kubernetes.io/ingress.global-static-ip-name: <global-ip-name> |
| 56 | + kubernetes.io/ingress.allow-http: "false" |
| 57 | + name: <loadbalancer-name> |
| 58 | + namespace: <ingress-controller-namespace> |
| 59 | +spec: |
| 60 | + backend: |
| 61 | + serviceName: <ingress-controller-service> |
| 62 | + servicePort: 443 |
| 63 | + tls: |
| 64 | + - secretName: <kubernetes-secret-with-certificates> |
| 65 | +``` |
| 66 | +
|
| 67 | +We tied the Cloud Armor security policy to the Loadbalancer with a `BackendConfig` on the Ingress Controller's service: |
| 68 | +```yaml |
| 69 | +apiVersion: v1 |
| 70 | +kind: Service |
| 71 | +metadata: |
| 72 | + annotations: |
| 73 | + cloud.google.com/app-protocols: '{"https": "HTTP2"}' |
| 74 | + cloud.google.com/backend-config: '{"default": "<backendconfig-name>"}' |
| 75 | + cloud.google.com/neg: '{"ingress": true}' |
| 76 | + ... |
| 77 | +--- |
| 78 | +apiVersion: cloud.google.com/v1 |
| 79 | +kind: BackendConfig |
| 80 | +metadata: |
| 81 | + name: <backendconfig-name> |
| 82 | +spec: |
| 83 | + securityPolicy: |
| 84 | + name: <security-policy-name> |
| 85 | + ... |
| 86 | +``` |
| 87 | + |
| 88 | +Alrighty. We'd now gotten ourselves a brand new set of independantly configured LoadBalancers and a shiny new Ingress Controller. |
| 89 | +<!--  --> |
| 90 | + |
| 91 | +However - if we'd started shipping traffic to the new components at this stage, things would start breaking as there were no ingresses in the cluster - only VirtualServices. |
| 92 | +To avoid downtime, we created an interim ingress that forwarded all traffic to the Istio IngressGateway: |
| 93 | +```yaml |
| 94 | +apiVersion: networking.k8s.io/v1beta1 |
| 95 | +kind: Ingress |
| 96 | +spec: |
| 97 | + rules: |
| 98 | + - host: '<domain-name>' |
| 99 | + http: |
| 100 | + paths: |
| 101 | + - backend: |
| 102 | + serviceName: <istio-ingressgateway-service> |
| 103 | + servicePort: 443 |
| 104 | + path: / |
| 105 | + ... |
| 106 | +``` |
| 107 | +<!--  --> |
| 108 | +With this ingress in place, we could reach all the existing VirtualServices exposed by the Istio Ingressgateway via the new Loadbalancers and Nginx. |
| 109 | +And we could point our DNS records to the new rig without anyone noticing a thing. |
| 110 | + |
| 111 | +### Migrating workloads from Istio to Linkerd |
| 112 | +Once LoadBalancing and ingress traffic were closed chapters, we changed our attention to migrating workloads from Istio to Linkerd. |
| 113 | +When moving a workload to a new service-mesh, there's a bit more to it than swapping out the sidecar with a new namespace annotation. |
| 114 | +Our primary concerns were: |
| 115 | +- The new sidecar would require NetworkPolicies to allow traffic to and from linkerd. |
| 116 | +- The application's VirtualService would have to be transformed into an Ingress. |
| 117 | +- Applications that used [scuttle](https://github.com/redboxllc/scuttle) to wait for the Istio sidecar to be ready had to be disabled. |
| 118 | +- We couldn't possibly migrate all workloads simultaneously due to scale. |
| 119 | +- Applications have to communicate, but they can't when they're in different service-meshes. |
| 120 | + |
| 121 | +Since applications have a tendency of communicating with eachother, and communication between different service-meshes was a bit of a bother, we decided to migrate workloads based on who they were communicating with to avoid causing trouble. |
| 122 | +Using the NetworkPolicies to map out who were communicating with whom, we found a suitable order to migrate workloads in. |
| 123 | + |
| 124 | +We then gave this list to our [operator](https://github.com/nais/naiserator), who in turn removed Istio resources, updated NetworkPolicies, created Ingresses and restarted workloads. |
| 125 | +Slowly but surely our linkerd dashboard was starting to populate, and the only downtime was during the seconds it took for the first Linkerd pod to be ready. |
| 126 | +One thing we didn't take into concideration (but should have), was that some applications shared a hostname. |
| 127 | +When an ingress was created for a shared hostname, Nginx would stop forwarding requests for these hosts to Istio Ingressgateway, resulting in non-migrated applications not getting any traffic. |
| 128 | +Realizing this, we started migrating applications on the same hostname simultaneously too. |
| 129 | + |
| 130 | +<!--  --> |
| 131 | +And within a couple of hours, all workloads were migrated and we had ourselves a brand spanking new service-mesh in production. |
| 132 | +And then they all lived happily ever after... |
| 133 | + |
| 134 | +Except that we had to clean up Istio's mess. |
| 135 | + |
| 136 | +### Cleaning up |
| 137 | +What was left after the party was a fully operational Istio control plane, a whole bunch of Istio CRD's and a completely unused set of LoadBalancers. In addition we had to clean up everything related to Istio in a whole lot of pipelines and components |
| 138 | + |
| 139 | + |
| 140 | +<!--  |
| 141 | + --> |
| 142 | + |
| 143 | +It has to be said - there is a certain satisfaction in cleaning up after a party that has been going on for too long. |
0 commit comments