Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS][auto mode] Workloads fail due to inability to locate kube-dns service #2546

Closed
truongnht opened this issue Feb 17, 2025 · 11 comments
Closed
Labels
EKS Auto Mode EKS Networking EKS Networking related issues EKS Amazon Elastic Kubernetes Service

Comments

@truongnht
Copy link

In typical system we deploy, coredns (kube-dns) is installed in kube-system namespace which our observability related services like Loki (loki-gateway) and Tempo (tempo-gateway) depend on for resolving its related services. Now with auto-mode enabled, we donot install coredns addons and therefore get into the service name resolution issues.

@mikestef9 mikestef9 added EKS Amazon Elastic Kubernetes Service EKS Networking EKS Networking related issues EKS Auto Mode labels Feb 18, 2025
@oliviassss
Copy link

@truongnht hi, thanks for the info. As a workaround you can create a dummy kube-dns service in your cluster to unblock.
For your application, can you provide more info on how it "locate" the kube-dns service? is it through a nslookup or some k8s api?

@wimspaargaren
Copy link

Hello, I'm facing the same issue. Loki gateway seems to be basically an nginx container with a specific config. In order to resolve DNS for proxy passes within Kubernetes it sets the resolver field (by default) as follows:

resolver kube-dns.kube-system.svc.cluster.local.;

Not familiar with the loki repo, but this seems very similar to what appears in the loki-gateway configmap (which contains the nginx conf): https://github.com/grafana/loki/blob/main/production/ksonnet/loki/gateway.libsonnet

I've tried your suggestion by applying a dummy kube-dns service (basically a copy paste of what core dns creates):

apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/port: "9153"
    prometheus.io/scrape: "true"
  labels:
    eks.amazonaws.com/component: kube-dns
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    kubernetes.io/name: CoreDNS
  name: kube-dns
  namespace: kube-system
spec:
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: dns
    port: 53
    protocol: UDP
    targetPort: 53
  - name: dns-tcp
    port: 53
    protocol: TCP
    targetPort: 53
  - name: metrics
    port: 9153
    protocol: TCP
    targetPort: 9153
  selector:
    k8s-app: kube-dns
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

It actually does get an IP, but now I get the following error:

2025/02/20 21:53:47 [error] 9#9: send() failed (111: Connection refused) while resolving, resolver: 172.XX.YY.ZZZ:53

So my assumption is that the app selector is incorrect, but maybe it's something else.

@oliviassss any idea what I should adjust here/could try?

@truongnht
Copy link
Author

@oliviassss I believe info provided by @wimspaargaren is pretty complete. To get over that issue, I have enabled coredns addons. So now our auto-mode cluster is running with core-dns (kube-dns) enabled.

@wimspaargaren
Copy link

Thanks @truongnht will do that as well then to work around the issue for now.

@oliviassss
Copy link

@wimspaargaren, @truongnht, thanks for the details. We will reproduce this error internally.

@wimspaargaren I think the issue for dummy kube-dns service is that it's not assigned with a correct CLUSTER IP for the DNS resolution. In auto mode, we have the coredns server listening to the CLUSTER DNS IP on port 53. If the dummy service is not created with the same IP, the dns resolve would fail.

@truongnht, once you re-install the coredns addon, if you delete the deployment(so no coredns pods are gone, only kube-dns service retain), would the loki-gateway still work?

@tomhemmes-rl
Copy link

@oliviassss We ran into the same issue as @wimspaargaren described above, also for loki-gateway.

When I created a dummy "kube-dns" service explicitly with the ClusterIP set (I grabbed the IP from another pod cat /etc/resolv.conf) loki-gateway was able to connect and boot succesfully.

@oliviassss
Copy link

Hi all, we are rolling out a fix for this issue in EKS auto this week, we map the kube-dns FQDN with the cluster DNS IP in host file, so you no longer need to create a service if you are just to resolve the FQDN.

@truongnht
Copy link
Author

@oliviassss may I know what else are fixed for eks auto-mode? Interested to learn more on the releases.

@oliviassss
Copy link

@truongnht

- 10 March[DONE] - IPv4 Egress from IPv6 Clusters
- 17 March[rolling out] - kube-dns service will resolve for Pods on Auto Mode, even if the kube-dns service doesn’t exist in the cluster
- 24 March[ECD] - fixes an issue where if a coredns Pod is running on an auto mode node, DNS queries from Pods on the node would hit that coredns Pod instead of the node local DNS server

@truongnht
Copy link
Author

@oliviassss thanks, do you have those information published somewhere?

@mikestef9
Copy link
Contributor

This fix as described above has been fully deployed, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Auto Mode EKS Networking EKS Networking related issues EKS Amazon Elastic Kubernetes Service
Projects
None yet
Development

No branches or pull requests

5 participants