Lab 9.3 issue with coredns

Hello,
I am stuck on the lab 9.3. to the point that I've recreated the cluster from scratch (following the k8s docs on teardown) and I cannot get DNS to work on the nettool pod. I checked kubectl describe coredns-pod and it shows it's listening on the cluster IP port 53. The same IP is found in the nettool-pod's /etc/resolv.conf and yet I get the following:
[email protected]:~$ kubectl create -f sols/s_09/nettool.yaml
pod/nettool created
[email protected]:~$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default nettool 1/1 Running 0 5s
kube-system calico-kube-controllers-5c6f6b67db-b67kz 1/1 Running 0 4m58s
kube-system calico-node-hqbcg 1/1 Running 1 3m54s
kube-system calico-node-tjwzd 1/1 Running 0 4m58s
kube-system coredns-f9fd979d6-k9kc4 1/1 Running 0 8m16s
kube-system coredns-f9fd979d6-z5dtb 1/1 Running 0 8m16s
kube-system etcd-ip-172-31-12-197 1/1 Running 0 8m33s
kube-system kube-apiserver-ip-172-31-12-197 1/1 Running 0 8m33s
kube-system kube-controller-manager-ip-172-31-12-197 1/1 Running 0 8m33s
kube-system kube-proxy-dnqq4 1/1 Running 0 8m17s
kube-system kube-proxy-hc8lh 1/1 Running 1 3m54s
kube-system kube-scheduler-ip-172-31-12-197 1/1 Running 0 8m32s
[email protected]:~$ kubectl exec -it nettool -- /bin/bash
[email protected]:/# apt-get update
Err:1 http://security.ubuntu.com/ubuntu focal-security InRelease
Temporary failure resolving 'security.ubuntu.com'
Err:2 http://archive.ubuntu.com/ubuntu focal InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Temporary failure resolving 'archive.ubuntu.com'
Reading package lists... Done
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-backports/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/focal-security/InRelease Temporary failure resolving 'security.ubuntu.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
[email protected]:/# cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local eu-central-1.compute.internal
options ndots:5
[email protected]:/#
I would appreciate any leads so I could learn to troubleshoot such issues please.
Comments
Hi @rzadzins,
Similar errors have been reported on Kubernetes clusters running on AWS. It typically has to do with firewalls, at VPC and/or EC2 level.
Are you able to reach ubuntu.com from your node (not from a container)? Can you compare the
resolv.conf
of your node vs the nettool/ubuntu container?Can you provide details about the
coredns
configMap object?kubectl -n kube-system get cm coredns -o yaml
Can you provide the logs of a
coredns
Pod?kubectl -n kube-system logs coredns-...
Regards,
-Chris
Hi @chrispokorni
Yes, ubuntu.com resolves from the node. The /etc/resolv.conf config on the node has been populated by systemd:
nameserver 127.0.0.53
options edns0 trust-ad
search eu-central-1.compute.internal
On the pod however, it points to the cluster IP:
[email protected]:/# cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local eu-central-1.compute.internal
options ndots:5
The coredns configmap is pristine:
[email protected]:~$ kubectl -n kube-system get cm coredns -o yaml
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
kind: ConfigMap
metadata:
creationTimestamp: "2020-11-17T21:50:49Z"
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:data:
.: {}
f:Corefile: {}
manager: kubeadm
operation: Update
time: "2020-11-17T21:50:49Z"
name: coredns
namespace: kube-system
resourceVersion: "193"
selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
uid: e78e3cc4-c931-4bc7-ac1a-18af785909fb
The logs of the coredns pods show timeouts:
[email protected]:~$ kubectl -n kube-system logs coredns-f9fd979d6-g5bsd
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[ERROR] plugin/errors: 2 3509682171098248668.7412719447730720321. HINFO: read udp 172.31.4.1:54497->172.31.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 3509682171098248668.7412719447730720321. HINFO: read udp 172.31.4.1:59711->172.31.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 3509682171098248668.7412719447730720321. HINFO: read udp 172.31.4.1:55194->172.31.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 3509682171098248668.7412719447730720321. HINFO: read udp 172.31.4.1:41570->172.31.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 3509682171098248668.7412719447730720321. HINFO: read udp 172.31.4.1:51632->172.31.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 archive.ubuntu.com.eu-central-1.compute.internal. A: read udp 172.31.4.1:35942->172.31.0.2:53: i/o timeout
(that last one is probably me trying to run apt-get update on the pod)
I've run iptables cleanups before creating the cluster and made sure ufw is disabled:
[email protected]:~# ufw status
Status: inactive
Next I checked the VPC, its subnets and ACLs but they show all traffic is allowed to all destinations. The subnet in this VPC is 172.31.0.0/20, so the above communcation that times out belongs to it.
Are the firewall rules built on top of existing default rules in a default VPC? Or did you build a custom/new VPC with an all-open firewall rule?
I have experienced such conflicts in the past when I used default VPCs with default rules as a foundation to my cluster specific rules. My rules for this class allow all traffic - all protocols, from all sources, to all ports - no restrictions of any kind.
Regards,
-Chris
@chrispokorni I've set up a new AWS account for the sake of the labs, in it a new VPC with all traffic allowed, and then the 2 new EC2 instances to use this VPC and subnets.
I checked one other thing - the DNS settings in the yaml file of the pod result in having the google DNS in /etc/resolv.conf in the pod which then allows domain resolution. I'm not sure where else traffic could be filtered out...
One would expect container DNS to be configured on EC2 instances similarly to other environments. It must be something specific to how AWS handles network and DNS configuration.
Regards,
-Chris
I solved the problem - it was my mistake.
After some more research, I noticed I am getting timeouts also for other pod activity, not only DNS. After a lot of trial and error (and cluster tear downs) I noticed I had misread the instructions - there must be zero overlap in the network configuration between the local interfaces and the cluster (I somehow got confused by the lab asking to check the IP on the interface of the instance). Now DNS is smooth. Thanks for the help and patience!
PS. one thing that bothered me - why the CoreDNS service is called kube-dns? From what I understood kube-dns got replaced by CoreDNS.
Hi @rzadzins,
It is common to have one application (foo) exposed via a service of a different name (bar). By exposing the newly introduced CoreDNS application via the kube-dns service helps with backward compatibility.
Regards,
-Chris
Good point, thank you Chris!