Lab 9.3 issue with coredns
Hello,
I am stuck on the lab 9.3. to the point that I've recreated the cluster from scratch (following the k8s docs on teardown) and I cannot get DNS to work on the nettool pod. I checked kubectl describe coredns-pod and it shows it's listening on the cluster IP port 53. The same IP is found in the nettool-pod's /etc/resolv.conf and yet I get the following:
ubuntu@master:~$ kubectl create -f sols/s_09/nettool.yaml
pod/nettool created
ubuntu@master:~$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default nettool 1/1 Running 0 5s
kube-system calico-kube-controllers-5c6f6b67db-b67kz 1/1 Running 0 4m58s
kube-system calico-node-hqbcg 1/1 Running 1 3m54s
kube-system calico-node-tjwzd 1/1 Running 0 4m58s
kube-system coredns-f9fd979d6-k9kc4 1/1 Running 0 8m16s
kube-system coredns-f9fd979d6-z5dtb 1/1 Running 0 8m16s
kube-system etcd-ip-172-31-12-197 1/1 Running 0 8m33s
kube-system kube-apiserver-ip-172-31-12-197 1/1 Running 0 8m33s
kube-system kube-controller-manager-ip-172-31-12-197 1/1 Running 0 8m33s
kube-system kube-proxy-dnqq4 1/1 Running 0 8m17s
kube-system kube-proxy-hc8lh 1/1 Running 1 3m54s
kube-system kube-scheduler-ip-172-31-12-197 1/1 Running 0 8m32s
ubuntu@master:~$ kubectl exec -it nettool -- /bin/bash
root@nettool:/# apt-get update
Err:1 http://security.ubuntu.com/ubuntu focal-security InRelease
Temporary failure resolving 'security.ubuntu.com'
Err:2 http://archive.ubuntu.com/ubuntu focal InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Temporary failure resolving 'archive.ubuntu.com'
Reading package lists... Done
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-backports/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/focal-security/InRelease Temporary failure resolving 'security.ubuntu.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
root@nettool:/# cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local eu-central-1.compute.internal
options ndots:5
root@nettool:/#
I would appreciate any leads so I could learn to troubleshoot such issues please.
Comments
-
Hi @rzadzins,
Similar errors have been reported on Kubernetes clusters running on AWS. It typically has to do with firewalls, at VPC and/or EC2 level.
Are you able to reach ubuntu.com from your node (not from a container)? Can you compare the
resolv.conf
of your node vs the nettool/ubuntu container?Can you provide details about the
coredns
configMap object?kubectl -n kube-system get cm coredns -o yaml
Can you provide the logs of a
coredns
Pod?kubectl -n kube-system logs coredns-...
Regards,
-Chris0 -
Yes, ubuntu.com resolves from the node. The /etc/resolv.conf config on the node has been populated by systemd:
nameserver 127.0.0.53
options edns0 trust-ad
search eu-central-1.compute.internalOn the pod however, it points to the cluster IP:
root@nettool:/# cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local eu-central-1.compute.internal
options ndots:5The coredns configmap is pristine:
ubuntu@master:~$ kubectl -n kube-system get cm coredns -o yaml
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
kind: ConfigMap
metadata:
creationTimestamp: "2020-11-17T21:50:49Z"
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:data:
.: {}
f:Corefile: {}
manager: kubeadm
operation: Update
time: "2020-11-17T21:50:49Z"
name: coredns
namespace: kube-system
resourceVersion: "193"
selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
uid: e78e3cc4-c931-4bc7-ac1a-18af785909fbThe logs of the coredns pods show timeouts:
ubuntu@master:~$ kubectl -n kube-system logs coredns-f9fd979d6-g5bsd
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[ERROR] plugin/errors: 2 3509682171098248668.7412719447730720321. HINFO: read udp 172.31.4.1:54497->172.31.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 3509682171098248668.7412719447730720321. HINFO: read udp 172.31.4.1:59711->172.31.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 3509682171098248668.7412719447730720321. HINFO: read udp 172.31.4.1:55194->172.31.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 3509682171098248668.7412719447730720321. HINFO: read udp 172.31.4.1:41570->172.31.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 3509682171098248668.7412719447730720321. HINFO: read udp 172.31.4.1:51632->172.31.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 archive.ubuntu.com.eu-central-1.compute.internal. A: read udp 172.31.4.1:35942->172.31.0.2:53: i/o timeout
(that last one is probably me trying to run apt-get update on the pod)I've run iptables cleanups before creating the cluster and made sure ufw is disabled:
root@master:~# ufw status
Status: inactiveNext I checked the VPC, its subnets and ACLs but they show all traffic is allowed to all destinations. The subnet in this VPC is 172.31.0.0/20, so the above communcation that times out belongs to it.
0 -
Are the firewall rules built on top of existing default rules in a default VPC? Or did you build a custom/new VPC with an all-open firewall rule?
I have experienced such conflicts in the past when I used default VPCs with default rules as a foundation to my cluster specific rules. My rules for this class allow all traffic - all protocols, from all sources, to all ports - no restrictions of any kind.
Regards,
-Chris0 -
@chrispokorni I've set up a new AWS account for the sake of the labs, in it a new VPC with all traffic allowed, and then the 2 new EC2 instances to use this VPC and subnets.
I checked one other thing - the DNS settings in the yaml file of the pod result in having the google DNS in /etc/resolv.conf in the pod which then allows domain resolution. I'm not sure where else traffic could be filtered out...
0 -
One would expect container DNS to be configured on EC2 instances similarly to other environments. It must be something specific to how AWS handles network and DNS configuration.
Regards,
-Chris0 -
I solved the problem - it was my mistake.
After some more research, I noticed I am getting timeouts also for other pod activity, not only DNS. After a lot of trial and error (and cluster tear downs) I noticed I had misread the instructions - there must be zero overlap in the network configuration between the local interfaces and the cluster (I somehow got confused by the lab asking to check the IP on the interface of the instance). Now DNS is smooth. Thanks for the help and patience!
PS. one thing that bothered me - why the CoreDNS service is called kube-dns? From what I understood kube-dns got replaced by CoreDNS.
0 -
Hi @rzadzins,
It is common to have one application (foo) exposed via a service of a different name (bar). By exposing the newly introduced CoreDNS application via the kube-dns service helps with backward compatibility.
Regards,
-Chris0 -
Good point, thank you Chris!
0 -
rzadzins, I seem to having the same issue for the name resolution. I see that the worker node pod is not able to reach to the control node.
23:45:00.823 INFO [SelfRegisteringRemote$1.run] - Couldn't register this node: The hub is down or not responding: selenium-hub: Temporary failure in name resolution
23:45:05.825 INFO [SelfRegisteringRemote$1.run] - Couldn't register this node: The hub is down or not responding: selenium-hubTrying to register to a selenium hub. On removing the service name and changing that to the IP, getting a connection time out
23:48:41.336 INFO [GridLauncherV3.lambda$buildLaunchers$7] - Selenium Grid node is up and ready to register to the hub
23:48:41.398 INFO [SelfRegisteringRemote$1.run] - Starting auto registration thread. Will try to register every 5000 ms.
23:50:41.628 WARN [SelfRegisteringRemote.registerToHub] - Error getting the parameters from the hub. The node may end up with wrong timeouts.connect timed out
23:50:41.629 INFO [SelfRegisteringRemote.registerToHub] - Registering the node to the hub: http://10.0.0.6:4444/grid/register
23:52:41.730 INFO [SelfRegisteringRemote$1.run] - Couldn't register this node: Error sending the registration request: connect timed out
23:54:46.748 INFO [SelfRegisteringRemote$1.run] - Couldn't register this node: The hub is down or not responding: connect timed outWould really appreciate any pointers on this.
0 -
Hi @akarthik,
Is your Kubernetes cluster bootstrapped with the
kubeadm
tool, following the lab guide?Regards,
-Chris0 -
I was able to work around this issue by following these steps.
ssh into the ubutnu container as described in the course.
create a copy of the /etc/resolv.conf file
cp /etc/resolv.conf /etc/resolv.conf.orig
replace the resolv.conf by executing
echo "nameserver 8.8.8.8" | tee /etc/resolv.conf > /dev/null
run the apt-get commands as described in the course
replace the resolv.conf with the original
cp /etc/resolv.conf.orig /etc/resolv.conf
continue with the course materials for Lab 9.3.3 -
Hi @rzadzins
I'm facing the same issue right now.
What did you change in network installation?" I noticed I had misread the instructions - there must be zero overlap in the network configuration between the local interfaces and the cluster (I somehow got confused by the lab asking to check the IP on the interface of the instance)"
Could you share the mistake?
0 -
Hi @dandrade.dev,
In summary, at bootstrapping time the cluster admin should pay close attention to all networks involved, and insure no overlap between them. With that said, the Kubernetes cluster will use by default 10.96.0.0/12 network for Services, the Calico network plugin will use by default (but can be modified at bootstrapping time) 192.168.0.0/16 for application Pod network, therefore the VM/node network should not overlap with either of the networks above, it should be a distinct network - for example 10.200.0.0/16.
Regards,
-Chris0 -
Hi @chrispokorni ! Thanks for your response.
Ok, I got it.
Let me show that I made:
-> Remove all nodes from the cluster (from the cp):
kubectl cordon node-1
kubectl drain node-1 --force --ignore-daemonsets
kubectl delete node node-1
... more clear net config and restarted serviceskubeadm reset
rm -rf /etc/cni/net.d && iptables -F
systemctl restart kubelet && systemctl restart containerd
-> Restart cp (as root):
1 -kubeadm reset
2 -rm -rf /etc/cni/net.d && iptables -F
3 -systemctl restart kubelet && systemctl restart containerd
-> Changed the Calico and kubeadm-config manifests with the new CIDR block
calico.yaml -[...] - name: CALICO_IPV4POOL_CIDR value: "10.200.0.0/16 [...]"
kubeadm-config.yaml -[...] - podSubnet: 10.200.0.0/16 [...]"
-> Init the cp (as root):
kubeadm init --config=kubeadm-config.yaml --upload-cert
... and reconfigure kubectl access (as admin user):sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
-> Calico Applied
kubectl apply -f calico.yaml
-> Rejoin the nodes to cluster:
sudo kubeadm join cp_server:6443 --token XXX --discovery-token-ca-cert-hash YYY
sudo kubeadm init
Until here, it seems ok. The nodes joined to the cluster
When I return to the exercise (LAB 9.3) and launch the Ubuntu POD, I'm still ok. But, when I try perform, as the exercise said,
$ apt-get update
, I still stuck.Please, could you help me with this?
FYI:
Calico ConfigMap:
_apiVersion: v1
data:
calico_backend: bird
cni_network_config: |-
{
"name": "k8s-pod-network",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "calico",
"log_level": "info",
"log_file_path": "/var/log/calico/cni/cni.log",
"datastore_type": "kubernetes",
"nodename": "KUBERNETES_NODE_NAME",
"mtu": CNI_MTU,
"ipam": {
"type": "calico-ipam"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "KUBECONFIG_FILEPATH"
}
},
{
"type": "portmap",
"snat": true,
"capabilities": {"portMappings": true}
},
{
"type": "bandwidth",
"capabilities": {"bandwidth": true}
}
]
}
typha_service_name: none
veth_mtu: "0"
kind: ConfigMap`My VPC CIDR: 172.31.0.0/16 (172.31.0.0 - 172.31.255.255)
Default k8S network CIDR: 10.96.0.0/12 (10.96.0.0 - 10.111.255.255)
Calico and Cluster CIDR: 10.200.0.0/16 (10.200.0.0 - 10.200.255.255)My SecurityGroups on AWS EC2 have outbound for the world allowed (0.0.0.0/0) for all ports/protocols
apt-get update commands work well inside the node, not into the pod
0 -
Hi @dandrade.dev,
On EC2 instances there was no need to go through all that trouble and modify the pod network, since all IP networks were already distinct, with no overlaps.
For the purposes of this course, Kubernetes 1.26.1 or better yet 1.27.1 (instead of 1.26.0) is recommended.
What is the output of
kubectl get pods -A -o wide
On the "ubuntu" container, can you check the
/etc/resolv.conf
file? Can you try addingnameserver 8.8.8.8
as suggested in an earlier comment of this thread?Regards,
-Chris0 -
Sure!
The PODS
The /etc/resolv.conf
If I modify the /etc/resolv.conf file to
nameserver 8.8.8.8
, the apt-get update and install commands work well. But it not seems the right way to solve that. Beyond this, further, in the exercise, we need to perform some curl commands toservice-lab.accounting.svc.cluster.local
... and this not works as expected with this nameserver on resolv.conf.Cheers!
0 -
Hi @dandrade.dev,
How are your VPC and SG configured? A possible firewall misconfiguration could impact pod-to-pod and pod-to-service communication across nodes in your cluster.
When setting up your VPC, SG, and EC2 instances, did you happen to watch and follow the video guide from the introductory chapter?
Regards,
-Chris0 -
Hey @chrispokorni
Yep!
I followed the steps presented here: https://cm.lf.training/LFS258/LabSetup-AWS.mp4Btw, the VPC used was created by default from AWS.
And, although not the best approach, I allow all traffic as the instruct video said.
Cheers and sorry for the situation!
0 -
Hi @dandrade.dev,
It seems you are blocking all inbound protocols with the exception of TCP. Please open inbound traffic of All protocols.
Regards,
-Chris0 -
That's it!!
Thanks a lot!Just a side note here: It is not needed open all inbound traffic, just the UDP on port 53 and just for the VPC CIDR.
Cheers!!!
0
Categories
- All Categories
- 217 LFX Mentorship
- 217 LFX Mentorship: Linux Kernel
- 788 Linux Foundation IT Professional Programs
- 352 Cloud Engineer IT Professional Program
- 177 Advanced Cloud Engineer IT Professional Program
- 82 DevOps Engineer IT Professional Program
- 146 Cloud Native Developer IT Professional Program
- 137 Express Training Courses
- 137 Express Courses - Discussion Forum
- 6.2K Training Courses
- 46 LFC110 Class Forum - Discontinued
- 70 LFC131 Class Forum
- 42 LFD102 Class Forum
- 226 LFD103 Class Forum
- 18 LFD110 Class Forum
- 37 LFD121 Class Forum
- 18 LFD133 Class Forum
- 7 LFD134 Class Forum
- 18 LFD137 Class Forum
- 71 LFD201 Class Forum
- 4 LFD210 Class Forum
- 5 LFD210-CN Class Forum
- 2 LFD213 Class Forum - Discontinued
- 128 LFD232 Class Forum - Discontinued
- 2 LFD233 Class Forum
- 4 LFD237 Class Forum
- 24 LFD254 Class Forum
- 694 LFD259 Class Forum
- 111 LFD272 Class Forum
- 4 LFD272-JP クラス フォーラム
- 12 LFD273 Class Forum
- 146 LFS101 Class Forum
- 1 LFS111 Class Forum
- 3 LFS112 Class Forum
- 2 LFS116 Class Forum
- 4 LFS118 Class Forum
- 6 LFS142 Class Forum
- 5 LFS144 Class Forum
- 4 LFS145 Class Forum
- 2 LFS146 Class Forum
- 3 LFS147 Class Forum
- 1 LFS148 Class Forum
- 15 LFS151 Class Forum
- 2 LFS157 Class Forum
- 25 LFS158 Class Forum
- 7 LFS162 Class Forum
- 2 LFS166 Class Forum
- 4 LFS167 Class Forum
- 3 LFS170 Class Forum
- 2 LFS171 Class Forum
- 3 LFS178 Class Forum
- 3 LFS180 Class Forum
- 2 LFS182 Class Forum
- 5 LFS183 Class Forum
- 31 LFS200 Class Forum
- 737 LFS201 Class Forum - Discontinued
- 3 LFS201-JP クラス フォーラム
- 18 LFS203 Class Forum
- 130 LFS207 Class Forum
- 2 LFS207-DE-Klassenforum
- 1 LFS207-JP クラス フォーラム
- 302 LFS211 Class Forum
- 56 LFS216 Class Forum
- 52 LFS241 Class Forum
- 48 LFS242 Class Forum
- 38 LFS243 Class Forum
- 15 LFS244 Class Forum
- 2 LFS245 Class Forum
- LFS246 Class Forum
- 48 LFS250 Class Forum
- 2 LFS250-JP クラス フォーラム
- 1 LFS251 Class Forum
- 151 LFS253 Class Forum
- 1 LFS254 Class Forum
- 1 LFS255 Class Forum
- 7 LFS256 Class Forum
- 1 LFS257 Class Forum
- 1.2K LFS258 Class Forum
- 10 LFS258-JP クラス フォーラム
- 118 LFS260 Class Forum
- 159 LFS261 Class Forum
- 42 LFS262 Class Forum
- 82 LFS263 Class Forum - Discontinued
- 15 LFS264 Class Forum - Discontinued
- 11 LFS266 Class Forum - Discontinued
- 24 LFS267 Class Forum
- 22 LFS268 Class Forum
- 30 LFS269 Class Forum
- LFS270 Class Forum
- 202 LFS272 Class Forum
- 2 LFS272-JP クラス フォーラム
- 1 LFS274 Class Forum
- 4 LFS281 Class Forum
- 9 LFW111 Class Forum
- 259 LFW211 Class Forum
- 181 LFW212 Class Forum
- 13 SKF100 Class Forum
- 1 SKF200 Class Forum
- 1 SKF201 Class Forum
- 795 Hardware
- 199 Drivers
- 68 I/O Devices
- 37 Monitors
- 102 Multimedia
- 174 Networking
- 91 Printers & Scanners
- 85 Storage
- 758 Linux Distributions
- 82 Debian
- 67 Fedora
- 17 Linux Mint
- 13 Mageia
- 23 openSUSE
- 148 Red Hat Enterprise
- 31 Slackware
- 13 SUSE Enterprise
- 353 Ubuntu
- 468 Linux System Administration
- 39 Cloud Computing
- 71 Command Line/Scripting
- Github systems admin projects
- 93 Linux Security
- 78 Network Management
- 102 System Management
- 47 Web Management
- 63 Mobile Computing
- 18 Android
- 33 Development
- 1.2K New to Linux
- 1K Getting Started with Linux
- 371 Off Topic
- 114 Introductions
- 174 Small Talk
- 22 Study Material
- 805 Programming and Development
- 303 Kernel Development
- 484 Software Development
- 1.8K Software
- 261 Applications
- 183 Command Line
- 3 Compiling/Installing
- 987 Games
- 317 Installation
- 96 All In Program
- 96 All In Forum
Upcoming Training
-
August 20, 2018
Kubernetes Administration (LFS458)
-
August 20, 2018
Linux System Administration (LFS301)
-
August 27, 2018
Open Source Virtualization (LFS462)
-
August 27, 2018
Linux Kernel Debugging and Security (LFD440)