Welcome to the Linux Foundation Forum!

LAB 3.4 - Failed to pull image when Master and Worker are ready

The deployment is not ready when master and worker are ready however if master is only working the error doesn't appear.

USE CASE OK:

_u2004@k8sm0:~$ date
Wed 23 Dec 2020 06:09:17 PM UTC

u2004@k8sm0:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8sm0 Ready control-plane,master 6h22m v1.20.1
k8sw0 NotReady 4h58m v1.20.1

u2004@k8sm0:~$ kubectl create deployment nginx --image=nginx
deployment.apps/nginx created

u2004@k8sm0:~$ kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
nginx 1/1 1 1 11s <------------------------------------------------- READY 1/1

u2004@k8sm0:~$ kubectl get events --sort-by='.lastTimestamp'
36s Normal ScalingReplicaSet deployment/nginx Scaled up replica set nginx-6799fc88d8 to 1
36s Normal SuccessfulCreate replicaset/nginx-6799fc88d8 Created pod: nginx-6799fc88d8-vr2mr
35s Normal Pulling pod/nginx-6799fc88d8-vr2mr Pulling image "nginx"
32s Normal Pulled pod/nginx-6799fc88d8-vr2mr Successfully pulled image "nginx" in 2.797267818s
32s Normal Created pod/nginx-6799fc88d8-vr2mr Created container nginx
32s Normal Started pod/nginx-6799fc88d8-vr2mr Started container nginx

u2004@k8sm0:~$ kubectl delete deployment nginx
deployment.apps "nginx" deleted
u2004@k8sm0:~$ kubectl get deployment
No resources found in default namespace._

USE CASE NOK:

_u2004@k8sm0:~$ date
Wed 23 Dec 2020 06:13:49 PM UTC
u2004@k8sm0:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8sm0 Ready control-plane,master 6h26m v1.20.1
k8sw0 Ready 5h2m v1.20.1

u2004@k8sm0:~$ kubectl get deployment
No resources found in default namespace.

u2004@k8sm0:~$ kubectl create deployment nginx --image=nginx
deployment.apps/nginx created

u2004@k8sm0:~$ kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
nginx 0/1 1 0 16s
u2004@k8sm0:~$ kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
nginx 0/1 1 0 41s

u2004@k8sm0:~$ kubectl get events --sort-by='.lastTimestamp'
2m23s Normal NodeAllocatableEnforced node/k8sw0 Updated Node Allocatable limit across pods
2m23s Normal NodeHasSufficientPID node/k8sw0 Node k8sw0 status is now: NodeHasSufficientPID
2m23s Normal NodeHasNoDiskPressure node/k8sw0 Node k8sw0 status is now: NodeHasNoDiskPressure
2m23s Normal NodeHasSufficientMemory node/k8sw0 Node k8sw0 status is now: NodeHasSufficientMemory
2m23s Normal Starting node/k8sw0 Starting kubelet.
2m23s Warning Rebooted node/k8sw0 Node k8sw0 has been rebooted, boot id: 3148704a-d187-451c-b603-43b3a30be807
2m23s Normal NodeReady node/k8sw0 Node k8sw0 status is now: NodeReady
2m12s Normal Starting node/k8sw0 Starting kube-proxy.
117s Normal SuccessfulCreate replicaset/nginx-6799fc88d8 Created pod: nginx-6799fc88d8-z8hpq
117s Normal ScalingReplicaSet deployment/nginx Scaled up replica set nginx-6799fc88d8 to 1
116s Normal Pulling pod/nginx-6799fc88d8-z8hpq Pulling image "nginx"
0s Warning Failed pod/nginx-6799fc88d8-z8hpq Failed to pull image "nginx": rpc error: code = Unknown desc = dial tcp: lookup registry-1.docker.io: Temporary failure in name resolution
0s Warning Failed pod/nginx-6799fc88d8-z8hpq Error: ErrImagePull_

Comments

  • Resolution fails when both nodes are up and running:

    u2004@k8sm0:/tmp$ kubectl get nodes
    NAME STATUS ROLES AGE VERSION
    k8sm0 Ready control-plane,master 6h47m v1.20.1
    k8sw0 Ready 5h23m v1.20.1

    u2004@k8sm0:/tmp$ ping google.es
    ping: google.es: Temporary failure in name resolution


    u2004@k8sm0:/tmp$ kubectl get nodes
    NAME STATUS ROLES AGE VERSION
    k8sm0 Ready control-plane,master 6h51m v1.20.1
    k8sw0 NotReady 5h26m v1.20.1

    u2004@k8sm0:/tmp$ ping google.es
    PING google.es (172.217.17.3) 56(84) bytes of data.
    64 bytes from mad07s09-in-f3.1e100.net (172.217.17.3): icmp_seq=1 ttl=128 time=15.9 ms
    64 bytes from mad07s09-in-f3.1e100.net (172.217.17.3): icmp_seq=2 ttl=128 time=21.0 ms
    64 bytes from mad07s09-in-f3.1e100.net (172.217.17.3): icmp_seq=3 ttl=128 time=18.0 ms
    64 bytes from mad07s09-in-f3.1e100.net (172.217.17.3): icmp_seq=4 ttl=128 time=19.8 ms
    64 bytes from mad07s09-in-f3.1e100.net (172.217.17.3): icmp_seq=5 ttl=128 time=20.4 ms
    ^C
    --- google.es ping statistics ---

  • serewicz
    serewicz Posts: 1,000

    Hello,

    I notice that your output is dissimilar to what one would expect. For example, this is what I see when I run kubectl get nodes:
    `student@master:~$ kubectl get nodes

    NAME STATUS ROLES AGE VERSION

    master Ready master 4d19h v1.19.0

    worker Ready 4d19h v1.19.0
    `
    What are you using for your lab environment, GCE, AWS, VirtualBox, bare metal etc...?
    What version of Ubuntu are you using?
    Please follow the lab guide in using kubeadm to build your cluster? As you are using a different version of Kubernetes than the current course it is clear you have deviated from the lab. How much you deviated could explain why other steps are not working.

    Regards,

  • My Lab:

    • VMWare
    • Ubuntu 20.04 LTS
    • K8S 1.20.1

    Last version of all component as Kubernetes documentation recommends.

    This is a rare case, because is there any depedency between nodes status and the network?

    I believe this case is not related to the versions.

    Regards

  • In order to disscard the version I have re-deployed my Lab according to current course documentation.

    The Issue is reproduced. I have it produced in my laptop and in other environment

    ###### VERSIONS #######

    student@master:~$ lsb_release -a
    No LSB modules are available.
    Distributor ID: Ubuntu
    Description: Ubuntu 18.04.5 LTS
    Release: 18.04
    Codename: bionic

    student@master:~$ apt-show-versions | grep -i kube
    cri-tools:amd64/kubernetes-xenial 1.13.0-01 uptodate
    kubeadm:amd64/kubernetes-xenial 1.18.1-00 upgradeable to 1.20.1-00
    kubectl:amd64/kubernetes-xenial 1.18.1-00 upgradeable to 1.20.1-00
    kubelet:amd64/kubernetes-xenial 1.18.1-00 upgradeable to 1.20.1-00
    kubernetes-cni:amd64/kubernetes-xenial 0.8.7-00 uptodate

    ###### NOK USE CASE #####

    student@master:~$ kubectl get nodes
    NAME STATUS ROLES AGE VERSION
    master Ready master 19m v1.18.1
    worker Ready 4m11s v1.18.1

    student@master:~$ kubectl get deployment
    NAME READY UP-TO-DATE AVAILABLE AGE
    nginx 0/1 1 0 2m2s

    student@master:~$ kubectl get events --sort-by='.lastTimestamp'
    13m Normal NodeReady node/worker Node worker status is now: NodeReady
    6m9s Normal SuccessfulCreate replicaset/nginx-6799fc88d8 Created pod: nginx-6799fc88d8-9xd8t
    6m9s Normal ScalingReplicaSet deployment/nginx Scaled up replica set nginx-6799fc88d8 to 1
    6m9s Normal Scheduled pod/nginx-6799fc88d8-9xd8t Successfully assigned default/nginx-6799fc88d8-9xd8t to worker
    4m4s Normal Pulling pod/nginx-6799fc88d8-9xd8t Pulling image "nginx"
    3m49s Warning Failed pod/nginx-6799fc88d8-9xd8t Error: ErrImagePull
    3m49s Warning Failed pod/nginx-6799fc88d8-9xd8t Failed to pull image "nginx": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    3m22s Warning Failed pod/nginx-6799fc88d8-9xd8t Error: ImagePullBackOff
    3m22s Normal BackOff pod/nginx-6799fc88d8-9xd8t Back-off pulling image "nginx"

    ###### OK USE CASE #####

    student@master:~$ kubectl get nodes
    NAME STATUS ROLES AGE VERSION
    master Ready master 33m v1.18.1
    worker NotReady 17m v1.18.1

    student@master:~$ kubectl get deployment
    NAME READY UP-TO-DATE AVAILABLE AGE
    nginx 1/1 1 1 20s

    student@master:~$ kubectl get events --sort-by='.lastTimestamp'
    5m41s Normal NodeNotReady node/worker Node worker status is now: NodeNotReady
    89s Normal ScalingReplicaSet deployment/nginx Scaled up replica set nginx-6799fc88d8 to 1
    89s Normal SuccessfulCreate replicaset/nginx-6799fc88d8 Created pod: nginx-6799fc88d8-wn9m4
    89s Normal Scheduled pod/nginx-6799fc88d8-wn9m4 Successfully assigned default/nginx-6799fc88d8-wn9m4 to master
    88s Normal Pulling pod/nginx-6799fc88d8-wn9m4 Pulling image "nginx"
    74s Normal Started pod/nginx-6799fc88d8-wn9m4 Started container nginx
    74s Normal Created pod/nginx-6799fc88d8-wn9m4 Created container nginx
    74s Normal Pulled pod/nginx-6799fc88d8-wn9m4 Successfully pulled image "nginx"

    ###### OTHER TESTS ######

    When both nodes are in the cluster ready all name resolutions are affected:

    student@master:~$ sudo apt install apt-show-versions
    Reading package lists... Done
    Building dependency tree
    Reading state information... Done
    The following additional packages will be installed:
    libapt-pkg-perl
    The following NEW packages will be installed:
    apt-show-versions libapt-pkg-perl
    0 upgraded, 2 newly installed, 0 to remove and 5 not upgraded.
    Need to get 96.6 kB of archives.
    After this operation, 312 kB of additional disk space will be used.
    Do you want to continue? [Y/n] y
    Err:1 http://es.archive.ubuntu.com/ubuntu bionic/main amd64 libapt-pkg-perl amd64 0.1.33build1
    Temporary failure resolving 'es.archive.ubuntu.com'
    0% [Working]^C

    student@master:~$ ping google.es
    ping: google.es: Temporary failure in name resolution

    When MASTER only is in the cluster ready resolution name works fine:

    student@master:~$ sudo apt install apt-show-versions
    Reading package lists... Done
    Building dependency tree
    Reading state information... Done
    The following additional packages will be installed:
    libapt-pkg-perl
    The following NEW packages will be installed:
    apt-show-versions libapt-pkg-perl
    0 upgraded, 2 newly installed, 0 to remove and 5 not upgraded.
    Need to get 96.6 kB of archives.
    After this operation, 312 kB of additional disk space will be used.
    Do you want to continue? [Y/n] y
    Get:1 http://es.archive.ubuntu.com/ubuntu bionic/main amd64 libapt-pkg-perl amd64 0.1.33build1 [68.0 kB]
    Get:2 http://es.archive.ubuntu.com/ubuntu bionic/universe amd64 apt-show-versions all 0.22.7ubuntu1 [28.6 kB]
    Fetched 96.6 kB in 16s (6,140 B/s)
    Selecting previously unselected package libapt-pkg-perl.
    (Reading database ... 67530 files and directories currently installed.)
    Preparing to unpack .../libapt-pkg-perl_0.1.33build1_amd64.deb ...
    Unpacking libapt-pkg-perl (0.1.33build1) ...
    Selecting previously unselected package apt-show-versions.
    Preparing to unpack .../apt-show-versions_0.22.7ubuntu1_all.deb ...
    Unpacking apt-show-versions (0.22.7ubuntu1) ...
    Setting up libapt-pkg-perl (0.1.33build1) ...
    Setting up apt-show-versions (0.22.7ubuntu1) ...
    ** initializing cache. This may take a while **
    Processing triggers for man-db (2.8.3-2ubuntu0.1) ...

    student@master:~$ ping google.es
    PING google.es (216.58.209.67) 56(84) bytes of data.
    64 bytes from mad07s22-in-f3.1e100.net (216.58.209.67): icmp_seq=1 ttl=128 time=38.2 ms
    64 bytes from mad07s22-in-f3.1e100.net (216.58.209.67): icmp_seq=2 ttl=128 time=20.6 ms
    64 bytes from mad07s22-in-f3.1e100.net (216.58.209.67): icmp_seq=3 ttl=128 time=42.9 ms
    64 bytes from mad07s22-in-f3.1e100.net (216.58.209.67): icmp_seq=4 ttl=128 time=17.5 ms

  • check and compare the routes and dns settings changes between the OK and NOK use cases

  • serewicz
    serewicz Posts: 1,000

    Hello,

    When you have two nodes, the output indicates that the network stops working. This would mean that the issue is Linux networking, not Kubernetes.

    Please make sure that the VMware virtual machines do not have any firewall between each other or the outside world.

    If you are using the default calico.yaml file, then you would be using a combination of 10. addresses and 192.168. addresses for your cluster. You can verify this by looking at the pools mentioned in the calico.yaml file. Are you using an overlapping network for your VMWare VMs, or the laptop it is running on? If so then the routing becomes an issue. You can view the before and after routes using ip route. As you compare and contrast, what is different between them?

    Regards,

  • There isn't firewall

    Routes appear OK

    # Use case OK

    student@master:~$ route -n
    Kernel IP routing table
    Destination Gateway Genmask Flags Metric Ref Use Iface
    0.0.0.0 192.168.10.2 0.0.0.0 UG 100 0 0 ens33
    172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
    192.168.10.0 0.0.0.0 255.255.255.0 U 0 0 0 ens33
    192.168.10.2 0.0.0.0 255.255.255.255 UH 100 0 0 ens33
    192.168.219.64 0.0.0.0 255.255.255.192 U 0 0 0 *
    192.168.219.76 0.0.0.0 255.255.255.255 UH 0 0 0 cali4f2dae3ae57
    192.168.219.77 0.0.0.0 255.255.255.255 UH 0 0 0 cali3b44909318d
    192.168.219.78 0.0.0.0 255.255.255.255 UH 0 0 0 calif48570d0d2e

    student@master:~$ ip route
    default via 192.168.10.2 dev ens33 proto dhcp src 192.168.10.133 metric 100
    172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
    192.168.10.0/24 dev ens33 proto kernel scope link src 192.168.10.133
    192.168.10.2 dev ens33 proto dhcp scope link src 192.168.10.133 metric 100
    192.168.219.76 dev cali4f2dae3ae57 scope link
    192.168.219.77 dev cali3b44909318d scope link
    192.168.219.78 dev calif48570d0d2e scope link

    # Use case NOK

    student@master:~$ route -n
    Kernel IP routing table
    Destination Gateway Genmask Flags Metric Ref Use Iface
    0.0.0.0 192.168.10.2 0.0.0.0 UG 100 0 0 ens33
    172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
    192.168.10.0 0.0.0.0 255.255.255.0 U 0 0 0 ens33
    192.168.10.2 192.168.10.134 255.255.255.255 UGH 0 0 0 tunl0
    192.168.10.2 0.0.0.0 255.255.255.255 UH 100 0 0 ens33
    192.168.171.64 192.168.10.134 255.255.255.192 UG 0 0 0 tunl0
    192.168.219.64 0.0.0.0 255.255.255.192 U 0 0 0 *
    192.168.219.76 0.0.0.0 255.255.255.255 UH 0 0 0 cali4f2dae3ae57
    192.168.219.77 0.0.0.0 255.255.255.255 UH 0 0 0 cali3b44909318d
    192.168.219.78 0.0.0.0 255.255.255.255 UH 0 0 0 calif48570d0d2e

    student@master:~$ ip route
    default via 192.168.10.2 dev ens33 proto dhcp src 192.168.10.133 metric 100
    172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
    192.168.10.0/24 dev ens33 proto kernel scope link src 192.168.10.133
    192.168.10.2 via 192.168.10.134 dev tunl0 proto bird onlink
    192.168.10.2 dev ens33 proto dhcp scope link src 192.168.10.133 metric 100
    192.168.171.64/26 via 192.168.10.134 dev tunl0 proto bird onlink
    blackhole 192.168.219.64/26 proto bird
    192.168.219.73 dev cali3b44909318d scope link
    192.168.219.74 dev cali4f2dae3ae57 scope link
    192.168.219.75 dev calif48570d0d2e scope link

  • fjlozano
    fjlozano Posts: 9
    edited December 2020

    The problem is with the DNS resolution.

    When the second node is Ready, DNS resolution fails.

    By default in ubuntu 18.04 the name resolution is managed by systemd-resolved service. (Standard installation using Official Server ISO )

    #### Use Case OK

    _student@master:/run/systemd/resolve$ dig google.es

    ; <<>> DiG 9.11.3-1ubuntu1.13-Ubuntu <<>> google.es
    ;; global options: +cmd
    ;; connection timed out; no servers could be reached
    student@master:/run/systemd/resolve$ dig google.es

    ; <<>> DiG 9.11.3-1ubuntu1.13-Ubuntu <<>> google.es
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27420
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 65494
    ;; QUESTION SECTION:
    ;google.es. IN A

    ;; ANSWER SECTION:
    google.es. 5 IN A 172.217.17.3

    ;; Query time: 24 msec
    ;; SERVER: 127.0.0.53#53(127.0.0.53)
    ;; WHEN: Sat Dec 26 12:30:10 UTC 2020
    ;; MSG SIZE rcvd: 54_

    student@master:/run/systemd/resolve$ netstat -anp | grep -i ":53"
    (Not all processes could be identified, non-owned process info
    will not be shown, you would have to be root to see it all.)
    tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN -
    udp 0 0 127.0.0.53:53 0.0.0.0:* -

    #### Use Case NOK

    _student@master:~$ dig google.es

    ; <<>> DiG 9.11.3-1ubuntu1.13-Ubuntu <<>> google.es
    ;; global options: +cmd
    ;; connection timed out; no servers could be reached

    student@master:~$ ping 8.8.8.8
    PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
    64 bytes from 8.8.8.8: icmp_seq=1 ttl=128 time=62.9 ms
    64 bytes from 8.8.8.8: icmp_seq=2 ttl=128 time=23.0 ms
    64 bytes from 8.8.8.8: icmp_seq=3 ttl=128 time=24.0 ms_

    student@master:/run/systemd/resolve$ netstat -anp | grep -i ":53"
    (Not all processes could be identified, non-owned process info
    will not be shown, you would have to be root to see it all.)
    tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN -
    udp 0 0 127.0.0.53:53 0.0.0.0:* -
    udp 0 0 192.168.219.64:35529 192.168.10.2:53 ESTABLISHED -

    I would like to understand why the behaviour is modified by Kubernetes/Calico/Other component.
    As WA I could configure 8.8.8.8 as DNS server but I prefer to understand this use case.

    Regards

  • The issues could be related to:

    Some Linux distributions (e.g. Ubuntu) use a local DNS resolver by default (systemd-resolved). Systemd-resolved moves and replaces /etc/resolv.conf with a stub file that can cause a fatal forwarding loop when resolving names in upstream servers. This can be fixed manually by using kubelet's --resolv-conf flag to point to the correct resolv.conf (With systemd-resolved, this is /run/systemd/resolve/resolv.conf). kubeadm automatically detects systemd-resolved, and adjusts the kubelet flags accordingly.

    https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/

  • serewicz
    serewicz Posts: 1,000

    Hello,

    Are all pods running without error? If you look at the logs for your calico and coredns pods, are there any errors? The command would look something like kubectl -n kube-system logs coredns-<TAB>

    It looks as if you are using 192.168. network for the nodes. Which overlaps with the default Calico setup. If you change the pool range to something else, like a 172.16.0.0/12 network when you build the cluster check to see if it works.

    Regards,

  • There aren't errors in coredns

    I moved my network to 172.16.0.0/12

    Finally I solved the issue re-installing the platform completely but I disabled systemd-resolved before.

Categories

Upcoming Training