Welcome to the Linux Foundation Forum!

Curl-ing a Cluster IP not working

Options

Hi All,
I'm on exercise Exercise 3.4: Deploy A Simple Application line 20.

I've created a deployment and exposed a service.

pch@master:~/Deployments$ kubectl expose deployment/nginx
service/nginx exposed

pch@master:~/Deployments$ kubectl get svc nginx
NAME    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
nginx   ClusterIP   10.109.43.160   <none>        80/TCP    40s

pch@master:~/Deployments$ kubectl get ep nginx
NAME    ENDPOINTS     AGE
nginx   10.0.1.4:80   74s

pch@master:~/Deployments$ kubectl describe pod nginx-7848d4b86f-hkmlk | grep Node:
Node:         worker/192.168.0.137

When I curl the service and endpoint I get Connection TimeOut errors.

pch@master:~/Deployments$ curl 10.109.43.160:80
curl: (28) Failed to connect to 10.109.43.160 port 80: Connection timed out
pch@master:~/Deployments$ curl 10.0.1.4:80
curl: (28) Failed to connect to 10.0.1.4 port 80: Connection timed out

If I look at all infrastructure pods running (installation was with kubeadm), I see two coredns pods, and two flannel pods. Not seeing any issues.

pch@master:~$ kubectl get pods --all-namespaces
NAMESPACE     NAME                             READY   STATUS    RESTARTS   AGE
kube-system   coredns-78fcd69978-qzfq9         1/1     Running   0          5h46m
kube-system   coredns-78fcd69978-vnr7m         1/1     Running   0          5h46m
kube-system   etcd-master                      1/1     Running   0          5h47m
kube-system   kube-apiserver-master            1/1     Running   1          5h47m
kube-system   kube-controller-manager-master   1/1     Running   1          5h47m
kube-system   kube-flannel-ds-ks94v            1/1     Running   0          5h33m
kube-system   kube-flannel-ds-ztmzv            1/1     Running   0          5h36m
kube-system   kube-proxy-dwxrq                 1/1     Running   0          5h33m
kube-system   kube-proxy-vvsmg                 1/1     Running   0          5h46m
kube-system   kube-scheduler-master            1/1     Running   1          5h47m

When I deployed the flannel network I see it created NIC's on the head node and on the worker node.

HEAD NODE
5: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether 1e:d1:a0:5b:66:57 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.0/32 brd 10.0.0.0 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::1cd1:a0ff:fe5b:6657/64 scope link
valid_lft forever preferred_lft forever

WORKER NODE
5: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether 8e:39:10:e5:87:cc brd ff:ff:ff:ff:ff:ff
inet 10.0.1.0/32 brd 10.0.1.0 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::8c39:10ff:fee5:87cc/64 scope link
valid_lft forever preferred_lft forever

When I run tcpdump on the worker node (sudo tcpdump -i flannel.1) I do see activity, but still overall yields a connection failure on head node.

pch@worker:~$ sudo tcpdump -i flannel.1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
00:11:00.870225 IP 10.0.0.0.13917 > 10.0.1.4.http: Flags [S], seq 4066182250, win 64240, options [mss 1460,sackOK,TS val 938402630 ecr 0,nop,wscale 7], length 0
00:11:01.879778 IP 10.0.0.0.13917 > 10.0.1.4.http: Flags [S], seq 4066182250, win 64240, options [mss 1460,sackOK,TS val 938403640 ecr 0,nop,wscale 7], length 0
00:11:03.896173 IP 10.0.0.0.13917 > 10.0.1.4.http: Flags [S], seq 4066182250, win 64240, options [mss 1460,sackOK,TS val 938405656 ecr 0,nop,wscale 7], length 0
00:11:07.995933 IP 10.0.0.0.13917 > 10.0.1.4.http: Flags [S], seq 4066182250, win 64240, options [mss 1460,sackOK,TS val 938409756 ecr 0,nop,wscale 7], length 0
00:11:16.183887 IP 10.0.0.0.13917 > 10.0.1.4.http: Flags [S], seq 4066182250, win 64240, options [mss 1460,sackOK,TS val 938417944 ecr 0,nop,wscale 7], length 0
00:11:32.311722 IP 10.0.0.0.13917 > 10.0.1.4.http: Flags [S], seq 4066182250, win 64240, options [mss 1460,sackOK,TS val 938434072 ecr 0,nop,wscale 7], length 0
00:11:56.524309 IP worker.mdns > 224.0.0.251.mdns: 0 [2q] PTR (QM)? _ipps._tcp.local. PTR (QM)? _ipp._tcp.local. (45)
00:11:57.358047 IP6 worker.mdns > ff02::fb.mdns: 0 [2q] PTR (QM)? _ipps._tcp.local. PTR (QM)? _ipp._tcp.local. (45)
00:12:05.591719 IP 10.0.0.0.13917 > 10.0.1.4.http: Flags [S], seq 4066182250, win 64240, options [mss 1460,sackOK,TS val 938467352 ecr 0,nop,wscale 7], length 0
00:12:12.875671 ARP, Request who-has nicoda.kde.org tell worker, length 28
00:12:13.906135 ARP, Request who-has nicoda.kde.org tell worker, length 28
00:12:14.929559 ARP, Request who-has nicoda.kde.org tell worker, length 28
00:12:15.953546 ARP, Request who-has nicoda.kde.org tell worker, length 28
00:12:16.977931 ARP, Request who-has nicoda.kde.org tell worker, length 28
00:12:18.001375 ARP, Request who-has nicoda.kde.org tell worker, length 28

Could it be at the container level something is not working as expected?

Comments

  • serewicz
    serewicz Posts: 1,000
    Options

    Hello,

    Is there a reason you're using Flannel instead of Calico as the labs call for?

    As the pod is on the worker it's most likely an issue with your inter-VM network. What are you using to run your VMs?

    When you connect to the worker node can you curl the pod's ephemeral IP? Then try the service ep.

    Regards,

  • PeterCharuza
    Options

    Looking back at when I created the flannel network I ran these commands, the second one failed it looks like.

    pch@master:~$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
    Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
    podsecuritypolicy.policy/psp.flannel.unprivileged created
    clusterrole.rbac.authorization.k8s.io/flannel created
    clusterrolebinding.rbac.authorization.k8s.io/flannel created
    serviceaccount/flannel created
    configmap/kube-flannel-cfg created
    daemonset.apps/kube-flannel-ds created
    
    pch@master:~$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/k8s-manifests/kube-flannel-rbac.yml
    unable to recognize "https://raw.githubusercontent.com/coreos/flannel/master/Documentation/k8s-manifests/kube-flannel-rbac.yml": no matches for kind "ClusterRole" in version "rbac.authorization.k8s.io/v1beta1"
    unable to recognize "https://raw.githubusercontent.com/coreos/flannel/master/Documentation/k8s-manifests/kube-flannel-rbac.yml": no matches for kind "ClusterRoleBinding" in version "rbac.authorization.k8s.io/v1beta1"
    
  • PeterCharuza
    PeterCharuza Posts: 22
    edited September 2021
    Options

    I had followed another guide to setting up my environment with kubeadm.
    I'm using docker for my VM's, I stuck with defaults there.

    I've deleted Flannel using:
    kubectl delete -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

    I've installed Calico using the recommended steps, adjusting the config for my CIDR network 10.0.0.0/16

    After rebooting the hosts I see the Calico networks are in place.

    It looks though like my deployment isn't too happy for some reason, the pod won't spin up.

    pch@master:~/Deployments$ kubectl get deploy,pod
    NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/nginx   0/1     1            0           84m
    
    NAME                         READY   STATUS              RESTARTS   AGE
    pod/nginx-7848d4b86f-zrvnx   0/1     ContainerCreating   0          72s
    
  • PeterCharuza
    PeterCharuza Posts: 22
    edited September 2021
    Options

    looking into the pod it's getting errors, I deleted the deployment and redeployed from the yaml file we created in the exercise. Same issue.

    pch@master:~/Deployments$ kubectl logs nginx-7848d4b86f-4ms4b
    Error from server (BadRequest): container "nginx" in pod "nginx-7848d4b86f-4ms4b" is waiting to start: ContainerCreating
    

    ah-ha, ok when running kubectl describe pods I see a calico error that's the blocker.

      Warning  FailedCreatePodSandBox  2m50s                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "ced8d79d1a96fa1ef5f47f03cd4910703e86aff693d60e33a5270823dab99658" network for pod "nginx-7848d4b86f-4ms4b": networkPlugin cni failed to set up pod "nginx-7848d4b86f-4ms4b_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
    
  • PeterCharuza
    Options

    Issue is calico pods are crashing.

    pch@master:~$ kubectl get pods -n kube-system
    NAME                                       READY   STATUS              RESTARTS         AGE
    calico-kube-controllers-58497c65d5-t6nk7   0/1     ContainerCreating   0                75s
    calico-node-d7t8w                          1/1     Running             1 (29m ago)      42m
    calico-node-rp2cp                          0/1     CrashLoopBackOff    17 (2m32s ago)   42m
    

    It looks like the issue might be duplicate IP address between the head (master) node and the worker node.

    pch@master:~$ kubectl -n kube-system logs calico-node-rp2cp
    2021-09-01 05:22:53.596 [INFO][9] startup/startup.go 713: Using autodetected IPv4 address on interface br-d55a3b06d6c7: 172.18.0.1/16
    2021-09-01 05:22:53.596 [INFO][9] startup/startup.go 530: Node IPv4 changed, will check for conflicts
    2021-09-01 05:22:53.600 [WARNING][9] startup/startup.go 1074: Calico node 'master' is already using the IPv4 address 172.18.0.1.
    2021-09-01 05:22:53.600 [INFO][9] startup/startup.go 360: Clearing out-of-date IPv4 address from this node IP="172.18.0.1/16"
    2021-09-01 05:22:53.607 [WARNING][9] startup/utils.go 48: Terminating
    Calico node failed to start
    
    pch@master:~$ calicoctl get nodes -o wide
    NAME     ASN       IPV4            IPV6
    master   (64512)   172.18.0.1/16
    worker
    

    Checked Calico support info online, looks like the issue could be the IP_AUTODETECTION_METHOD listed here: https://github.com/projectcalico/calico/issues/1628

    Well it's a good thought, I updated the calico config for that by running,

    kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=can-reach=DESTINATION
    

    This was set successfully. I rebooted my machines, and issue persists...

    pch@master:~$ kubectl get pods -n kube-system
    NAME                                       READY   STATUS              RESTARTS        AGE
    calico-kube-controllers-58497c65d5-t6nk7   0/1     ContainerCreating   0               13m
    calico-node-hp5nt                          1/1     Running             1 (3m53s ago)   4m45s
    calico-node-vwzt7                          0/1     CrashLoopBackOff    6 (35s ago)     4m45s
    

    I do see that both master and worker have the following NICS,
    MASTER-HEAD
    4: br-d55a3b06d6c7: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:12:93:d4:cd brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 brd 172.18.255.255 scope global br-d55a3b06d6c7
    valid_lft forever preferred_lft forever

    WORKER
    3: br-d55a3b06d6c7: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:17:0d:c6:31 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 brd 172.18.255.255 scope global br-d55a3b06d6c7
    valid_lft forever preferred_lft forever

  • PeterCharuza
    Options

    This is where I'm stuck, I'm not sure how to resolve this IP conflict, Calico auto populates this IP address and is choosing to assign the same IP to each node (Master + Worker).

  • serewicz
    serewicz Posts: 1,000
    Options

    Hello,

    Docker uses its own networking configuration, and I am not too aware of the details of making it work. Calico does not automatically set the IP range, the lab actually has you look at where the information is coming from in the calico.yaml file and the configuration file passed to kubeadm init.

    The short of it is you don't want any of your IP ranges (VMs, Host, service pods, ephemeral IPs) to overlap. Also you want to ensure there is NO firewall between your VMs. Are you sure you have configured Docker to allow all traffic?

    Perhaps you can use VirtualBox, VMWare, or KVM/QEMU locally or GCE, AWS, or Digital Ocean, where you can properly control the IP range of you VMs, as well as the connectivity of the VMs to each other.

    Regards,

  • PeterCharuza
    Options

    Hi Serewicz,

    Totally understand, I'm using VMware, and the VM's themselves are on a 192.168.0.0/24 network, so they are good. The kubeadm config I initiated with 10.0.0.0/16, and I set the calico.yaml to the same thing thinking they need to match.

     - name: CALICO_IPV4POOL_CIDR
                  value: "10.0.0.0/16"
    

    kubeadm install command:

    sudo kubeadm init --pod-network-cidr=10.0.0.0/16
    

    I'm not sure where the 172.18.0.1/16 came from.

    pch@master:~/Deployments$ ip a
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
           valid_lft forever preferred_lft forever
        inet6 ::1/128 scope host
           valid_lft forever preferred_lft forever
    2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
        link/ether 00:0c:29:31:02:a4 brd ff:ff:ff:ff:ff:ff
        altname enp2s1
        inet 192.168.0.172/24 brd 192.168.0.255 scope global noprefixroute ens33
           valid_lft forever preferred_lft forever
        inet6 fe80::4afa:aa33:58f0:ee0/64 scope link noprefixroute
           valid_lft forever preferred_lft forever
    3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
        link/ether 02:42:5d:78:aa:29 brd ff:ff:ff:ff:ff:ff
        inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
           valid_lft forever preferred_lft forever
    4: br-d55a3b06d6c7: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
        link/ether 02:42:3c:3d:29:3f brd ff:ff:ff:ff:ff:ff
        inet 172.18.0.1/16 brd 172.18.255.255 scope global br-d55a3b06d6c7
           valid_lft forever preferred_lft forever
    
  • serewicz
    serewicz Posts: 1,000
    Options

    Hello,

    The VMs IP range conflicts with the pods ephemeral IPs. As a result when you try to connect to a pod your node is sending the request out of the primary interface instead of across the tunnel to the other node.

    Exercise 3.1, steps 10 and 11 speak to what Calico uses by default. I would encourage you to create two new VMs, this time using an IP range like 10.128.0.0/16 or 172.16.0.0/12 for the VM, which does not conflict with your host, the Pod IPs or default ephemeral IPs. If you host is also using 192.168 as its own network there may be conflicts during routing, but less likely as the traffic should stay within the VMs.

    In a previous command you mentioned you had erased and redone the networking. Unfortunately this does not really work. There is much more to how the IP is used when you create the cluster. Starting over is the suggested method if you want to change your cluster configuration is such a dramatic manner.

    Regards,

  • PeterCharuza
    PeterCharuza Posts: 22
    edited September 2021
    Options

    I nuked my setup and rebuilt some VM's. This time I took adequate clones and snapshots if I need to rebuild again :smile::smiley:

    I'm running through the installation exercise LAB 3.1, and I've found that I'm getting odd output when running "hostname -i" on both master and worker nodes.

    Master:
    pch@master:~$ hostname -i
    127.0.1.1

    Worker:
    pch@worker:~$ hostname -i
    192.168.0.157 172.17.0.1 fe80::20c:29ff:feb1:9de8

    The vmNIC is using 192.168.0.15X
    Docker created 172.17.0.1

    The server is Ubuntu 20.04. I set the IP static in the GUI and disabled IPV6 rebooted and checked again. Same output...

    I realized that /etc/hosts wasn't reading the same way on both servers. I cleaned it up, and made sure the proper local IP was set to the appropriate DNS name and now we're all good!

  • serewicz
    serewicz Posts: 1,000
    Options

    Glad to hear its working.

Categories

Upcoming Training