connection timeout connecting from worker to the local registry running on master via ClusterIP

crixo · January 2019

I'm stuck at alb 3.1 - 28
The worker node joined successfully the cluster using kubeadm join... and status is Ready.
I'm not able to connect the worker to the registry
curl http://10.107.88.26:5000/v2/ -> curl: (7) Failed to connect to 10.107.88.26 port 5000: Connection timed out
same command works w/o problem on the master node.
is there network utilities/tools/commands I could use to provide you details for troubleshooting?

that's the following output of "kubectl describe nodes"

    Name:               k8s-master1
    Roles:              master
    Labels:             beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/os=linux
                        kubernetes.io/hostname=k8s-master1
                        node-role.kubernetes.io/master=
    Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                        node.alpha.kubernetes.io/ttl: 0
                        volumes.kubernetes.io/controller-managed-attach-detach: true
    CreationTimestamp:  Sat, 05 Jan 2019 14:54:13 +0000
    Taints:             <none>
    Unschedulable:      false
    Conditions:
___omitted___
    ready status. AppArmor enabled
    Addresses:
      InternalIP:  10.0.0.4
      Hostname:    k8s-master1
    Capacity:
     attachable-volumes-azure-disk:  16
     cpu:                            2
     ephemeral-storage:              30428648Ki
     hugepages-1Gi:                  0
     hugepages-2Mi:                  0
     memory:                         4040536Ki
     pods:                           110
    Allocatable:
     attachable-volumes-azure-disk:  16
     cpu:                            2
     ephemeral-storage:              28043041951
     hugepages-1Gi:                  0
     hugepages-2Mi:                  0
     memory:                         3938136Ki
     pods:                           110
    System Info:
     Machine ID:                 39f567ff6e2f4b29bd860fd7228c1322
     System UUID:                AE723471-2D04-1B41-A684-3FB8B12C8C31
     Boot ID:                    bf187457-5b0e-43ca-83c2-0e200a7572c9
     Kernel Version:             4.15.0-1036-azure
     OS Image:                   Ubuntu 16.04.5 LTS
     Operating System:           linux
     Architecture:               amd64
     Container Runtime Version:  docker://18.6.1
     Kubelet Version:            v1.12.1
     Kube-Proxy Version:         v1.12.1
    PodCIDR:                     192.168.0.0/24
    Non-terminated Pods:         (10 in total)
      Namespace                  Name                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits
      ---------                  ----                                   ------------  ----------  ---------------  -------------
___omitted___
    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource                       Requests    Limits
      --------                       --------    ------
      cpu                            900m (45%)  0 (0%)
      memory                         70Mi (1%)   170Mi (4%)
      attachable-volumes-azure-disk  0           0
    Events:
      Type    Reason                   Age                From                     Message
      ----    ------                   ----               ----                     -------
___omitted___


    Name:               k8s-worker1
    Roles:              <none>
    Labels:             beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/os=linux
                        kubernetes.io/hostname=k8s-worker1
    Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                        node.alpha.kubernetes.io/ttl: 0
                        volumes.kubernetes.io/controller-managed-attach-detach: true
    CreationTimestamp:  Sat, 05 Jan 2019 14:55:07 +0000
    Taints:             <none>
    Unschedulable:      false
    Conditions:
      Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
      ----             ------  -----------------                 ------------------                ------                       -------
___omitted___
ready status. AppArmor enabled
    Addresses:
      InternalIP:  10.0.0.5
      Hostname:    k8s-worker1
    Capacity:
     attachable-volumes-azure-disk:  16
     cpu:                            1
     ephemeral-storage:              30428648Ki
     hugepages-1Gi:                  0
     hugepages-2Mi:                  0
     memory:                         944136Ki
     pods:                           110
    Allocatable:
     attachable-volumes-azure-disk:  16
     cpu:                            1
     ephemeral-storage:              28043041951
     hugepages-1Gi:                  0
     hugepages-2Mi:                  0
     memory:                         841736Ki
     pods:                           110
    System Info:
     Machine ID:                 8977d6e96cdd47ef9fd20f9496ab84f2
     System UUID:                92429DB1-12B5-3342-8C66-5A5119371B50
     Boot ID:                    23371f94-ebaa-4186-b27c-7a756f327aa6
     Kernel Version:             4.15.0-1036-azure
     OS Image:                   Ubuntu 16.04.5 LTS
     Operating System:           linux
     Architecture:               amd64
     Container Runtime Version:  docker://18.6.1
     Kubelet Version:            v1.12.1
     Kube-Proxy Version:         v1.12.1
    PodCIDR:                     192.168.1.0/24
    Non-terminated Pods:         (4 in total)
      Namespace                  Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits
      ---------                  ----                                        ------------  ----------  ---------------  -------------
___omitted___
    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource                       Requests    Limits
      --------                       --------    ------
      cpu                            350m (35%)  0 (0%)
      memory                         70Mi (8%)   170Mi (20%)
      attachable-volumes-azure-disk  0           0
    Events:
      Type    Reason                   Age                From                     Message
      ----    ------                   ----               ----                     -------
___omitted___

crixo · January 2019

BTW: both azure node/VM are on the same vnet having both the default rule: 65000
AllowVnetInBound
Any
Any
VirtualNetwork
VirtualNetwork
Allow

ping 10.107.88.26 succeeded

    ping 10.107.88.26
    PING 10.107.88.26 (10.107.88.26) 56(84) bytes of data.
    ^C
    --- 10.107.88.26 ping statistics ---
    22 packets transmitted, 0 received, 100% packet loss, time 21484ms

crixo · January 2019

More details
PS: ping did NOT succeeded (cannot edit previous comment)

My VMs provisioning script

    RESOURCE_GROUP='<my-rg-group>'
    LOCATION='westeurope'
    IMAGE='UbuntuLTS'
    #MASTER_SKU='Standard_D1_v2'
    MASTER_SKU='Standard_B2s'
    AGENT_SKU='Standard_B1s'
    MASTER_NAME='k8s-master1'

    az group create -g $RESOURCE_GROUP -l $LOCATION

    az vm create -g $RESOURCE_GROUP -n $MASTER_NAME \
      --size $MASTER_SKU \
      --image $IMAGE \
      --public-ip-address-dns-name 'master1-'$RESOURCE_GROUP \
      --vnet-name vnet1 \
      --subnet subnet1 \
      --custom-data k8sMaster.sh \
      --ssh-key-value @/Users/cristiano/.ssh/azure-vm_rsa.pub
      ##--generate-ssh-keys

    az vm create -g $RESOURCE_GROUP -n 'k8s-worker1' \
      --size $AGENT_SKU \
      --image $IMAGE \
      --public-ip-address-dns-name 'worker1-'$RESOURCE_GROUP \
      --vnet-name vnet1 \
      --subnet subnet1 \
      --custom-data k8sSecond.sh \
      --ssh-key-value @/Users/cristiano/.ssh/azure-vm_rsa.pub
      #--generate-ssh-keys

    #https://docs.microsoft.com/en-us/cli/azure/vm?view=azure-cli-latest#az-vm-open-port
    az vm open-port -g $RESOURCE_GROUP -n $MASTER_NAME --port 30000-33000 --priority 1010

Script has been expired by this article
https://www.aaronmsft.com/posts/azure-vmss-kubernetes-kubeadm/

Network settings in azure
https://docs.microsoft.com/en-us/azure/virtual-network/security-overview#denyallinbound

Do you think I have to add additional rules rather the default one? Not sure if ClusterIPs requires extra rules or vnet/subnet settings should be enough

Thanks a lot for support and I apologize for poor networking skills

crixo · January 2019

The only ClusterIP reachable from worker1 node is service/kubernetes
Tthe following curl enquires
curl --insecure https://10.96.0.1/api
succeed on both nodes

same for endpoints: the only one working on both nodes is endpoints/kubernetes

On master1 node all endpoints and clusterip are working as expected

ClusterIPs should work on all nodes, regardless where the pod(s) is running. Does the same rule apply to endpoints as well?
I mean, endpoints should be reachable on all nodes?

I'm providing here below the output of "route -n" on both node. Endpoint IP should be included on both nodes or the current output is correct?

Where is stored the mappings between clusterip vs endpoints?

cristiano@k8s-master1:~$ kubectl get pods,svc,pvc,pv,deploy,endpoints

NAME                            READY   STATUS    RESTARTS   AGE
pod/basicpod                    1/1     Running   0          55m
pod/nginx-67f8fb575f-r8kh5      1/1     Running   0          179m
pod/registry-56cffc98d6-jkzj5   1/1     Running   0          179m

NAME                   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/basicservice   ClusterIP   10.100.134.88   <none>        80/TCP     55m
service/kubernetes     ClusterIP   10.96.0.1       <none>        443/TCP    3h17m
service/nginx          ClusterIP   10.111.131.30   <none>        443/TCP    70m
service/registry       ClusterIP   10.104.96.176   <none>        5000/TCP   70m

NAME                                    STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/nginx-claim0      Bound    task-pv-volume   200Mi      RWO                           179m
persistentvolumeclaim/registry-claim0   Bound    registryvm       200Mi      RWO                           179m

NAME                              CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                     STORAGECLASS   REASON   AGE
persistentvolume/registryvm       200Mi      RWO            Retain           Bound    default/registry-claim0                           3h3m
persistentvolume/task-pv-volume   200Mi      RWO            Retain           Bound    default/nginx-claim0                              3h3m

NAME                             DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deployment.extensions/nginx      1         1         1            1           179m
deployment.extensions/registry   1         1         1            1           179m

NAME                     ENDPOINTS              AGE
endpoints/basicservice   192.168.159.131:80     55m
endpoints/kubernetes     10.0.0.4:6443          3h17m
endpoints/nginx          192.168.159.129:443    70m
endpoints/registry       192.168.159.130:5000   70m

cristiano@k8s-master1:~$ curl http://10.100.134.88 -> OK
cristiano@k8s-worker1:~$ curl http://10.100.134.88 -> Timeout

cristiano@k8s-master1:~$ route -n

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.0.1        0.0.0.0         UG    0      0        0 eth0
10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 eth0
168.63.129.16   10.0.0.1        255.255.255.255 UGH   0      0        0 eth0
169.254.169.254 10.0.0.1        255.255.255.255 UGH   0      0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
192.168.159.128 0.0.0.0         255.255.255.192 U     0      0        0 *
192.168.159.129 0.0.0.0         255.255.255.255 UH    0      0        0 cali0968f46c6a2
192.168.159.130 0.0.0.0         255.255.255.255 UH    0      0        0 calib0effd1e9a6
192.168.159.131 0.0.0.0         255.255.255.255 UH    0      0        0 cali3fc1b4ac805
192.168.159.132 0.0.0.0         255.255.255.255 UH    0      0        0 calif49f354fce4
192.168.159.133 0.0.0.0         255.255.255.255 UH    0      0        0 calie2160929446
192.168.194.128 10.0.0.5        255.255.255.192 UG    0      0        0 tunl0

cristiano@k8s-worker1:~$ route -n

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.0.1        0.0.0.0         UG    0      0        0 eth0
10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 eth0
168.63.129.16   10.0.0.1        255.255.255.255 UGH   0      0        0 eth0
169.254.169.254 10.0.0.1        255.255.255.255 UGH   0      0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
192.168.159.128 10.0.0.4        255.255.255.192 UG    0      0        0 tunl0
192.168.194.128 0.0.0.0         255.255.255.192 U     0      0        0 *

chrispokorni · January 2019

Hi @crixo ,
From your first output, it seems the firewall (AppArmor) is enabled on each node, and it may be blocking traffic to some of the ports - the reason why curls and pings are timing out.
See the instructions from Lab 2.1 - Overview section.
Regards,
-Chris

tanwarsatya · January 2019

this is the exact error i am facing. Not able to get to registry from the worker node. My cluster is in Azure and virtual network has traffic enabled on all the ports within cluster (1 master , 2 nodes ).

chrispokorni · January 2019

Hi @tanwarsatya ,
Also make sure the firewalls at OS level are disabled, as they may be blocking some traffic.
Regards,
-Chris

tanwarsatya · January 2019

ufw status is showing inactive on both master and worker nodes.

crixo · January 2019

Hi @chrispokorni,
If I understood, AppArmor is sort of firewall at kernel-level while the Network settings in azure are somenthing on top of it at azure platform level.
I tried on AWS as well and I did not have that issues; azure and aws VM are both based on ubuntu 16.04, does it mean azure vm and aws VM (both based on ubuntu16.04) have different kernel settings? I mean, azure ubuntu does have AppArmor enabled while aws ubuntu does not?
I'll try to disable AppArmor on azure and i let you know. I assume you are referring to this activity: http://www.techytalk.info/disable-and-remove-apparmor-on-ubuntu-based-linux-distributions/

@chrispokorni, do you have any scripts/tools to verify that AppArmor is blocking some ports/ip? I'd like to be able to troubleshoot /verify with some tool the root cause of the problem rather than simply fix the problem.

chrispokorni · January 2019

@crixo
You can use different tools on your nodes to troubleshoot networking: netcat (nc) or wireshark.
Each cloud provider makes available images for you to chose from, and these images are customized to work best with the provider's infrastructure.
Also, each provider's networking features will work slightly differently.
Regards,
-Chris

chrispokorni · January 2019

@tanwarsatya
You may have a different firewall enabled. The output provided by @crixo for Ubuntu 16 on Azure VMs shows AppArmor as being enabled.
-Chris

tanwarsatya · January 2019

@chrispokorni i disabled the apparmor and still not able to work with registry. Will be trying couple more options to see what may be causing this.

crixo · January 2019

I fully removed apparmor: https://webhostinggeeks.com/howto/how-to-disable-and-remove-apparmor-on-ubuntu-14-04/
now "AppArmor enabled" disappeared from "kubectl describe nodes" but stiil same timeout error.
On azure VM firewall I also add a rule to allow all-traffic in and out on both nodes, but same issue: the clusterip are reachable only on the node where the pod has been deployed.
ufw is disable as well

sudo ufw status verbose
Status: inactive

crixo · January 2019

Hi @chrispokorni,
according to my previous post, all firewalls should be disabled/remove(ufw and AppArmor)e also the one provide by azure in front of the VM (Network settings in azure ).
I tried to use some tool you and @serewicz () suggested me in other post but I'm lost due to lack of networking skills... not sure I ran the right tools/script or maybe I'm not able to evaluate the results.
Not sure if the problem is related to a firewall restriction or to some k8s network misconfiguration: can calico have some issue on azure VM? I tried to inspect iptables and I saw a lot settings related to calico... is there a way/script to check ClusterIP vs iptables settings?
Do you have a chance to try the cluster deployment on azure? I posted here above VM provisioning scripts. I'd like to understand what's the issue, how to identify the issue and of course solve it to resume the lab.
Thanks a lot for your support

chrispokorni · January 2019

Hi @crixo ,
Kubernetes' networking is not as complex as we may think, and it doesn't fix networking for us either. It relies on well-configured network infrastructure. So unless our VMs/nodes' networking is setup properly, Kubernetes will not work as expected.
I have not used Azure as of yet, and I am not familiar with its networking configurations. In my research, however, I did find lots of posts around Kubernetes on Azure, with tips on how to fix certain configuration issues.
Is there in Azure a networking configuration similar to a VPC on GCP and on AWS? It allows you to create a custom virtual network (not a VPN) in which to run your nodes for Kubernetes. This configuration resolves similar networking issues on GCP and AWS.
Regards,
-Chris

tanwarsatya · January 2019

I guess i have found the issue, and i am afraid it's not possible to just simply create a kubernete cluster with vnet and couple of nodes.

https://docs.projectcalico.org/v3.3/reference/public-cloud/azure

I will still try some more options by installing Azure CNI plugin, if this won't work I am moving to Google for these labs.

crixo · January 2019

Hi @tanwarsatya,
thanks a lot for your update. I had the same feeling as well.
I assume you'd like to try Azure CNI plug-in instead of the calico plugin included into the our lab setup script. Please keep me posted on the result.
I have been misled by the article I shared in my previous post
https://www.aaronmsft.com/posts/azure-vmss-kubernetes-kubeadm/
in that article the setup script is pretty much equals to the one provided by our lab: it uses same calico plugin.
I did not test in full the suggested approach, but i assume should fail as well.

kvcrajan · December 2019

I have the same problem in my kubernetes running on VM through hyper-v. I have tried all the options mentioned above (disabling apparmor and ufw), but no help. Did anyone managed to resolve this issue? Please help me resolve this issue.

kvcrajan · December 2019

Hello Serewic,

I am able to ssh between master and worker, even from the host machine, which indicates network connectivity/traffic fine. Are you suggesting to setup the environment in Google cloud in order to complete the exercise? I would certainly be glad to attempt to setup the environment in Google cloud, as long as it works fine at least in Google cloud. Please advise.

connection timeout connecting from worker to the local registry running on master via ClusterIP

Comments

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)