Welcome to the Linux Foundation Forum!

connection timeout connecting from worker to the local registry running on master via ClusterIP

crixo
crixo Posts: 31
edited January 2019 in LFD259 Class Forum

I'm stuck at alb 3.1 - 28
The worker node joined successfully the cluster using kubeadm join... and status is Ready.
I'm not able to connect the worker to the registry
curl http://10.107.88.26:5000/v2/ -> curl: (7) Failed to connect to 10.107.88.26 port 5000: Connection timed out
same command works w/o problem on the master node.
is there network utilities/tools/commands I could use to provide you details for troubleshooting?

that's the following output of "kubectl describe nodes"

    Name:               k8s-master1
    Roles:              master
    Labels:             beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/os=linux
                        kubernetes.io/hostname=k8s-master1
                        node-role.kubernetes.io/master=
    Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                        node.alpha.kubernetes.io/ttl: 0
                        volumes.kubernetes.io/controller-managed-attach-detach: true
    CreationTimestamp:  Sat, 05 Jan 2019 14:54:13 +0000
    Taints:             <none>
    Unschedulable:      false
    Conditions:
___omitted___
    ready status. AppArmor enabled
    Addresses:
      InternalIP:  10.0.0.4
      Hostname:    k8s-master1
    Capacity:
     attachable-volumes-azure-disk:  16
     cpu:                            2
     ephemeral-storage:              30428648Ki
     hugepages-1Gi:                  0
     hugepages-2Mi:                  0
     memory:                         4040536Ki
     pods:                           110
    Allocatable:
     attachable-volumes-azure-disk:  16
     cpu:                            2
     ephemeral-storage:              28043041951
     hugepages-1Gi:                  0
     hugepages-2Mi:                  0
     memory:                         3938136Ki
     pods:                           110
    System Info:
     Machine ID:                 39f567ff6e2f4b29bd860fd7228c1322
     System UUID:                AE723471-2D04-1B41-A684-3FB8B12C8C31
     Boot ID:                    bf187457-5b0e-43ca-83c2-0e200a7572c9
     Kernel Version:             4.15.0-1036-azure
     OS Image:                   Ubuntu 16.04.5 LTS
     Operating System:           linux
     Architecture:               amd64
     Container Runtime Version:  docker://18.6.1
     Kubelet Version:            v1.12.1
     Kube-Proxy Version:         v1.12.1
    PodCIDR:                     192.168.0.0/24
    Non-terminated Pods:         (10 in total)
      Namespace                  Name                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits
      ---------                  ----                                   ------------  ----------  ---------------  -------------
___omitted___
    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource                       Requests    Limits
      --------                       --------    ------
      cpu                            900m (45%)  0 (0%)
      memory                         70Mi (1%)   170Mi (4%)
      attachable-volumes-azure-disk  0           0
    Events:
      Type    Reason                   Age                From                     Message
      ----    ------                   ----               ----                     -------
___omitted___


    Name:               k8s-worker1
    Roles:              <none>
    Labels:             beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/os=linux
                        kubernetes.io/hostname=k8s-worker1
    Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                        node.alpha.kubernetes.io/ttl: 0
                        volumes.kubernetes.io/controller-managed-attach-detach: true
    CreationTimestamp:  Sat, 05 Jan 2019 14:55:07 +0000
    Taints:             <none>
    Unschedulable:      false
    Conditions:
      Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
      ----             ------  -----------------                 ------------------                ------                       -------
___omitted___
ready status. AppArmor enabled
    Addresses:
      InternalIP:  10.0.0.5
      Hostname:    k8s-worker1
    Capacity:
     attachable-volumes-azure-disk:  16
     cpu:                            1
     ephemeral-storage:              30428648Ki
     hugepages-1Gi:                  0
     hugepages-2Mi:                  0
     memory:                         944136Ki
     pods:                           110
    Allocatable:
     attachable-volumes-azure-disk:  16
     cpu:                            1
     ephemeral-storage:              28043041951
     hugepages-1Gi:                  0
     hugepages-2Mi:                  0
     memory:                         841736Ki
     pods:                           110
    System Info:
     Machine ID:                 8977d6e96cdd47ef9fd20f9496ab84f2
     System UUID:                92429DB1-12B5-3342-8C66-5A5119371B50
     Boot ID:                    23371f94-ebaa-4186-b27c-7a756f327aa6
     Kernel Version:             4.15.0-1036-azure
     OS Image:                   Ubuntu 16.04.5 LTS
     Operating System:           linux
     Architecture:               amd64
     Container Runtime Version:  docker://18.6.1
     Kubelet Version:            v1.12.1
     Kube-Proxy Version:         v1.12.1
    PodCIDR:                     192.168.1.0/24
    Non-terminated Pods:         (4 in total)
      Namespace                  Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits
      ---------                  ----                                        ------------  ----------  ---------------  -------------
___omitted___
    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource                       Requests    Limits
      --------                       --------    ------
      cpu                            350m (35%)  0 (0%)
      memory                         70Mi (8%)   170Mi (20%)
      attachable-volumes-azure-disk  0           0
    Events:
      Type    Reason                   Age                From                     Message
      ----    ------                   ----               ----                     -------
___omitted___

Comments

  • crixo
    crixo Posts: 31

    BTW: both azure node/VM are on the same vnet having both the default rule: 65000
    AllowVnetInBound
    Any
    Any
    VirtualNetwork
    VirtualNetwork
    Allow

    ping 10.107.88.26 succeeded

        ping 10.107.88.26
        PING 10.107.88.26 (10.107.88.26) 56(84) bytes of data.
        ^C
        --- 10.107.88.26 ping statistics ---
        22 packets transmitted, 0 received, 100% packet loss, time 21484ms
    
  • crixo
    crixo Posts: 31

    More details
    PS: ping did NOT succeeded (cannot edit previous comment)

    My VMs provisioning script

        RESOURCE_GROUP='<my-rg-group>'
        LOCATION='westeurope'
        IMAGE='UbuntuLTS'
        #MASTER_SKU='Standard_D1_v2'
        MASTER_SKU='Standard_B2s'
        AGENT_SKU='Standard_B1s'
        MASTER_NAME='k8s-master1'
    
        az group create -g $RESOURCE_GROUP -l $LOCATION
    
        az vm create -g $RESOURCE_GROUP -n $MASTER_NAME \
          --size $MASTER_SKU \
          --image $IMAGE \
          --public-ip-address-dns-name 'master1-'$RESOURCE_GROUP \
          --vnet-name vnet1 \
          --subnet subnet1 \
          --custom-data k8sMaster.sh \
          --ssh-key-value @/Users/cristiano/.ssh/azure-vm_rsa.pub
          ##--generate-ssh-keys
    
        az vm create -g $RESOURCE_GROUP -n 'k8s-worker1' \
          --size $AGENT_SKU \
          --image $IMAGE \
          --public-ip-address-dns-name 'worker1-'$RESOURCE_GROUP \
          --vnet-name vnet1 \
          --subnet subnet1 \
          --custom-data k8sSecond.sh \
          --ssh-key-value @/Users/cristiano/.ssh/azure-vm_rsa.pub
          #--generate-ssh-keys
    
        #https://docs.microsoft.com/en-us/cli/azure/vm?view=azure-cli-latest#az-vm-open-port
        az vm open-port -g $RESOURCE_GROUP -n $MASTER_NAME --port 30000-33000 --priority 1010
    

    Script has been expired by this article
    https://www.aaronmsft.com/posts/azure-vmss-kubernetes-kubeadm/

    Network settings in azure
    https://docs.microsoft.com/en-us/azure/virtual-network/security-overview#denyallinbound

    Do you think I have to add additional rules rather the default one? Not sure if ClusterIPs requires extra rules or vnet/subnet settings should be enough

    Thanks a lot for support and I apologize for poor networking skills

  • crixo
    crixo Posts: 31
    edited January 2019

    The only ClusterIP reachable from worker1 node is service/kubernetes
    Tthe following curl enquires
    curl --insecure https://10.96.0.1/api
    succeed on both nodes

    same for endpoints: the only one working on both nodes is endpoints/kubernetes

    On master1 node all endpoints and clusterip are working as expected

    ClusterIPs should work on all nodes, regardless where the pod(s) is running. Does the same rule apply to endpoints as well?
    I mean, endpoints should be reachable on all nodes?

    I'm providing here below the output of "route -n" on both node. Endpoint IP should be included on both nodes or the current output is correct?

    Where is stored the mappings between clusterip vs endpoints?

    cristiano@k8s-master1:~$ kubectl get pods,svc,pvc,pv,deploy,endpoints

    NAME                            READY   STATUS    RESTARTS   AGE
    pod/basicpod                    1/1     Running   0          55m
    pod/nginx-67f8fb575f-r8kh5      1/1     Running   0          179m
    pod/registry-56cffc98d6-jkzj5   1/1     Running   0          179m
    
    NAME                   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
    service/basicservice   ClusterIP   10.100.134.88   <none>        80/TCP     55m
    service/kubernetes     ClusterIP   10.96.0.1       <none>        443/TCP    3h17m
    service/nginx          ClusterIP   10.111.131.30   <none>        443/TCP    70m
    service/registry       ClusterIP   10.104.96.176   <none>        5000/TCP   70m
    
    NAME                                    STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    persistentvolumeclaim/nginx-claim0      Bound    task-pv-volume   200Mi      RWO                           179m
    persistentvolumeclaim/registry-claim0   Bound    registryvm       200Mi      RWO                           179m
    
    NAME                              CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                     STORAGECLASS   REASON   AGE
    persistentvolume/registryvm       200Mi      RWO            Retain           Bound    default/registry-claim0                           3h3m
    persistentvolume/task-pv-volume   200Mi      RWO            Retain           Bound    default/nginx-claim0                              3h3m
    
    NAME                             DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
    deployment.extensions/nginx      1         1         1            1           179m
    deployment.extensions/registry   1         1         1            1           179m
    
    NAME                     ENDPOINTS              AGE
    endpoints/basicservice   192.168.159.131:80     55m
    endpoints/kubernetes     10.0.0.4:6443          3h17m
    endpoints/nginx          192.168.159.129:443    70m
    endpoints/registry       192.168.159.130:5000   70m
    

    cristiano@k8s-master1:~$ curl http://10.100.134.88 -> OK
    cristiano@k8s-worker1:~$ curl http://10.100.134.88 -> Timeout

    cristiano@k8s-master1:~$ route -n

    Kernel IP routing table
    Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
    0.0.0.0         10.0.0.1        0.0.0.0         UG    0      0        0 eth0
    10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 eth0
    168.63.129.16   10.0.0.1        255.255.255.255 UGH   0      0        0 eth0
    169.254.169.254 10.0.0.1        255.255.255.255 UGH   0      0        0 eth0
    172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
    192.168.159.128 0.0.0.0         255.255.255.192 U     0      0        0 *
    192.168.159.129 0.0.0.0         255.255.255.255 UH    0      0        0 cali0968f46c6a2
    192.168.159.130 0.0.0.0         255.255.255.255 UH    0      0        0 calib0effd1e9a6
    192.168.159.131 0.0.0.0         255.255.255.255 UH    0      0        0 cali3fc1b4ac805
    192.168.159.132 0.0.0.0         255.255.255.255 UH    0      0        0 calif49f354fce4
    192.168.159.133 0.0.0.0         255.255.255.255 UH    0      0        0 calie2160929446
    192.168.194.128 10.0.0.5        255.255.255.192 UG    0      0        0 tunl0
    

    cristiano@k8s-worker1:~$ route -n

    Kernel IP routing table
    Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
    0.0.0.0         10.0.0.1        0.0.0.0         UG    0      0        0 eth0
    10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 eth0
    168.63.129.16   10.0.0.1        255.255.255.255 UGH   0      0        0 eth0
    169.254.169.254 10.0.0.1        255.255.255.255 UGH   0      0        0 eth0
    172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
    192.168.159.128 10.0.0.4        255.255.255.192 UG    0      0        0 tunl0
    192.168.194.128 0.0.0.0         255.255.255.192 U     0      0        0 *
    
  • chrispokorni
    chrispokorni Posts: 2,366

    Hi @crixo ,
    From your first output, it seems the firewall (AppArmor) is enabled on each node, and it may be blocking traffic to some of the ports - the reason why curls and pings are timing out.
    See the instructions from Lab 2.1 - Overview section.
    Regards,
    -Chris

  • this is the exact error i am facing. Not able to get to registry from the worker node. My cluster is in Azure and virtual network has traffic enabled on all the ports within cluster (1 master , 2 nodes ).

  • chrispokorni
    chrispokorni Posts: 2,366

    Hi @tanwarsatya ,
    Also make sure the firewalls at OS level are disabled, as they may be blocking some traffic.
    Regards,
    -Chris

  • ufw status is showing inactive on both master and worker nodes.

  • crixo
    crixo Posts: 31

    Hi @chrispokorni,
    If I understood, AppArmor is sort of firewall at kernel-level while the Network settings in azure are somenthing on top of it at azure platform level.
    I tried on AWS as well and I did not have that issues; azure and aws VM are both based on ubuntu 16.04, does it mean azure vm and aws VM (both based on ubuntu16.04) have different kernel settings? I mean, azure ubuntu does have AppArmor enabled while aws ubuntu does not?
    I'll try to disable AppArmor on azure and i let you know. I assume you are referring to this activity: http://www.techytalk.info/disable-and-remove-apparmor-on-ubuntu-based-linux-distributions/

    @chrispokorni, do you have any scripts/tools to verify that AppArmor is blocking some ports/ip? I'd like to be able to troubleshoot /verify with some tool the root cause of the problem rather than simply fix the problem.

  • chrispokorni
    chrispokorni Posts: 2,366
    edited January 2019

    @crixo
    You can use different tools on your nodes to troubleshoot networking: netcat (nc) or wireshark.
    Each cloud provider makes available images for you to chose from, and these images are customized to work best with the provider's infrastructure.
    Also, each provider's networking features will work slightly differently.
    Regards,
    -Chris

  • chrispokorni
    chrispokorni Posts: 2,366

    @tanwarsatya
    You may have a different firewall enabled. The output provided by @crixo for Ubuntu 16 on Azure VMs shows AppArmor as being enabled.
    -Chris

  • @chrispokorni i disabled the apparmor and still not able to work with registry. Will be trying couple more options to see what may be causing this.

  • crixo
    crixo Posts: 31

    I fully removed apparmor: https://webhostinggeeks.com/howto/how-to-disable-and-remove-apparmor-on-ubuntu-14-04/
    now "AppArmor enabled" disappeared from "kubectl describe nodes" but stiil same timeout error.
    On azure VM firewall I also add a rule to allow all-traffic in and out on both nodes, but same issue: the clusterip are reachable only on the node where the pod has been deployed.
    ufw is disable as well

    sudo ufw status verbose
    Status: inactive
    
  • crixo
    crixo Posts: 31

    Hi @chrispokorni,
    according to my previous post, all firewalls should be disabled/remove(ufw and AppArmor)e also the one provide by azure in front of the VM (Network settings in azure ).
    I tried to use some tool you and @serewicz () suggested me in other post but I'm lost due to lack of networking skills... not sure I ran the right tools/script or maybe I'm not able to evaluate the results.
    Not sure if the problem is related to a firewall restriction or to some k8s network misconfiguration: can calico have some issue on azure VM? I tried to inspect iptables and I saw a lot settings related to calico... is there a way/script to check ClusterIP vs iptables settings?
    Do you have a chance to try the cluster deployment on azure? I posted here above VM provisioning scripts. I'd like to understand what's the issue, how to identify the issue and of course solve it to resume the lab.
    Thanks a lot for your support

  • chrispokorni
    chrispokorni Posts: 2,366

    Hi @crixo ,
    Kubernetes' networking is not as complex as we may think, and it doesn't fix networking for us either. It relies on well-configured network infrastructure. So unless our VMs/nodes' networking is setup properly, Kubernetes will not work as expected.
    I have not used Azure as of yet, and I am not familiar with its networking configurations. In my research, however, I did find lots of posts around Kubernetes on Azure, with tips on how to fix certain configuration issues.
    Is there in Azure a networking configuration similar to a VPC on GCP and on AWS? It allows you to create a custom virtual network (not a VPN) in which to run your nodes for Kubernetes. This configuration resolves similar networking issues on GCP and AWS.
    Regards,
    -Chris

  • I guess i have found the issue, and i am afraid it's not possible to just simply create a kubernete cluster with vnet and couple of nodes.

    https://docs.projectcalico.org/v3.3/reference/public-cloud/azure

    I will still try some more options by installing Azure CNI plugin, if this won't work I am moving to Google for these labs.

  • crixo
    crixo Posts: 31

    Hi @tanwarsatya,
    thanks a lot for your update. I had the same feeling as well.
    I assume you'd like to try Azure CNI plug-in instead of the calico plugin included into the our lab setup script. Please keep me posted on the result.
    I have been misled by the article I shared in my previous post
    https://www.aaronmsft.com/posts/azure-vmss-kubernetes-kubeadm/
    in that article the setup script is pretty much equals to the one provided by our lab: it uses same calico plugin.
    I did not test in full the suggested approach, but i assume should fail as well.

  • I have the same problem in my kubernetes running on VM through hyper-v. I have tried all the options mentioned above (disabling apparmor and ufw), but no help. Did anyone managed to resolve this issue? Please help me resolve this issue.

  • serewicz
    serewicz Posts: 1,000

    Hello,

    The issue remains the environment network. If there is anything blocking traffic between nodes you will have issues. While you may have disabled UFW in the instance you would also need to ensure the environment allows all traffic.

    From what others have found this is not allowed. You may consider a more accessible environment.


    Regards,

  • Hello Serewic,

    I am able to ssh between master and worker, even from the host machine, which indicates network connectivity/traffic fine. Are you suggesting to setup the environment in Google cloud in order to complete the exercise? I would certainly be glad to attempt to setup the environment in Google cloud, as long as it works fine at least in Google cloud. Please advise.

  • serewicz
    serewicz Posts: 1,000

    SSH uses port 22, there are many more in use. Ensure there are no ports blocked between instances.

    Should you decide to use GCE there is a setup video I would encourage you to follow. I believe the video is in the resources URL mentioned in the class setup information.

    Regards,

  • serewicz
    serewicz Posts: 1,000

    @pistle - we are not using Openshift which is a vendor specific implementation of Kubernetes. Instead we build our clusters using the community supported and fully open kubeadm.

Categories

Upcoming Training