Welcome to the Linux Foundation Forum!

Lab 12.7 Problem timeouts on dashboard

CharcoGreen
CharcoGreen Posts: 5
edited February 2020 in LFS258 Class Forum

Hi I have problems in the labs 12.7, also 12.x but i has fix it, but in this Lab is impossible to me. I´m problems with timeouts, I know that is problem of network (I think) but how is the best line to fix it.

docker@k8s1:~/k8s$ kubectl -n kubernetes-dashboard logs kubernetes-dashboard-b65488c4-2cp6s
2020/02/05 05:23:28 Using namespace: kubernetes-dashboard
2020/02/05 05:23:28 Using in-cluster config to connect to apiserver
2020/02/05 05:23:28 Starting overwatch
2020/02/05 05:23:28 Using secret token for csrf signing
2020/02/05 05:23:28 Initializing csrf token from kubernetes-dashboard-csrf secret
panic: Get https://10.96.0.1:443/api/v1/namespaces/kubernetes-dashboard/secrets/kubernetes-dashboard-csrf: dial tcp 10.96.0.1:443: i/o timeout

goroutine 1 [running]:
github.com/kubernetes/dashboard/src/app/backend/client/csrf.(csrfTokenManager).init(0xc00050f740)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/csrf/manager.go:40 +0x3b4
github.com/kubernetes/dashboard/src/app/backend/client/csrf.NewCsrfTokenManager(...)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/csrf/manager.go:65
github.com/kubernetes/dashboard/src/app/backend/client.(
clientManager).initCSRFKey(0xc000381b80)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/manager.go:487 +0xc7
github.com/kubernetes/dashboard/src/app/backend/client.(*clientManager).init(0xc000381b80)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/manager.go:455 +0x47
github.com/kubernetes/dashboard/src/app/backend/client.NewClientManager(...)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/manager.go:536
main.main()
/home/travis/build/kubernetes/dashboard/src/app/backend/dashboard.go:105 +0x212

Thanks

Comments

  • Hi @CharcoGreen,

    From your output, it seems you are experiencing a timeout on port 443. Is it in use by another application, or is it blocked by a firewall of your OS or a firewall at the infrastructure level?

    The first step would be to determine why your traffic is blocked, and after that, come up with an action plan to fix the issue.

    Regards,
    -Chris

  • Thanks for your help,
    I´m renew my cluster and my firewall rules

  • dnx
    dnx Posts: 32
    edited April 2020

    I have the same issue. Running nodes on VMWare Fusion. The metrics-server pod logs show:

    Error: Get https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.96.0.1:443: i/o timeout

    If I utilise nodeSelector to force it to the master it works fine.

    But, trying to run it on a worker I always get that error.

    I have the extra args:

          - args:
            - --cert-dir=/tmp
            - --secure-port=4443
            - --kubelet-insecure-tls
            - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
            image: k8s.gcr.io/metrics-server-amd64:v0.3.6
    

    From the worker node I can curl https://10.96.0.1:443/ just fine, and also from within a pod on the same node (I used the kube-proxy pod container to test from).

    # curl -k https://10.96.0.1
    {
      "kind": "Status",
      "apiVersion": "v1",
      "metadata": {
    
      },
      "status": "Failure",
      "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
      "reason": "Forbidden",
      "details": {
    
      },
      "code": 403
    }
    

    No proimiscious mode on any of my interfaces:

    netstat -i #can see no P
    Kernel Interface table
    Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
    calib8f1  1440     1109      0      0 0          1074      0      0      0 BMRU
    docker0   1500        0      0      0 0             0      0      0      0 BMU
    eth0      1500   199755      0      0 0         86684      0      0      0 BMRU
    eth1      1500   263085      0      0 0        219869      0      0      0 BMRU
    lo       65536   208914      0      0 0        208914      0      0      0 LRU
    tunl0     1440    12756      0      0 0         12719      0      0      0 ORU
    

    Here are my interfaces on the worker:

    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
           valid_lft forever preferred_lft forever
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
        link/ether 00:0c:29:9d:50:59 brd ff:ff:ff:ff:ff:ff
        inet 192.168.134.131/24 brd 192.168.134.255 scope global dynamic eth0
           valid_lft 1624sec preferred_lft 1624sec
        inet6 fe80::20c:29ff:fe9d:5059/64 scope link 
           valid_lft forever preferred_lft forever
    3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
        link/ether 00:0c:29:9d:50:63 brd ff:ff:ff:ff:ff:ff
        inet 192.168.10.3/24 brd 192.168.10.255 scope global eth1
           valid_lft forever preferred_lft forever
        inet6 fe80::20c:29ff:fe9d:5063/64 scope link 
           valid_lft forever preferred_lft forever
    4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
        link/ether 02:42:cc:78:79:00 brd ff:ff:ff:ff:ff:ff
        inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
           valid_lft forever preferred_lft forever
    7: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN group default qlen 1000
        link/ipip 0.0.0.0 brd 0.0.0.0
        inet 192.168.230.192/32 brd 192.168.230.192 scope global tunl0
           valid_lft forever preferred_lft forever
    74: cali810f9f98cd8@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default 
        link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0
    

    As I have multiple interfaces I've edited the calico daemonset to ensure the vmware private network interface is being used (btw does anyone know how to set this based on specific nodes? If I wanted eth0 on one node, but eth1 on another?):

          containers:
          - env:
            - name: IP_AUTODETECTION_METHOD
              value: interface=eth1
    

    Running tshark on the worker when the metrics-server pod is starting I see:

    # tshark -i any 'port 443'
    Running as user "root" and group "root". This could be dangerous.
    Capturing on 'any'
        1 0.000000000 192.168.230.250 → 10.96.0.1    TCP 76 55514 → 443 [SYN] Seq=0 Win=28000 Len=0 MSS=1400 SACK_PERM=1 TSval=2513108884 TSecr=0 WS=128
        2 1.018008286 192.168.230.250 → 10.96.0.1    TCP 76 [TCP Retransmission] 55514 → 443 [SYN] Seq=0 Win=28000 Len=0 MSS=1400 SACK_PERM=1 TSval=2513109902 TSecr=0 WS=128
        3 3.033443940 192.168.230.250 → 10.96.0.1    TCP 76 [TCP Retransmission] 55514 → 443 [SYN] Seq=0 Win=28000 Len=0 MSS=1400 SACK_PERM=1 TSval=2513111918 TSecr=0 WS=128
        4 7.192949413 192.168.230.250 → 10.96.0.1    TCP 76 [TCP Retransmission] 55514 → 443 [SYN] Seq=0 Win=28000 Len=0 MSS=1400 SACK_PERM=1 TSval=2513116077 TSecr=0 WS=128
        5 15.385327465 192.168.230.250 → 10.96.0.1    TCP 76 [TCP Retransmission] 55514 → 443 [SYN] Seq=0 Win=28000 Len=0 MSS=1400 SACK_PERM=1 TSval=2513124269 TSecr=0 WS=128
    

    Which shows the traffic via tunl0 I suppose, but I'm now a little lost as to where to go from here.

    I've looked through IPTables and can't see anything, and I wondered if maybe there was something with nftables in there but I checked and there are only iptables modules loaded - no nft.

    Any more ideas? I feel pretty lost now.

  • dnx
    dnx Posts: 32
    edited April 2020

    I've actually got it to 'work' but I neither like nor understand the solution, which irks me.

    I edited the metrics-server deployment and added hostNetwork: true. New pod starts up on the worker and all is fine. I don't actually understand what this is doing however, and why it works. I also then see nothing from tshark.

    So, I wonder why is this working, and does it indicate where I can fix the problem properly?

  • chrispokorni
    chrispokorni Posts: 2,376

    Hi @dnx,

    The hostNetwork is a feature borrowed from container runtimes, where a container can share the host's network namespace, hence expose itself directly under the host's IP address. While a convenient feature, it also poses security concerns since the container now has access to the node's network stack, which otherwise would not be allowed considering the resource isolation a container was aimed to provide.

    The Kubernetes pod operates the same way when the hostNetwork attribute is set to true. The pod is exposed directly under the node's IP address, sharing the host's network namespace. Easy to implement and use, yet not the most secure. In this case, the pod no longer receives it's IP address from the CNI network plugin (calico) as it is exposed directly via the node's IP address, thus eliminating a level of traffic routing and network abstraction. What seems to be an easy fix, it is not how things were intended to work in Kubernetes. If a pod does not operate as expected over the pod network implemented by the CNI network plugin, there may be several issues with your setup. Several aspects could play a role in why your pod does not behave as expected: the infrastructure networking overall, (in)compatibility between your infra and the CNI plugin or just a missed configuration option specific to the mix of technologies in your setup.

    Part of being a Kubernetes admin is to figure out compatibility and incompatibilities between your infrastructure and cluster components and to discover specifics about certain configuration options in order to overcome such issues (where such specific options are available). Unfortunately, Kubernetes does not fix misconfigured networks, infrastructure, or incompatibilities for us.

    Regards,
    -Chris

  • dnx
    dnx Posts: 32

    Thanks for the explanation of hostNetwork @chrispokorni . Given all the things that I've checked and listed above do you have any tips or ideas as to where to check next? I've spent a whole day so far on this and feel like I've run to the end of my abilities thus far.

  • dnx
    dnx Posts: 32

    I went back to basics and checked my cluster init. After destroying the cluster and recreating with some changes it now works fine.

    The two things I changed:

    • added --apiserver-advertise-address to kubeadm, set to the IP of eth1(vmware private network)
    • changed calico.yaml and --pod-network-cidr to 172.16.0.0/16 as I was using 192.168 ranges for eth0 and eth1
  • chrispokorni
    chrispokorni Posts: 2,376

    I am glad it all works now.
    I was going to suggest exploring the networking section of your hypervisor's documentation, cross-referenced with the calico network plugin documentation to find the missing link. It seems that you found it in the meantime :smile: Great work!

    Regards,
    -Chris

  • @dnx said:
    I went back to basics and checked my cluster init. After destroying the cluster and recreating with some changes it now works fine.

    The two things I changed:

    • added --apiserver-advertise-address to kubeadm, set to the IP of eth1(vmware private network)
    • changed calico.yaml and --pod-network-cidr to 172.16.0.0/16 as I was using 192.168 ranges for eth0 and eth1

    Thanks this really helped!
    My conclusion is: the IP range from which the controlplane and worker nodes get their IP addresses MUST be different from the IP range which is used for the network plugin (CNI) for the pod network.

Categories

Upcoming Training