Lab 12.7 Problem timeouts on dashboard

CharcoGreen · February 2020

Hi I have problems in the labs 12.7, also 12.x but i has fix it, but in this Lab is impossible to me. I´m problems with timeouts, I know that is problem of network (I think) but how is the best line to fix it.

docker@k8s1:~/k8s$ kubectl -n kubernetes-dashboard logs kubernetes-dashboard-b65488c4-2cp6s
2020/02/05 05:23:28 Using namespace: kubernetes-dashboard
2020/02/05 05:23:28 Using in-cluster config to connect to apiserver
2020/02/05 05:23:28 Starting overwatch
2020/02/05 05:23:28 Using secret token for csrf signing
2020/02/05 05:23:28 Initializing csrf token from kubernetes-dashboard-csrf secret
panic: Get https://10.96.0.1:443/api/v1/namespaces/kubernetes-dashboard/secrets/kubernetes-dashboard-csrf: dial tcp 10.96.0.1:443: i/o timeout

goroutine 1 [running]:
github.com/kubernetes/dashboard/src/app/backend/client/csrf.(csrfTokenManager).init(0xc00050f740)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/csrf/manager.go:40 +0x3b4
github.com/kubernetes/dashboard/src/app/backend/client/csrf.NewCsrfTokenManager(...)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/csrf/manager.go:65
github.com/kubernetes/dashboard/src/app/backend/client.(clientManager).initCSRFKey(0xc000381b80)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/manager.go:487 +0xc7
github.com/kubernetes/dashboard/src/app/backend/client.(*clientManager).init(0xc000381b80)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/manager.go:455 +0x47
github.com/kubernetes/dashboard/src/app/backend/client.NewClientManager(...)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/manager.go:536
main.main()
/home/travis/build/kubernetes/dashboard/src/app/backend/dashboard.go:105 +0x212

Thanks

chrispokorni · February 2020

Hi @CharcoGreen,

From your output, it seems you are experiencing a timeout on port 443. Is it in use by another application, or is it blocked by a firewall of your OS or a firewall at the infrastructure level?

The first step would be to determine why your traffic is blocked, and after that, come up with an action plan to fix the issue.

Regards,
-Chris

CharcoGreen · February 2020

Thanks for your help,
I´m renew my cluster and my firewall rules

dnx · April 2020

I have the same issue. Running nodes on VMWare Fusion. The metrics-server pod logs show:

Error: Get https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.96.0.1:443: i/o timeout

If I utilise nodeSelector to force it to the master it works fine.

But, trying to run it on a worker I always get that error.

I have the extra args:

      - args:
        - --cert-dir=/tmp
        - --secure-port=4443
        - --kubelet-insecure-tls
        - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
        image: k8s.gcr.io/metrics-server-amd64:v0.3.6

From the worker node I can curl https://10.96.0.1:443/ just fine, and also from within a pod on the same node (I used the kube-proxy pod container to test from).

# curl -k https://10.96.0.1
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {

  },
  "code": 403
}

No proimiscious mode on any of my interfaces:

netstat -i #can see no P
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
calib8f1  1440     1109      0      0 0          1074      0      0      0 BMRU
docker0   1500        0      0      0 0             0      0      0      0 BMU
eth0      1500   199755      0      0 0         86684      0      0      0 BMRU
eth1      1500   263085      0      0 0        219869      0      0      0 BMRU
lo       65536   208914      0      0 0        208914      0      0      0 LRU
tunl0     1440    12756      0      0 0         12719      0      0      0 ORU

Here are my interfaces on the worker:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:9d:50:59 brd ff:ff:ff:ff:ff:ff
    inet 192.168.134.131/24 brd 192.168.134.255 scope global dynamic eth0
       valid_lft 1624sec preferred_lft 1624sec
    inet6 fe80::20c:29ff:fe9d:5059/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:9d:50:63 brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.3/24 brd 192.168.10.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:fe9d:5063/64 scope link 
       valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:cc:78:79:00 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
7: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
    inet 192.168.230.192/32 brd 192.168.230.192 scope global tunl0
       valid_lft forever preferred_lft forever
74: cali810f9f98cd8@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default 
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0

As I have multiple interfaces I've edited the calico daemonset to ensure the vmware private network interface is being used (btw does anyone know how to set this based on specific nodes? If I wanted eth0 on one node, but eth1 on another?):

      containers:
      - env:
        - name: IP_AUTODETECTION_METHOD
          value: interface=eth1

Running tshark on the worker when the metrics-server pod is starting I see:

# tshark -i any 'port 443'
Running as user "root" and group "root". This could be dangerous.
Capturing on 'any'
    1 0.000000000 192.168.230.250 → 10.96.0.1    TCP 76 55514 → 443 [SYN] Seq=0 Win=28000 Len=0 MSS=1400 SACK_PERM=1 TSval=2513108884 TSecr=0 WS=128
    2 1.018008286 192.168.230.250 → 10.96.0.1    TCP 76 [TCP Retransmission] 55514 → 443 [SYN] Seq=0 Win=28000 Len=0 MSS=1400 SACK_PERM=1 TSval=2513109902 TSecr=0 WS=128
    3 3.033443940 192.168.230.250 → 10.96.0.1    TCP 76 [TCP Retransmission] 55514 → 443 [SYN] Seq=0 Win=28000 Len=0 MSS=1400 SACK_PERM=1 TSval=2513111918 TSecr=0 WS=128
    4 7.192949413 192.168.230.250 → 10.96.0.1    TCP 76 [TCP Retransmission] 55514 → 443 [SYN] Seq=0 Win=28000 Len=0 MSS=1400 SACK_PERM=1 TSval=2513116077 TSecr=0 WS=128
    5 15.385327465 192.168.230.250 → 10.96.0.1    TCP 76 [TCP Retransmission] 55514 → 443 [SYN] Seq=0 Win=28000 Len=0 MSS=1400 SACK_PERM=1 TSval=2513124269 TSecr=0 WS=128

Which shows the traffic via tunl0 I suppose, but I'm now a little lost as to where to go from here.

I've looked through IPTables and can't see anything, and I wondered if maybe there was something with nftables in there but I checked and there are only iptables modules loaded - no nft.

Any more ideas? I feel pretty lost now.

dnx · April 2020

I've actually got it to 'work' but I neither like nor understand the solution, which irks me.

I edited the metrics-server deployment and added hostNetwork: true. New pod starts up on the worker and all is fine. I don't actually understand what this is doing however, and why it works. I also then see nothing from tshark.

So, I wonder why is this working, and does it indicate where I can fix the problem properly?

chrispokorni · April 2020

Hi @dnx,

The hostNetwork is a feature borrowed from container runtimes, where a container can share the host's network namespace, hence expose itself directly under the host's IP address. While a convenient feature, it also poses security concerns since the container now has access to the node's network stack, which otherwise would not be allowed considering the resource isolation a container was aimed to provide.

The Kubernetes pod operates the same way when the hostNetwork attribute is set to true. The pod is exposed directly under the node's IP address, sharing the host's network namespace. Easy to implement and use, yet not the most secure. In this case, the pod no longer receives it's IP address from the CNI network plugin (calico) as it is exposed directly via the node's IP address, thus eliminating a level of traffic routing and network abstraction. What seems to be an easy fix, it is not how things were intended to work in Kubernetes. If a pod does not operate as expected over the pod network implemented by the CNI network plugin, there may be several issues with your setup. Several aspects could play a role in why your pod does not behave as expected: the infrastructure networking overall, (in)compatibility between your infra and the CNI plugin or just a missed configuration option specific to the mix of technologies in your setup.

Part of being a Kubernetes admin is to figure out compatibility and incompatibilities between your infrastructure and cluster components and to discover specifics about certain configuration options in order to overcome such issues (where such specific options are available). Unfortunately, Kubernetes does not fix misconfigured networks, infrastructure, or incompatibilities for us.

Regards,
-Chris

dnx · April 2020

Thanks for the explanation of hostNetwork @chrispokorni . Given all the things that I've checked and listed above do you have any tips or ideas as to where to check next? I've spent a whole day so far on this and feel like I've run to the end of my abilities thus far.

dnx · April 2020

I went back to basics and checked my cluster init. After destroying the cluster and recreating with some changes it now works fine.

The two things I changed:

added --apiserver-advertise-address to kubeadm, set to the IP of eth1(vmware private network)
changed calico.yaml and --pod-network-cidr to 172.16.0.0/16 as I was using 192.168 ranges for eth0 and eth1

chrispokorni · April 2020

I am glad it all works now.
I was going to suggest exploring the networking section of your hypervisor's documentation, cross-referenced with the calico network plugin documentation to find the missing link. It seems that you found it in the meantime Great work!

Regards,
-Chris

christianhoerl · April 2023

@dnx said:
I went back to basics and checked my cluster init. After destroying the cluster and recreating with some changes it now works fine.

The two things I changed:

added --apiserver-advertise-address to kubeadm, set to the IP of eth1(vmware private network)

changed calico.yaml and --pod-network-cidr to 172.16.0.0/16 as I was using 192.168 ranges for eth0 and eth1

Thanks this really helped!
My conclusion is: the IP range from which the controlplane and worker nodes get their IP addresses MUST be different from the IP range which is used for the network plugin (CNI) for the pod network.

Lab 12.7 Problem timeouts on dashboard

Comments

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)