LAB 12.3 Unable to fully collect metrics

ccamachofg · October 2019

When deploying the metrics server, I cannot get any metrics to show. I get errors on the metrics-server pod like:

reststorage.go:144] unable to fetch pod metrics for pod ....
manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:kube-master: unable to fetch metrics from kubelet:kube-master: Get https://kube-master:10250/stats/summary: dial tcp: lookup kube-master on 10.96.0.10:53: server misbehaving

Not sure if there is some additional configuration to be made on the RBAC definition?

chrispokorni · October 2019

Hi @ccamachofg,
From the limited information provided above, it seems that your metrics-server has trouble finding the kube-master kubelet agent. This may happen based on how your cluster DNS is configured.

Check the metrics-server's github repo. It may provide additional options to the metrics-server command for exercise 12.3 step 4. Additional values you may try are: "InternalDNS" and "ExternalDNS".

https://github.com/kubernetes-incubator/metrics-server

Regards,
-Chris

ccamachofg · October 2019

Thanks @chrispokorni,

I did some research and found a solution to my issue. Since I am doing all the labs inside VMs in a VirtualBox Nat network I was not able to have dns resolution of my master and worker servers.
So I added static resolution on the coredns configmap like:

hosts {
10.0.2.8 kube-worker
10.0.2.8 kube-master
fallthrough
}

With this configuration the metrics server was able to resolve and reach the nodes. Everything was fine after that

MarceloSales · April 2020

I'm facing the same problem. How do you solve the problem editing the configmap? Can you post the configmap?

MarceloSales · April 2020

I realized that following the lab the metrics-server only works when deployed inside kubernetes master. When the pod is on any worker node it does not reach the kubernetes service ClusterIP, in my case is 10.96.0.1 and port 443. Timeout occurs.

chrispokorni · April 2020

Hi @MarceloSales ,

That is strange behavior. A service should be accessible on the assigned ClusterIP and exposed port from any node. When it is not, it may be due to a firewall blocking traffic to some ports between the nodes.

Regards,
-Chris

MarceloSales · April 2020

@chrispokorni said:
Hi @MarceloSales ,

That is strange behavior. A service should be accessible on the assigned ClusterIP and exposed port from any node. When it is not, it may be due to a firewall blocking traffic to some ports between the nodes.

Regards,
-Chris

The pod dashboard-metrics-scraper works in worker nodes after insert some iptables rules but does not collect metrics. The pod for kubernetes-dashboard does not works even after insert iptables rules.

chrispokorni · April 2020

The dashboard is dependent on the metrics-server to display metrics. Without it, the dashboard cannot display any metrics but still allows you to interact with your cluster. You can find out more from the official documentation:

https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/

https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/

IPtables are used for intra-node traffic routing, therefore rules in a particular IPtable will only affect the internal traffic of that node. Kubernetes has a dedicated agent kube-proxy in charge of maintaining all routing rules in the IPtables. Your issue, however, is with node-to-node communication, not managed by IPtables. Depending on how your infrastructure is setup (cloud or local) there may be some sort of firewall blocking traffic between your nodes (not internal to any specific node).

Regards,
-Chris

ccamachofg · April 2020

@MarceloSales said:
I'm facing the same problem. How do you solve the problem editing the configmap? Can you post the configmap?

Hi @MarceloSales

Here is the configuration I made:

[student@kube-master ~]$ kubectl -n kube-system get configmap coredns -o yaml
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           upstream
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        hosts {
           10.0.2.9 kube-worker
           10.0.2.8 kube-master
           fallthrough
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap
metadata:
  creationTimestamp: "2019-08-15T09:35:48Z"
  name: coredns
  namespace: kube-system
  resourceVersion: "115068"
  selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
  uid: eb8297d2-d440-4e5a-8e15-0ac2c6437704

And here is my /etc/hosts file

[student@kube-master ~]$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.0.2.8   kube-master
10.0.2.9   kube-worker

Hope this helps

Regards
Camilo

MarceloSales · April 2020

@chrispokorni said:
The dashboard is dependent on the metrics-server to display metrics. Without it, the dashboard cannot display any metrics but still allows you to interact with your cluster. You can find out more from the official documentation:

https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/

https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/

IPtables are used for intra-node traffic routing, therefore rules in a particular IPtable will only affect the internal traffic of that node. Kubernetes has a dedicated agent kube-proxy in charge of maintaining all routing rules in the IPtables. Your issue, however, is with node-to-node communication, not managed by IPtables. Depending on how your infrastructure is setup (cloud or local) there may be some sort of firewall blocking traffic between your nodes (not internal to any specific node).

Regards,
-Chris

Thanks @chrispokorni .

I have three hosts:
192.168.1.200 k8smaster
192.168.1.201 k8sworker1
192.168.1.202 k8sworker2

I can ping from every node to every node without problem using his IPs. The clusterip address for my Kubernetes services is 10.96.0.1 but no one except the master can reach this address. This is odd because I have no firewall or iptables rules. Maybe something has to do about the hosts that has two networks interfaces. I'll test the @ccamachofg configuration and see if it works.

Thanks @ccamachofg for your help.

chrispokorni · April 2020

A ping response is not an indication that all ports are open. For that, you would need to use a different tool (netcat), that allows you to target specific ports during your testing.

What exactly are you trying to accomplish by accessing the kubernetes service? I don't remember any step in the lab exercises working with this particular service.

Are you on Virtualbox? Have you enabled promiscuous mode for the node networking? Is your nodes' subnet overlapping the pod subnet?

Regards,
-Chris

MarceloSales · April 2020

Hi @chrispokorni , thanks again for helping.

Well, this is during metrics-server lab. I'm using virtualbox. The kubernetes service 10.96.0.1 port 443 is running on the master node and that is the ip that the metrics pod trying to connect and receives timeout when this pod is running on any worker node. When I use nodeSelector to force metrics to run inside master the pod starts without problem.

Thanks for the hint with netcat. I'm gonna be crazy, look the output from a worker node:

nc -vv 10.96.0.1 443
Connection to 10.96.0.1 443 port [tcp/https] succeeded!

This is whats happens when I try to start the metric pod in any worker.
```
kubeclt -n kube-system logs metrics-server-XXXXXX

OUTPUT BEGIN

Error: Get https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.96.0.1:443: i/o timeout
Usage:
[flags]

Flags:
--alsologtostderr log to standard error as well as files
--authentication-kubeconfig string kubeconfig file pointing at the 'core' kubernetes server with enough rights to
....
A LOT OF HELP FLAGS
....

panic: Get https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.96.0.1:443: i/o timeout

goroutine 1 [running]:
main.main()
/go/src/github.com/kubernetes-incubator/metrics-server/cmd/metrics-server/metrics-server.go:39 +0x13b

OUTPUT END

I have not enabled promiscuous mode for networking.

Thanks again @chrispokorni for your attention.

serewicz · April 2020

Hello,

A couple of possible issues. If I understood your earlier posts you have multiple interfaces in use. Use wireshark, or some other tool, to determine which interface the request is being made. In the past I've had similar issues, which did not happen on single interface systetms. A previous work-around was to initialize the cluster and then add the second interface. If I did that it worked. Somewhere there a request is being mis-routed, is my guess.

The next issue may be promiscuous mode. You said you did not enable it, but the last time I worked with VB it is enabled by default. You may want to double check each interface on each VM to ensure that all traffic is being allowed. This caused me all sorts of issues until I disabled it.

Regards,

MarceloSales · April 2020

Thanks @serewicz for your attention. I found it. The problem was related for overlap in my network configuration as @chrispokorni has suggested _ "Is your nodes' subnet overlapping the pod subnet?"_.
I have tried to change my CIDR following this guide https://docs.projectcalico.org/networking/migrate-pools but my core-dns pods did not worked anymore. So I have decided reinstall my cluster (Following the exercises I have created a ansible playbook, it's about 5 minutes to have a cluster with vagrant and kubeadm on virtualbox) but at this time I changed the CIDR with a range that does not conflicts with my network 192.168.x.x, I choosed 172.16.0.0/16. Does not forget to edit calico.yaml and adjust the variable CALICO_IPV4POOL_CIDR to your new IP Range. Everything is working fine now. Pay attention when you are installing the cluster to network range to avoid conflicts and overlapping. Hope that this information can help someone. Thanks to everyone that helped me.

serewicz · April 2020

Thanks for the feedback. I think if you check out Exercise 3.1, step 10 it speaks to your issue specifically. It is important to read each step, more than just the command being run.

Changing the IP pools after initialization are near impossible, and most rebuild the cluster rather then track down every possible place the information is used.

MarceloSales · April 2020

@serewicz said:
Thanks for the feedback. I think if you check out Exercise 3.1, step 10 it speaks to your issue specifically. It is important to read each step, more than just the command being run.

Changing the IP pools after initialization are near impossible, and most rebuild the cluster rather then track down every possible place the information is used.

Hi @serewicz , thanks. You're right. I read everything and in the step 10 shows exactly the IP 192.168.0.0/16 but I did not knew about the overlapping risk and the exercise does not warn us about it. Maybe some warning about this can help other in the future. My fault, network configurations has a lot of pitfalls for me.

LAB 12.3 Unable to fully collect metrics

Answers

OUTPUT BEGIN

OUTPUT END

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)