Welcome to the Linux Foundation Forum!

LAB 12.3 Unable to fully collect metrics

When deploying the metrics server, I cannot get any metrics to show. I get errors on the metrics-server pod like:

reststorage.go:144] unable to fetch pod metrics for pod ....
manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:kube-master: unable to fetch metrics from kubelet:kube-master: Get https://kube-master:10250/stats/summary: dial tcp: lookup kube-master on 10.96.0.10:53: server misbehaving

Not sure if there is some additional configuration to be made on the RBAC definition?

Answers

  • chrispokorni
    chrispokorni Posts: 2,273

    Hi @ccamachofg,
    From the limited information provided above, it seems that your metrics-server has trouble finding the kube-master kubelet agent. This may happen based on how your cluster DNS is configured.

    Check the metrics-server's github repo. It may provide additional options to the metrics-server command for exercise 12.3 step 4. Additional values you may try are: "InternalDNS" and "ExternalDNS".

    https://github.com/kubernetes-incubator/metrics-server

    Regards,
    -Chris

  • Thanks @chrispokorni,

    I did some research and found a solution to my issue. Since I am doing all the labs inside VMs in a VirtualBox Nat network I was not able to have dns resolution of my master and worker servers.
    So I added static resolution on the coredns configmap like:

    hosts {
    10.0.2.8 kube-worker
    10.0.2.8 kube-master
    fallthrough
    }

    With this configuration the metrics server was able to resolve and reach the nodes. Everything was fine after that

  • I'm facing the same problem. How do you solve the problem editing the configmap? Can you post the configmap?

  • I realized that following the lab the metrics-server only works when deployed inside kubernetes master. When the pod is on any worker node it does not reach the kubernetes service ClusterIP, in my case is 10.96.0.1 and port 443. Timeout occurs.

  • chrispokorni
    chrispokorni Posts: 2,273

    Hi @MarceloSales ,

    That is strange behavior. A service should be accessible on the assigned ClusterIP and exposed port from any node. When it is not, it may be due to a firewall blocking traffic to some ports between the nodes.

    Regards,
    -Chris

  • MarceloSales
    MarceloSales Posts: 9
    edited April 2020

    @chrispokorni said:
    Hi @MarceloSales ,

    That is strange behavior. A service should be accessible on the assigned ClusterIP and exposed port from any node. When it is not, it may be due to a firewall blocking traffic to some ports between the nodes.

    Regards,
    -Chris

    The pod dashboard-metrics-scraper works in worker nodes after insert some iptables rules but does not collect metrics. The pod for kubernetes-dashboard does not works even after insert iptables rules.

  • chrispokorni
    chrispokorni Posts: 2,273

    The dashboard is dependent on the metrics-server to display metrics. Without it, the dashboard cannot display any metrics but still allows you to interact with your cluster. You can find out more from the official documentation:

    https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/

    https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/

    IPtables are used for intra-node traffic routing, therefore rules in a particular IPtable will only affect the internal traffic of that node. Kubernetes has a dedicated agent kube-proxy in charge of maintaining all routing rules in the IPtables. Your issue, however, is with node-to-node communication, not managed by IPtables. Depending on how your infrastructure is setup (cloud or local) there may be some sort of firewall blocking traffic between your nodes (not internal to any specific node).

    Regards,
    -Chris

  • ccamachofg
    ccamachofg Posts: 3
    edited April 2020

    @MarceloSales said:
    I'm facing the same problem. How do you solve the problem editing the configmap? Can you post the configmap?

    Hi @MarceloSales

    Here is the configuration I made:

    [student@kube-master ~]$ kubectl -n kube-system get configmap coredns -o yaml
    apiVersion: v1
    data:
      Corefile: |
        .:53 {
            errors
            health
            kubernetes cluster.local in-addr.arpa ip6.arpa {
               pods insecure
               upstream
               fallthrough in-addr.arpa ip6.arpa
               ttl 30
            }
            hosts {
               10.0.2.9 kube-worker
               10.0.2.8 kube-master
               fallthrough
            }
            prometheus :9153
            forward . /etc/resolv.conf
            cache 30
            loop
            reload
            loadbalance
        }
    kind: ConfigMap
    metadata:
      creationTimestamp: "2019-08-15T09:35:48Z"
      name: coredns
      namespace: kube-system
      resourceVersion: "115068"
      selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
      uid: eb8297d2-d440-4e5a-8e15-0ac2c6437704
    
    

    And here is my /etc/hosts file

    [student@kube-master ~]$ cat /etc/hosts
    127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
    ::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
    10.0.2.8   kube-master
    10.0.2.9   kube-worker
    

    Hope this helps

    Regards
    Camilo

  • @chrispokorni said:
    The dashboard is dependent on the metrics-server to display metrics. Without it, the dashboard cannot display any metrics but still allows you to interact with your cluster. You can find out more from the official documentation:

    https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/

    https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/

    IPtables are used for intra-node traffic routing, therefore rules in a particular IPtable will only affect the internal traffic of that node. Kubernetes has a dedicated agent kube-proxy in charge of maintaining all routing rules in the IPtables. Your issue, however, is with node-to-node communication, not managed by IPtables. Depending on how your infrastructure is setup (cloud or local) there may be some sort of firewall blocking traffic between your nodes (not internal to any specific node).

    Regards,
    -Chris

    Thanks @chrispokorni .

    I have three hosts:
    192.168.1.200 k8smaster
    192.168.1.201 k8sworker1
    192.168.1.202 k8sworker2

    I can ping from every node to every node without problem using his IPs. The clusterip address for my Kubernetes services is 10.96.0.1 but no one except the master can reach this address. This is odd because I have no firewall or iptables rules. Maybe something has to do about the hosts that has two networks interfaces. I'll test the @ccamachofg configuration and see if it works.

    Thanks @ccamachofg for your help.

  • chrispokorni
    chrispokorni Posts: 2,273

    A ping response is not an indication that all ports are open. For that, you would need to use a different tool (netcat), that allows you to target specific ports during your testing.

    What exactly are you trying to accomplish by accessing the kubernetes service? I don't remember any step in the lab exercises working with this particular service.

    Are you on Virtualbox? Have you enabled promiscuous mode for the node networking? Is your nodes' subnet overlapping the pod subnet?

    Regards,
    -Chris

  • MarceloSales
    MarceloSales Posts: 9
    edited April 2020

    Hi @chrispokorni , thanks again for helping.

    Well, this is during metrics-server lab. I'm using virtualbox. The kubernetes service 10.96.0.1 port 443 is running on the master node and that is the ip that the metrics pod trying to connect and receives timeout when this pod is running on any worker node. When I use nodeSelector to force metrics to run inside master the pod starts without problem.

    Thanks for the hint with netcat. I'm gonna be crazy, look the output from a worker node:

    nc -vv 10.96.0.1 443
    Connection to 10.96.0.1 443 port [tcp/https] succeeded!
    

    This is whats happens when I try to start the metric pod in any worker.
    ```
    kubeclt -n kube-system logs metrics-server-XXXXXX

    OUTPUT BEGIN

    Error: Get https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.96.0.1:443: i/o timeout
    Usage:
    [flags]

    Flags:
    --alsologtostderr log to standard error as well as files
    --authentication-kubeconfig string kubeconfig file pointing at the 'core' kubernetes server with enough rights to
    ....
    A LOT OF HELP FLAGS
    ....

    panic: Get https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.96.0.1:443: i/o timeout

    goroutine 1 [running]:
    main.main()
    /go/src/github.com/kubernetes-incubator/metrics-server/cmd/metrics-server/metrics-server.go:39 +0x13b

    OUTPUT END

    I have not enabled promiscuous mode for networking.

    Thanks again @chrispokorni for your attention.

  • serewicz
    serewicz Posts: 1,000

    Hello,

    A couple of possible issues. If I understood your earlier posts you have multiple interfaces in use. Use wireshark, or some other tool, to determine which interface the request is being made. In the past I've had similar issues, which did not happen on single interface systetms. A previous work-around was to initialize the cluster and then add the second interface. If I did that it worked. Somewhere there a request is being mis-routed, is my guess.

    The next issue may be promiscuous mode. You said you did not enable it, but the last time I worked with VB it is enabled by default. You may want to double check each interface on each VM to ensure that all traffic is being allowed. This caused me all sorts of issues until I disabled it.

    Regards,

  • Thanks @serewicz for your attention. I found it. The problem was related for overlap in my network configuration as @chrispokorni has suggested _ "Is your nodes' subnet overlapping the pod subnet?"_.
    I have tried to change my CIDR following this guide https://docs.projectcalico.org/networking/migrate-pools but my core-dns pods did not worked anymore. So I have decided reinstall my cluster (Following the exercises I have created a ansible playbook, it's about 5 minutes to have a cluster with vagrant and kubeadm on virtualbox) but at this time I changed the CIDR with a range that does not conflicts with my network 192.168.x.x, I choosed 172.16.0.0/16. Does not forget to edit calico.yaml and adjust the variable CALICO_IPV4POOL_CIDR to your new IP Range. Everything is working fine now. Pay attention when you are installing the cluster to network range to avoid conflicts and overlapping. Hope that this information can help someone. Thanks to everyone that helped me.

  • serewicz
    serewicz Posts: 1,000

    Thanks for the feedback. I think if you check out Exercise 3.1, step 10 it speaks to your issue specifically. It is important to read each step, more than just the command being run.

    Changing the IP pools after initialization are near impossible, and most rebuild the cluster rather then track down every possible place the information is used.

  • @serewicz said:
    Thanks for the feedback. I think if you check out Exercise 3.1, step 10 it speaks to your issue specifically. It is important to read each step, more than just the command being run.

    Changing the IP pools after initialization are near impossible, and most rebuild the cluster rather then track down every possible place the information is used.

    Hi @serewicz , thanks. You're right. I read everything and in the step 10 shows exactly the IP 192.168.0.0/16 but I did not knew about the overlapping risk and the exercise does not warn us about it. Maybe some warning about this can help other in the future. My fault, network configurations has a lot of pitfalls for me.

Categories

Upcoming Training