Lab 9.1 (Services and Endpoints) - Master can't curl to endpoints on Worker/Node

psyrus · May 2021

Hi, I was just going through all the labs, got to 9.1 and have hit an interesting problem.

When I create the NGINX deployment, then create the endpoints using the command:
kubectl -n accounting expose deployment nginx-one

It works fine, and I get the endpoints as below:

[username@k8smaster1 ~]$ kubectl -n accounting get endpoints nginx-one 
NAME        ENDPOINTS                                 AGE
nginx-one   192.168.249.49:8080,192.168.249.50:8080   12m

The problem is that curl will only work from the worker nodes, not the master node

[username@k8smaster1 ~]$ curl -l 192.168.249.49:80
^C (after several minutes)

Whereas on the worker:

[username@k8snode1 ~]$ curl -l 192.168.249.49:80
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

I assume this has something to do with my networking setup, but since the kubectl commands on the master can return the information, and I can ping back and forth no problem from the two virtual machines, I'm a little stumped as to what would be causing this issue.

In all honesty, I'm not great on the networking side so perhaps there is something wrong with the calico networking that is setup, but I would really appreciate if anyone could point me in some troubleshooting directions for this.

chrispokorni · May 2021

Hi @psyrus,

Such timeouts are common when the VM to VM (or node to node) networking is not properly configured, outside of the Kubernetes cluster.

What are you using as infrastructure for your cluster? Are you in the cloud or a local hypervisor?

Regards,
-Chris

psyrus · May 2021

Hi @chrispokorni I am using virtual machines hosted in Azure.

They are on the same VNET, on the same subnet with no networking restrictions in place between the nodes.

The Azure layer networking has a CIDR range of:
10.0.22.0/23

The K8s calico setup is such that:

[username@k8smaster1 ~]$ kubectl get configmaps -n kube-system kubeadm-config -o yaml
apiVersion: v1
data:
  ClusterConfiguration: |
    apiServer:
      extraArgs:
        authorization-mode: Node,RBAC
      timeoutForControlPlane: 4m0s
    apiVersion: kubeadm.k8s.io/v1beta2
    certificatesDir: /etc/kubernetes/pki
    clusterName: kubernetes
    controlPlaneEndpoint: k8smaster:6443
    controllerManager: {}
    dns:
      type: CoreDNS
    etcd:
      local:
        dataDir: /var/lib/etcd
    imageRepository: k8s.gcr.io
    kind: ClusterConfiguration
    kubernetesVersion: v1.20.6
    networking:
      dnsDomain: cluster.local
      podSubnet: 192.168.0.0/16
      serviceSubnet: 10.96.0.0/12
    scheduler: {}
  ClusterStatus: |
    apiEndpoints:
      k8smaster1:
        advertiseAddress: 10.0.22.4
        bindPort: 6443
    apiVersion: kubeadm.k8s.io/v1beta2
    kind: ClusterStatus
kind: ConfigMap
metadata:
  creationTimestamp: "2021-04-08T04:49:09Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        .: {}
        f:ClusterConfiguration: {}
        f:ClusterStatus: {}
    manager: kubeadm
    operation: Update
    time: "2021-04-08T04:49:09Z"
  name: kubeadm-config
  namespace: kube-system
  resourceVersion: "10235"
  uid: 198fd9ad-8fe7-4b01-93b5-2ecb087dfdb0

I am curious if the service subnet is the problem. Both the nodes are available from the k8s perspective:

[username@k8smaster1 ~]$ kubectl get nodes
NAME         STATUS   ROLES                  AGE   VERSION
k8smaster1   Ready    control-plane,master   39d   v1.20.6
k8snode1     Ready    <none>                 39d   v1.20.6

chrispokorni · May 2021

Hi @psyrus,

As long as your networks do not overlap (Pods, Nodes, Services) your cluster should be fine.

On Azure the network plugin may not behave as expected due to Azure's specific network implementation. For the Calico network plugin you can find Azure specific installation details, or you may search the forum for solutions posted by learners who completed the labs on Azure.

Regards,
-Chris

psyrus · May 2021

Thanks @chrispokorni I will check the Azure specific stuff. I hadn't thought of that because I figured it's all virtualized and shouldn't make a difference unless I wanted to take specific advantages of Azure's networking 'special features'. Will post back after some investigation.

psyrus · May 2021

@chrispokorni FYI I am not using AKS-Engine, and instead running directly on virtual machines, and as such there is nothing special that needs to be done from a Calico networking perspective.

Quoted below for completeness:
https://docs.projectcalico.org/getting-started/kubernetes/self-managed-public-cloud/azure#other-options-and-tools

Other options and tools

Calico networking

You can also deploy Calico for both networking and policy enforcement. In this mode, Calico uses a VXLAN-based overlay network that masks the IP addresses of the pods from the underlying Azure VNET. This can be useful in large deployments or when running multiple clusters and IP address space is a big concern.

Unfortunately, aks-engine does not support this mode, so you must use a different tool chain to install and manage the cluster. Some options:

Use Terraform to provision the Azure networks and VMs, then kubeadm to install the Kubernetes cluster.
Use Kubespray

Are there any tips/tricks that I could run (kubectl interrogation commands) or linux level checks that I can do to root out the cause of the "iptables proxy mode" (as far as I can tell) misbehaving?

psyrus · May 2021

Thanks for the information guys, although it doesn't really help with tracking down the problems. There really should be some way to troubleshoot these networking issues from the k8s side, but perhaps that's not what the forum is about. I will just deal with this myself.

chrispokorni · May 2021

Hi @psyrus,

You may also try one of my prior suggestions:

search the forum for solutions posted by learners who completed the labs on Azure

Between the LFS258 and LFD259 forums I am sure that lessons learned and suggestions have been shared for the benefit of future learners.

Regards,
-Chris

lhensley · December 2022

Commenting here to say that I am currently facing an identical issue, but on GCE. @psyrus did you ever come up with a solution?

chrispokorni · December 2022

Hi @lhensley,

When provisioning the GCE instances, did you happen to follow the video guide from the introductory chapter? It also covers the VPC networking requirements and firewalls needed to enable traffic between your instances.

Regards,
-Chris

lhensley · December 2022

That's probably it. I was initially trying to use local hardware and after pretty quickly coming upon some unexpected behavior decided to use GCE instead - but did not backtrack and follow those steps closely.

pavelhradil · April 2024

Hi, had the same problem trying to run kubernetes on Azure VMs with Ubuntu Server.
Got one CP node and two worker ones. No node can reach the pods running on other nodes with curl. I'm using Calico for pod networking. Problem is two fold - in Calico setup and Azure routing (or lack thereof).
To fix...
TL;DR: disable bird in calico setup and create Azure routing table to map routes between internal node IPs and VM IPs.

See full description with picture on stackoverflow: https://stackoverflow.com/questions/60222243/calico-k8s-on-azure-cant-access-pods

Lab 9.1 (Services and Endpoints) - Master can't curl to endpoints on Worker/Node

Answers

Other options and tools

Calico networking

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)