Welcome to the Linux Foundation Forum!

Pod network across nodes does not work

I followed the installation procedure of lab 3.1 to 3.3 closely. Everything looks nice, but whenever I try to establish a network connection between a pod on one node and another pod on another node, that does not work. The calico-node pods are up and running. In their logs I don't see any error messages.
calicoctl node status for the cp node results in:

Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+--------------+-------------------+-------+----------+-------------+
| 10.0.0.7     | node-to-node mesh | up    | 13:15:03 | Established |
+--------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.

For the worker node, I get

Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+--------------+-------------------+-------+----------+-------------+
| 10.0.0.6     | node-to-node mesh | up    | 13:15:03 | Established |
+--------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.

On the cp node ip routereturns:

default via 10.0.0.1 dev eth0 proto dhcp src 10.0.0.6 metric 100 
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.6 
168.63.129.16 via 10.0.0.1 dev eth0 proto dhcp src 10.0.0.6 metric 100 
169.254.169.254 via 10.0.0.1 dev eth0 proto dhcp src 10.0.0.6 metric 100 
blackhole 192.168.74.128/26 proto bird 
192.168.74.136 dev calie739583d8fa scope link 
192.168.74.137 dev cali9270933bb0b scope link 
192.168.74.138 dev cali73bd7dd6478 scope link 
192.168.74.139 dev cali3344860a0ad scope link 
192.168.189.64/26 via 10.0.0.7 dev tunl0 proto bird onlink 

On the worker node I see:

default via 10.0.0.1 dev eth0 proto dhcp src 10.0.0.7 metric 100 
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.7 
168.63.129.16 via 10.0.0.1 dev eth0 proto dhcp src 10.0.0.7 metric 100 
169.254.169.254 via 10.0.0.1 dev eth0 proto dhcp src 10.0.0.7 metric 100 
192.168.74.128/26 via 10.0.0.6 dev tunl0 proto bird onlink 
blackhole 192.168.189.64/26 proto bird 
192.168.189.76 dev cali3140fc1dafd scope link 
192.168.189.77 dev calia35b901ce89 scope link 

calicoctl get workloadendpoints -A returns:

NAMESPACE     WORKLOAD                                   NODE      NETWORKS            INTERFACE         
accounting    nginx-one-575f648647-j2rwh                 worker2   192.168.189.77/32   calia35b901ce89   
accounting    nginx-one-575f648647-x5c5c                 worker2   192.168.189.76/32   cali3140fc1dafd   
default       bb2                                        k8scp     192.168.74.137/32   cali9270933bb0b   
kube-system   calico-kube-controllers-5f6cfd688c-h29qd   k8scp     192.168.74.136/32   calie739583d8fa   
kube-system   coredns-74ff55c5b-69n8g                    k8scp     192.168.74.139/32   cali3344860a0ad   
kube-system   coredns-74ff55c5b-bngtf                    k8scp     192.168.74.138/32   cali73bd7dd6478   

There is the example from lab 9.1 deployed. In addition I used the pod bb2 containing busybox for debug purposes. The problem became obvious to me, when I tried to curl the nginx pods. This only works when logged into the worker node.

This is my second cluster. I called the cp node k8scp and the worker worker2, as in my first cluster it is still master and worker. The issue occurs in both clusters. The first one was set up with docker, the second one with cri-o.

The whole setup runs on VMs on Azure.

Is there anything obvious I missed?

One thing that appears odd to me is that the pods do not get addresses out of the PodCIDR range of the according node. If I do kubectl describe node k8scp |grep PodCIDR, I get

PodCIDR:                      192.168.0.0/24
PodCIDRs:                     192.168.0.0/24

The pods on that node are in 192.168.74.128/26, though, as ip route shows. Is that normal?

Comments

  • chrispokorni
    chrispokorni Posts: 2,346

    Hi @deissnerk,

    Azure is not a recommended or supported environment for labs in this course. However, there are learners who ran lab exercises on Azure and shared their findings in the forum. You may use the search option of the forum to locate them for reference.

    Regards,
    Chris

  • Thanks for the quick response @chrispokorni. I suppose I'm running into similar issues as @luis-garza has been describing here.
    In the beginning of lab 3.1 it is stated:

    The labs were written using Ubuntu instances running on GoogleCloudPlatform (GCP). They have been written to be vendor-agnostic so could run on AWS, local hardware, or inside of virtualization to give you the most flexibility and options.

    I didn't read this as a clear recommendation. After all it should just be about two Ubuntu VMs in an IP subnet. I was prepared to figure out some Azure specifics on my own, but an incompatibility on this level comes to me as a surprise. A warning in section 3.1 that the components used in the lab might have compatibility issues with other cloud providers would be helpful.

    Regards,

    Klaus

  • joov
    joov Posts: 10

    Got same problem on AWS:

    • all firewalls on cp and worker node disabled
    • all input / output traffic enabled

    Any help?

  • Hi @joov,

    On AWS the VPC and Security Group configurations directly impact the cluster networking. If you have not done so already, I would invite you to watch the video "Using AWS to set up labs" found in the introductory chapter of this course. The video outlines important settings needed to enable the networking of your cluster.

    Also, when provisioning the second EC2 instance, make sure it is placed in the same VPC subnet, and under the same SG as the first instance.

    Regards,
    -Chris

  • joov
    joov Posts: 10

    I followed the video and got it working already. Thank you.

Categories

Upcoming Training