Part 1: NSX-T routing deep dive - How a stateful service drastically changes routing

Imagem de capa

I am going to walk through a south to north packet walk with two different topologies. T0 in ECMP + T1 with no stateful services and a T0 in ECMP + T1 with stateful services.

One of the main purposes of this post is to make aware that when two conditions are met for a T1, a T1-SR is created on your associated Edge Node cluster which then stops you from being able to take full advantage of ECMP for that T1 router.

Two conditions must be met for a T1-SR to be created on your Edge Node Cluster:
- T1 must be connected to a T0
- T1 must be associated with an Edge Cluster

Now, this is something that some people may do without realising its impacts. Typically they will just create a T1, connect it to the T0 during creation and as a part of creation you have the option to associate your T1 with an Edge Cluster, see here:

Now, you may select the Edge Cluster thinking you do not need stateful services right now but may do in the future. You should not select an edge cluster unless there is currently a requirement for stateful services. You can always come back and associate a T1 with an edge cluster when there is a requirement for stateful services in the future (T1 Firewall, NAT, LB etc).

My testing showed the network interruption to be minimal when associating a T1 with an Edge Cluster, post deployment with active north/south and east/west traffic flow. TCP sessions were maintained and only a single packet was dropped and on other occasions it did not drop a packet at all.

You may think “But I didn’t setup any stateful services?” Well.. when the above mentioned conditions are met a T1-SR is created and the stateful service that is enabled by default is the T1 Edge Firewall. Disabling the firewall on the T1 post creation will not remove the T1-SR. I will explain how this effects the routing for a T1 below.

First let’s understand what a traffic flow would look like without a T1 stateful service.

Topology A: T0 in ECMP and a T1 router with no stateful services (T1 connected to T0 but our T1 is NOT associated with an Edge Cluster).

If we look at ESXI-A above, we will see that two Logical Routers exist within the current topology. T1-DR and T0-DR.

  Logical Routers Summary
 ------------------------------------------------------------
                 VDR UUID                 LIF num  Route num
   af33010c-6c43-4e09-ad2d-807be3ee83e6      5        6   <----------- T1-DR
   4fa7d77d-324e-4325-adc0-543577b3c749      6        24  <----------- T0-DR

We are going to look at the forwarding table for this T1-DR to understand the traffic path when we send traffic from a VM to the Internet.

T1-DR Forwarding Table
 Logical Routers Forwarding Table
 --------------------------------------------------------------------------------------------------------------
 Flags Legend: [U: Up], [G: Gateway], [C: Connected], [I: Interface]
 [H: Host], [R: Reject], [B: Blackhole], [F: Soft Flush], [E: ECMP]
  
                    Network                               Gateway                Type               Interface UUID
 ==============================================================================================================
 0.0.0.0/0                                              100.64.128.8              UG     b0143d2d-478d-408c-b5a5-ef734da070d7 

What we can see above is that the T1-DR’s default route is 100.64.128.8. This is a /31 subnet from the reserved range of 100.64.0.0. This subnet range is used for the inter connect between a T1 and a T0. If we look at the interfaces of the T0 which is running within the hypervisor we will see the 100.64.128.8 address.

Below I have run the command to show the interface of the T0-DR which the T1-DR uses as its default route. I removed some of the output of the command so we can focus on the specific interface we are interested in.

T0-DR Interface
 Logical Router Interfaces
 ---------------------------------------------------------------------------
 LIF UUID                 : ce451134-7be8-46a9-b7ae-190035b5de95
 Mode                     : [b'Routing-LinkLif']
 Overlay VNI              : 71698
 IP/Mask                  : 100.64.128.8/31;  fe80::50:56ff:fe56:4452/128; fc25:93fe:3ac1:c804::1/64
 Mac                      : 02:50:56:56:44:52
 Connected DVS            : NVDS1
 Control plane enable     : True
 Replication Mode         : 0.0.0.1
 State                    : [b'Enabled']
 Flags                    : 0x8308
 DHCP relay               : Not enable 

Now if we look at the forwarding table of the T0-DR which is running in the hypervisor, we can see it has two default routes.

T0-DR Forwarding Table
 Logical Routers Forwarding Table
 --------------------------------------------------------------------------------------------------------------
 Flags Legend: [U: Up], [G: Gateway], [C: Connected], [I: Interface]
 [H: Host], [R: Reject], [B: Blackhole], [F: Soft Flush], [E: ECMP]
  
                    Network                               Gateway                Type               Interface UUID
 ==============================================================================================================
 0.0.0.0/0                                              169.254.0.2              UGE     105bca47-9671-4079-89cb-e00936764916
 0.0.0.0/0                                              169.254.0.3              UGE     105bca47-9671-4079-89cb-e00936764916 

.2 is the IP associated with the bp-sr-port on T0-SR instance on Edge-A and .3 is associated with the T0-SR instance on Edge-B. We can see the “Type” flags are set, Gateway (G) and ECMP (E) shows that we are doing ECMP across the two default routes.

Simply put, the T0-DR will distribute northbound traffic across the two T0-SR instances which span the two Edge Transport Nodes in the Edge Cluster.

Below I have dumped the interfaces of the T0-SR instance running on Edge-A and you can see the interface “bp-sr0-port” has the IP: 169.254.0.2 and the T0-SR instance on the Edge-B will have the .3 IP on it’s bp-sr-port.

Edge-A - T0-SR
 Logical Router
 UUID                                   VRF    LR-ID  Name                              Type
 33aa3f1f-c5cc-4f4f-9f34-37e800b0bbbd   5      8194   SR-T0-carrot                      SERVICE_ROUTER_TIER0
 Interfaces

     Interface     : 37b97dc0-c12e-4b5d-8d69-07cd9ccac830
     Ifuid         : 334
     Name          : bp-sr0-port
     Mode          : lif
     IP/Mask       : 169.254.0.2/25;fe80::50:56ff:fe56:5300/64
     MAC           : 02:50:56:56:53:00
     VNI           : 71687
     LS port       : 9d226528-49db-4b1e-a958-8a8c103fc7fd
     Urpf-mode     : NONE
     Admin         : up
     Op_state      : up
     MTU           : 1500 

Now we look at the forwarding table of the T0-SR on Edge-A to see how it will forward traffic north bound.

T0-SR Forwarding Table on Edge-A
 
 Flags: t0c - Tier0-Connected, t0s - Tier0-Static, B - BGP,
 t0n - Tier0-NAT, t1s - Tier1-Static, t1c - Tier1-Connected,
 t1n: Tier1-NAT, t1l: Tier1-LB VIP, t1ls: Tier1-LB SNAT,
 t1d: Tier1-DNS FORWARDER, > - selected route, * - FIB route
  
  
 b  > * 0.0.0.0/0 [20/0] via 10.10.11.1, uplink-332, 01:12:29
 b  > * 0.0.0.0/0 [20/0] via 10.10.12.2, uplink-338, 01:12:29 

The 10.10.12.x and .11.x IPs are the two TORs the T0 is peering with and we can see both routes are selected because ECMP has been enabled.

Summary of what we have covered so far for the specified topology (T1 with no stateful service connected to a T0 configured in ECMP).

Let’s look at how having a T1-SR changes this. Consider that the correct conditions have been met to have a T1-SR created.
Conditions:
- T1 must be connected to a T0
- T1 must be associated with an Edge Cluster

T1 is now associated with an Edge Cluster

Below we will see in Topology B the T0-DR that was on ESXI-A no longer exists and we can see a T1-SR (active) has been created on Edge-A.

Topology B

Logical Routers on ESXI-A 
                  Logical Routers Summary
 ------------------------------------------------------------
                 VDR UUID                 LIF num  Route num
   af33010c-6c43-4e09-ad2d-807be3ee83e6      5        1 <----------- T1-DR

If we look at our Edge nodes in Topology B we will see a T1-SR has been created on each of them. T1 Service Routers are deployed as active/standby. The active T1-SR has been created on Edge-A and standby T1-SR on Edge-B.

Below I have dumped the output of the get logical-router command showing the running SR/DR instances on Edge-A and B. You can see the new T1-SR has been created.

Edge-A:
 Logical Router
 UUID                                   VRF    LR-ID  Name                              Type                        Ports
 736a80e3-23f6-5a2d-81d6-bbefb2786666   0      0                                        TUNNEL                      3
 4fa7d77d-324e-4325-adc0-543577b3c749   2      8193   DR-T0-carrot                      DISTRIBUTED_ROUTER_TIER0    8
 33aa3f1f-c5cc-4f4f-9f34-37e800b0bbbd   5      8194   SR-T0-carrot                      SERVICE_ROUTER_TIER0        7
 af33010c-6c43-4e09-ad2d-807be3ee83e6   8      10241  DR-T1-spinach                     DISTRIBUTED_ROUTER_TIER1    7
 f9a16c9a-942b-40b5-866b-d3c971a76cc9   9      13316  SR-T1-spinach                     SERVICE_ROUTER_TIER1        5
  
 Edge-B:
 Logical Router
 UUID                                   VRF    LR-ID  Name                              Type                        Ports
 736a80e3-23f6-5a2d-81d6-bbefb2786666   0      0                                        TUNNEL                      3
 b3407be1-e52a-478d-9601-2f820129ce15   1      13313  SR-T0-carrot                      SERVICE_ROUTER_TIER0        7
 4fa7d77d-324e-4325-adc0-543577b3c749   2      8193   DR-T0-carrot                      DISTRIBUTED_ROUTER_TIER0    8
 af33010c-6c43-4e09-ad2d-807be3ee83e6   5      10241  DR-T1-spinach                     DISTRIBUTED_ROUTER_TIER1    7
 f9a16c9a-942b-40b5-866b-d3c971a76cc9   6      13316  SR-T1-spinach                     SERVICE_ROUTER_TIER1        5 

Figure A

Figure A shows:

We can now see that the traffic from the T1-DR is not being distributed across Edge nodes A and B, instead it is just being forwarded to the active T1-SR instance on Edge-A.

Figure B

We can see in Figure B the T1-SR has a default route to IP 100.64.128.0/31 with MAC 02:50:56:56:44:52 (This is the IP/MAC of the local T0-DR running within the same edge node). The traffic going to the default route will come out of the pink interface 2259c379-78b1-4ff4-9b91-3234693d1bbe which is used for inter tier0/tier1 connectivity.

Now there is no ECMP occurring when traffic is heading northbound from the hypervisor DR to the Edge Nodes. Instead, it is being pinned to the specific edge with the active T1-SR router.

Let’s take a look at how the T0-DR is forwarding traffic it receives from the T1-SR.

Figure C

The forwarding table of the T0-DR has two default routes (TOR-B - 10.10.12.2 and TOR-A - 10.10.11.1)

The T0-DR will forward the traffic out of interfaces e1493ed9-73e6-41a0-ac1f-3d6349a5ac08 and 61826aca-54ef-400e-935b-96eece0fbc6b on the T0-SR to get there. These two interfaces are the uplinks on the local T0-SR which are peering with the TOR switches. The gateway MAC is the MAC address of the specific TOR peer.
The T0-SR will balance the traffic across the TORs northbound in ECMP.

TL;DR summary:

In Part 2 we will take a look at the north to south traffic flow in depth.