Playing with configurations yields a working layer two over layer three network with Quagga ospfd(infrastructure routes), Quagga bgpd(end-point routes), pimd(multicast routing), openvswitch(layer 2 bridging), and vxlan(layer 2 over layer 3) which reliably handles interface and device failures. Next stop: routing lxc to lxc via vxlan.
This is using the current Debian Stretch, kernel 4.3.5, pimd 2.3.1, quagga 0.99.24, openvswitch 2.3.0
Test configuration:
/ core001 \ edge003 + + host001 \ core002 /
On all four devices:
apt-get install pimd
On core001, core002 add a vlan for the rp (rendevouz points) [for some reason, pimd/kernel don't like to add /32 addresses, and I can't get lo to be part of the multicast network, so revert to a vlan as a pseudo loopback]:
apt-get install openvswitch-switch ovs-vsctl init ovs-vsctl add-br ovsbr0 # vlan for pimd rp (rendevouz point) ovs-vsctl add-port ovsbr0 vlan10 tag=10 -- set interface vlan10 type=internal ip link set up dev vlan10
on core001:
ip addr add 10.20.1.1/24 dev vlan10
in /etc/pimd.conf:
bsr-candidate 10.20.1.1 priority 120 rp-candidate 10.20.1.1 time 30 priority 100 group-prefix 224.0.0.0 masklen 4
bgp config advertising the rp:
router bgp 64512 bgp router-id 10.2.0.2 network 10.20.1.0/24 neighbor 10.2.4.3 remote-as 64512 neighbor 10.2.4.5 remote-as 64512 neighbor 10.2.4.5 route-reflector-client neighbor 10.2.4.9 remote-as 64512 neighbor 10.2.4.9 route-reflector-client
on core002:
ip addr add 10.20.2.1/24 dev vlan10
in /etc/pimd.conf:
bsr-candidate 10.20.2.1 priority 120 rp-candidate 10.20.2.1 time 30 priority 100 group-prefix 224.0.0.0 masklen 4
bgp config advertising rp:
router bgp 64512 bgp router-id 10.2.0.3 network 10.20.2.0/24 neighbor 10.2.4.2 remote-as 64512 neighbor 10.2.4.7 remote-as 64512 neighbor 10.2.4.7 route-reflector-client neighbor 10.2.4.11 remote-as 64512 neighbor 10.2.4.11 route-reflector-client
edge003, host001 have default /etc/pimd.conf files, and basic bgpd, ospfd configurations
on edge003 create a vxlan endpoint:
ip route add 224.0.0.0/4 nexthop via 10.2.4.8 dev enp3s0 nexthop via 10.2.4.10 dev enp4s0 ip link add vxlan10 type vxlan id 10 group 229.1.1.10 ttl 10 dstport 4789 ip addr add 10.3.0.5/24 dev vxlan10
on host001 create a vxlan endpoint:
ip route add 224.0.0.0/4 nexthop via 10.2.4.4 dev enp3s0 nexthop via 10.2.4.6 dev enp4s0 ip link add vxlan10 type vxlan id 10 group 229.1.1.10 ttl 10 dstport 4789 ip addr add 10.3.0.6/24 dev vxlan10
Takes a few seconds to connect the end points:
root@host001# ping 10.3.0.5 PING 10.3.0.5 (10.3.0.5) 56(84) bytes of data. From 10.3.0.6 icmp_seq=9 Destination Host Unreachable From 10.3.0.6 icmp_seq=10 Destination Host Unreachable From 10.3.0.6 icmp_seq=11 Destination Host Unreachable 64 bytes from 10.3.0.5: icmp_seq=12 ttl=64 time=999 ms 64 bytes from 10.3.0.5: icmp_seq=13 ttl=64 time=0.494 ms 64 bytes from 10.3.0.5: icmp_seq=14 ttl=64 time=0.539 ms 64 bytes from 10.3.0.5: icmp_seq=15 ttl=64 time=0.532 ms 64 bytes from 10.3.0.5: icmp_seq=16 ttl=64 time=0.546 ms
on edge003, as per rfc, pimd starts with the multicast group, then switches point to point. VXLAN is the encapsulation, and the ping is de-encap'd.
12:41:03.257860 IP 10.2.4.5.38407 > 229.1.1.10.4789: VXLAN, flags [I] (0x08), vni 10 ARP, Request who-has 10.3.0.5 tell 10.3.0.6, length 28 12:41:03.257982 IP 10.2.4.9.37643 > 10.2.4.5.4789: VXLAN, flags [I] (0x08), vni 10 ARP, Reply 10.3.0.5 is-at be:c2:0b:0e:67:79 (oui Unknown), length 28 12:41:03.258330 IP 10.2.4.5.51870 > 10.2.4.9.4789: VXLAN, flags [I] (0x08), vni 10 IP 10.3.0.6 > 10.3.0.5: ICMP echo request, id 1295, seq 12, length 64 12:41:03.258395 IP 10.2.4.9.40003 > 10.2.4.5.4789: VXLAN, flags [I] (0x08), vni 10 IP 10.3.0.5 > 10.3.0.6: ICMP echo reply, id 1295, seq 12, length 64 12:41:03.259115 IP 10.2.4.5.51870 > 10.2.4.9.4789: VXLAN, flags [I] (0x08), vni 10 IP 10.3.0.6 > 10.3.0.5: ICMP echo request, id 1295, seq 13, length 64 12:41:03.259169 IP 10.2.4.9.40003 > 10.2.4.5.4789: VXLAN, flags [I] (0x08), vni 10 IP 10.3.0.5 > 10.3.0.6: ICMP echo reply, id 1295, seq 13, length 64 12:41:04.258135 IP 10.2.4.5.51870 > 10.2.4.9.4789: VXLAN, flags [I] (0x08), vni 10 IP 10.3.0.6 > 10.3.0.5: ICMP echo request, id 1295, seq 14, length 64 12:41:04.258231 IP 10.2.4.9.40003 > 10.2.4.5.4789: VXLAN, flags [I] (0x08), vni 10 IP 10.3.0.5 > 10.3.0.6: ICMP echo reply, id 1295, seq 14, length 64
pimd routing:
root@edge003# pimd -r Virtual Interface Table ====================================================== Vif Local Address Subnet Thresh Flags Neighbors --- --------------- ------------------ ------ --------- ----------------- 0 10.2.2.9 10.2.2.8/31 1 DISABLED 1 10.2.4.9 10.2.4.8/31 1 DR PIM 10.2.4.8 2 10.2.4.11 10.2.4.10/31 1 DR PIM 10.2.4.10 3 10.2.4.9 register_vif0 1 Vif SSM Group Sources Multicast Routing Table ====================================================== ----------------------------------- (*,G) ------------------------------------ Source Group RP Address Flags --------------- --------------- --------------- --------------------------- INADDR_ANY 229.1.1.10 10.20.1.1 WC RP Joined oifs: .... Pruned oifs: .... Leaves oifs: .l.. Asserted oifs: .... Outgoing oifs: .o.. Incoming : .I.. TIMERS: Entry JP RS Assert VIFS: 0 1 2 3 0 55 0 0 0 0 0 0 ----------------------------------- (S,G) ------------------------------------ Source Group RP Address Flags --------------- --------------- --------------- --------------------------- 10.2.4.5 229.1.1.10 10.20.1.1 SG Joined oifs: .... Pruned oifs: .... Leaves oifs: .l.. Asserted oifs: .... Outgoing oifs: .o.. Incoming : .I.. TIMERS: Entry JP RS Assert VIFS: 0 1 2 3 40 15 0 0 0 0 0 0 --------------------------------- (*,*,G) ------------------------------------ Number of Groups: 1 Number of Cache MIRRORs: 0 ------------------------------------------------------------------------------
Was able to disable interfaces, and pings routed around interface failures.
Some debug aids, shows which interfaces are taking part in multicast routing:
root@core001# cat /proc/net/ip_mr_vif Interface BytesIn PktsIn BytesOut PktsOut Flags Local Remote 1 enp2s0 0 0 0 0 00000 0204020A 0304020A 2 enp3s0f0 312 4 78 1 00000 0404020A 0504020A 3 enp3s0f1 78 1 312 4 00000 0804020A 0904020A 4 vlan10 0 0 0 0 00000 0101140A 00000000 5 pimreg 0 0 0 0 00004 0204020A 00000000
Shows the routes in the current kernel table. To be fixed: enp1s0 is a management interface. The default route needs to be removed and replaced with a default towards the core. Then it might be possible to remove the specific route for the mulitcast range. Also note that the BGP advertised RP is in the list. The vxlan route is also in the list. But since the core doesn't advertise it, it goes through the vxlan encapsulation.
root@edge003# ip -d route unicast default via 10.2.2.8 dev enp1s0 proto boot scope global unicast 10.2.0.2 via 10.2.4.8 dev enp3s0 proto zebra scope global metric 20000 unicast 10.2.0.3 via 10.2.4.10 dev enp4s0 proto zebra scope global metric 20000 unicast 10.2.0.4 proto zebra scope global metric 20100 nexthop via 10.2.4.8 dev enp3s0 weight 1 nexthop via 10.2.4.10 dev enp4s0 weight 1 unicast 10.2.2.8/31 dev enp1s0 proto kernel scope link src 10.2.2.9 unicast 10.2.4.2/31 proto zebra scope global metric 10100 nexthop via 10.2.4.8 dev enp3s0 weight 1 nexthop via 10.2.4.10 dev enp4s0 weight 1 unicast 10.2.4.4/31 via 10.2.4.8 dev enp3s0 proto zebra scope global metric 10100 unicast 10.2.4.6/31 via 10.2.4.10 dev enp4s0 proto zebra scope global metric 10100 unicast 10.2.4.8/31 dev enp3s0 proto kernel scope link src 10.2.4.9 unicast 10.2.4.10/31 dev enp4s0 proto kernel scope link src 10.2.4.11 unicast 10.3.0.0/24 dev vxlan10 proto kernel scope link src 10.3.0.5 unicast 10.20.1.0/24 via 10.2.4.8 dev enp3s0 proto zebra scope global unicast 10.20.2.0/24 via 10.2.4.10 dev enp4s0 proto zebra scope global unicast 224.0.0.0/4 proto boot scope global nexthop via 10.2.4.8 dev enp3s0 weight 1 nexthop via 10.2.4.10 dev enp4s0 weight 1
un-optimized iperf functions, not full speed, but on the lower end brown box, is reasonable (single core no hyperthread cpu at 75% via top):
root@host001# iperf -c 10.3.0.5 -d ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 85.3 KByte (default) ------------------------------------------------------------ ------------------------------------------------------------ Client connecting to 10.3.0.5, TCP port 5001 TCP window size: 272 KByte (default) ------------------------------------------------------------ [ 3] local 10.3.0.6 port 52262 connected with 10.3.0.5 port 5001 [ 5] local 10.3.0.6 port 5001 connected with 10.3.0.5 port 41712 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.04 GBytes 896 Mbits/sec [ 5] 0.0-10.0 sec 660 MBytes 553 Mbits/sec
Summary of useful commands: 'bridge fdb', 'pimd -r', 'cat /proc/net/ip_mr_vif', 'cat /proc/net/ip_mr_cache' (which is the raw data for 'ip mroute')
Some iperf testing (-b bandwidth, -t seconds, -T ttl, -u unicast):
root@host001# iperf -u -c 229.0.0.20 -T 5 -t 4 -b 10000000 ------------------------------------------------------------ Client connecting to 229.0.0.20, UDP port 5001 Sending 1470 byte datagrams Setting multicast TTL to 5 UDP buffer size: 208 KByte (default) ------------------------------------------------------------ [ 3] local 10.2.4.5 port 40319 connected with 229.0.0.20 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0- 4.0 sec 4.77 MBytes 10.0 Mbits/sec [ 3] Sent 3403 datagrams
root@edge003# iperf -u -s -B 229.0.0.20 -i 1 ------------------------------------------------------------ Server listening on UDP port 5001 Binding to local address 229.0.0.20 Joining multicast group 229.0.0.20 Receiving 1470 byte datagrams UDP buffer size: 208 KByte (default) ------------------------------------------------------------ [ 3] local 229.0.0.20 port 5001 connected with 10.2.4.5 port 40319 [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 3] 0.0- 1.0 sec 1.19 MBytes 10.0 Mbits/sec 0.052 ms 0/ 850 (0%) [ 3] 1.0- 2.0 sec 1.19 MBytes 10.0 Mbits/sec 0.048 ms 0/ 851 (0%) [ 3] 2.0- 3.0 sec 1.19 MBytes 10.0 Mbits/sec 0.037 ms 0/ 850 (0%) [ 3] 3.0- 4.0 sec 1.19 MBytes 10.0 Mbits/sec 0.058 ms 0/ 850 (0%) [ 3] 0.0- 4.0 sec 4.77 MBytes 10.0 Mbits/sec 0.064 ms 0/ 3402 (0%) [ 3] 0.0- 4.0 sec 1 datagrams received out-of-order