Skip to main content

Full Layer 3

info

This document is work in progress.

When we started with metal-stack, we decided to go full layer-3 for the dataplane for workloads. But the inventarization and installation process is done in a layer-2 segment with a traditional DHCP/TFTP/PXE approach.

This works well, does not require manual configuration steps on any of the components in the datacenter. New servers just need to be turned on and get the metal-hammer bootet via DHCP/TFTP/PXE and get registered and are ready to use.

But there are downsides with this approach. Most notable:

  • 2 different network topologies (L2 and L3) in the dataplane
  • The switch port of a machine must be reconfigured between these two modes, once a machine changes from registered to installed and back.
  • dhcp and tftp server is deployed in the management network of a partition. Connecting these services to a L2 segment on the leaf switches somehow mix control-plane (management) and dataplane traffic, which is not ideal from a security perspective.

We were searching for a proper solution which can achieve the same convenient and fast solution but within layer-3.

Requirements

The following requirements must be fulfilled with a L3 replacement solution:

  • Clear separation of control-plane (management) and dataplane traffic
  • Same "no-touch" experience for new servers
  • Configurability of metal-hammer version per partition in real time
  • token based authentication against metal-apiserver of the metal-hammer
  • Cache of metal-images accessible from metal-hammer inside a partition
  • Preserve all existing metal-hammer discovery, hardware detection, and provisioning logic
  • Secure network when machine reclaim goes wrong with ACLs on the switch which allows communication only to the control-plane and the boot helper
  • TODO more

High level Architecture

The main idea is based on three concepts.

  • Boot from ISO feature of server bmc firmware which can be configured from remote via redfish.
  • Enable automated IPv6 address acquisition via SLAAC (RFC 4862) driven by Router Advertisements (RFC 4861) instead of DHCP
  • IPv6 in a dedicated Boot VRF instead of a Boot VLAN.

This approach requires that metal-apiserver, metal-hammer, ipxe and a new component running in the partition and connected to the boot-vrf (boot helper for now) are IPv6 ready.

The L3 only boot and registration process can be described as follows:

  • Every server will be scanned on a regular basis from the metal-bmc if there is IPXE is configured as boot iso payload. This is a additional task on the metal-bmc. metal-bmc already scans all servers on a regular basis to gather power metrics etc.
  • If the boot iso is set to ipxe, the boot order must be set to CDROM instead of PXE from network and a reboot must be triggered (migration to this approach, not when a machine is allocated).
  • Once the server is powered on, ipxe is booted from the CDROM presented from the firmware.
  • The production interfaces will then get a IPv6 routable ip address from the switch which is configured to enable SLAAC and router advertisement. The configured routes must enable the machine to reach the metal-apiserver in the control plane and the boot helper in the partition.
  • The IPXE iso must contain a boot configuration which chain loads from a known location a secondary boot configuration. To speed up the ipxe startup, the boot.ixpe should disable ipv4 completely as otherwise ipxe will try dhcp first. Sample:
    #!ipxe
    chain https://v2.metal-stack.dev/<partition>/boot.ipxe || shell
    The secondary boot.ipxe will then contain the same payload as actually delivered from pixiecore. This especially contains the configured linux kernel, metal-hammer version, command line and the url in the boot vrf of the boot-helper.
  • With this ipxe will boot into metal-hammer and will contact first the boot-helper on the given url and will get a token to access the metal-apiserver

Logical View

Sequence Diagram

Implementation

Before we start with the implementation or decision if this is the right approach and way to go we should ensure that the current draft is at least working as expected.

This must be done in several steps:

  • ensure ipxe can be packed as ISO image stored in the firmware, booted with DHCP disabled and get a IP with routes from a SLAAC enable switch.
  • The initial boot.ipxe contains instruction to pull a secondary boot.ipxe which contains kernel, image and cmdline and ipxe chain boots this.
  • can ipxe resolve hostnames to ipv6 addresses ?
  • Specify how the boot vrf must be configured on the SONiC Side
  • Specify how metal-hammer kernel must be configured to accept router advertisements
  • how do we configure the boot vrf on the switch, e.g. which address space will be set per port, is it stored in the metal-apiserver and configured by metal-core.

After all these tasks are done, we can proceed and write a more detailed implementation roadmap and requirements with changes in the api and apiserver or other microservices.