Network

The network is part of the infrastructure layer and it provides the interconnection between the different elements of the infrastructure layer within and between the datacentres. In the context of ZOIS it doesn’t cover the Internet which is considered as media between client location and DCs hosting the business service. It does also not cover any details about how the client accesses the Internet.

The characteristics of the network need to fit into the overall ZOIS architecture but the specific implementation will depend on the requirements of the business service which needs to be implemented.

The Network connects:

  • Servers and storage in the datacentre.
  • Datacentre to datacentre connectivity.
  • Datacentres to the Internet.

and is composed of network nodes, links between them and network functions.

Network nodes operate either at OSI layer 2 or 3 whereas network functions may also include communication at OSI layer 4-7 e.g. firewalls. In addition, there are network services that enable the communication between client and business services hosted in the DC. Latest generation of networking equipment is consolidating many of the different network functions within a single unit.

  • L2 nodes are called switches and operate on OSI layer 2. These days this is typically a form of Ethernet.
  • L3 nodes are called routers and operate on OSI layer 3. The communication in the cloud era has converged to IP as L3 protocol whereas there is a strong push away from IPv4 towards IPv6.

Network nodes exists in form of dedicated physical appliances as well as in virtualized form. The goal of network layer is to ensure the connectivity between the devices involved in the ZOIS business service is always available when required.

The following redundancy considerations should be made:

A physical link connects two adjacent network nodes or end hosts with network nodes. A single link represents a single point of failure. The countermeasure against an outage caused by a failing link is redundancy on link level. The simplest form of redundancy is one or more parallel links between the same adjacent nodes.

Figure 1: Link Redundancy

Node redundancy provides redundant physical paths between two non-adjacent network nodes or an end-host and the network. Also in this case the physical links on the end-host or the network nodes residing on the path intersections connect to interfaces. Those need to be physically independent and connected to disjoint forwarding paths in order to avoid single-point of failures at that point.

Figure 2: Node Redundancy

If this is not possible then the redundancy needs to be provided by multiple end-hosts located such that the client reaches the end-host hosting the business service via disjoint paths over the network. Higher layers need to provide the mechanisms to select which end-host is being used and how to switch over in case of failure of the end-host or the network path to it.

The aggregated link capacity needs to be sufficient for the business service to support the projected number of clients. In addition, there needs to be enough link capacity in the network for non-restrictive support of the business service in case of end-host or network failures. In case of redundant end-hosts and network paths it requires an intelligent mechanism to distribute the traffic over the available service resources. Depending on the implementation this mechanism is part of the network layer or higher layers.

In a setup with parallel links and network paths the capacity for redundancy needs to be at least of the capacity of the biggest member of the set of parallel links/paths in order to not lose traffic in case of failure of a member of the set.

The node capacity of a L2/L3 node needs to be sufficient to not lose traffic even under full load of the respective service. In addition the network node could also contribute to certain network functions like NAT, firewalling, DPI and others. These functions require additional processing of packets that consume CPU and memory resources. The resources of the network element need to be sufficient to cope with the work at full load, even in a redundancy situation.

Firewalling is the capability to filter network traffic according to a certain policy. In context of ZOIS firewalling is being used to protect the relevant resources for a functioning service from a security and DoS attack perspective. Firewall functions can be implemented in form of appliances that are integrated in the network as nodes or as a software function as part of the operating system either in the hypervisor of the virtualization layer or as part of the virtualized application itself.

Firewalls in form of appliances are network nodes with the additional firewalling function. Regarding redundancy and capacity the same applies as for regular network nodes.

The firewalling function maintains tables with states. Those tables need to be sufficiently big in order to keep up with the requirements of the business application, but also to protect against DDoS attacks.

A new set of functionalities are recently being added to firewalls (so called “Next generation firewalls”), like the capability to perform security protection based on defined threat patterns, the support of more secure SSL protocol and the decryption and inspection of traffic to identify applications and behaviour patterns.

Load balancing is the capability in the network to distribute traffic over multiple paths according to a certain policy. Hereby it allows to operate on various levels of the OSI model. Load balancers exists in form of dedicated physical and virtual devices, as function of a regular network node or as an embedded function in the application. Load balancers as dedicated physical and virtual devices can be located on the entire communication path used by the business service. They need to have enough capacity required for the business application for the regular operation but also in case of failure in one or multiple forwarding paths. In order to avoid single-point of failures redundancy concepts need to be in place e.g. parallel load balancers.

Network nodes typically support some form of embedded load balancing. It allows to distribute traffic over multiple path if available. Typical example is ECMP. The load balancing policy is based on information in the packet header of the traffic.

Load balancing as embedded function of an application means that the logic to distribute traffic is built-in the application itself and doesn’t rely on any function in the network.

Active/active traffic distribution means all available forwarding paths are actively used at the same time. There must be a logic in place how to distribute the traffic over the available paths

The traffic distribution logic in this case can be of 2 forms:

  • Per Packet
    • Per Packet load distribution means for each individual packet the decision what path it should take is being newly evaluated.
  • Per Flow
    • A flow is a set of packets, which meets commonly some particular criteria. The packets of a flow take the same path. The decision about what path to take is evaluated on a per flow base.
  • Per Flowlet
    • A flowlet is a burst of packets from a flow that are separated by suitably large gaps in time. If the idle interval between two burst of packets is larger than the maximum difference in latency among available paths, the second burst (or flowlet) can be sent along a different path than the first without reordering packets.
  • Per Application/Session
    • In sophisticated implementations, all traffic belonging to a certain application can be grouped and follows the same path. The traffic of different applications is then distributed over the various paths.

The active/hot-standby concept is based on a primary path and at least one that is pre-provisioned. All traffic is using the primary path as long as it is operational. In case of failure the traffic is redirected from the failed primary path to the pre-provisioned standby path. It requires some form of detection mechanism to verify the operation of the primary path in order to trigger redirection in case of failure.

The active/standby concept is very similar to the active/hot-standby concept, but is different in that the redundant path is established after detection of the failing primary path. As consequence, it takes more time until the service is being restored. Also, here it requires a mechanism to verify the proper operation of the primary path and that triggers the setup of the standby path and redirection of the traffic on it in case of failure. Dependant on the business requirement this implementation may impact the ZO design or require proper mitigation.

QoS is the capability to provide differentiated treatment of forwarding traffic. QoS in the context of ZOIS allows to give preference to relevant forwarding traffic for a specific application and thus protect the service in case of resource contention. QoS helps to meet the requirements of the service for packet loss, latency and jitter.

QoS is limited by the availability of resources on network nodes and end-hosts e.g. available memory to queue traffic at a port. An end to end capacity management is responsible to avoid an overrun of resources.

In addition to the network components and embedded network functions some key network based services are needed in order to build a ZOIS network, which are outlined below.

A prerequisite for communication over the network is that all relevant elements have assigned IP addresses. Within a domain the assigned IP addresses have to be unique. Whether those addresses are assigned statically or dynamically is irrelevant.

Forwarding packets requires forwarding entries in the network nodes which tell the node where to send the traffic. In order to setup forwarding paths in the network it requires some form of control plane on all nodes that makes sure all relevant addresses and associated routing metadata like link metrics are distributed to all relevant places in the network. Whether this control plane works with statically programmed entries or via dynamic routing protocols is irrelevant in the context of ZO as long as the relevant forwarding paths are established and that in case of failures in the network the business service remains unaffected.

The data plane deals locally on the network nodes or in the end-host with the programming of the forwarding information base (FIB) which is a lookup table being used to forward packets. The data plane takes also care of forwarding itself.

In case of a failure in the network it requires to detect the failure(s) fast and switch to the alternate forwarding path as quickly as possible. ZO requires to cover all potential failure scenarios either within the network layer or on higher layers.

During failures, typically packets are being dropped. Packet loss in case of failure is considered as normal rather than exceptional in IP networks. Therefore it requires adequate measures on higher layers in the OSI model to compensate that the business application remains unaffected for instance via retransmission.

Once failures are detected traffic needs to be shifted to the next best forwarding option. In general, there are two approaches to deal with failed forwarding paths. One approach puts enough parallel forwarding paths with enough bandwidth in the network (active/active load balancing concept) such that one or more failing paths do not impact the business application.

In that case the network doesn’t need to converge and find next best paths. Measures need to be taken to eliminate the failing forwarding paths from the list of available paths in order to avoid un-detected traffic losses, frequently referred to as “blackholing.”

The other approach is based on alternate paths in the network (active/ (hot-) standby concept). There exist mainly two options how to find the next better paths. One is by waiting for the control plane to learn a new best path. This consumes some time because it requires to exchange some information on control plane level and reprogramming of the FIB. Alternatively, this can be optimized by learning the alternative forwarding options already before the failure(s) happen and by pre-programming of those next best paths into the FIB upfront.

Time Synchronization of all systems in the network including the servers is highly desirable. Applications required for a proper operation of a ZO system like monitoring, troubleshooting and debugging rely on the same time on all systems. The accuracy of the time synchronization should be in the order of magnitude of milliseconds.

Like IP addressing name resolution is a key functionality of the network required to implement a business service over the Internet or in an intranet. In most cases the first step to initiate a communication session for an application it requires to resolve domain names to IP addresses. Without a functioning DNS no business service can be run. Because of the importance of the DNS service it requires to be provided in a fault tolerant way.

Whether the implementation of a DNS service is based on a dynamic learning mechanism for the name to address binding or a static one is not relevant in the context of a ZO architecture as long as the implementation provides a reliable and accurate service.

Networks usually build for different environments like Campus, Branch, Data Center or WAN. Those different networks are designed to meet different requirements like scale, bandwidth, security and availability. Datacentre and WAN networks typically are demanding the highest level of availability and redundancy. This is achieved by using redundant links between the network nodes, and by doubling the network nodes avoiding single point of failures. Network-protocols running on the IP and Ethernet networks aims to build forwarding paths based on the available links and recalculates alternative paths in case of a link-failure. A recent trend is to converge both DC network LAN and SAN to a single network. The evolution of the Ethernet standard allows the transport of loss-less services like Fibre Channel over the Ethernet based datacentre fabric.

There are various reasons to build networks with multiple layers like link distance, scale of ports per device, floor-plan, etc. A common implementation of network has got 2 layers, the core and the access layers, also called spine and leaf layers and is known as CLOS networks, first formulized by Charles Clos in 1952.

Example schematic of a Clos Network Design

In a typical data centre today the Server-to-Server communication represents a high percentage of overall traffic. Different applications require different characteristics from the network, for instance they have different sensitivity to latency. The ZO design needs to take care of all such requirements.

The data centre network infrastructure is central to the overall IT architecture. It is where most business-critical applications are hosted and various types of services are provided to the business. Proper planning of the data centre infrastructure design is critical, and performance, resiliency, and scalability need to be carefully considered.

Another important aspect of the data centre network design is the flexibility to quickly deploy and support new services as well as to grow the network as the business services increases. CLOS design allows to scale-out more easily by adding additional network nodes to the leaf layer when there is a need to attach new servers to the network.

Network capacity can be increased by adding additional spine and leaf nodes (horizontal growth) or by increasing the link speed (vertical growth).

Server virtualization and cloud services pose new requirements on large scale REF L2 \h L2 and REF L3 \h L3 services to support workload-mobility.

Modern DC networks include the concept of running overlay-networks on top of inherently static networks. VXLAN the de-facto industry-standard for overlays allows the creation of flexible L2 overlays on top of an IP based underlay. Those underlay networks are built with proven IP Control Plane Protocols like ISIS, OSPF, etc. and provides the same redundancy like a routed WAN network or the internet. In case of a link failure, those control plane protocols are tuned in order to allow the network to converge in milliseconds versus seconds.

VXLAN is designed to provide the same Ethernet Layer 2 network services as VLANs do today, but with greater extensibility and flexibility. Implementing VXLAN technologies in the network will provide large Layer 2 scale and flexible placement of workload.

On top of the VXLAN encapsulation the industry-standard VXLAN-EVPN provides a robust Control-Plane (Border Gateway Protocol, BGP) which brings the stability and scalability of routing to Layer 2.

Another aspect of a modern DC network is the incorporation of the Software-Defined concept. There are different definitions of Software-Defined, but the common understanding is, that it should simplify the orchestration and automation of the network, so it can be more easily consumed by the application layer. A central SDN controller with a Rest-full API is controlling the network policy and is pushing down the network policy definition to the network nodes. The ZO network can benefit from such a policy driven approach, by avoiding misconfigurations of individual network nodes by keeping the network policy and function consistent across the network.

One particular network is responsible to interconnect storage devices. While many options are possible for realization, it is nowadays common to utilize a Storage Area Network (SAN), which will be described in the next sections.

A Storage Area Network or SAN, is a high speed network designed to connect servers to storage devices. Connectivity between the servers to disk storage arrays or tape libraries, allow multiple servers to share resources.

The SAN allows storage to be moved from a server into consolidation environments like a large disk array. This provides several benefits including improved storage utilisation, allows for clustering technologies to be deployed, snapshot features to aid backup and recovery and finally allows for inter site data replication.

Many different kinds of traffic traverse a SAN fabric. The mix of traffic is typically based on the workload on the servers and the effect that behaviour has on the fabric and the connected storage. Examples of different types of workload include these:

  • I/O-intensive, transaction-based applications: These systems typically do high volumes of short block I/O and do not consume a lot of SAN bandwidth. These applications usually have very high-performance service levels to ensure low response times. Care must be taken to ensure that there are a sufficient number of paths between the storage and hosts to ensure that other traffic does not interfere with the performance of the applications. These applications are usually very sensitive to latencies.
  • I/O-intensive applications: These applications tend to do a lot of long block or sequential I/O and typically generate much higher traffic levels than transaction-based applications (data mining). Depending on the type of storage, these applications can consume bandwidth and generate latencies in both storage and hosts that can negatively impact the performance of other applications sharing the environment.
  • Host High Availability (HA) clustering: These clusters utilise storage very differently from standalone systems. They continuously check their storage for data integrity reasons generating a load on both the fabric and the storage arrays.
  • Host-based replication: Host-based replication causes traffic levels to increase significantly across a fabric and can put considerable pressure on Inter-Switch Links (ISL).
  • Array-based replication: Data can be replicated between storage arrays as well.

This section provides some example high-level guidelines necessary to implement a typical SAN installation.

Servers and storage devices should be connected to both SAN fabrics and using Multi-Path I/O (MPIO) software, allowing data to flow through both fabrics. The MPIO software provides load balancing and failover known as active / active or just failover known as active/passive.

Key suggestions for a Zero Outage design:

  • Are there at least two physically independent paths between each server and destination storage?
  • Are there two redundant fabrics?
  • Does each server connect to two different SAN switches?
  • Are edge switches connected to at least two different core switches? (Edge to Core design.)
  • Are inter-switch connections composed two or more ISLs?
  • Does each storage array connect to at least two different SAN switches?
  • Are storage ports provisioned such that every host has at least two ports through which it can access its disks?

A typical SAN design comprises devices on the edge of the fabric, switches in the core of the fabric, and the cabling that connects it all together. Topology is usually described in terms of how the switches are interconnected, such as ring, core-edge, and edge core- edge or fully meshed.

The edge-core topology places initiators on the edge tier and storage on the core tier. Since the servers and storage are on different switches, this topology provides ease of management as well as good performance, with most traffic only traversing one hop from the edge to the core.

The disadvantage of this design is that the storage and core connections are in limited for expansion, this topology allows for only minimal growth.

Figure – Example of an Edge-Core Design

A full-mesh topology allows servers and storage to be placed anywhere, since the communication between sources to destination is no more than one hop. The disadvantage to this design is the number of ports that are dedicated to inter switch connectivity, and this becomes very inefficient as the size of the SAN fabric grows. Also it becomes the limiting factor for scalability.

Figure – Example of a full mesh design

An important aspect of SAN design is the resiliency and redundancy of the fabric. The main objective is to remove any single point of failure. Resiliency is the ability of the network to continue to function during a failure, while redundancy describes duplication of components. It is standard practice to design and implement a dual fabric, to ensure no single point of failure. At the highest level of fabric design, the complete network should be redundant, with two completely separate fabrics that do not share any network equipment.

Figure – Example dual fabric connectivity

To ensure no single point of failure within a fabric there should be at least two of every interconnection in the fabric to provide redundancy and improve resiliency. The number of ports and device locality (server/storage) determines the number of inter switch links needed to meet performance requirements. Each switch should be connected to at least two other switches, and so on.

Consideration in regards to the location of the inter switch connections on the SAN switch also needs to be considered. It is important to be consistent across the fabric, the same port should be utilised at either end. Mismatched ISL placement can introduce performance issues as well as making maintenance or troubleshooting more difficult than it needs to be.

Device placement is a balance between traffic isolation, scalability, and performance requirements. The growth of virtualization and multi-node clustering platforms, frame congestion can become a serious concern in the fabric.

Server to storage connectivity should where possible be limited to stay on a single switch in the fabric. For simplicity, communicating servers and storage should be attached to the same switch. However this design/practice doesn’t scale well.

It is possible to introduce additional switches for additional server or storage connectivity, but the inter switch links can become a performance issue causing traffic congestion. It’s possible to mitigate traffic congestion with correct sizing and provisioning of inter switch links. For the current generation of SAN switches locality isn’t a requirement for performance, however for mission critical applications it is a design criteria for solution architects to consider for the design especially when combined with high end performance disk like Solid State Drives. (SSD)

Figure – Example of servers and storage being local

Another aspect of device placement is the “fan-in-ratio” or “oversubscription”. This is a measure of the number of source ports to target ports and device to ISL’s. This is also referred to as the “fan-out-ratio” if viewed from the storage. The ratio is the number of device ports that share a single port, whether ISL or target. This is usually expressed as a ratio, for example 7:1 for 7 hosts utilizing a single ISL or storage port.

During the design phase the solution architect should incorporate a number of factors in determine the optimum number of hosts that can/should be connected to a storage port.

These include:

  • Numbers of servers and if they are clustered, and if virtualisation is to be used
  • Number of storage LUNs per server
  • Estimated traffic, in data change quantity, data reads and IOPS

In a traditional application deployed per server environment it’s very unlikely that all servers will be running at their maximum throughputs. Periods like during boot or a full data backup are expected exceptions. In these cases it’s possible to “oversubscribe” the fan-in-ratio, sometimes by quite a margin. An example maybe to combine an application utilising multiple database servers with a DNS server, the DNS server will have a low IO profile, whereas the database servers will have a considerable IO profile. The effect of combining these together would reduce the oversubscription.

Virtual servers also add a further complexity, as virtualisation can provide a level of oversubscription as this is part of the virtual server concept. If the overall solution includes virtual servers, then the oversubscription in the SAN fabric, and likely the storage array must be reduced.

Generally it’s considered a best practice not to have an oversubscription on storage port to the SAN fabric core. As example, if the storage device has 8 x 16Gbps ports connected to the fabric, we would expect 128Gbps of bandwidth from the storage to core tier.

To ensure a proper design, we suggest to take into close consideration the following aspects:

  • Host-to-storage port fan-in/out ratios
  • Oversubscription ratios:
    • Host to ISL
    • Edge switch to core switch
    • Storage to ISL

As with any equipment used to provide services supporting applications requiring zero outage compliance, monitoring the SAN in real time and on a historical basis. Monitoring allows organisations to address performance issues proactively and rapidly diagnose underlying causes quickly to resolve issues before the SAN becomes a bottleneck for critical applications.

The Fibre Channel to Fibre Channel (FC-FC) routing service enables Fibre Channel SANs to share devices between two or more fabrics without merging fabrics. There are many advantages for considering FC-FC routing, these include providing connectivity for legacy devices, allowing connectivity to a remote Disk / LUN, fault isolation and security.

Virtualising fabrics allows multiple virtual SAN switches to be hosted on a single piece of SAN hardware. Deploying virtual switches provides a mechanism for partitioning and sharing hardware resources, providing more efficient use, increased fault isolation and improved scalability. Logical Switches are then connected to form logical Fabrics. Logical Fabrics consist of one or more logical Switches over multiple physical switches. The use of virtual fabrics also allows different protocols to run on the same physical hardware, FICON and FCP traffic could run independently of each other over the same physical hardware platform.

One of the advantages of deploying a SAN provides centralised storage, along with a network connecting it together. Once storage is centralised, data can be replicated from the storage array to another storage array at a “Desaster Recovery” (DR) site. SANs are typically connected over metro, long-distance dark fibre links, providing the ability to replicate data to a remote site. Path latency is critical for mirroring and replication solutions, and this should be factored into the design.

For fibre links the length of time that a frame spends on the cable between two ports is negligible, as this is the speed of light being around 5 microseconds per kilometre. Typical disk latency is 5 to 10 milliseconds. It is important to understand the distance of the fibre links, as some compensation may be required in the SAN. As the time will increase for the frames to pass over the links, it’s normally necessary to allocate additional buffers on the corresponding SAN Switch ports. These additional buffers provide short term storage to compensate for the latency over the distance link.

When using dark fibre links it usual to utilise a form of wave division multiplexing. Examples of wave division multiplexing are:

  • Dense Wave Division Multiplexing (DWDM)
  • Coarse Wave Division Multiplexing (CWDM)
  • Time Division Multiplexing (TDM)

These technologies allow for the SAN fabric to be extended between the locations, again using additional buffers on the ports in question to compensate for the additional latency introduced.

Fibre Channel over IP (FCIP) links are most commonly used for data replication and remote tape applications. Usually for the purpose of Business Continuity via Disaster Recovery, FCIP links allow data to be transported over significant distances. Data replication is typically used for storage array to storage array communications, although remote tape applications work equally well.

The choice of remote connectivity is going to be driven by the distance, speed (latency) and data requirements.

For data replication best practice is to use a separate IP connection between the production and backup data centre. Often a dedicated IP connection between data centres is not viable, and in these instances bandwidth must at least be logically dedicated to prevent additional latency. To determine the amount of network bandwidth required, it is recommended that a month’s worth of data is gathered from the servers, SAN fabric and storage. It is important to understand the quantity of server to disk traffic write/update traffic, as this will be equal to the data that will be replicated to the remote storage.

There are many components to SAN security in relation to SAN design, and the decision to use them is greatly dependent on installation requirements rather than functionality or performance. One clear exception is SAN zoning feature used to control device communication. The proper use of zoning is key to fabric functionality, performance, and stability, especially in larger SAN fabrics. Zoning is used to specify the devices in the fabric that should be allowed to communicate with each other. If zoning is enforced then devices that are not in the same zone cannot communicate.

It’s considered best practice to create zones containing a single initiator and one target it communicates with. Changes to initiators in this case do not impact other initiators or other targets. Zoning provides protection from disruption in the fabric. Other security-related features are largely mechanisms for limiting access and preventing attacks on the SAN fabric (local regulatory requirements). These are not required for normal fabric operation.

We suggest to ensure in a design the following zoning best practices are taken into consideration:

  • Always deploy zoning
  • Create zones with only one server and target port
  • Define zones using device WWPNs (World Wide Port Names)
  • Deploy logical naming, to aid troubleshooting and maintenance