Servers

Today’s ever-changing world places more demands on IT than ever before. Organizations expect IT to rapidly deliver new services while still being able to meet changing objectives. A new style for the future-ready enterprise has evolved and is based on maximizing existing data centers while still supporting cloud computing and systems management solutions that automate and simplify operations. The end goal? A faster time-to-production through infrastructure delivery automation and greater flexibility in creating needed solutions, with greater resiliency and security.

Servers in a ZO-compliant implementation must fulfil multiple different requirements in terms of availability, serviceability, agility and performance depending on the application requirements. Within the last decades servers are mostly deployed with Hypervisor technologies to allow server consolidation through virtualization. Server virtualization gives companies much higher efficiencies as it is mostly resulting in less physical servers needed to deploy the same amount of applications as virtual machines. Savings on physical servers is important, but most current hypervisor technologies are offering operators the ability to increase service availability by allowing to change the physical server without a notable impact on the application running inside a migrated virtual machine.

Beside this remarkable adoption of server virtualization as a standard asset within most enterprise IT stores today, there are still applications where either virtualization makes no technical sense (e.g. because of performance) or the application vendor is not supporting virtualization for production usage. In these cases, classical server deployments are needed known as bare-metal systems.

Virtual Machines and bare-metal systems are building the basement for most applications in today’s IT environments, but technology innovations as Hyper Converged Infrastructures (HCI), Software Defined Storage (SDS), Machine Learning (ML), In-Memory databases and Micro services are demanding more or specific physical resources within a server. For example HCI and SDS architectures needs a variety of local storage types (e.g. HDD, SSD, NVMe) and a low-latency, high-bandwidth network for data replication. Machine Learning applications requiring special GPUs with a massive number of cores.

What do customers want? Assurance that they will receive product that will perform best in their environments with minimal to zero downtime and low Total cost of ownership. This requires organizations to focus on reliability to increase customer satisfaction, lengthen product lifecycle, reduce costs for customers, and maximize uptime.

What is reliability? The ability to perform consistently well over time. Servers must meet reliability requirements for the System; Subsystems, and Component levels. They must operate continuously at specified environmental conditions (temperature, humidity, shock, vibration) and allow for short term excursions. Systems should specify potential deployment in uncontrolled environments (locations with polluted air and dust).

Server design reliability begins at the component level and starts with choosing and approving component suppliers, then qualification for all subsystems, and Server system qualification and test. Product qualification and release systems ensure that strict design criteria, including deployment life, additional deployment life margin, as well as accommodation for lifetime limited warranty, are met before a product is launched. This qualification and release system should be based on industry standards and other rigorous methods.

ZO-Compliant Servers should deliver physical redundancy and support for hot-swap replacement of components that fails most frequently like Hard Disk Drives (HDD), Power Supplies (PSU) and fans. PSUs and fans should have at least N+1 redundancy (where N+1 means that one redundant device is provisioned to support the event of a failure of one amongst N operating devices, rather than 1+1 where for each operating device a supporting device is provisioned to support the failure of each component). Single and multi HDD failures should be protected by RAID technology with hardware based RAID Controller or similar software based solutions.

Memory redundancy with hardware based memory mirroring could be considered to address non-recoverable DIMM failures. Mirroring the physical memory comes along with a high cost premium, but it can mitigate from non-ECC correctable memory failures, which usually leads to Operating System panics and a full server outage.

I/O connectivity for Local Area Networks (LAN) and Storage Area Networks (SAN) must deliver at least N+1 redundancy. For SAN redundancy, the most common approach is to utilize Dual Port Host Bus Adaptor (HBA). Each HBA Fibre Channel Port is connected to a dedicated SAN Switch. At the Operating System a software implementation of Multi-Path-IO (MPIO) should be used to enable redundant access to SAN provided storage devices (LUNs). LAN connectivity redundancy is done similar by utilizing at least two physical Network Interface Card (NIC) ports and connect these ports equally divided to a redundant pair of LAN Access switches. Aggregation and Port Channel technologies within the Operation System (a.k.a. NIC Teaming or Bonding) should be used to deliver redundancy across the physical network connections. Servers running Hypervisors usually handle network connectivity of the Virtual Machines by implementing software based Virtual Switches. Beside software based approaches, there are industry solutions that delivers a single Network Interface Card (NIC) to the Operating System or Hypervisor, by a systematic and integrated approach of managing the Servers physical NIC and the LAN Access Switch pair.

Out-of-Band (OOB) Management access to the Baseboard management controller (BMC) of a physical server is a general requirement for all devices in the data center. It could be done through a dedicated physical network or through an isolated converged approach. Server should deliver N+1 redundancy for OOB access to mitigate management network switch failures and keeping management access path alive.

In a virtualized environment, virtualized machines can be easily migrated from one server to the next server. This allows operators to evacuate physical servers without any impact on service availability. This allows operators to plan server maintenance windows through usual business hours or to proactively evacuate servers, based on anomaly detection, before a component fail.

For Bare-Metal systems there is currently no industry solution available which allows hot (or live) migrations to another Bare-Metal system. But there are solutions on the market, abstracting the physical server from the actually implementation (e.g. UUID, MAC address, WWx Names, BIOS settings, firmware, etc.) of the service. Using that technologies allows a seamless (cold) migration of an operating system running on a bare-metal server. A fully stateless computing could be achieved by placing boot device on SAN Storage and data partition either on SAN or NAS Storage. This approach allows for server hardware to fail, and then be replaced without need to re-install the operating system. This approach is frequently seen with blade/chassis type servers, where a blade can be replaced or upgraded without need for the operating system to be reinstalled.

Utilising servers without internal system disks also helps when considering disaster recovery plans, if the operating system is centrally store then it makes the Disaster Recovery (DR) process slightly easier as there are no requirements to synchronise the system partition to ensure that it’s kept updated and maintained.

Servers should use some kind of technologies for abstracting the physical server from the intention of the service or application running on it. This would allow operators to migrate bare-metal servers with single reboot from one server to another. This could also be useful to evacuate physical servers for maintenance task or for a planned physical server upgrade (e.g. CPU refresh or more physical resources). In some cases, these technologies could also be used to migrate across different form factors (e.g. migrate from blade to rack server).

Server Management is a critical part of a successful IT solution. Simplicity, efficiency, and availability are key attributes. A large percentage of system downtime is from manual processes and user errors. System management solutions can help eliminate this risk and bring stability to any IT environment. Processes such as automated and scheduled firmware updates, and automatic call-home capabilities with health reports, will reduce infrastructure downtime. To reduce downtime, server maintenance can lead to unexpected downtime in any environment, from the cloud to the branch office. The goal of systems management is to increase server availability as much as possible by removing the impact to applications of any necessary maintenance and as such do not cause any business downtime or reduction in service.

The management access on a ZO-Compliant Server should allow easy integration of automation frameworks via open, adoptable and well documented API data models. An API at the physical server BMC should be at least available to allow basic automation and monitoring integration. Modern industry solutions delivering enriched and open API endpoints at a higher level of converged infrastructures giving a more consolidated and unified API access for simpler automation integration.

A critical aspect of a physical server lifecycle is the initial deployment and the effort needed for this task. Usual physical server bring-up from unpacking to Operating System installation start could range from a couple of hours to a couple of days, depending on the components that needs to be configured or modified (e.g. BMC settings, BIOS settings, Firmware updates, RAID configurations, boot order definition, SAN Boot integration, etc.). Reducing the time to service could be achieved by controlling these parameters based on management constructs like policies, pools and templates. Policies should be control a limited scope of a component (e.g. BIOS Policy for BIOS Settings, boot policy for boot order definition). Policies are used to build templates which abstracts physical components (e.g. server template, HBA/NIC template, etc.). Pools allows to predefine unique identifiers (e.g. UUID, MAC Address, etc.) which can be referenced by templates. In addition, pools should allow logical grouping of physical servers. The abstraction technology must be able to configure all needed components based on the intention defined by policies without any human intervention.

Finally, a server template could generate a single or hundreds of instances of abstracted physical servers. Further automation could easily jumpstart the Operating System installation from a DHCP/boot network or through an OOB Management Network mounted media.