Technical Architecture

Following
is a brief description of the hardware and software redundancy employed
within the Intersect eResearch Nexus, the physical facilities that underpin Space and Time services. The goal is to use commodity hardware and take advantage of all features that enhance the reliability and remove single points of failure where possible. All services are configured with an
Active/Active architecture and components have been selected
and designed to work this way.

A list is included for some common failure/recovery scenarios.

Power

Every physical device, both servers, storage and switches, has dual power supplies.
Each rack has two independent power circuits fed from separate data centre supplies.

Each power supply is connected to a different power circuit.
The data centre provides uninterruptible power supply, air conditioning and cooling services.

Network

There are two core routers.
Each core router has a link to AARNET. Both links are active and traffic automatically switches over to the other router on failure.
The systems use two network
switches per rack, linked into a single logical stack.

The stack has at least one
uplink per switch to the core routers.
If an uplink port fails the other uplinks automatically carry the traffic.

Each server uses two network ports in an aggregated link. The individual ports are connected to separate switches.

If a network port has an error, the other port continues to carry the traffic.

If a switch fails, the server uses the other port to carry the traffic.

Space Storage

Space Storage consists of a number of storage controllers, disk arrays and servers.
The Storage Area Network (SAN) uses two fibre channel switches.

All disk arrays, controllers and servers use at least two fibre channel connections for the Storage Area Network.
All disk arrays, controllers and servers have connections to both fibre channel switches.

There are three servers providing Network File Services to clients.

Each server complies with the power and network design above.

NFS automount is used on clients with all three servers as a target.

At mount time the most suitable NFS server is selected.

When data is written to tape a copy is written to two separate tapes. Either copy can be used to read back the file when required.

OwnTime

The Openstack control infrastructure consists of several virtual machines running across three physical servers.

Each of the physical servers is in a different rack.

The various openstack services are distributed across these three servers.

Mysql database cluster has three virtual machines, one running on each server. Ha-proxy provides access to an active database.

Rabbit Message Queue has three virtual machines, one running on each server. Services are configured to query all three systems.

There is one availability zone with two cell controllers, running on separate servers. The cell controller manages scheduling of new virtual machines and reporting of statistics, and various management functions to the core Openstack services.

The compute servers have local storage for hosting virtual machines.

Operating system disks all use RAID1.

Data disks for virtual machines are RAID6, allowing for two disk failures in each node before loss of service.

There is remote console access to allow reboot or recovery from system crashes without attending the site.

There is extensive logging and monitoring to track hardware warnings and faults.

The operating system installation for all physical and virtual systems is automated using network boot and templates to ensure rapid and consistent installation.

The software configuration of all systems is managed using Puppet, so configurations are automated and consistent across all systems.

Nagios configuration is generated automatically from the puppet configuration. Adding a new service or system to puppet causes a corresponding Nagios test to be created and added automatically.

Failure/Recovery Scenarios

Failure Scenario	Continuity of Service	What happens in the infrastructure and what we do to recover it….
One of the power supplies to a physical server, disk array or switch fails	Yes	The other power supply continues to carry the load. Normal operation continues with no loss of service. A service call is raised with the vendor to replace the failed part.
One of the network ports goes down	Yes	The other port in the aggregated link continues to carry the traffic. Normal operation continues with no loss of service. A service call is raised with the vendor to replace the failed part.
One of the network switches or routers fails	Yes	All servers are connected to more than one switch so the other port in the aggregated link continues to carry the traffic. Normal operation continues. A service call is raised with the vendor to replace the failed part.
Comms link to AARNet lost	Yes	The network routes automatically change to use the alternate AARNet link. A brief pause, typically less 30 seconds, in traffic flow may occur, then traffic flow continues without service interruption.
Compute server goes offline	Partial	Server recovery may require restarting services or rebooting. After system update/verification VMs are manually restarted. Intersect maintains a pool of spare servers, so in the event of a likely extended outage we would swap the system and/or reallocate the VMs to other servers, whichever is most appropriate. If there is a hardware failure then a service call is raised with the vendor.
A hard disk fails	Yes	RAID allows the remaining disk(s) to provide service. A service call is raised with the vendor to replace the failed part. The new disk can be installed without interruption to service.

Last updated: 19 May 2019

Source: https://inter.fyi/R8EaJ

Useful links

Contact us

eResearch Nexus Hardware Redundancy and High Availability

Technical Architecture

Failure/Recovery Scenarios