eResearch Nexus Hardware Redundancy and High Availability

Technical Architecture

Following
is a brief description of the hardware and software redundancy employed
within the Intersect eResearch Nexus, the physical facilities that underpin Space and Time services. The goal is to use commodity hardware and take advantage of all features that enhance the reliability and remove single points of failure where possible. All services are configured with an
Active/Active architecture and components have been selected
and designed to work this way.


A list is included for some common failure/recovery scenarios.


Power

  • Every physical device, both servers, storage and switches, has dual power supplies.
  • Each rack has two independent power circuits fed from separate data centre supplies.
  • Each power supply is connected to a different power circuit.
  • The data centre provides uninterruptible power supply, air conditioning and cooling services.
Network
  • There are two core routers.
  • Each core router has a link to AARNET. Both links are active and traffic automatically switches over to the other router on failure.
  • The systems use two network
    switches per rack, linked into a single logical stack.
    • The stack has at least one
      uplink per switch to the core routers.
    • If an uplink port fails the other uplinks automatically carry the traffic.
  • Each server uses two network ports in an aggregated link. The individual ports are connected to separate switches.
  • If a network port has an error, the other port continues to carry the traffic.
  • If a switch fails, the server uses the other port to carry the traffic.

Space Storage

  • Space Storage consists of a number of storage controllers, disk arrays and servers.
  • The Storage Area Network (SAN) uses two fibre channel switches.
    • All disk arrays, controllers and servers use at least two fibre channel connections for the Storage Area Network.
    • All disk arrays, controllers and servers have connections to both fibre channel switches.
  • There are three servers providing Network File Services to clients.
    • Each server complies with the power and network design above.
  • NFS automount is used on clients with all three servers as a target.
    • At mount time the most suitable NFS server is selected.
  • When data is written to tape a copy is written to two separate tapes. Either copy can be used to read back the file when required.

OwnTime
  • The Openstack control infrastructure consists of several virtual machines running across three physical servers.
  • Each of the physical servers is in a different rack.
  • The various openstack services are distributed across these three servers.
  • Mysql database cluster has three virtual machines, one running on each server. Ha-proxy provides access to an active database.
  • Rabbit Message Queue has three virtual machines, one running on each server. Services are configured to query all three systems.
  • There is one availability zone with two cell controllers, running on separate servers. The cell controller manages scheduling of new virtual machines and reporting of statistics, and various management functions to the core Openstack services.
  • The compute servers have local storage for hosting virtual machines.
  • Operating system disks all use RAID1.
  • Data disks for virtual machines are RAID6, allowing for two disk failures in each node before loss of service.
  • There is remote console access to allow reboot or recovery from system crashes without attending the site.
  • There is extensive logging and monitoring to track hardware warnings and faults.
  • The operating system installation for all physical and virtual systems is automated using network boot and templates to ensure rapid and consistent installation.
  • The software configuration of all systems is managed using Puppet, so configurations are automated and consistent across all systems.
  • Nagios configuration is generated automatically from the puppet configuration. Adding a new service or system to puppet causes a corresponding Nagios test to be created and added automatically.

Failure/Recovery Scenarios

Failure Scenario

Continuity of Service

What happens in the infrastructure and what we do to recover it….

One of the power supplies to a physical server, disk array or switch fails

Yes

The other power supply continues to carry the load. Normal operation continues with no loss of service. A service call is raised with the vendor to replace the failed part.

One of the network ports goes down

Yes

The other port in the aggregated link continues to carry the traffic. Normal operation continues with no loss of service. A service call is raised with the vendor to replace the failed part.

One of the network switches or routers fails

Yes

All servers are connected to more than one switch so the other port in the aggregated link continues to carry the traffic. Normal operation continues. A service call is raised with the vendor to replace the failed part.

Comms link to AARNet lost

Yes

The network routes automatically change to use the alternate AARNet link. A brief pause, typically  less 30 seconds, in traffic flow may occur, then traffic flow continues without service interruption.

Compute server goes offline

Partial

Server recovery may require restarting services or rebooting. After system update/verification VMs are manually restarted. Intersect maintains a pool of spare servers, so in the event of a likely extended outage we would swap the system and/or reallocate the VMs to other servers, whichever is most appropriate.
If there is a hardware failure then a service call is raised with the vendor.

A hard disk fails

Yes

RAID allows the remaining disk(s) to provide service. A service call is raised with the vendor to replace the failed part. The new disk can be installed without interruption to service.


Last updated: 19 May 2019

Source: https://inter.fyi/R8EaJ
Back to FAQ
Your browser is not supported. Please upgrade your browser.