[RESOLVED] Downtime for Hardware Failure - ALL TCHPC Clusters
Update Wed 26 Jun 9am
The replacement controller was installed, and the disk arrays have been rebuilt. The filesystems are back online.
Logins are available on the cluster headnodes again. The queues will be released shortly.
Original post: Friday 21st 9.30am
Due to a hardware failure in the SAN storage system, the clusters (lonsdale, parsons, kelvin) will now be taken offline.
We are taking this step as the storage system is now in a non-redundant state, and we wish to guard against potential data loss.
We expect to have a replacement unit delivered on Monday, and will have the systems back online as soon as rebuilds have finished. This could lead into Tuesday.
All queues will be unavailable at this time.
The GPFS cluster filesystems (
/gscratch) will also be unavailable during this period.
For queries, please contact: firstname.lastname@example.org
More like this
- Downtime for Hardware Failure - ALL TCHPC Clusters
- Downtime for Service Expansion (20-23rd Dec 2010) - ALL TCHPC Clusters
- Downtime for Server Room Maintenance (Fri 21th Nov - Fri 5th Dec 2014) -- ALL TCHPC Clusters
- Downtime for Service Window (27th June - 5th July 2011) - ALL TCHPC Clusters and Services
- Downtime for main clusters - storage issues