[RESOLVED] Downtime for Hardware Failure - ALL TCHPC Clusters
Update Wed 26 Jun 9am
The replacement controller was installed, and the disk arrays have been rebuilt. The filesystems are back online.
Logins are available on the cluster headnodes again. The queues will be released shortly.
Original post: Friday 21st 9.30am
Due to a hardware failure in the SAN storage system, the clusters (lonsdale, parsons, kelvin) will now be taken offline.
We are taking this step as the storage system is now in a non-redundant state, and we wish to guard against potential data loss.
We expect to have a replacement unit delivered on Monday, and will have the systems back online as soon as rebuilds have finished. This could lead into Tuesday.
All queues will be unavailable at this time.
The GPFS cluster filesystems (
/gscratch) will also be unavailable during this period.
For queries, please contact: firstname.lastname@example.org
More like this
- Downtime for main clusters - UPS issues
- Downtime for Hardware Failure - ALL TCHPC Clusters
- Downtime for Service Expansion (20-23rd Dec 2010) - ALL TCHPC Clusters
- Downtime 2017-01-27 College-wide power brown-out
- Downtime for Server Room Maintenance (Fri 21th Nov - Fri 5th Dec 2014) -- ALL TCHPC Clusters