[RISOLTO] Downtime del data center #1 il 4 agosto 2015

Il 4 agosto 2015 alle ore 18:46 nel data center #1 un tecnico ha subito un elettroshock da media tensione. In seguito alle chiamate e segnalazioni un’ambulanza, medici d’emergenza ed i vigili del fuoco erano in loco nel data center ed il tecnico è stato subito rianimato e trasportato in ospedale. È stato spento l’intero data center e messo in alimentazione da generatore Diesel con bypass attivo.

Di seguito sono state riattivate le unità di distribuzione elettrica (PDU), procedendo PDU per PDU. La PDU#1 risultava difettosa e gravemente danneggiata ed è stato informato il fornitore dell’unità per il ripristino. A partire dalle ore 19:17 circa due terzi dell’infrastruttura e dei servizi era di nuovo con corrente ed abbiamo provveduto a rimettere online servizi e server procedendo server per server.

Alle ore 23:06 circa purtroppo si è sovraccaricata l’unità PDU#2 dato che troppi clienti hanno effettuato lo switchover dall’unità difettosa PDU#1 all’unità PDU#2. Sia il fornitore che i nostri tecnici hanno continuato a tentare di ripristinare la corrente per l’intero DC.

Alle ore 04:30 circa del 5 agosto 2015 la corrente è stata ripristinata per l’intero data center ed a partire dalle ore 04:46 la maggior parte dell’infrastruttura era di nuovo online. I nostri tecnici hanno provveduto a sostituire hardware difettosa (tra l’altro Cisco Cat) ed in parte a ripristinare la configurazione delle (P)VLAN da backup.

Alle ore 6 del mattino tutti i server e l’infrastruttura risultavano avviati ed abbiamo provveduto alla verifica dei servizi. Il traffico internet si è normalizzato con problemi rilevati ancora per singoli clienti (filesystem-check, database recovery ecc.).

Infine alle ore 05:00 del mattino il 7 agosto 2015 è stato effettuato lo switchback alla corrente elettrica da fornitori commerciali con successo.

Ci scusiamo per gli eventuali disagi causati da questo downtime (il tecnico ha preso delle bruciature, ma è stato rilasciato dall’ospedale il giorno seguente). Purtroppo anche molti dei nostri servizi (sito internet, centralina VoIP, mailserver) erano coinvolti, grazie al canale Twitter @server24it siamo riusciti ad informare almeno parte dei nostri clienti.

Segue protocollo d’emergenza (in inglese):

 

08/07/2015 05:43:11 GMT - 
*** CASCADED EXTERNAL NOTES 07-Aug-2015 05:42:42 GMT From CASE: 9542194 - Event
The activity to switch back to commercial power completed by 03:00 GMT. All activities are now completed and full infrastructure resiliency is in place.


08/06/2015 21:21:04 GMT - 
*** CASCADED EXTERNAL NOTES 06-Aug-2015 21:19:54 GMT From CASE: 9542194 - Event
August 6 There has been an unrelated network fault in Germany on our cable infrastructure. The repair operation is progressing well and we expect to have that completed by 01:00 GMT. The activity to switch back to commercial power at the Munich location will now take place between 02:00-03:00 GMT.


08/06/2015 12:18:42 GMT - 
*** CASCADED EXTERNAL NOTES 06-Aug-2015 12:18:18 GMT From CASE: 9542194 - Event
August 6 The specialist's engineers have completed the UPS and infrastructure checks. Preparations are now underway to return the site back to the commercial supply. The switch over from the site generators to the commercial supply will take place tonight between 22:00-23:00GMT during which there is no anticipated interruption to service. A further communication will be sent when switch over has been completed.


08/05/2015 13:59:30 GMT - 
*** CASCADED EXTERNAL NOTES 05-Aug-2015 13:58:38 GMT From CASE: 9542194 - Event
The facility power continues to remain stable.  While further infrastructure checks take place, the Munich facility will remain on generator supply for a further 24 hours.  Provisions have been made by Server24 to ensure that sufficient generator fuel is available. Server24 specialists will continue to remain on site until the investigations have been concluded and the commercial power has been reinstated.


08/05/2015 13:56:27 GMT - 
*** CASCADED EXTERNAL NOTES 05-Aug-2015 13:56:00 GMT From CASE: 9542194 - Event
The facility power continues to remain stable.  While further infrastructure checks take place, the Munich facility will remain on generator supply for a further 24 hours.  Provisions have been made by Server24 to ensure that sufficient generator fuel is available. Server24 specialists will continue to remain on site until the investigations have been concluded and the commercial power has been reinstated.


08/05/2015 10:39:02 GMT - 
*** CASCADED EXTERNAL NOTES 05-Aug-2015 10:38:40 GMT From CASE: 9542194 - Event
On 4th August 2015 at approximately 16:50 GMT, while troubleshooting a potential faulty power track and sub-section of a PDU, a fault occurred which caused an electrical arc to take place. The UPS went into bypass mode and the site generators activated to restore power to customer colocation areas.
Whilst further investigations by the equipment specialists take place, the UPS will remain in bypass mode and the facility will remain on generator supply.
We do not anticipate any further interruptions to service. 
Until such time that full resiliency has been restored to the facility further updates will continue to be communicated.


08/05/2015 06:14:24 GMT - 
*** CASCADED EXTERNAL NOTES 05-Aug-2015 06:13:59 GMT From CASE: 9542194 - Event
Efforts to resolve the single customer issue are ongoing and all other customers remain restored. A final communication will be sent once all work at the site has been completed.


08/05/2015 04:15:06 GMT - 
*** CASCADED EXTERNAL NOTES 05-Aug-2015 04:14:39 GMT From CASE: 9542194 - Event
Facility Management has replaced the breakers and an Electrician completed the required cable checks. The breakers were turned on one by one to complete the bypass of the Uninterrupted Power Supply. Power is currently being supplied by the emergency generator and the affected services have restored. A single customer issue persists due to a power rail down on a separate Power Distribution Unit. The issue was caused by the device being overloaded with customers during the event and efforts to resolve this issue are underway.


08/05/2015 01:45:45 GMT - 
*** CASCADED EXTERNAL NOTES 05-Aug-2015 01:45:16 GMT From CASE: 9542194 - Event
Facility Management arrived to the site and determined it is necessary to replace multiple breakers. The breakers will be replaced, then the internal cables will be checked. Updates will be provided, as they are made available by Facility Management.


08/04/2015 22:32:06 GMT - 
*** CASCADED EXTERNAL NOTES 04-Aug-2015 22:31:30 GMT From CASE: 9542194 - Event
The European NOC advised that Facility Management remains en route to the site to resolve the Power Distribution Unit issue, with an estimated time of arrival of 02:00 GMT. Efforts to coordinate the equipment vendor's dispatch to resolve the initial Uninterrupted Power Supply issue continue. The next update will be provided upon the Facility Management's arrival, or when new information is available.


08/04/2015 21:41:32 GMT - 
*** CASCADED EXTERNAL NOTES 04-Aug-2015 21:40:57 GMT From CASE: 9542194 - Event
Efforts to restore all power at the site continue.


08/04/2015 20:41:36 GMT - 
*** CASCADED EXTERNAL NOTES 04-Aug-2015 20:41:09 GMT From CASE: 9542194 - Event
The equipment vendor worked with Field Services to remotely troubleshoot the issue, but was unable to isolate the fault. As such, efforts to escalate with the equipment vendor and obtain an estimated time of arrival for the electrician continue.


08/04/2015 19:55:44 GMT - 
*** CASCADED EXTERNAL NOTES 04-Aug-2015 19:55:19 GMT From CASE: 9542194 - Event
The European NOC advised that the Facility Management is also dispatching to the site to ensure the integrity of all of the power systems at site and confirm that no further issues are present. An estimated time of arrival of 02:00 GMT on August 5, 2015 was provided, but efforts to expedite the arrival are in progress. The power equipment vendor is also dispatching an electrician to resolve the issue with the Uninterrupted Power Supply, but an estimated time of arrival remains unavailable.


08/04/2015 19:30:01 GMT - 
*** CASCADED EXTERNAL NOTES 04-Aug-2015 19:29:34 GMT From CASE: 9542194 - Event
The European NOC advised that some services have restored at this time, but some remain impacted by the power issue. The equipment vendor is engaged to restore the remaining services and the European NOC is continuing efforts to obtain an estimated time of arrival.
Pubblicato in:

Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *

Questo sito usa Akismet per ridurre lo spam. Scopri come i tuoi dati vengono elaborati.