Network incident under investigation

Incident Report for Bloomreach

Postmortem

After further investigation with our upstream provider about this network incident, we would like to make a correction to the initial report with an addition.

On the 5th of September, between 11:38 and 12:06 UTC our EU-B (NL2) platform experienced an issue related to outgoing traffic from, and traffic within, the platform. Incoming requests were unaffected.

Below are details from the upstream provider:

We had an un-expected server hardware failure, and while our high availability service was migrating services to a new node, this coincidentally overlapped with our regular cloudstack maintenance update. This caused the virtual routers to run into some kind of race condition while switching over from master to backup one, which in turn led to the ARP entry pointing to the backup router.

To prevent this occurring in the future our upstream provider will:

lower the ARP cache entry timeout on their core switches
investigate if they can force an update of the ARP cache entry on their core switches if this scenario occurs
investigate the possibility for the backup router to keep forwarding the traffic in this case

Posted Sep 11, 2019 - 12:20 CEST

Resolved

We are currently investigating a network incident upstream that led to connectivity problems for traffic exiting the platform. Sites remained up during the incident.

Posted Sep 05, 2019 - 14:11 CEST