Network incident under investigation
Incident Report for Bloomreach
Postmortem

After further investigation with our upstream provider about this network incident, we would like to make a correction to the initial report with an addition.

On the 5th of September, between 11:38 and 12:06 UTC our EU-B (NL2) platform experienced an issue related to outgoing traffic from, and traffic within, the platform. Incoming requests were unaffected.

Below are details from the upstream provider:

We had an un-expected server hardware failure, and while our high availability service was migrating services to a new node, this coincidentally overlapped with our regular cloudstack maintenance update. This caused the virtual routers to run into some kind of race condition while switching over from master to backup one, which in turn led to the ARP entry pointing to the backup router.

To prevent this occurring in the future our upstream provider will:

  • lower the ARP cache entry timeout on their core switches
  • investigate if they can force an update of the ARP cache entry on their core switches if this scenario occurs
  • investigate the possibility for the backup router to keep forwarding the traffic in this case
Posted Sep 11, 2019 - 12:20 CEST

Resolved
We are currently investigating a network incident upstream that led to connectivity problems for traffic exiting the platform. Sites remained up during the incident.
Posted Sep 05, 2019 - 14:11 CEST
This incident affected: Bloomreach Cloud (Bloomreach Cloud EU-B (NL2)).