Partial Service Outage/Degraded Performance - 14-16 January 2019 [RESOLVED]
Wed Jan 16 23:10 2019 UTC:
As of 11:10pm UTC we are seeing server response times return to normal.
We will continue to closely monitor over the coming hours, and provide further updates if necessary, however we believe this issue has now been resolved.
Please contact us if you're still experiencing an issue with slow response times of your hosted MIDAS service.
Wed Jan 16 20:55 2019 UTC:
We remain in close contact with our hosting provider, who have just informed us that the RAID rebuild (see previous updates) is now 98% complete.
Once again we apologize if you're currently experiencing degraded server response times when accessing your MIDAS system, and we appreciate the frustration this is causing some of our customers.
Again, we would like to reassure you that we are working closely and diligently with our hosting provider to actively monitor the situation, and we are hopeful that "normal service" will be resumed as soon as practically possible.
We genuinely appreciate and thank your ongoing patience.
Wed Jan 16 01:03 2019 UTC:
The following email has been sent to the Primary Contact we currently hold on file for all customers potentially affected by this issue.:
Service Status Update - 15th January 2019
Hi <customers name>,
You may have noticed degraded performance of your hosted MIDAS system over the last 30 or so hours, and also a brief period of inaccessibility yesterday afternoon/evening.
We therefore wanted to take this opportunity to reach out to the limited number of customers who are potentially affected by this to provide a detailed update and bring you up to speed on exactly what happened...
Just after 5pm UTC yesterday (Monday 14th) we received an isolated report from a customer reporting slow loading times of their MIDAS system. We investigated but were initially unable to replicate the issue, and our internal network monitoring systems were not detecting any unusual issues with response times.
However, at 6:30pm UTC yesterday, we began receiving further reports from other customers whos hosted MIDAS systems all resided on the same node in our network, reporting the same issue of slow loading times, and in some instances temporary "service unavailable" messages.
Whilst we were investigating the potential cause of this issue on one of our client nodes, the node in question then suffered what's known as a "kernel panic" (essentially a computer crash) at 9:42pm UTC yesterday.
As you may know if your own home/work computer has ever crashed - often in the run up to this, your computer will start to become very slow, and we now believe that the degraded performance of one of our client nodes was a precursor to its subsequent "kernel panic"
Anyway, this "kernel panic" was instantly detected by our hosting provider, who immediately rebooted the node and brought it back online. This downtime lasted less than 5 minutes.
However, when the node was brought back online, customer's MIDAS databases failed to correctly initialize.
So whilst customers on the affected node could then access their MIDAS login screens, attempts to log in would be met with "access denied" or "unable to connect to database" errors.
Our hosting provider immediately began working to resolve the database problem, which was caused as a result of the unexpected crash. (as you know if you've ever pulled the plug on your own home/work PC without correctly shutting it down first, files can become corrupt)
In order to resolve the database errors caused by the unexpected crash, databases on the node had to be rebuilt. Given the number of customers who have MIDAS databases on each client node, this took a little time to complete, during which time it was necessary to take the databases offline, meaning you would have been unable to login to your MIDAS system whilst the rebuild was taking place.
No data was lost during this process and our hosting provider was able to fully resolve these database issues within the hour, with access being fully restored for all affected customers.
We are aware however that performance on this node is still not up to par following yesterday's issues, and so you may still presently be experiencing slower than normal page loading times when using your MIDAS system.
Our hosting provider inform us that the cause of the ongoing degraded performance is due to the node's RAID (hard disc) array rebuilding itself (essentially "self healing") following the crash.
Whilst an exact time frame for this "self healing" process to complete and for performance to return to optimum is impossible to predict, due to a large variety of contributing factors, our hosting provider are closely monitoring the node, and their current "estimate" is 12 - 36 hours.
So, in summary:
- The degraded performance yesterday was a precursor to an unexpected "kernel panic" (server crash)
- The unexpected crash caused errors in the databases on the node, which were fully repaired (within an hour)
- Any ongoing degraded performance is due to the node "self healing" following the hardware failure
- It may potentially be another 12 - 36 hours before the node's performance returns to optimum
So we'd like to take this opportunity firstly to apologize for any inconvenience this issue has/is currently causing you, and secondly to thank you for your patience and understanding.
We're passionate about MIDAS and delivering the best possible service to you, and we hope this update provides reassurance that we are actively and closely monitoring the situation.
We'd also like to reassure you that at no time has your data been at risk, as we take both real-time and daily offsite backups of customer's MIDAS databases.
Additionally, we'd also like to remind you that in the very rate instances when an issue arises which may affect your hosted MIDAS service, we have a dedicated Service Status site at https://midas.network where you can find regular updates, and we also post updates our Twitter feed (@mid_as) too. Please do bookmark our service status site and follow us on Twitter - these are your primary "go to" places for up-to-date information on any ongoing issues.
Finally, should you have any questions or concerns, please don't hesitate to contact us and we'll be happy to help!
Tue Jan 15 00:10 2019 UTC:
Further to the previous updates (below), our hosting provider believes that the performance issues that some of our customers reported to us earlier yesterday evening were symptomatic and a potential precursor to the subsequent kernel panic.
All systems are back up a running, however customers on the affected node may continue to face some temporary issues with slowness for the next few hours until the affected node can normalize again, after which we expect the node to return to the high levels of performance our cloud-hosted customers are used to!
Thank you again for your patience and understanding during this brief partial outage which affected some of our customers earlier.
Mon Jan 14 23:50 2019 UTC:
As of 22:45 UTC this evening access to all affected databases should have been fully restored.
If you're currently still unable to access your hosted MIDAS, please contact us and we'll be happy to look into this further.
We are currently monitoring server performance following reports received prior to the outage, but once once again we sincerely apologize for the inconvenience this earlier outage may have caused.
Mon Jan 14 22:30 2019 UTC:
One of our client servers suffered a "kernel panic" at 21:37 UTC this evening and had to be power cycled.
Unfortunately, when the power came back up a few moments later a number of customer databases didn't initialize correctly and are presently offline. Our hosting provider is actively working to resolve this issue and bring the affected databases back online, and we anticipate that access will be restored shortly.
In the meantime, we sincerely apologize for the inconvenience caused and we thank you for your patience.
Mon Jan 14 19:50 2019 UTC:
We are currently experiencing elevated load on one of our client servers. This may be resulting in degraded performance in response times or intermittent "503 Service Temporarily Unavailable" messages for some of our customers. We are currently looking into the cause of this and will post a further update shortly.