Sunday, October 14, 2018

Root cause analysis - Mon 1st Oct

Affected service
- cPanel Hosting on green.phurix.com

The problems
- Support was unavailable
- Server was unable to send emails
- Server ran out of disk space
- Out of band email for urgent issues failed

Timeline of events
- Sun 9th September - Users were not receiving mail into their mailboxes
- Sat 29th September - We were made aware of the issue
- Sun 30th September - The mail service was restored

What happened? Breakdown of the problem
- As part of our ongoing efforts to decommission our older, retiring servers we moved many of our virtual servers to newer hardware
- Unfortunately server “aqua.phurix.com” was suffering from a disk/quota issue which rendered a system level transfer not possible
- In an effort to avoid any downtime, users were seamlessly migrated away from “aqua.phurix.com” to “green.phurix.com” using the express transfer process at cPanel WHM level
- One of the users that was transferred to the server was running a Twitter aggregation script that was not configured correctly and was outputting errors to an error_log file
- As cPanel does not do user level log rotation and due to an issue users quotas were not enforced so the offending error_log file filled the cPanel server quota, which meant the server had run out of disk space
- We had yet to transfer the “phurix.co.uk” domain away from the shared server to a separate instance and so it too ran out of disk space
- As the “phurix.co.uk” server had run out of disk space, we were unfortunately unable to receive or reply to any emails
- Our telephone messaging service and urgent messaging service was configured to point at an “out of band” domain and email address at “urgent@phurix.net”
- Due to a recent migration to a new service, email forwarders were not configured and the resulting “on call” recipients did not receive any notifications of an issue

What was done to mitigate the issue?
- As soon as we were made aware of the issue we escalated it and an investigation was started
- At first we thought it was a MySQL issue as that seemed to be consuming lots of system resources, investigations into this and restarting the service did not resolve the issue
- The disk space was increased, the large file was identified and removed, the culprit script was disabled

What actions will be taken to ensure that this issue does not happen again?
- The “phurix.co.uk” domain will be migrated away from the shared server
- The “out of band” email will be configured with email forwarding
- More “out of band” contact details will be provided and promoted (such as twitter)
- User quotas will be enabled
- Review service costs to reflect a higher level of support required
- Offer plans for high level of support

Executive summary

We fully appreciate and share in the frustration you have experienced.

Unfortunately, it seems that a number of circumstances have lead up to this situation which ultimately resulted in a failure in the service.

Rest assured that there is no excuses for what happened, but we hope that this offers an explanation.

Unfortunately, these types of situations are rarely the result of a single action and instead are the result of a “perfect storm”. This does not reflect our usual level of service and is the exception.

We’ve been around for over 10 years now and have always strived to offer a great service at a great price.

We will endeavour to take the appropriate actions to avoid this happening in future.

Thanks.