PDA

View Full Version : Server Outage - Reboot



Snowman
01-12-2007, 05:57 PM
Perl on the server is playing up so i am rebooting the server now to correct the issue

services should be restored shortly

apologies for any inconvenience caused

Snowman
01-12-2007, 06:28 PM
The server is not responding to reboot and Angus is currently on his way into the datacentre to investigate the cause of this issue

I will update everyone as soon as he has taken a look at the server

Snowman
01-12-2007, 07:41 PM
Angus has checked over the server and unfortunately due to a corrupt virtfs folder the hard symlinks to tbe /bin directory where perl resides have been cut and the bin directory has been wiped

We are now forced to have to do an OS reload, which Angus us currently doing.

Once thats done i will re-install cpanel and configure the server and then re-sync everyones data from the second raid drive, so that todays data is fresh upto the time the server went down

I cant give an eta at this stage as it is a complex process to complete, but as we progress thru each stage i will update everyone with as much info as possible

Snowman
01-12-2007, 08:48 PM
centos has been re-installed and i am now configuring and installing cpanel, which generally takes around 1 hour to complete

once that' s done sites will be restored from the overnight backup, we will then look at re-sysncing the data from the second raid drive if its possible without causing issues

Snowman
02-12-2007, 12:49 AM
Im configuring CPanel and Apache currently and the restoration of sites will start in around an hours time

Snowman
02-12-2007, 04:00 AM
Apologies for the delay in getting this issue resolved

We are having problems getting the backup drive to mount correctly on the linux file system

Angus and myself have made several attempts to correct the problem but we have not been able to so far.

Ive contacted the datacentre and have asked for them to send someone in to assist and we are waiting on acknowledgement from them on this

We will update everyone as soon as we have more information

Angus
02-12-2007, 08:24 AM
Hi

The Data Centre were unable to send someone in and I am not long off the phone organising for Richard to go in and as soon as they have established the problem and are working on correcting it we will update you

The reason for the delay is we wanted someone from the DC to look at it as it was either him or Jon who set the server up originally (12 months ago) and would therefore have the bes idea of what was causing the problem and be able to fix it

We only had a response from them about half an hour ago - 7.30am Brisbane time to say they were unable to get to the DC

As a result I organised someone else we use, who is making their way there now

As Steve mentioned in the last post the problem is the main drive will not properly read the backup drive and recognise it is there. This in turn means we have been unable to restore the sites from Friday's backup

Once this problem is corrected and I would hope expect it to be no longer than 60 - 90 minutes from now (and I hope considerably less) we can start restoring the sites

This is normally done on an alphabetical basis but I have asked Steve to see if he can start by restoring the sites of anyone who posts here and requests their account be given priority and he is going to try and do this as long as it will not significantly increase the overall restore time

In relation to the server as a whole we have been seeing higher than should be expected loads over the last 2 weeksand have been investigating this for a while

We initially suspected that one particular account was a major contributor and as a result moved it to another server during the week but this didn't make as big a difference as we expected

Last night while examining the server again Steve found some corruption of the core Centos Operating System files and we were in the process of looking at what would be required to repair them when the server failed altogether

Our first response was to insert the Centos disk and try to run the repair feature in the same way you would try to repair Windows on a PC, this did not prove possible and as a result we had to wipe the main drive and initiate the re-installation of Centos and in turn cPanel

To be honest I believe we would have been looking at having to conduct a re-install of everything anyway in order to fix this problem

The difference here is that we would have hoped to keep the server up long enough to give everyone notice of what was to be done and to pick the time and announce it in advance. As it was the server made the decision for us

If it had not been for this backup drive issue all the sites would have been back up already

I completely understand no downtime is what is preferred but when I think about it if we had been able to keep the server going in order to provide notice of the re-install eg scheduling it for Monday night - there is the chance it would have failed anyway but and that could have been on Monday morning causing far greater issues

I would also point out the server was running Centos version 4 (the version which was current when the server was set up) and we have also taken the opportunity to install the latest version of Centos 5

Richard is on his way to the Data Centre now and as soon as he has had a look at things we will be updating you further

Once again my apologies for the inconvenience but as the server was running RAID we have kept the original primary drive aside in order for us to look at and try to establish what caused the original issue

Regards
Angus

Snowman
02-12-2007, 10:01 AM
server is back online and backup has mounted correctly

We are starting the restore of sites from the friday backup now

Snowman
02-12-2007, 11:33 AM
We have found that some of the cpanel backup files are corrupt (not client sites files which are all intact) so we would like everyone who has a dedicated IP for any of their sites to drop a ticket into the helpdesk with the details, this will help speed up the process for us

Snowman
02-12-2007, 03:44 PM
we are at about 1/4 of the way thru restoring sites to the server

the process is slow as we cant do more than one restore at a time without locking up the server

Angus
02-12-2007, 06:34 PM
We were having problems with the configuration of Mailscanner (which filters the mail for spam etc) and this was meaning no mail was getting through while it was being worked on so instead of delaying this any longer we have temporarily turned it off in order to allow the mail to flow

The downside is that spam will go through as well and mean a lot more mail for the mail server to process but it is a bit of a catch 22 and I think this is the preferable option at the moment We have asked configserver in the UK who provide our Mailscanner system to have a look at it

Angus

Angus
02-12-2007, 07:23 PM
The restoration of the sites is continuing with those who have submitted tickets with their sites being given priority and those are almost complete

Steve is also in discussion with someone about the possibility of reinstalling the original drive in order to sync data from it to the new drive

If this is found to be possible we will be able to restore individual sites exactly as they were when the server failed

I don't guarantee this will be possible but we will keep you up to date

Angus

Snowman
02-12-2007, 10:51 PM
we are about 3/4 of the way thru the restoration of sites now and we have managed to speed the restore up slightly. I would estimate another 2 hours before all sites are back online.

We will then set about moving resellers sites onto their shared IP's and once thats done tomorrow morning restore dedicated IP sites, and SSL's