Post Mortem Analysis of Mail Server Crash 10/26/11
At 10:38 10/26/11, the AtNex mail server crashed. We determined it was a corrupted hard drive, likely a bad drive. We found that the backup was also corrupt. So we commenced data recovery on the drive. The recovery worked well enough for the server to be operational again on the afternoon of 10/27. Through the night of the 27th and morning of the 28th, we attempted repeatedly to get a system backup. Each attempt failed due to corruption on the drive. However, a backup of the mail store succeeded at around 6:30PM 10/27.
Then the server crashed again at 6AM Friday 10/28.
We commenced a parallel 2 path approach to remedy.
Path 1 - preferred because we'd get all the mail - Recover the data and get a full system backup from the old server. Unfortunately, the recovery software is still running on Sunday 10/30 at noon...
Path 2 - New server with mailstore backup. This, too, is a slow process with all the loading and restoring. But, ultimately, this path won the race. The server is online as of 11:40, Sunday, 11/30/11.
What happened to the mail?
The answer depends on when it was sent and/or received here:
Mail received before 10:38AM 10/26 - should all be on the new server.
Mail sent between 10:38AM 10/26 and 10/27 afternoon when server came up - depends on the sender's server configuration. Most servers are typically configured to retry mail for 3 days. If the sender's server was configured to retry for more than 30 hours, the mail probably came through to the old server sometime in the afternoon or evening of the 27th. But see next possibility... If the sending server was configured to timeout in less than 30 hours, the sender should have received a rejection NDR.
Mail received by the old server after the mailstore backup but before 6AM 10/28 - These messages reside on the old server. The sender thinks they made it through to the destination. If/when the old server is recovered, we will make it available so you can see that mail.
Mail sent after 6AM 10/28 will either show up in the new server, or be rejected, depending on the sending server's configuration and retry period.
We will edit & update this page once we have a final disposition on the old server.
Also, the new server needs a little bit of housekeeping maintenance that will cause short outages in the next few days.
At Atlantic Nexus, we care about our customers. We know that this had a significant impact on your lives. We fixed it as fast as possible under the circumstances. We will be implementing other steps to minimize future issues with the server. Please accept our apologies.