Calendar
Availability Calendar
Current Events
Outage Details
Downtime for April 2007
| Start | End | Comments |
|---|---|---|
| 04 Apr 08:00 | 04 Apr 12:30 | Replaced power supplies on two modules Upgraded MOAB to version 5.1.0p1 |
| 09 Apr 00:50 | 09 Apr 02:30 | System down. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 09 Apr 09:16 | 09 Apr 11:32 | System panic due to memory problem. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 09 Apr 16:44 | 09 Apr 20:08 | System down. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 09 Apr 22:05 | 10 Apr 02:22 | System panic due to memory problem. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 10 Apr 07:30 | 10 Apr 09:52 | System down due to memory problem. Identified and replaced problem component. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 12 Apr 00:03 | 12 Apr 10:53 | System down. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 13 Apr 10:02 | 13 Apr 16:58 | System down due to pump failure. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 24 Apr 16:48 | 24 Apr 18:36 | System rebooted after a panic. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 25 Apr 08:00 | 25 Apr 13:32 | During maintenance, the OS was upgraded to UNICOS/mp 3.1.30 and several DIMMs were replaced. |
Downtime for May 2007
| Start | End | Comments |
|---|---|---|
| 02 May 22:12 | 03 May 03:30 | System down due to I/O channel error. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 09 May 08:20 | 09 May 09:00 | System upgraded to UNICOS/mp 3.1.31 |
| 14 May 13:40 | 14 May 15:57 | Power fluctuation caused the system to go down. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 14 May 23:08 | 15 May 00:48 | System rebooted after hardware failure caused a panic. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 19 May 13:47 | 19 May 16:05 | System rebooted after hardware failure caused a panic. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 23 May 08:15 | 23 May 12:30 | During maintenance, four fan speed controllers were replaced. |
| 31 May 10:55 | 31 May 12:15 | System panic due to scalar TLB miss. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
Downtime for June 2007
| Start | End | Comments |
|---|---|---|
| 03 Jun 12:00 | 03 Jun 15:35 | Phoenix lost its connection to /spin. New logins were hanging and commands that accessed /spin were hanging as well. The system was rebooted and returned to service. Jobs running at the time of the reboot were killed; jobs in the queue (but not yet running) were not affected. |
| 06 Jun 08:04 | 06 Jun 12:00 | Scheduled maintenance. |
| 06 Jun 21:32 | 07 Jun 10:50 | A job failed due to a CRPE and this caused PBS to hang. The system was rebooted and returned to production use. |
| 13 Jun 21:10 | 14 Jun 02:30 | System crashed due to site power outage. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 18 Jun 20:10 | 19 Jun 09:37 | The system hung requiring a system reboot. |
| 20 Jun 08:00 | 20 Jun 11:38 | Scheduled maintenance. Replaced a power supply. |
| 27 Jun 08:06 | 27 Jun 12:05 | Replaced a pump during scheduled maintenance. |
| 28 Jun 22:00 | 29 Jun 10:30 | System stopped processing jobs and was rebooted. Jobs in a run state at the time of the outage were killed. Jobs in the queue (but not yet running) were not affected. |
Downtime for July 2007
| Start | End | Comments |
|---|---|---|
| 03 Jul 01:43 | 03 Jul 09:39 | System stopped running jobs and was rebooted. During the outage, one power supply was replaced. |
| 10 Jul 11:10 | 10 Jul 12:15 | System interaction was degraded leading to a system reboot. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 10 Jul 14:04 | 10 Jul 14:51 | System interaction was degraded leading to a system reboot. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 18 Jul 14:30 | 18 Jul 15:40 | System was rebooted to clear a job that was hung in an exiting state. Scheduling was stopped and jobs that were running were allowed to complete prior to the reboot. |
| 24 Jul 00:16 | 24 Jul 02:13 | System crashed due to CRPE. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 24 Jul 11:46 | 24 Jul 13:08 | System became unresponsive and was rebooted. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 28 Jul 03:01 | 28 Jul 04:05 | System rebooted after a CRPE caused a system panic. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
Downtime for August 2007
| Start | End | Comments |
|---|---|---|
| 08 Aug 08:00 | 08 Aug 12:00 | Maintenance |
| 15 Aug 08:10 | 15 Aug 11:10 | Scheduled maintenance |
| 22 Aug 08:00 | 22 Aug 11:40 | Scheduled maintenance. A power supply was replaced and maintenance was performed on the cooling system in one of the cabinets. |
| 22 Aug 23:57 | 23 Aug 00:49 | System crashed due to Kernel Mode Processor Parity Error. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 29 Aug 08:00 | 29 Aug 11:38 | During scheduled maintenance, replaced two power supplies and performed maintenance on one of the cabinets. |
| 29 Aug 08:00 | 29 Aug 12:00 | Scheduled maintenance |
Downtime for September 2007
| Start | End | Comments |
|---|---|---|
| 05 Sep 08:00 | 05 Sep 12:00 | System Maintenance |
| 05 Sep 08:00 | 05 Sep 13:08 | Performed maintenance on one of the cabinets and replaced a power supply. |
| 08 Sep 05:00 | 08 Sep 15:30 | System was unavailable while NFS mounted directories were moved to a new server. |
| 10 Sep 02:00 | 10 Sep 03:47 | System became unresponsive and was rebooted. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 12 Sep 08:00 | 12 Sep 11:50 | Scheduled maintenance. |
| 25 Sep 12:40 | 25 Sep 15:04 | System became unresponsive and was rebooted. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 26 Sep 08:00 | 26 Sep 11:40 | System maintenance. During maintenance, the default Programming Environment was changed to 5.6.0.3. NOTE: On the cross-compilers, the Programming Environment module is named 'PrgEnv-x1', while on phoenix it is still named 'PrgEnv'. |
Downtime for October 2007
| Start | End | Comments |
|---|---|---|
| 03 Oct 08:00 | 03 Oct 11:45 | Hardware maintenance. |
| 04 Oct 12:35 | 04 Oct 14:30 | System crashed due to a bad memory controller on one of the modules. The module was replaced and the system returned to service. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 10 Oct 08:00 | 10 Oct 12:16 | Hardware Maintenance |
| 17 Oct 08:00 | 17 Oct 12:40 | Replaced a memory module and performed maintenance on one of the cabinets. |
| 20 Oct 20:33 | 20 Oct 21:27 | System panic due to a processor parity error. System was rebooted and returned to service. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 24 Oct 08:00 | 24 Oct 12:30 | Performed hardware maintenance and installed patches. The new cross-compiler (robin1) was made the default and the DNS for robin.ccs.ornl.gov was changed to point to robin1. |
| 28 Oct 02:30 | 28 Oct 03:48 | System stopped responding after /scratch/scr101 directory filled to 100%. The system was rebooted and a sweep was run on /scratch/scr101. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 29 Oct 00:00 | 29 Oct 00:00 | Due to low utilization, the debug reservation will no longer be held. |
| 31 Oct 08:10 | 31 Oct 12:45 | Hardware maintenance |
Downtime for November 2007
| Start | End | Comments |
|---|---|---|
| 07 Nov 08:00 | 07 Nov 11:45 | Hardware maintenance |
| 08 Nov 00:00 | 08 Nov 01:43 | System lost connectivity with the NFS server and was rebooted. Jobs running at the time of the outage were killed; jobs in the queue (but not running) were not affected. |
| 11 Nov 11:30 | 11 Nov 14:54 | System crashed due to a processor parity error on the boot node. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected |
| 13 Nov 12:02 | 13 Nov 12:55 | Neither robin nor the CPES could mount scratch filesystems form phoenix. Phoenix was rebooted and the problem cleared. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 14 Nov 09:30 | 14 Nov 11:21 | System crashed due to site power interruption. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 14 Nov 21:00 | 14 Nov 21:41 | System crashed due to a Kernel Mode Processor Parity Error. |
| 22 Nov 12:40 | 22 Nov 18:38 | System rebooted to clear problems communicating with one of the NFS servers. |
| 23 Nov 10:33 | 23 Nov 11:51 | System rebooted after a panic. After rebooting, jobs that were in a Deferred state were released. |
| 23 Nov 21:21 | 24 Nov 01:03 | System rebooted due to Kernel Mode Processor Parity error. |
| 26 Nov 20:00 | 26 Nov 21:28 | Neither robin nor the CPES could mount scratch filesystems form phoenix. Phoenix was rebooted and the problem cleared. No jobs were running at the time of the outage. Jobs waiting to run were not affected. |
| 27 Nov 14:48 | 27 Nov 16:21 | System panic/crash due to a bad hardware module. The module was disabled and the system returned to service. |
| 28 Nov 08:00 | 28 Nov 12:05 | During system maintenance, replaced a DIMM and a hardware module. The system is once again running with all processors available. |
| 30 Nov 04:05 | 30 Nov 05:34 | System rebooted due to CRPE hardware errors. The cause of the errors is under investigation. |
Downtime for December 2007
| Start | End | Comments |
|---|---|---|
| 05 Dec 08:00 | 05 Dec 12:15 | Replaced three hardware modules during scheduled maintenance. |
| 08 Dec 11:00 | 08 Dec 23:00 | Phoenix will be unavailable for login and batch processing during this time. |
| 08 Dec 22:38 | 09 Dec 00:30 | System rebooted due to a system panic. |
| 12 Dec 07:53 | 12 Dec 10:31 | During system maintenance, the OS was upgraded to UNICOS/mp 3.1.42 |
| 13 Dec 03:31 | 13 Dec 04:50 | System rebooted after a panic. |
| 13 Dec 10:09 | 13 Dec 11:33 | System rebooted after a panic. |
| 17 Dec 13:16 | 17 Dec 16:15 | System crashed due to an error on a hardware module. This module was replaced and the system was returned to service. |
| 22 Dec 13:20 | 22 Dec 14:15 | System rebooted to clear problems accessing scratch directories on the CPES and robin1. |
| 23 Dec 11:40 | 23 Dec 13:10 | System rebooted to clear issues accessing some filesystems |
| 22 Dec 14:15 | 23 Dec 15:30 | System was unavailable for login and batch processing during this time. |
| 23 Dec 15:30 | 23 Dec 16:30 | System rebooted and returned to general availability. |
| 26 Dec 08:00 | 26 Dec 12:09 | Replaced a pump during scheduled maintenance. |
Downtime for January 2008
| Start | End | Comments |
|---|---|---|
| 02 Jan 15:20 | 02 Jan 18:30 | Problems on several processors were preventing large jobs from starting. The system was rebooted to clear these problems. |
| 09 Jan 10:28 | 09 Jan 12:42 | System crashed due to hardware error. |
| 11 Jan 09:40 | 11 Jan 12:00 | System rebooted after a panic. |
| 16 Jan 08:00 | 16 Jan 14:30 | System maintenance. |
| 23 Jan 08:00 | 23 Jan 11:20 | Updated firmware on disk controllers during scheduled maintenance. |
| 24 Jan 07:25 | 24 Jan 17:00 | Phoenix crashed due to a site power problem. |
| 25 Jan 12:30 | 25 Jan 16:55 | System shut down due to work on the site power system. |
| 28 Jan 21:13 | 28 Jan 22:44 | System rebooted after a panic. |
| 30 Jan 08:20 | 30 Jan 12:35 | Updated firmware on a disk controller and replaced a power supply during scheduled maintenance. |
Downtime for February 2008
| Start | End | Comments |
|---|---|---|
| 03 Feb 06:41 | 03 Feb 08:06 | System rebooted after a panic. |
| 04 Feb 02:51 | 04 Feb 04:29 | System rebooted after a panic. |
| 06 Feb 08:15 | 06 Feb 12:03 | Updated disk controller firmware during scheduled maintenance. |
| 13 Feb 08:00 | 13 Feb 11:30 | Updated disk firmware during scheduled maintenance. |
| 14 Feb 13:30 | 14 Feb 14:05 | System rebooted to clear problems with NFS. |
| 15 Feb 10:15 | 15 Feb 11:00 | System rebooted to clear problems with NFS exports. |
| 20 Feb 08:00 | 20 Feb 10:00 | Replaced a power converter during scheduled maintenance. |
| 23 Feb 05:30 | 23 Feb 07:30 | System rebooted due to a software panic. |
| 27 Feb 08:00 | 27 Feb 11:15 | System maintenance |
Downtime for March 2008
| Start | End | Comments |
|---|---|---|
| 05 Mar 08:00 | 05 Mar 12:00 | The phoenix cross-compiler (robin) will be unavailable due to maintenance. Phoenix will remain in production. |
| 12 Mar 08:00 | 12 Mar 14:10 | During maintenance, replaced a memory module and upgraded disk firmware. |
| 14 Mar 16:08 | 16 Mar 19:58 | System unavailable due to site power outage. |
| 19 Mar 08:00 | 19 Mar 15:10 | System maintenance. |
| 26 Mar 08:00 | 26 Mar 11:15 | Replaced a pump during maintenance. |
Downtime for April 2008
| Start | End | Comments |
|---|---|---|
| 02 Apr 08:00 | 02 Apr 11:50 | During scheduled maintenance, repaired the network connection between phoenix and the CPES and upgraded the OS to UNICOS/mp 3.1.46 |
| 09 Apr 08:00 | 09 Apr 11:45 | System maintenance |
| 10 Apr 14:30 | 10 Apr 17:10 | System became unresponsive. A DIMM was replaced and the system was returned to service |
| 11 Apr 12:40 | 11 Apr 13:50 | System rebooted after a panic. |
| 17 Apr 14:22 | 17 Apr 17:56 | System rebooted after a panic |
| 23 Apr 08:00 | 23 Apr 11:55 | System maintenance |
| 25 Apr 11:47 | 25 Apr 15:45 | System crashed due to hardware failure. |
| 28 Apr 16:00 | 28 Apr 22:30 | Phoenix and robin were not available for general use. Jobs running at the time of the outage were killed and rerun after the outage. If you had jobs running at the time please check your output to verify your job finished without any errors. |
| 30 Apr 08:00 | 30 Apr 11:35 | System maintenance |
Downtime for May 2008
| Start | End | Comments |
|---|---|---|
| 05 May 21:00 | 05 May 21:45 | System crashed due to Kernel Mode Processor Parity Error. |
| 07 May 08:00 | 07 May 13:50 | Scheduled Maintenance. |
| 14 May 08:00 | 14 May 12:00 | Scheduled Maintenance. |
Downtime for June 2008
| Start | End | Comments |
|---|