JaguarCNL Calendar
Availability Calendar
Current Events
- Since hardware maintenance was completed during the outage last night, the scheduled downtime for today has been cancelled.
Updated: 2008-04-29 08:30:00
Outage Details
Downtime for April 2007
| Start | End | Comments |
|---|
Downtime for May 2007
| Start | End | Comments |
|---|
Downtime for June 2007
| Start | End | Comments |
|---|
Downtime for July 2007
| Start | End | Comments |
|---|
Downtime for August 2007
| Start | End | Comments |
|---|---|---|
| 19 Aug 10:32 | 19 Aug 13:42 | System rebooted after an OSS panic caused Lustre to become unresponsive. Jobs running at the time of the outage were killed; those in the queue (but not yet running) were not affected. |
| 20 Aug 23:40 | 21 Aug 02:30 | Many nodes reported "out of memory". System was rebooted to bring these back into the compute pool. Jobs running at the time of the outage were killed; those in the queue (but not yet running) were not affected. |
| 21 Aug 16:44 | 21 Aug 18:57 | System rebooted to enable a new version of ALPS. During the outage, a module was replaced. Jobs running at the time of the outage were killed; those in the queue (but not yet running) were not affected. |
| 24 Aug 18:44 | 24 Aug 20:50 | System rebooted due to lustre panic. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 25 Aug 12:57 | 25 Aug 14:06 | System rebooted. Jobs running prior to the outage were killed; jobs in the queue (but not yet running) were not affected. |
Downtime for September 2007
| Start | End | Comments |
|---|---|---|
| 02 Sep 11:12 | 02 Sep 12:08 | System rebooted after report of jobs hanging. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 04 Sep 13:00 | 04 Sep 15:00 | Downtime to install kernel patch. During the outage, a VRTY was replaced and diagnostic testing was performed. |
| 05 Sep 16:29 | 05 Sep 17:33 | System rebooted due to lustre problems and slow response time. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected). |
| 06 Sep 15:05 | 06 Sep 15:44 | System rebooted after lustre and several OSSs stopped responding. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 07 Sep 08:46 | 07 Sep 09:12 | System crashed after portals problems caused lustre to fail. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 07 Sep 17:43 | 07 Sep 18:06 | System crashed and was rebooted. Jobs running at the time of the crash were killed; jobs in the queue (but not yet running) were not affected. |
| 08 Sep 05:00 | 08 Sep 15:19 | System unavailable while NFS mounted directories were moved to a new server. During the outage, additional system testing was performed and as a result the system was rebooted. |
| 12 Sep 13:00 | 12 Sep 16:07 | System taken down for dedicated application testing. |
| 13 Sep 11:54 | 13 Sep 17:37 | System crashed due to a failed hardware link. During the outage, maintenance was performed. Failed VRTYs and processors were replaced. In addition, system patches were applied. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 13 Sep 20:54 | 13 Sep 21:51 | System crashed due to failed hardware link. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 17 Sep 21:20 | 17 Sep 21:52 | Portals errors caused system performance to degrade. System was rebooted to clear the errors. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 18 Sep 08:49 | 18 Sep 09:13 | System crashed due to a problem with Global Arrays. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 21 Sep 15:05 | 21 Sep 17:05 | System rebooted after PBS node became unresponsive. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 22 Sep 11:37 | 22 Sep 13:43 | System rebooted after a module powered off and would not power back on. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
Downtime for October 2007
| Start | End | Comments |
|---|---|---|
| 02 Oct 08:00 | 02 Oct 14:25 | Installed patches and performed hardware maintenance. |
| 04 Oct 08:00 | 10 Oct 08:00 | System unavailable while a new Lustre filesystem was built. Additionally, several dedicated runs were performed. |
| 22 Oct 08:30 | 22 Oct 11:52 | System rebooted after many nodes became unresponsive. Jobs running at the time of the outage were killed; jobs in the queue (but not yet running) were not affected. |
| 26 Oct 08:00 | 26 Oct 18:30 | OS upgraded to UNICOS/lc 2.0.26. |
| 30 Oct 08:00 | 30 Oct 12:00 | Scheduled maintenance |
Downtime for November 2007
| Start | End | Comments |
|---|---|---|
| 01 Nov 08:00 | 01 Nov 15:00 | Moved 8 cabinets from jaguar to jaguarcnl. Jaguarcnl now has 40 cabinets. |
| 01 Nov 17:16 | 01 Nov 21:46 | System rebooted due to portals problems. Jobs running at the time of the outage were killed; jobs in the queue (but not tyet running) were not affected. |
| 02 Nov 07:39 | 02 Nov 17:03 | System crashed. Jobs running at the time of the outage were killed; jobs in the queue (but not tyet running) were not affected. |
| 14 Nov 08:00 | 14 Nov 15:16 | Hardware maintenance followed by system testing. |
| 20 Nov 08:00 | 20 Nov 12:00 | System maintenance |
| 20 Nov 12:00 | 20 Nov 22:07 | System reboot following the downtime failed due to several hardware problems. These problems were corrected and the system was rebooted with one node disabled. |
| 27 Nov 22:15 | 27 Nov 23:25 | System rebooted after one of the OSS nodes crashed. |
| 29 Nov 11:37 | 29 Nov 17:55 | System rebooted due to a hardware link failure. During the outage, a hardware module and a DDN controller were replaced. |
| 29 Nov 19:55 | 30 Nov 04:12 | System rebooted due to problems with the Lustre filesystem. |
Downtime for December 2007
| Start | End | Comments |
|---|---|---|
| 06 Dec 08:00 | 06 Dec 15:00 | OS upgraded to UNICOS/lc 2.0.33 |
| 07 Dec 16:07 | 07 Dec 20:15 | System rebooted due to problems with a hardware module. |
| 12 Dec 07:30 | 12 Dec 09:47 | System down due to maintenance on the site chilled water system. Installed a patch during the outage. |
| 30 Dec 16:40 | 30 Dec 17:16 | System performance had been degrading (node panics, login node problems, etc.). A debug patch was removed and the system was rebooted to clear these problems. |
Downtime for January 2008
| Start | End | Comments |
|---|---|---|
| 04 Jan 08:17 | 04 Jan 10:35 | System rebooted due to a failed VRTY. |
| 15 Jan 05:45 | 15 Jan 10:45 | System rebooted. |
| 15 Jan 23:08 | 15 Jan 23:31 | System rebooted due to failed hardware link. |
| 18 Jan 15:52 | 18 Jan 16:25 | Several compute nodes were marked 'up' but were causing jobs to hang. The system was rebooted to clear the problems on these nodes. |
| 22 Jan 08:00 | 22 Jan 11:30 | Replaced a mezzanine card and relocated several hardware modules during scheduled maintenance. |
| 24 Jan 20:00 | 24 Jan 21:09 | System rebooted. During the outage, a new portals patch was installed. |
| 28 Jan 00:09 | 28 Jan 00:45 | System rebooted after a module powered off. |
| 28 Jan 01:51 | 28 Jan 03:09 | System rebooted after a module powered off. |
| 29 Jan 07:20 | 29 Jan 11:02 | During maintenance, replaced a mezzanine card and two DIMMS. Additionally, replaced three VRTYs that caused modules to power off earlier in the week. |
| 29 Jan 20:11 | 29 Jan 21:23 | System rebooted after hardware link failure. |
| 31 Jan 10:02 | 31 Jan 11:38 | System was unavailable. During the outage, a portals patch was installed. |
Downtime for February 2008
| Start | End | Comments |
|---|---|---|
| 01 Feb 16:32 | 01 Feb 17:45 | System rebooted after a panic. |
| 05 Feb 07:30 | 05 Feb 12:24 | Maintenance to repair failed hardware links |
| 12 Feb 09:39 | 12 Feb 10:51 | System rebooted after an OSS panic. |
| 19 Feb 08:00 | 19 Feb 12:00 | System testing |
| 19 Feb 20:39 | 19 Feb 21:51 | System rebooted to clear problems with one of OSS nodes. |
| 26 Feb 08:00 | 26 Feb 09:00 | System maintenance |
Downtime for March 2008
| Start | End | Comments |
|---|---|---|
| 11 Mar 08:00 | 11 Mar 11:43 | System maintenance. |
| 14 Mar 16:00 | 17 Mar 05:47 | System unavailable due to site power outage. |
| 18 Mar 07:30 | 18 Mar 14:01 | System unavailable |
| 28 Mar 11:20 | 28 Mar 13:49 | System rebooted due to HSN hang. |
Downtime for April 2008
| Start | End | Comments |
|---|---|---|
| 01 Apr 08:00 | 01 Apr 11:23 | Replaced a hardware module during system maintenance. |
| 17 Apr 18:32 | 17 Apr 19:17 | System rebooted to repair failed hardware link |
| 19 Apr 10:14 | 19 Apr 11:55 | Several nodes powered off. They were disabled and the system was rebooted. |
| 25 Apr 22:08 | 25 Apr 23:58 | System rebooted due to problems on an OSS node. |
| 28 Apr 16:00 | 28 Apr 18:19 | System crashed due to a DDN failure. During the outage, other hardware was replaced so the maintenance scheduled for 29 April is cancelled. |
| 29 Apr 18:30 | 29 Apr 22:30 | System down due to a failure on a hardware module. The module was disabled and the system was rebooted. |
Downtime for May 2008
| Start | End | Comments |
|---|---|---|
| 01 May 04:25 | 01 May 05:23 | System was rebooted due to an error on one of the hardware modules. |
Downtime for June 2008
| Start | End | Comments |
|---|