Scheduling Policy

General Information

In a simple batch queue system, jobs run in a first-in, first-out (FIFO) order. This often does not make effective use of the system. A large job may be next in line to run. If the system is using a strict FIFO queue, then many processors may sit idle while the large job waits to run. Backfilling would allow smaller, shorter jobs to use those otherwise idle resources, and with the proper algorithm, the start time of the large job would not be delayed. While this does make more effective use of the system, it encourages smaller jobs to ensure quick turnaround time.

Basic Queue Policies

The NCCS has implemented the following queue policies to enable large jobs to run in a timely fashion on Leadership Computing Facility (LCF) systems.

The basic priority factor for jobs remains the time a given job has been waiting relative to other jobs in the queue. However, several factors are applied by the batch system to modify the apparent time a job has been waiting. These factors include the size of the job and the queue to which it was submitted (both of which are described below) as well as the percent usage of the project under which the job is run (see the next section). If you have any questions or comments on the queue policies below, please direct them to help@nccs.gov.

  • Jobs are aged 8 minutes per MSP requested. Thus, if a 100-MSP job is submitted at the same time as a 400-MSP job, the system will consider the 400-MSP job 40 hours older for priority purposes. Accordingly, the 400 MSP job can be submitted up to 40 hours after the 100-MSP job and still be considered older for priority purposes.

  • To improve debug job turnaround time, Phoenix reserves 128 MSPs for the debug queue from noon until 8:00 p.m. (Eastern time) each workday. At other times the system does not set aside processors for debug-only work. Debug jobs have a maximum wall time of 1 hour. Users are limied to one job in the debug queue at any time. NOTE: The debug reservation is held only on Monday and Friday.

  • The debug queue is for debugging only. Production jobs are not allowed in the debug queue.

  • Non-debug jobs of fewer than 32 MSPs have a maximum wall time of 4 hours.

  • Non-debug jobs using 32 to 255 MSPs have a maximum wall time of 12 hours.

  • Non-debug jobs using 256 or more MSPs have a maximum wall time of 24 hours.

  • The batch queue system places a limit of two jobs in the “queued” (i.e., eligible-to-run) state per user. If a user submits more than two jobs, the additional jobs will enter a “held” state. Once one of the user’s queued jobs begins execution, one of the held jobs will be moved into the queued state. Note that this is not a limit on the number of jobs that a user may have running simultaneously. Instead, it is a limit on the number eligible to enter a run state.

  • The maximum wall time for any queue is 24 hours.

Allocation Overuse Policies

Projects that overrun their allocation are still allowed to run on LCF systems, although at a reduced priority. As with the adjustment for the number of processors requested, this is an adjustment to the apparent submit time of the job. However, this adjustment has the effect of making jobs appear much younger than jobs submitted under projects that have not exceeded their allocation. In addition to the priority change, these jobs are also limited in the amount of wall time that can be used.

For example, if job1 is submitted at the same time as job2, and the project for job1 is over its allocation (while the project for job2 is not over its allocation), the batch system will consider job2 to have been waiting for a longer time than job1. The length of time applied as an adjustment to the project that is over its allocation depends on the system being used and the percentage that the project is over its allocation.

  • Jobs from projects that have used between 100% and 125% of their allocations are handled the same as jobs from projects that are under their allocation.

  • For projects that have used greater than 125% of their allocations, the following rules apply:

    • Jobs have their priority reduced by 365 days.

    • Jobs have a maximum wall-time limit of 4 hours.

These policies are subject to change. This page should reflect the current scheduling policy in place on the systems, so you may want to check back occasionally.