Setting up your cluster account for Sun Grid Engine (SGE):

What is SGE and why should I use SGE to submit my jobs?

SGE is a queuing system for job submissions to the Cipres cluster. It tracks
the cluster resources, chooses the most suitable machine to run your job on,
and if requested, notifies you when the job is completed. SGE allows us to make
the most of our available computational resources by distributing jobs on all
available machines.

Submission Priorities

Currently, there are 3 types of queues implemented by SGE on the cluster.
Each has different priority level. Access to each queue is based on your affiliation
with the CIPRES project. The queue type can be specified when submitting a job
using "qsub" (see qsub below).

Submission using SGE step-by-step instructions

  1. ssh
  2. SGE Environment Setup
    In order to use SGE, you must have SGE binaries in your path. This can be done
    by executing one of the following commands depending on the shell you are using:

    * To determine the shell you are using, type the command: echo $SHELL

    For csh and tcsh: source /projects/cipres/gridengine/default/common/settings.csh
    For bash and ksh: . /projects/cipres/gridengine/default/common/

    Advanced users: you can add the above command to your .cshrc (for csh and tcsh)
    or .bashrc file (for bash and ksh)

    Now try to execute the command "qstat", if it returns "command not found",
    this means you have not executed the setup successfully. Please repeat the
    above steps.
  3. Submitting a job to SGE: qsub
    You submit a job using the SGE command "qsub". "qsub" reads commands from either
    the keyboard or a script file, but for all practical purpose, a script file is used. The file
    contains the commands to instruct "qsub" how to submit the job.

    Note: many options are available when running "qsub". Please refer to "man qsub" for a list
    of possible options. Only lines in bold are required. All others are optional.

    A) sample script file for serial job submission:

    # the name of my job, whatever makes sense to you
    #$ -N myprogram

    # instructs SGE to save stdout and stderr in current directory of submission
    #$ -cwd

    # it's recommended to run the script in BASH
    #$ -S /bin/bash

    # optionally, set the priority (see Submission Priorities in earlier section)
    # Cipres Users: high
    # ATOL Users: medium
    # All others: all (or leave it blank)
    #$ -l [ all | medium | high]

    # your command goes here - it is the same command you would use when running without SGE:
    # [your_program]: path to your program, e.g. /users/u3/joe/mb
    # [arguments]: Optional: list of arguments to your program if any
    [your_program] [arguments]

    B) sample script file for parallel job submission:
    To run a job in parallel, you must do a one-time setup to establish SSH connections from
    cipres1 (the submission host) to all the other cipres nodes (the execution hosts). Please see
    instructions in the next section):


    # it's recommended to run the script in BASH
    #$ -S /bin/bash

    # instructs SGE to save stdout and stderr in current directory of submission
    #$ -cwd

    # set the priority (see Submission Priorities in earlier section)
    # Cipres Users: high
    # ATOL Users: medium
    # All others: all (or leave it blank)
    #$ -l [ all | medium | high]

    # specify "lam" as the programming environment (PE) and request number of processors to run your job with
    #$ -pe lam [number_of_processors]

    # specify the job command
    /projects/cipres/gridengine/tight-lammpi/bin/mpirun -np $NSLOTS [your_parallel_program] [arguments]

    C) How to run interactive SGE jobs
    To run a job interactively via SGE, perform the following steps:
    Remember to include any option to run the job interactively, either in your command or in your_parallel_program
    (Note that all requeseted processors will spawned on the same node that you log into)
    1. ssh
    2. qlogin _OR_
    3. xterm -e /bin/sh -c "qlogin" (this executes the command in a new xterm)
    4. (login with username/password as requested)
    5. lamboot
    6. /projects/cipres/gridengine/tight-lammpi-7.0.6/bin/mpirun -np [number_of_processors] [your_parallel_program] [arguments]

    How to set up SSH connections for cipres1 to all other nodes: Commands you need to execute are in blue color:
    1. you are logged into
    2. Create a public key in ~/.ssh directory:
      % mkdir -p ~/.ssh
      % ssh-keygen -t dsa
      Generating public/private dsa key pair.
      Enter file in which to save the key (~/.ssh/id_dsa): [hit enter to accept the default]
      Enter passphrase (empty for no passphrase): [hit enter to accept the default]
      Enter same passphrase again: [hit enter to accept the default]
      Your identification has been saved in ~/.ssh/id_dsa
      Your public key has been saved in ~/.ssh/
      The key fingerprint is: [Some really long string]
    3. % touch ~/.ssh/authorized_keys
      % cat ~/.ssh/ >> ~/.ssh/authorized_keys
    4. get a copy of known_hosts file
      % wget
    5. move this file to ~/.ssh directory:
      % mv known_hosts ~/.ssh/
    Once the script file is created and the authenticaton step is completed, submit your job using "qsub":
    % qsub [script_file]

    Note: Both "emac" and "vi" are available on the machine. Let us know if you have trouble
    using either editor, and maybe we can install another editor that's easier for non-Unix
    users to use.

  4. Getting Info on your running jobs: qstat
    To get a list of your pending/running job, you can issue "qstat" command. Below is a
    n example output:
    job-ID prior name user state submit/start at queue slots ja-task-ID
    4427 0.55500 cipresw r 06/23/2008 19:43:14 1
    4437 0.55500 cipresw qw 06/24/2008 16:23:04

    r: running
    qw: queue waiting

    If you want to obtain more information on a job (including helpful error message why your job is not run): qstat -j <jobid>
  5. Checking status of a completed job: qacct
    To get information on a job that's already completed: qacct -j <jobid>

  6. Deleting a running job: qdel
    To delete or remove a running job from the queue after submission, use "qdel <jobid>".
  7. Getting status of nodes being used by SGE: qhost

    Note. There are many other options available when running SGE commands. It's impossible to list them all here.
    To see the options, use "-help" with the command, e.g.
    qsub -help
    qstat -help
    qacct -help
    qdel -help
    qhost -help
  8. For questions or problems, please contact us at: Report an Issue