Setting up your cluster account for Sun Grid Engine (SGE):
What is SGE and why should I use SGE to submit my jobs?
SGE is a queuing system for job submissions to the Cipres cluster. It tracks
the cluster resources, chooses the most suitable machine to run your job on,
and if requested, notifies you when the job is completed. SGE allows us to make
the most of our available computational resources by distributing jobs on all
available machines.
Submission Priorities
Currently, there are 3 types of queues implemented by SGE on the cluster.
Each has different priority level. Access to each queue is based on your affiliation
with the CIPRES project. The queue type can be specified when submitting a job
using "qsub" (see qsub below).
- high: highest priority: for Cipres users only;
- medium: medium priority: for ATOL users only;
- all: normal priority (this is the default) for ALL users;
Submission using SGE step-by-step instructions
- ssh cipres1.sdsc.edu
- SGE Environment Setup
In order to use SGE, you must have SGE binaries in your path. This can be done
by executing one of the following commands depending on the shell you are using:
* To determine the shell you are using, type the command: echo $SHELL
For csh and tcsh: source /projects/cipres/gridengine/default/common/settings.csh
For bash and ksh: . /projects/cipres/gridengine/default/common/settings.sh
Advanced users: you can add the above command to your .cshrc (for csh and tcsh)
or .bashrc file (for bash and ksh)
Now try to execute the command "qstat", if it returns "command not found",
this means you have not executed the setup successfully. Please repeat the
above steps. - Submitting a job to SGE: qsub
You submit a job using the SGE command "qsub". "qsub" reads commands from either
the keyboard or a script file, but for all practical purpose, a script file is used. The file
contains the commands to instruct "qsub" how to submit the job.
Note: many options are available when running "qsub". Please refer to "man qsub" for a list
of possible options. Only lines in bold are required. All others are optional.
A) sample script file for serial job submission:
#!/bin/bash
# the name of my job, whatever makes sense to you
#$ -N myprogram
# instructs SGE to save stdout and stderr in current directory of submission
#$ -cwd
# it's recommended to run the script in BASH
#$ -S /bin/bash
# optionally, set the priority (see Submission Priorities in earlier section)
# Cipres Users: high
# ATOL Users: medium
# All others: all (or leave it blank)
#$ -l [ all | medium | high]
# your command goes here - it is the same command you would use when running without SGE:
# [your_program]: path to your program, e.g. /users/u3/joe/mb
# [arguments]: Optional: list of arguments to your program if any
[your_program] [arguments]
B) sample script file for parallel job submission:
To run a job in parallel, you must do a one-time setup to establish SSH connections from
cipres1 (the submission host) to all the other cipres nodes (the execution hosts). Please see
instructions in the next section):
#!/bin/bash
# it's recommended to run the script in BASH
#$ -S /bin/bash
# instructs SGE to save stdout and stderr in current directory of submission
#$ -cwd
# set the priority (see Submission Priorities in earlier section)
# Cipres Users: high
# ATOL Users: medium
# All others: all (or leave it blank)
#$ -l [ all | medium | high]
# specify "lam" as the programming environment (PE) and request number of processors to run your job with
#$ -pe lam [number_of_processors]
# specify the job command
/projects/cipres/gridengine/tight-lammpi/bin/mpirun -np $NSLOTS [your_parallel_program] [arguments]
C) How to run interactive SGE jobs
To run a job interactively via SGE, perform the following steps:
Remember to include any option to run the job interactively, either in your command or in your_parallel_program
(Note that all requeseted processors will spawned on the same node that you log into)
- ssh cipres1.sdsc.edu
- qlogin _OR_
- xterm -e /bin/sh -c "qlogin" (this executes the command in a new xterm)
- (login with username/password as requested)
- lamboot
- /projects/cipres/gridengine/tight-lammpi-7.0.6/bin/mpirun -np [number_of_processors] [your_parallel_program] [arguments]
How to set up SSH connections for cipres1 to all other nodes: Commands you need to execute are in blue color:
Once the script file is created and the authenticaton step is completed, submit your job using "qsub":- you are logged into cipres1.sdsc.edu
- Create a public key in ~/.ssh directory:
% mkdir -p ~/.ssh
% ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (~/.ssh/id_dsa): [hit enter to accept the default]
Enter passphrase (empty for no passphrase): [hit enter to accept the default]
Enter same passphrase again: [hit enter to accept the default]
Your identification has been saved in ~/.ssh/id_dsa
Your public key has been saved in ~/.ssh/id_dsa.pub
The key fingerprint is: [Some really long string]
-
% touch ~/.ssh/authorized_keys
% cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
- get a copy of known_hosts file
% wget http://www.phylo.org/sub_sections/sge/known_hosts
-
move this file to ~/.ssh directory:
% mv known_hosts ~/.ssh/
% qsub [script_file]
Note: Both "emac" and "vi" are available on the machine. Let us know if you have trouble
using either editor, and maybe we can install another editor that's easier for non-Unix
users to use. - Getting Info on your running jobs: qstat
To get a list of your pending/running job, you can issue "qstat" command. Below is a
n example output:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------------
4427 0.55500 job1.sh cipresw r 06/23/2008 19:43:14 all.q@cipres10.sdsc.edu 1
4437 0.55500 job2.sh cipresw qw 06/24/2008 16:23:04
state:
r: running
qw: queue waiting
If you want to obtain more information on a job (including helpful error message why your job is not run): qstat -j <jobid>
- Checking status of a completed job: qacct
To get information on a job that's already completed: qacct -j <jobid> - Deleting a running job: qdel
To delete or remove a running job from the queue after submission, use "qdel <jobid>".
- Getting status of nodes being used by SGE: qhost
Note. There are many other options available when running SGE commands. It's impossible to list them all here.
To see the options, use "-help" with the command, e.g.
qsub -help
qstat -help
qacct -help
qdel -help
qhost -help

