Cluster Quickstart
February 2015

english Welcome to LSI.

The aim of this document is to be used as an introduction to the use of the Computing Cluster System from the Llenguatges and Sistemes Informàtics Department (LSI) at the Universitat Politècnica de Catalunya (UPC), managed by the Research and Development Lab (RDlab).


Clustering, in a computing context, concerns to a group of hardware and software in a set of computers which, connected through high speed networks, work together in order to solve a problem.

LSI's Computing Cluster System at Universitat Politècnica de Catalunya is introduced as a powerful computing tool due to its high number of processors, memory and its large disk space. Execution nodes are grouped in queues, which collect user's submitted jobs and manage their assignation.


It is necessary to bear in mind some default parameters when submitting jobs to cluster's queues.

The cluster is divided in 4 execution queues. Each queue is defined by how much execution time can it's processes consume:

Short Queue for jobs longing one day (24 hours) or less.
Default queue if no other stated.
Medium Queue for jobs longing one week maximum.
Long Queue with unlimited job's execution time.

If the job duration is not specified, the job will be sent to the Short queue. If a job exceeds the maximum time a queue allows, it will be automatically terminated.

IMPORTANT: Users are advised to specify the duration of the job (qsub -l h_rt=hours:minutes:seconds) in order to avoid the sistem killing the job prematurely.

There is also a queue named Test available, which may be used for immediate trials. This queue is made by the less powerful nodes and may only be used to check the jobs' correctness, not for executions.

The maximum memory for a job is 2 Gigabytes by default. This value can be modified by the user.

The maximum number of processes that can be executed simultaneously is determined by the amount of slots and memory, values which vary according to the kind of user. May you need more information, please contact the RDlab

The default configuration uses the core binding property, which attaches (binds) each job to a processor core. This guarantees that a job will never be migrated to another core or processor, avoiding the context switch cost. In case of a parallel job, core binding will attach the job to as many processors as reserved, with the same guarantees.

Finally, the cluster's configuration defines that each job maps to one processor core exactly, which guarantees exclusivity in the use of the processor by the job.

Connection to the Computing Cluster System in text mode

To connect in text mode (terminal or Command Line Interface) in UNIX systems we will need a Secure Shell (SSH) client.
From the terminal, type the next command where <username> refers to the LSI department username.

ssh <username>

The system will ask us for our password and, once introduced, we will gain access to the system.

In Windows environments, we should have a SSH client, like putty or others, available:

Connection to the Computing Cluster System in graphical mode

In case of willing to use a graphical environment in a UNIX system and wanting to redirect it, it is necessary to use the -X flag:

ssh -X <username>

Or by executing the next command in our computer:


And then, with an open connection with the cluster, executing:

setenv DISPLAY <our_ip>

In Windows environments, if we want graphic support, it is necessary to redirect de X server using a program like Xwin32 or similar.

User environment configuration (path)

Certain Open Grid - the queue manager - applications are architecture dependant and require system path redefinition. To do so, it is necessary to modify the .tcshrc file located in our home modifying the PATH variable:

set ARCH=`/usr/local/sge/util/arch`
set path=( /usr/local/sge/bin/${ARCH} $path)

It is important to take into account that if we have got any explicit reference to a concrete type of executable binary file, we will have to delete that reference from the path value:

setenv PATH /usr/local/sge/bin/lx24-x86

To apply the changes, it is necessary to close session and log-in back into the system.

Submitting a batch job

You can submit to the grid engine system all shell scripts that you can run from your command prompt by hand. Such shell scripts must not require a terminal connection, and the scripts must not need interactive user intervention. We are going to use the next script, which waits idle during 20 seconds, as an example:

You can find the following job in the file /sge-root/examples/jobs/
# (c) 2004 Sun Microsystems, Inc. Use is subject to license terms.
# This is a simple example of a SGE batch script
# request Bourne shell as shell for job
#$ -S /bin/sh
# print date and time
# Sleep for 20 seconds
sleep 20
# print date and time again

To be able to execute the script, it is necessary to set the execution permission (755 or just +x permission)

chmod +x

Submitting a job: qsub

We send the job to a queue by using the next command:


If the job has been correctly sent, we will see the next on screen:

your job 1 (“”)
has been submitted

Some important flags

The qsub statement, and also the script body, allows the user to specify flags (properties) to be applied by the time a job is executed. In case of wanting them specified when calling qsub, they should be added as parameters. For example:

qsub -m bea -o output.txt

Otherwise, if we want them permanently, it should be done adding them to the script body of the job as shown:

#$ -m bea
#$ -o output.txt

Some of the most important flags are:

-e: Specifies where to place the error output file.

qsub -e error.txt

-l: It allows the user to set special requirements for a job (e.g. execution time, memory, etc.)
To specify a different execution time than the default value:

qsub -l h_rt=hours:minutes:seconds

If we send a job which's execution time will take less than the default time limit, specifying it at the time of the call would increase the job's priority over other jobs with the default value; in other words, it may be executed before other awaiting jobs with the default value.

To specify a memory amount different from the default value (currently set to 4Gb.):

qsub -l h_vmem=1G

NOTE: If we now the maximum quantity of memory that your job is going to consume, it is recommended to specify it at the time of the call in order to speed up the job assignation to the nodes.

-m : It allows the user to specify when to receive mail: 'n'(none), 'a' (aborted), 'b' (begin), 'e' (end) , 's' (suspended).

qsub -m bea

In order to use this flag it is necessary, through the -M flag, to provide the address where we want to receive the e-mail:

qsub -M <email-address>

-o : Specifies where to leave the standard output file.

qsub -o output.txt

-q : Indicates user's queue.
Send "" job to the short queue:

qsub -q short

It is also possible to send a job to one or many concrete nodes.

qsub -q short@node112,short@node113

It specifies that we want the job to be executed at the short queue on the selected nodes: node112 or node113.

-S:Specifies which shell must be used on execution.

qsub -S /bin/sh

For further information:

man qsub

Querying queued job's state: qstat

After sending a job to any queue of nodes, it will not be immediately executed. The scheduler analizes the system status to execute the job in the best conditions. We can consult its state by using the next command:


The program shows us this information:

job-ID prior name user state submit/start at queue slots ja-task-ID
------------------------------------- ------------------------------------------------------
000001 0.55000 gabriel qw 10/14/2010 11:16:56 short@node112 1

We can see only our jobs with the -u flag and typing our username:

qstat -u <username>

We can get extra information from a job with the -j flag and indicating the job number:

qstat -j <#job>

We will get a similar output:

qstat -j 000001
submission_time:Fri Oct 15 13:42:05 2010
hard resource_list:h_rt=604800,h_vmem=4G
usage 1:cpu=00:00:00, mem=0.00000 GBs, io=0.00000, vmem=N/A, maxvmem=N/A
scheduling info:queue instance "short@node112" dropped because it is disabled
queue instance "short@node113" dropped because it is disabled
queue instance "short@node114" dropped because it is disabled

Moreover, if the job is not being executed and remains awaiting, in addition it reports the reason why it is still queued. If qstat is not showing any output, it may mean that our job's execution is finished.

Deleting jobs from the queue: qdel

If we want to delete a job we have sent to any queue, we can do it by using the next command:

qdel <#job>
gabriel has deleted job 000001

NOTE: To obtain the job's identifier (#job) we can use the qstat command

Suspend and restart a job: qmod

If we want to suspend a job until we want no restart it, we have to execute:

qmod -sj <#job>

When we want to restart it, we have to run:

qmod -usj <#job>

Executed process output file

Due to the fact that the jobs we send are not instantly executed, the standard and error outputs are redirected to files. Once the job is being run, by default it leaves us the standard and error outputs each one in one file in our home, identified by:


Furthermore, if we want to redirect one (or both) output/s to any other place, we can do it, as we have seen before, by this flags:

-e : error
-o : output

How to debugar a job

If we want to know if our job has finished correctly, we must have a look at the job's accounting information focusing on the exit status and failed fields. To obtain the accounting information it is necessary to execute:

qacct -g <job_id>

owner fgalindo
project NONE
department sistemas
qsub_timeFri Mar 14 12:13:07 2014
start_timeFri Mar 14 12:13:22 2014
end_time Fri Mar 14 12:13:22 2014
ru_msgsnd 0
ru_msgrcv 0

The failed field indicates the problem which occurred in case a job could not be started on the execution host. Besides, the exit_status field shows the job's exit code, determined by the normal shell conventions.

In case of error, a value of 128 is added to the value of the command. For instance, if a job dies through signal 9 (SIGKILL), then the exit status becomes 137.

If we tell the system to send us a mail when a job has finished or has been killed, we will also receive relevant information:

Job 3240728 ( Aborted
Exit Status      = 137
Signal           = KILL
User             = fgalindo
Queue            = short@node112
Host             = node112
Start Time       = 03/19/2014 12:32:33
End Time         = 03/19/2014 12:32:35
CPU              = 00:00:00
Max vmem         = 14.855M
failed assumedly after job because:
job 3240728.1 died through signal KILL (9)


Qmon is a graphical user interface that provides a job submission dialog box and a Job Control dialog box for the tasks of submitting and monitoring jobs. We can start Qmon by using the next command:


After the splash screen, we will access the dialog box which will allow us to select any of the program's options.


We can find further information on Qmon at:

External links

Node information and its properties:
Ganglia software monitoring program: