Cluster Advanced

RDlab@lsi.upc.edu
February 2015

english Welcome to LSI.

The aim of this document is to be used as an introduction to advanced concepts in the usage of the Computing Cluster System from the Llenguatges and Sistemes Informàtics Department (LSI) at the Universitat Politècnica de Catalunya (UPC), managed by the Research and Development Lab (RDlab).

Array jobs

May you have an amount of identical jobs whose only differences were the parameters, it is recommended, instead of sending each job separately, to use array jobs.

Array jobs have all the same job identifier (job_id) but different task identifier (task_id). The way to send an array job is shown below:

qsub -t <initial_task_id>-<final_task_id>:<step>

initial_task_id will be the task_id of the first job, final_task_id the last one's, and step (optional) the difference between the task_id's of consecutive jobs. For instance:

qsub -l h_cpu=0:45:0 -t 2-10:2 render.sh data.in

This will execute the script name render.sh with data.in as a parameter, and we will get the same job_id but different task_id (2,4,6,8 y 10) for each call.

It is possible to use the task_id number as a parameter by using $SGE_TASK_ID:

#!/bin/bash
#$ -t 1-100
PARAM=`awk "NR==$SGE_TASK_ID" $HOME/myJob_params.txt`
$HOME/myJob.sh $PARAM

In the example above, the file $HOME/myJob_params.txt has got the parameters for the script $HOME/myJob.sh. At each line it has got the parameters for each call; at line one, the parameters for the call with task_id=1, at line two, the ones for the call with task_id=2, and so on. The system call awk will read each line using the $SGE_TASK_ID value to access the line coinciding with this value, and will leave the line read at PARAM. Finally, PARAM will be passed to the script myJob.

Parallel environments

To execute jobs in more the one core simultaneously it is necessary to use a parallel environment. The RDLab Cluster offers the possibility to use two parallel environments: MPI (message-passing interface), based on messages exchange, and MAKE, based in smp (symmetrical multiprocessing).

In short, the main difference between them is that MPI allows the processes to be executed in different nodes, whereas MAKE only allows the parallel execution in cores of the same node.

Furthermore, the MPI environment requires explicitly prepared jobs to use this environment, whilst MAKE (SMP) alows certain jobs (OpenMP, matlab, java, processes with threads, etc.) to boost their performance by having more execution core available.

qsub -pe <entorno> <num_cores>

If we want our job to be executed in four cores in a make environment, we should execute:

qsub -pe make 4

This property can be used in conjunction with core binding to force the environment to try to reserve the 4 nodes consecutively:

qsub -binding linear:4 -pe make 4

Ejecución de trabajos MPI

The cluster allows execution of MPI jobs through the integration of OpenMPI with Grid Engine queue manager via the "ompi" parallel environment. The MPI applications must be compiled with the version of OpenMPI of the cluster, which resides in /home/soft/openmpi.

Previously you must configure ssh access without password intranode, since the interaction between different nodes of pool MPI is done as user via ssh:

ssh-keygen

Press return to the questions that will be made to generate keys without passphrase.

cat ~/.ssh/.id_rsa.pub > ~/.ssh/authorized_keys
cp /home/soft/rdlab/known_hosts ~/.ssh/

Below is a template of a job script for a typical MPI job:

# openmpi example job
#
#
#
#$ -N job_name
#
# Use current working directory
#$ -cwd
#
# PARALLEL ENVIRONMENT:
# 
# slots number to be used (In the example 1 master, 19 slaves)
#
#$ -pe ompi 20 
#
# Mandatory variables declaration
#
PATH=/home/soft/openmpi/bin:${PATH}
LD_LIBRARY_PATH=/home/soft/openmpi/lib

# Execute mpirun commands
mpirun -np $NSLOTS path_del_meu_proces_mpi
			

In this type of jobs memory limits (h_vmem) apply to the different MPI processes individually.

It is possible to see in which nodes the job's processes are executed with the command:

qstat -g -t

Hadoop integration

The RDlab's HPC System also offers the possibility of executing a Hadoop environment. To be able to access this environment it is necessary to ask RDlab for it.

Job dependency

Job dependency allows to delay the execution of a job until another job has finished. This way it is possible to define the execution order of the jobs to ensure the dependencies that may exist.

To do so, the flag hold_jid, followed by the job_id of the job to depend from, must be added:

qsub -hold_jid <job_id>

In the special case of array jobs, these can depend on other jobs, but tasks can not have dependencies on other jobs or tasks.

Interactive jobs: qlogin

Until this point it has been shown how to interact with the cluster with batch jobs. However, it is also possible to open an interactive shell directly with an execution node, using the binary qlogin.

This interactive session is useful with jobs with user direct input, window applications or long compilations.

qlogin -q short

NOTE: It is important to know that this shell is a cluster process and, as such, it is limited in its resources (memory and execution time).

The qlogin tool accepts the same flags than the qsub call, has the same restrictions and will only allow access to those nodes where allowed. The flags allow specification other limits as it is done with other jobs sent by qsub, as well as specify to which node connect to by using the flag hostname:

qlogin -l hostname=node210,h_vmem=1G

In the example above, a connection to node 210 would be stablished providing that it has 1GB of free memory.

It is necessary to keep in mind that, opposite to a regular job, if the resources are not available the job will no queue and neither an interactive connection will be opened.

Afinity: Core Binding

Core binding refers to the association of a job to a concrete processor. By default, the RDlab's cluster uses this property for all user processes, as has been stated at indications .

However, it is possible to define the binding flag to adjust it to the job's concrete needs, associating it to as many cores as necessary.

qsub -binding <binding_strategy>

The binding strategies that can be used at the cluster are linear or striding:

With linear the system will try to associate the job to as many consecutive cores as specified.

qsub -binding linear:<num_cores>

With striding the system will try to associate the job to as many cores at an step-size distance as specified.

qsub -binding striding:<amount>:<step-size>

GPUs execution

Currently, the HPC cluster provides computing nodes with NVIDIA TESLA and a NVIDIA GTX Titan cards. They are placed on nodes 800 and 801. To obtain data of these cards it is necesary to run this command at the nodes:

nvidia-smi

node 800

+------------------------------------------------------+                       
| NVIDIA-SMI 361.28     Driver Version: 361.28         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m          Off  | 0000:05:00.0     Off |                  Off |
| N/A   32C    P0    46W / 225W |     14MiB /  5119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          Off  | 0000:42:00.0     Off |                  Off |
| N/A   28C    P0    43W / 225W |     14MiB /  5119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
		    

node 801

 
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:03:00.0     Off |                  N/A |
| 22%   48C    P0    74W / 250W |     23MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          Off  | 0000:06:00.0     Off |                    0 |
| 23%   44C    P0    67W / 235W |     22MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

		    

The command also shows the information relative to the jobs being executed at the cards.

                                                     
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found 					      |
+-----------------------------------------------------------------------------+
		    

To use this node it is necessary to send the jobs a queue named gpu. To be allowed to access it it is necessary to ask RDlab for it. It is a batch queue as well as an interactive one, which can be used the usual way.

IMPORTANT: Users with access to GPU queue must explicitely set the queue (short, medium, long or gpu) at any job execution; otherwise, a "normal" job is suitable to be executed at the GPU queue.

To send a batch job:

qsub -q gpu ...

For a interactive connection:

qlogin -q gpu

Jobs at this queue have no time limit, but they do have a 2GB RAM limit by default, changeable the usual way.