site_logo

/rdlab HPC service manual


Introduction

The CS department High Performance Computing system (HPC) is running a queue manager environment that collects all the user requests / jobs. Then, the queue scheduler sorts and prioritizes every user task using several defined criteria (user quota, estimated execution time, RAM, CPU cores…).

When a gap that fits the job is available in the HPC system, the user request is transferred to an execution node and controlled by the queue system. If the user process tries to user more resources (RAM, time…) the queue system will kill the job to ensure system stability.

The /rdlab HPC system is currently using:

Warning: In a common server/laptop/computer the user just executes whatever he wants directly. In an HPC environment, all user requests/processes/jobs must be queued and controlled through the queue system.

Our HPC system groups all the execution nodes through several specific queues based on time criteria or specific hardware.
The user available queues are:

Queue name (partition) Purpose and limits
short execution time < 1 day
medium execution time < 1 week
long execution time unlimited
gpu execution nodes with GPU

Warning: Other queues are reserved for internal or specific services and are only available to HPC administrators.


Access and configuration

Our HPC environment is only accesible though the Secure SHell (ssh) protocol. For security reasons you have different access paths:

  • Inside CS Department network:
    • ssh <username>@logincluster.cs.upc.edu

  • Worldwide access:
    1. ssh <username>@login1.cs.upc.edu
    2. ssh <username>@logincluster.cs.upc.edu

logincluster.cs.upc.edu is “just” a bridge between the user and the HPC system is not an execution node, so you shouldn’t run any program in this system because is the less powerful server in the HPC environment. On the other hand, the system will kill all user processes in logincluster.cs.upc.edu after about 60 minutes of execution time.

Warning: In order to access to the SLURM commands directly you should update your PATH variable, adding the route “/usr/local/slurm/bin” to it.

You should be familiar on Linux environments. You can find lots of easy tutorials and Howtos for Linux beginners.
Ex: https://maker.pro/linux/tutorial/basic-linux-commands-for-beginners


Uploading/Downloading data to the HPC environment

You can upload and download files to/from the HPC environment using the SecureCopy (scp) or the Secure File Transfer Protocol (sftp) services.

  • Inside CS Department network you can connect directly:
    • Ex: from your computer to the HPC system:
      scp [filename] <username>@logincluster.cs.upc.edu:~/

    • Ex: from HPC system to your computer:
      scp [filename] <username>@<your_ip_address>:~/

  • World Wide access: You only can connect from HPC system to your computer:
    • Ex: from HPC system to your computer:
      scp [filename] <username>@<your_ip_address>:~/

    • Ex: to HPC system from your computer (your computer must be accessible from the Internet):
      scp <username>@<your_ip_address>:<file_path> .

Warning: You can create a .tar file if multiple directorys or files must be transfered between the systems. https://alvinalexander.com/unix/edu/examples/tar.shtml
https://www.tecmint.com/18-tar-command-examples-in-linux/


My first HPC job (Hello world!)

In the next example we will create a C program and we will run it using the HPC system. For educational purposes this user job will use 1 CPU core, 1024Mbytes RAM on the short queue.

  1. Create a Shell script file named “helloworld.c” and copy:
    
    #include 
    int main() {
        printf(“Hello, world! \n");
        sleep(60);
        printf(“Finishing after 60 seconds waiting \n”);
        return ;
    }
                                        
  2. Create a Shell script file named “helloworld.sh” and copy:
    
    #!/bin/bash -l
    #
    #SBATCH -J my-hello-world
    #SBATCH -o my-hello-world.”%j".out
    #SBATCH -e my-hello-world.”%j".err
    #
    #SBATCH --mail-user $USER@cs.upc.edu
    #SBATCH --mail-type=ALL
    #
    #SBATCH --mem=1024M
    #SBATCH -c 1
    #SBATCH -p short
    
    # firstly we will compile the .c program
    gcc -o helloword.exe helloworld.c
    
    # Secondly we will run the program
    ./helloworld.exe 
                                        
  3. Send your job to the HPC queue system:
    sbatch ./helloworld.sh

  4. Check your queue status:
    squeue -u <your_username>

  5. Wait until the queue system finds a free execution node that fits your job needs. You can monitor “in real time” your job output through the output file (my-hello-world.out) and the error file (my-hello-world.err)
    tail -f ./*.out ./*.err

Warning: If the execution nodes are “full” of user jobs or your executions time last much time, you can leave the system and you will be notified by the HPC system via email when your job execution is completed.


My first interactive job

If you need to compile or execute any program in interactive mode (never run programs in logincluster) you can ask for a interactive shell enviroment that fits your needs in a execution node.

  • Request a shell with 3 CPU cores, 1024MBytes RAM and less than 24h (short queue):
    srun -p short -mem=1024M -c 3 --pty bash

  • Request a shell with 6 CPU cores, 16GBytes RAM and less than 24h (short queue):
    srun -p short -mem=16G -c 6 --pty bash

Warning: This interactive shell will be only available if there are any free execution nodes which can fulfill your request. Otherwise you will receive and error because no interactive shell can be provided at this time because all the execution resources are currently used. Besides, you should also have enough quota available, or you will not be able to execute your interactive job.


SGE to Slurm Commands table

Bellow you can find the SLURM equivalent to SGE typical commands:

User Commands SGE SLURM
Interactive login qlogin srun --pty <shellname>
Job submission qsub [script_file] sbatch [script_file]
Job deletion qdel [job_id] scancel [job_id]
Job status by job qstat -u \* [-j job_id] squeue [job_id]
Job status by user qstat [-u user_name] squeue -u [user_name]
Job hold qhold [job_id] scontrol hold [job_id]
Job release qrls [job_id] scontrol release [job_id]
Queue list qconf -sql squeue
List nodes qhost sinfo -N OR scontrol show nodes
Cluster status qhost -q sinfo
GUI qmon sview

Bellow you can find the SLURM equivalent to SGE job parameters:

Job Specification SGE SLURM
Script directive #$ #SBATCH
queue -q [queue] -p [queue]
count of nodes N/A -N [min[-max]]
CPU count -pe [PE] [count] -c [count]
Wall clock limit -l h_rt=[seconds] -t [min] OR -t [days-hh:mm:ss]
Standard out file -o [file_name] -o [file_name]
Standard error file -e [file_name] -e [file_name]
Combine STDOUT & STDERR files -j yes (use -o without -e)
Copy environment -V --export=[ALL | NONE | variables]
Event notification -m abe --mail-type=[events]
send notification email -M [address] --mail-user=[address]
Job name -N [name] --job-name=[name]
Restart job -r [yes|no] --requeue OR --no-requeue (NOTE: configurable default)
Set working directory -wd [directory] --workdir=[dir_name]
Resource sharing -l exclusive --exclusive OR --shared
Memory size -l mem_free=[memory][K|M|G] --mem=[mem][M|G|T] OR
--mem-per-cpu=[mem][M|G|T]
Charge to an account -A [account] --account=[account]
Tasks per node (Fixed allocation_rule in PE) --tasks-per-node=[count]
--cpus-per-task=[count]
Job dependancy -hold_jid [job_id | job_name] --depend=[state:job_id]
Job project -P [name] --wckey=[name]
Job host preference -q [queue]@[node] OR -q
[queue]@@[hostgroup]
--nodelist=[nodes] AND/OR --exclude=[nodes]
Quality of service --qos=[name]
Job arrays -t [array_spec] --array=[array_spec] (Slurm version 2.6+)
Generic Resources -l [resource]=[value] --gres=[resource_spec]
Begin Time -a [YYMMDDhhmm] --begin=YYYY-MM-DD[THH:MM[:SS]]

Examples

Some examples on the commands and parameters above:

SGE SLURM
qstat squeue
qstat -u username squeue -u username
qstat -f squeue -al
qsub sbatch
qsub -N jobname sbatch -J jobname
qsub -l h_rt=24:00:00 sbatch -t 24:00:00
qsub -pe make 8 sbatch -c 8
qsub -l mem=4G sbatch --mem=4000
qsub -o filename sbatch -o filename
qsub -e filename sbatch -e filename
qsub -q gpu sbatch -p gpu --gres=gpu:n
qlogin srun --pty bash
qdel scancel