Site Tools


ts117:sge

Oneseventeen (ts)

Tennessine/ oneseventeen is our GPU cluster, comprised of 33 compute nodes (Dell R730s) each with 28 cores and 128 GB of RAM, and a single Nvidia P100 GPUs with 16 GB of RAM.

This cluster has an EDR InfiniBand compute fabric.

The login/head node is a Dell R630 with the same amount of RAM but no GPU.

  • Head node FQDN: ts.simcenter.utc.edu
  • Other nodes: ts{01-33}

Login procedure

To log into the firefly cluster use the following command:

$ ssh ts
# or
$ ssh ts.simcenter.utc.edu

Open Grid Scheduler

Running a job

The TS cluster uses the Open Grid Scheduler (OGS), similar to the Sun Grid Engine (SGE).

Load the SGE scheduler application using the command:

module load sge

To submit a job to the cluster use:

qsub job_script

The qsub command has many flags. Some of the more interesting and often used ones are listed below. These can either be used command-line arguments or as embedded arguments.

-S path: Specifies the interpreting shell for the job.
 
-N name: Specifies the job_name.
 
-V : Specifies  that  all environment variables active within the qsub utility be exported to the context of the job.
 
-pe <pe> N: Specifies the parallel environment requires for the job with N processes launched.
 
-cwd : Specifies  that  all environment variables active within the qsub utility be exported to the context of the job.
 
-e path: Defines or redefines the path used for the standard error stream of the job.
 
-o path: Defines or redefines the path used for the standard output stream of the job.
 
-j y|n: Specifies  whether  or  not  the standard error stream of the job is merged into the standard output stream.
 
-l resource=value: Launch  the  job  in a Grid Engine queue meeting the given resource request list.

Sample SGE Submission Scripts

This first script will launch a simple 'sge_helloworld' MATLAB script in batch mode (Remember to modify the path to program in the script before using it)

{{{
#!/bin/bash
#$ -V ## pass all environment variables to the job, VERY IMPORTANT
#$ -N matlab_test ## job name
#$ -S /bin/bash ## shell where it will run this job
#$ -j y ## join error output to normal output
#$ -cwd ## Execute the job from the current working directory
#$ -o matlab_test.out
#$ -e matlab_test.err
 
module load matlab/R2019a
echo "Starting test: $(date)"
matlab -nosplash -nodesktop -r "run('/path/to/helloworld.m'); exit;"
echo "Test complete: $(date)"
}}}

where 'helloworld.m' looks like:

{{{
      disp('Hello, World!');
}}}

To simply run this job without any flags:

qsub -o output_file.o sge_helloworld

To submit an MPI job with 8 slots:

qsub -pe openmpi_ib 8 -o mpi_hello.o mpi_helloworld 
 
#or
 
# Add this to script
#$ -pe openmpi_ib 8

To check the status of your script:

qstat

Allocate a job to desired node(s)

The flag -l in qsub allows allocating a job to a desired list of resources.

# qsub -l h=<node-name> <job-script>
# For ex: to force allocate job to node ts17:
$ qsub -l h=ts17 myjobscript.sh 

Debugging Failed qdel Jobs

Use qhost to determine if any nodes have a higher load average than normal. Use qstat -u <user> -g t | grep <host> to figure out how many jobs SGE is aware of on <host>. Use the following command: ps -o pid,etime -u <user> to list the elapsed run time of all processes owned by <user> on the node which should help to find the processes that may have been running longer than the rest and kill them by pid.

Debugging Failed Nodes Using qstat

Occasionally nodes will fail certain tests and no longer be scheduled with SGE. This recently happened on Oneseventeen and Bright helped me figure out why. The steps taken are outlined below for future reference.

  1. Run: qstat -f -qs “acousuACDES” -explain “acAE” -q *\@ts20 2>&1 to get information about why a compute node (in this case ts20) is in any of the acAE states.
  2. If error are found, they may be cleared, if appropriate, using this command: qmod -cq all.q@ts20.cm.cluster.
  3. Then, cmsh may be used to check the scheduler status of each node specified: cmsh -c “device; foreach -n ts20,ts22 (check schedulers)”

Flags can be embedded in the job_script using the “#$” escape comment for the scheduler to read these. Organise the job script such that these flags are at the top of the file.

#!/bin/bash
#$ -cwd
#$ -N job_name
#$ -j y
 
...

Modules can be loaded as done on the command line in the job_script.

module load namd/2.12

Job Dependencies

Occasionally we want to be able to trigger the execution of a job only after another has finished. This is possible through either using the same job allocation and including both program executions in the same job script or through job dependencies. To submit a job with a dependency on another job use the following syntax with at least one job_id or job_name:

qsub -hold_jid JOB_ID1, JOB_ID2, ...

Using job names presents an easier interface for dependency handling since it is not based on the scheduler state, however also requires non-colliding job names, unless a dependency on all of them is needed. Regex expressions can also be used.

Requesting Exclusivity

We (sysadmins) have decided to allow node sharing to enable greater resource utilization at busy times. However, occasionally it is desirable to have a guarantee that a node is owned wholly by a single job. This is done by requesting the exclusive resource as such either in the command line or the job script:

qsub -l excl=true job_script
qsub -l exclusive=true job_script

Requesting GPUs

To enable the use of GPUs use you will also need to request the GPU resource. This is done with the following resource request flag:

qsub -l gpu=N job_script

For the oneseventeen cluster all nodes have a single GPU on them, therefore the “gpu=1” should be used.

VNC functionality

tigervnc-server is installed on the login node for this cluster; this VNC can be used to remotely access graphical interface based applications while connected to this cluster. It requires server and client side setup. Some brief details for server and client side setup is provided below:

  1. Once you are logged on to ts.simcenter.utc.edu, launch the VNC server using the following command:
    vncserver -geometry 1400x780 -depth 24
    NOTE: 1400×780 can be adjusted based on client-side screen size. This resolution is for a device with 15.4 inches screen.
  2. The above command will provide output like given below:
     New 'os-hn:2 (username)' desktop is os-hn:2 
    It will also ask to set up a password (only the first time), which should be different from your SimCenter password.
  3. The server hostname is “os-hn” and the port number is 5902. Note, however, if you were to see “os-hn:1”, the port number would be 5901.
  4. Now, you can connect to this server from another device. I have set it up to work on my MacBook using a tunnel-based approach by using the following command from the terminal:
    ssh -nNT -L5902:os-hn:5902 ranjan@ts.simcenter.utc.edu 
    This creates a tunnel to the login node using port number 5902.
  5. I have an application installed on my MacBook called “TURBO VNC”. I launch the application where it asks for the server. I provide the following as the server name: localhost:5902
  6. After that, it asks for the password, which is the same as in step #2 above. This provides a desktop window, which can be used to launch a GUI application.
ts117/sge.txt · Last modified: 2021/11/01 15:14 by Sai Medury

Page Tools