Tutorial jobs

Job 1: A simple, nonparallel job

Create a workload

Inside the tutorial directory that you created or installed previously, let’s create a test script to execute as your job:

$ nano short.sh
file: short.sh
#!/bin/bash
date
hostname
id
pwd
echo
echo "Working hard..."
sleep 30
echo "Science complete!"

To close nano, hold down Ctrl and press X. Press Y to save, and then Enter

Now, make the script executable.

$ chmod a+x short.sh

Run the job locally

tm103@dcc-slogin-03  /work/tm103/tutorial $ ./short.sh 

Thu Jan 30 13:44:37 EST 2020

dcc-slogin-03

uid=1024131(tm103) gid=1000000(dukeusers) groups=1000000(dukeusers),507(chg),517(neuro),519(rrsg),547(sysbio),552(pcharbon),582(zijlab),584(mtchem),585(scsc),609(dhvi),619(bmannlab),1022(vooralab),100000029(dscr),100000191(pn-jump),100000422(prdn-rc-staff),100000429(stoutlab),100000477(condor-admins),100000494(ssri-prdn-test),100000501(burchfiel),100000626(training),100000630(smrtanalysis),100000648(musselman-comsol),100000768(nsoe-it),100000837(ssri-prdn-slurm-managers),100000899(docacad),100001306(securehpc_noduo),100001989(artai),100005349(pcharbonlow),100005449(rescomp),100006149(jan30test) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

/work/tm103/tutorial

Working hard...

Science complete!

tm103@dcc-slogin-03  /work/tm103/tutorial $ 

Create a Slurm job script

$ nano tutorial01.sh
file: tutorial01.sh
#!/bin/bash
#SBATCH -e slurm.err
#SBATCH -p scavenger 
./short.sh

Submit the job

Submit the job using sbatch.

tm103@dcc-slogin-03  /work/tm103/tutorial $ sbatch tutorial1.sh 

Submitted batch job 16900869

tm103@dcc-slogin-03  /work/tm103/tutorial $

Check job status

The squeue command tells the status of currently running jobs. Generally you will want to limit it to your own jobs:

tm103@dcc-slogin-03  /work/tm103/tutorial $ squeue -u tm103

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

          16900869 scavenger tutorial    tm103  R       0:10      1 dcc-hashimilab-02

tm103@dcc-slogin-03  /work/tm103/tutorial $

When your job has completed, it will disappear from the list.

Job history

Once your job has finished, you can get information about its execution from the sacct command:

tm103@dcc-slogin-03  ~ $ sacct -u tm103 

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 

------------ ---------- ---------- ---------- ---------- ---------- -------- 
16900869      tutorial1  scavenger    rescomp         1  COMPLETED      0:0 

tm103@dcc-slogin-03  ~ $

You can see much more information about your job’s final status using the --format option, specifying additional fields.

tm103@dcc-slogin-03  ~ $ sacct -j 16900869  --format=User,JobID,state,partition,start,end,elapsed,nodelist%15,MaxRss%10,ReqMem,NCPUS,ExitCode,Workdir%110

     User        JobID      State  Partition               Start                 End    Elapsed        NodeList     MaxRSS     ReqMem      NCPUS ExitCode                                                                                                        WorkDir 

--------- ------------ ---------- ---------- ------------------- ------------------- ---------- --------------- ---------- ---------- ---------- --------                                                          ----------------------------------------------------- 

    tm103 16900869      COMPLETED  scavenger 2020-01-30T13:49:55 2020-01-30T13:50:12   00:00:17 dcc-hashimilab+                   2Gc          1      0:0                                                                                           /work/tm103/tutorial 

          16900869.ba+  COMPLETED            2020-01-30T13:49:55 2020-01-30T13:50:12   00:00:17 dcc-hashimilab+      1084K        2Gc          1      0:0                                                                                                                

          16900869.ex+  COMPLETED            2020-01-30T13:49:55 2020-01-30T13:50:12   00:00:17 dcc-hashimilab+       816K        2Gc          1      0:0                                                                                                                

tm103@dcc-slogin-03  ~ $

Get a list of fields with sacct  -e

Check the job output

tm103@dcc-slogin-03  /work/tm103/tutorial $ ls -ltr

total 138

-rwxr-xr-x. 1 tm103 root 103 Jan 30 13:44 short.sh

-rw-r--r--. 1 tm103 root  67 Jan 30 13:49 tutorial1.sh

-rw-r--r--. 1 tm103 root   0 Jan 30 13:49 slurm.err

-rw-r--r--. 1 tm103 root 755 Jan 30 13:50 slurm-16900869.out

tm103@dcc-slogin-03  /work/tm103/tutorial $ cat slurm-16900869.out 

Thu Jan 30 13:49:56 EST 2020

dcc-hashimilab-02

uid=1024131(tm103) gid=1000000(dukeusers) groups=1000000(dukeusers),507(chg),517(neuro),519(rrsg),547(sysbio),552(pcharbon),582(zijlab),584(mtchem),585(scsc),609(dhvi),619(bmannlab),1022(vooralab),100000029(dscr),100000191(pn-jump),100000422(prdn-rc-staff),100000429(stoutlab),100000477(condor-admins),100000494(ssri-prdn-test),100000501(burchfiel),100000626(training),100000630(smrtanalysis),100000648(musselman-comsol),100000768(nsoe-it),100000837(ssri-prdn-slurm-managers),100000899(docacad),100001306(securehpc_noduo),100001989(artai),100005349(pcharbonlow),100005449(rescomp),100006149(jan30test) context=system_u:system_r:unconfined_service_t:s0

/work/tm103/tutorial

Working hard...

Science complete!

tm103@dcc-slogin-03  /work/tm103/tutorial $

Job 2: Slurm job arrays

file: tutorial02.sh
#!/bin/bash
#SBATCH -e slurm.err
#SBATCH -o slurm__%A_%a.out
#SBATCH -p scavenger 
#SBATCH -a 1-10
echo $SLURM_ARRAY_TASK_ID
./short.sh

Submit the job array


tm103@dcc-slogin-03  /work/tm103/tutorial $ sbatch tutorial2.sh 

Submitted batch job 16901515

tm103@dcc-slogin-03  /work/tm103/tutorial $ squeue -u tm103

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

        16901515_1 scavenger tutorial    tm103  R       0:02      1 dcc-aryalab-01

        16901515_2 scavenger tutorial    tm103  R       0:02      1 dcc-aryalab-01

        16901515_3 scavenger tutorial    tm103  R       0:02      1 dcc-pfister-01

        16901515_4 scavenger tutorial    tm103  R       0:02      1 dcc-pfister-01

        16901515_5 scavenger tutorial    tm103  R       0:02      1 dcc-pfister-01

        16901515_6 scavenger tutorial    tm103  R       0:02      1 dcc-pfister-01

        16901515_7 scavenger tutorial    tm103  R       0:02      1 dcc-pfister-01

        16901515_8 scavenger tutorial    tm103  R       0:02      1 dcc-owzarteam-01

        16901515_9 scavenger tutorial    tm103  R       0:02      1 dcc-owzarteam-01

       16901515_10 scavenger tutorial    tm103  R       0:02      1 dcc-owzarteam-01

tm103@dcc-slogin-03  /work/tm103/tutorial $ squeue -u tm103

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

tm103@dcc-slogin-03  /work/tm103/tutorial $ ls -ltr

total 584

-rwxr-xr-x. 1 tm103 root 103 Jan 30 13:44 short.sh

-rw-r--r--. 1 tm103 root  67 Jan 30 13:49 tutorial1.sh

-rw-r--r--. 1 tm103 root 755 Jan 30 13:50 slurm-16900869.out

-rw-r--r--. 1 tm103 root 137 Jan 30 14:15 tutorial2.sh

-rw-r--r--. 1 tm103 root   0 Jan 30 14:15 slurm.err

-rw-r--r--. 1 tm103 root 754 Jan 30 14:16 slurm__16901515_1.out

-rw-r--r--. 1 tm103 root 754 Jan 30 14:16 slurm__16901515_2.out

-rw-r--r--. 1 tm103 root 754 Jan 30 14:16 slurm__16901515_5.out

-rw-r--r--. 1 tm103 root 754 Jan 30 14:16 slurm__16901515_4.out

-rw-r--r--. 1 tm103 root 754 Jan 30 14:16 slurm__16901515_7.out

-rw-r--r--. 1 tm103 root 754 Jan 30 14:16 slurm__16901515_3.out

-rw-r--r--. 1 tm103 root 754 Jan 30 14:16 slurm__16901515_6.out

-rw-r--r--. 1 tm103 root 756 Jan 30 14:16 slurm__16901515_9.out

-rw-r--r--. 1 tm103 root 757 Jan 30 14:16 slurm__16901515_10.out

-rw-r--r--. 1 tm103 root 756 Jan 30 14:16 slurm__16901515_8.out

tm103@dcc-slogin-03  /work/tm103/tutorial $ cat slurm__16901515_8.out

8

Thu Jan 30 14:15:56 EST 2020

dcc-owzarteam-01

uid=1024131(tm103) gid=1000000(dukeusers) groups=1000000(dukeusers),507(chg),517(neuro),519(rrsg),547(sysbio),552(pcharbon),582(zijlab),584(mtchem),585(scsc),609(dhvi),619(bmannlab),1022(vooralab),100000029(dscr),100000191(pn-jump),100000422(prdn-rc-staff),100000429(stoutlab),100000477(condor-admins),100000494(ssri-prdn-test),100000501(burchfiel),100000626(training),100000630(smrtanalysis),100000648(musselman-comsol),100000768(nsoe-it),100000837(ssri-prdn-slurm-managers),100000899(docacad),100001306(securehpc_noduo),100001989(artai),100005349(pcharbonlow),100005449(rescomp),100006149(jan30test) context=system_u:system_r:unconfined_service_t:s0

/work/tm103/tutorial

Working hard...

Science complete!

tm103@dcc-slogin-03  /work/tm103/tutorial $