Tutorial jobs
Job 1: A simple, nonparallel job
Create a workload
Inside the tutorial directory that you created or installed previously, let’s create a test script to execute as your job:
$ nano short.sh
#!/bin/bash date hostname id pwd echo echo "Working hard..." sleep 30 echo "Science complete!"
To close nano
, hold down Ctrl and press X. Press Y to save, and then Enter
Now, make the script executable.
$ chmod a+x short.sh
Run the job locally
tm103@dcc-slogin-03 /work/tm103/tutorial $ ./short.sh Thu Jan 30 13:44:37 EST 2020 dcc-slogin-03 uid=1024131(tm103) gid=1000000(dukeusers) groups=1000000(dukeusers),507(chg),517(neuro),519(rrsg),547(sysbio),552(pcharbon),582(zijlab),584(mtchem),585(scsc),609(dhvi),619(bmannlab),1022(vooralab),100000029(dscr),100000191(pn-jump),100000422(prdn-rc-staff),100000429(stoutlab),100000477(condor-admins),100000494(ssri-prdn-test),100000501(burchfiel),100000626(training),100000630(smrtanalysis),100000648(musselman-comsol),100000768(nsoe-it),100000837(ssri-prdn-slurm-managers),100000899(docacad),100001306(securehpc_noduo),100001989(artai),100005349(pcharbonlow),100005449(rescomp),100006149(jan30test) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 /work/tm103/tutorial Working hard... Science complete! tm103@dcc-slogin-03 /work/tm103/tutorial $
Create a Slurm job script
$ nano tutorial01.sh
#!/bin/bash #SBATCH -e slurm.err #SBATCH -p scavenger ./short.sh
Submit the job
Submit the job using sbatch.
tm103@dcc-slogin-03 /work/tm103/tutorial $ sbatch tutorial1.sh Submitted batch job 16900869 tm103@dcc-slogin-03 /work/tm103/tutorial $
Check job status
The squeue command tells the status of currently running jobs. Generally you will want to limit it to your own jobs:
tm103@dcc-slogin-03 /work/tm103/tutorial $ squeue -u tm103 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 16900869 scavenger tutorial tm103 R 0:10 1 dcc-hashimilab-02 tm103@dcc-slogin-03 /work/tm103/tutorial $
When your job has completed, it will disappear from the list.
Job history
Once your job has finished, you can get information about its execution from the sacct command:
tm103@dcc-slogin-03 ~ $ sacct -u tm103
JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 16900869 tutorial1 scavenger rescomp 1 COMPLETED 0:0 tm103@dcc-slogin-03 ~ $
You can see much more information about your job’s final status using the --format
option, specifying additional fields.
tm103@dcc-slogin-03 ~ $ sacct -j 16900869 --format=User,JobID,state,partition,start,end,elapsed,nodelist%15,MaxRss%10,ReqMem,NCPUS,ExitCode,Workdir%110 User JobID State Partition Start End Elapsed NodeList MaxRSS ReqMem NCPUS ExitCode WorkDir --------- ------------ ---------- ---------- ------------------- ------------------- ---------- --------------- ---------- ---------- ---------- -------- ----------------------------------------------------- tm103 16900869 COMPLETED scavenger 2020-01-30T13:49:55 2020-01-30T13:50:12 00:00:17 dcc-hashimilab+ 2Gc 1 0:0 /work/tm103/tutorial 16900869.ba+ COMPLETED 2020-01-30T13:49:55 2020-01-30T13:50:12 00:00:17 dcc-hashimilab+ 1084K 2Gc 1 0:0 16900869.ex+ COMPLETED 2020-01-30T13:49:55 2020-01-30T13:50:12 00:00:17 dcc-hashimilab+ 816K 2Gc 1 0:0 tm103@dcc-slogin-03 ~ $
Get a list of fields with sacct -e
Check the job output
tm103@dcc-slogin-03 /work/tm103/tutorial $ ls -ltr total 138 -rwxr-xr-x. 1 tm103 root 103 Jan 30 13:44 short.sh -rw-r--r--. 1 tm103 root 67 Jan 30 13:49 tutorial1.sh -rw-r--r--. 1 tm103 root 0 Jan 30 13:49 slurm.err -rw-r--r--. 1 tm103 root 755 Jan 30 13:50 slurm-16900869.out tm103@dcc-slogin-03 /work/tm103/tutorial $ cat slurm-16900869.out Thu Jan 30 13:49:56 EST 2020 dcc-hashimilab-02 uid=1024131(tm103) gid=1000000(dukeusers) groups=1000000(dukeusers),507(chg),517(neuro),519(rrsg),547(sysbio),552(pcharbon),582(zijlab),584(mtchem),585(scsc),609(dhvi),619(bmannlab),1022(vooralab),100000029(dscr),100000191(pn-jump),100000422(prdn-rc-staff),100000429(stoutlab),100000477(condor-admins),100000494(ssri-prdn-test),100000501(burchfiel),100000626(training),100000630(smrtanalysis),100000648(musselman-comsol),100000768(nsoe-it),100000837(ssri-prdn-slurm-managers),100000899(docacad),100001306(securehpc_noduo),100001989(artai),100005349(pcharbonlow),100005449(rescomp),100006149(jan30test) context=system_u:system_r:unconfined_service_t:s0 /work/tm103/tutorial Working hard... Science complete! tm103@dcc-slogin-03 /work/tm103/tutorial $
Job 2: Slurm job arrays
#!/bin/bash #SBATCH -e slurm.err #SBATCH -o slurm__%A_%a.out #SBATCH -p scavenger #SBATCH -a 1-10 echo $SLURM_ARRAY_TASK_ID ./short.sh
Submit the job array
tm103@dcc-slogin-03 /work/tm103/tutorial $ sbatch tutorial2.sh Submitted batch job 16901515 tm103@dcc-slogin-03 /work/tm103/tutorial $ squeue -u tm103 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 16901515_1 scavenger tutorial tm103 R 0:02 1 dcc-aryalab-01 16901515_2 scavenger tutorial tm103 R 0:02 1 dcc-aryalab-01 16901515_3 scavenger tutorial tm103 R 0:02 1 dcc-pfister-01 16901515_4 scavenger tutorial tm103 R 0:02 1 dcc-pfister-01 16901515_5 scavenger tutorial tm103 R 0:02 1 dcc-pfister-01 16901515_6 scavenger tutorial tm103 R 0:02 1 dcc-pfister-01 16901515_7 scavenger tutorial tm103 R 0:02 1 dcc-pfister-01 16901515_8 scavenger tutorial tm103 R 0:02 1 dcc-owzarteam-01 16901515_9 scavenger tutorial tm103 R 0:02 1 dcc-owzarteam-01 16901515_10 scavenger tutorial tm103 R 0:02 1 dcc-owzarteam-01 tm103@dcc-slogin-03 /work/tm103/tutorial $ squeue -u tm103 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) tm103@dcc-slogin-03 /work/tm103/tutorial $ ls -ltr total 584 -rwxr-xr-x. 1 tm103 root 103 Jan 30 13:44 short.sh -rw-r--r--. 1 tm103 root 67 Jan 30 13:49 tutorial1.sh -rw-r--r--. 1 tm103 root 755 Jan 30 13:50 slurm-16900869.out -rw-r--r--. 1 tm103 root 137 Jan 30 14:15 tutorial2.sh -rw-r--r--. 1 tm103 root 0 Jan 30 14:15 slurm.err -rw-r--r--. 1 tm103 root 754 Jan 30 14:16 slurm__16901515_1.out -rw-r--r--. 1 tm103 root 754 Jan 30 14:16 slurm__16901515_2.out -rw-r--r--. 1 tm103 root 754 Jan 30 14:16 slurm__16901515_5.out -rw-r--r--. 1 tm103 root 754 Jan 30 14:16 slurm__16901515_4.out -rw-r--r--. 1 tm103 root 754 Jan 30 14:16 slurm__16901515_7.out -rw-r--r--. 1 tm103 root 754 Jan 30 14:16 slurm__16901515_3.out -rw-r--r--. 1 tm103 root 754 Jan 30 14:16 slurm__16901515_6.out -rw-r--r--. 1 tm103 root 756 Jan 30 14:16 slurm__16901515_9.out -rw-r--r--. 1 tm103 root 757 Jan 30 14:16 slurm__16901515_10.out -rw-r--r--. 1 tm103 root 756 Jan 30 14:16 slurm__16901515_8.out tm103@dcc-slogin-03 /work/tm103/tutorial $ cat slurm__16901515_8.out 8 Thu Jan 30 14:15:56 EST 2020 dcc-owzarteam-01 uid=1024131(tm103) gid=1000000(dukeusers) groups=1000000(dukeusers),507(chg),517(neuro),519(rrsg),547(sysbio),552(pcharbon),582(zijlab),584(mtchem),585(scsc),609(dhvi),619(bmannlab),1022(vooralab),100000029(dscr),100000191(pn-jump),100000422(prdn-rc-staff),100000429(stoutlab),100000477(condor-admins),100000494(ssri-prdn-test),100000501(burchfiel),100000626(training),100000630(smrtanalysis),100000648(musselman-comsol),100000768(nsoe-it),100000837(ssri-prdn-slurm-managers),100000899(docacad),100001306(securehpc_noduo),100001989(artai),100005349(pcharbonlow),100005449(rescomp),100006149(jan30test) context=system_u:system_r:unconfined_service_t:s0 /work/tm103/tutorial Working hard... Science complete! tm103@dcc-slogin-03 /work/tm103/tutorial $