The Duke Compute Cluster (“DCC”)
The Duke Compute Cluster (formerly called the Duke Shared Cluster Resource or “DSCR”) consists of machines that the University has provided for community use and that researchers have purchased to conduct their research. At present, the cluster consists of about 7000 CPU-cores, with underlying hardware from Cisco UCS and Dell M600-series blades in Dell M1000-series chassis. Interconnects are 10 Gbs.
The cluster itself is a project of the University community, with the hardware provided by individual researchers and the University. The University, through Duke Research Computing and the Office of Information Technology, maintains and administers the equipment for its useful life (designated to be four years) and provides support for cluster users. As a result of the incremental purchases, the cluster is heterogeneous, with a narrow range of Intel chipsets and RAM capacities, though purchases of equipment are organized and channeled by Duke Research Computing in order to ease maintenance and exploit economies of scale. New “standard” nodes have 512 GB of RAM and 44 physical CPU-cores with double that possible using hyperthreading.
In February 2016, machines fitted with Nvidia Tesla K80 GPUs were added and are available for purchase by research groups with sustained need of GPU-accelerated computing. The machines are also available on a limited basis to cluster users as a common resource.
Researchers who have provided equipment have “high priority” access to their nodes and have “low priority” (or “common”) access to others’ nodes, including those purchased by the University, when idle cycles are available. Since researchers tend not to use 100 percent of the CPU of nodes they have purchased, “low priority” consumption of cycles greatly increases the efficiency of the cluster overall, while also providing all users the benefit of being able to access more than their own nodes’ cycles when they might need it. Jobs submitted with high priority run only on the nodes that members have bought, and low priority jobs on the machines yield to high priority jobs.
The Duke Compute Cluster is a general purpose high performance/high-throughput installation, and it is fitted with software used for a broad array of scientific projects. For the most part, applications on the cluster are Free and Open Source Software (FOSS), though some researchers have arranged for proprietary licenses for software they use on the cluster. The operating system and software installation and configuration is standard across all nodes (barring license restrictions), with Red Hat Enterprise Linux 6 the current operating system. SLURM is the scheduler for the entire system. The entire system is professional managed by systems administrators in the Office of Information Technology and the equipment is housed in enterprise-grade data centers on Duke’s West Campus. Software installations and user support, including training on using the system, is provided by experienced staff of Duke Research Computing.
Users of the cluster agree to an Acceptable Use Policy.
Accessing the Duke Compute Cluster
There are currently 3 “front-end” machines that users must login to first.
This will connect to one of the three head nodes (dcc-slogin-01.oit.duke.edu, dcc-slogin-02.oit.duke.edu and dcc-slogin-03.oit.duke.edu).
Once you are logged in to a front-end, you will be able to login from there to any node in the cluster. Most of your non-computational work will be done on the front-ends: compilation, job submission, debugging. Do not use the login nodes for computationally intensive processes. All computationally demanding jobs should be submitted and run through the Slurm queueing system.
If you are a member of a group that already participates in the DSCR, please direct your new account request through your designated Point Of Contact
Using the Duke Compute Cluster
You can manage your work on the DCC in either interactive or batch mode.
In an interactive session, you type commands and receive output back to your screen as the commands complete.
To work in batch mode, you must first create a batch (or job) script which contains the commands to be run, then submit the job to a SLURM partition. It will be scheduled and run as soon as possible. Whether you use an interactive session or a batch job, all of your computing must be done on Bridges compute nodes. You cannot do your work on login nodes.
In both interactive and batch modes, the SLURM scheduler controls access to all of the compute nodes. SLURM manages four partition types, which are defined by the type of compute node that they control.
There are different DCC partitions to which batch jobs and interactive sessions can be directed:
- common, for jobs that will run on the DCC core nodes (up to 64 GB RAM).
- common-large, for jobs that will run on the DCC core nodes (64-240 GB GB RAM).
- gpu-common, for jobs that will run on DCC GPU nodes.
- Group partitions (partition name varies), for jobs that will run on lab-owned nodes
All the partitions use FIFO scheduling, although if the top job in the partition will not fit on the machine, SLURM will skip that job and try to schedule the next job in the partition.
Additional information can be found here: slurm.schedmd.com
Jobs in the GPU partition use DCC GPU nodes. Submit a batch job to the gpu-common partition with the Slurm commands
#SBATCH -p gpu-common --gres=gpu:1
in a job script or
srun -p gpu-common --gres=gpu:1 --pty bash -i
for an interactive session.
Office hours: Gross Hall 241 M/W 1:00-5:00 (map)
Duke Compute Cluster workshop
Next class Thursday, October 4th 2:30-3:30 at TEC 132: Enroll in this event
- DCC_Workshop_10-04-2018 Slides (PDF)