(Image: an airplane undergoing maintenance, San Diego Air & Space Museum. Wikimedia Commons.)

From December 17 at 8:00 am to December 19 at 5:00 pm, most research computing resources will be unavailable for scheduled maintenance. During that timeframe, we will be completing routine maintenance, such as firmware and software updates and improving the current storage schema on the Duke Compute Cluster to increase storage capacity for users and groups.

While it will be best for researchers to plan for a complete outage of services, we will do our best to return some services early as we complete the maintenance.

What’s affected?

  • Duke Compute Cluster (DCC), including home and group directories
  • All research virtual machines (VMs), including Research Toolkits RAPID virtual machines
  • Research computing resources (cluster and individual virtual machines) in the PRDN and Protected Network
  • Globus endpoints
  • PACE VMs (OIT GPU Resources)
  • Duke Data Commons storage services

What’s going to be done during the outage?

Since the last maintenance outage in early 2019, routine maintenance has been done “in the background” in a manner that has been unobtrusive and often completely invisible. These updates and changes were done using the cluster’s virtualization capabilities and duplicated devices installed for fail-over.

Some updates require complete installation and rebooting of servers. This includes (for example) cluster essentials like the SLURM master and the database machines that support it.

Maintenance will upgrade or change

  • Firmware in all machines and networking devices
  • SLURM software to the current stable version
  • SLURM databases
  • Operating system (Red Hat Enterprise Linux) updates
  • Machine placement in the data center to rebalance cooling and electric power requirements

Group and “home directory” storage changes on the Duke Compute Cluster

The scheme for providing data storage space to users of the Duke Compute Cluster will be changed in order to increase capacity that’s available to researchers and to improve the stability and availability of cluster resources to groups. The current storage scheme was established over fifteen years ago, and it has shown its age.

Some deficiencies we’re aiming to fix:

  • The 250 GB standard capacity does not reflect the realities of much data-drive science and scholarship
  • The provisioning of individual “home directories” within shared group space has led to lock-outs of entire groups if a single cluster user inadvertently consumes all of the storage space allocated to the group
  • Deprovisioning accounts of former group members has been difficult for “points-of-contact” (POCs) and PIs
  • Changes of group membership of individuals (graduate students rotating through labs, for example) has been cumbersome
  • Cluster groups with many members have been pinched by having to share 250 GB of space, while small groups or “groups” of one have often had too much space at hand

The storage scheme will be changed in the following manner:

  • Groups will be granted 1 TB of storage capacity that can be shared by members of the group. This is four times the current allocation. Storage above the 1 TB allocation will be available for labs to purchase.
  • Individual user home directories will no longer be located in a group’s shared storage capacity, where they consumed group storage resources. Instead, each individual user of the cluster will be granted 10 GB in a separate home directory. This is more capacity than is the usual practice on many clusters. If users fill up their home directory inadvertently, only they will be affected by being unable to log in.

These changes will eliminate many of the deficiencies of the old set-up, but the change will mean that scripts may need to be updated, especially if “hard paths” are coded into them. Also, scripts and data stored in home directories will not be available to users’ lab groups.

PI’s, lab members, and POCs will need to emphasize the importance of storing lab data in their group’s shared storage space, since home directories will not be available to them. In addition, when cluster users leave the University or when they are deprovisioned on the cluster, their individual home directories will be deleted.

More detailed information about the specific changes to the storage is forthcoming and will be provided before the planned maintenance outage on December 17.

Questions about the outage and the changes should be emailed to rescomputing@duke.edu.